Query Engines edit  

Our list of and information on commercial, open source and cloud based query engines, including Hive, Impala, Drill, Pig, Kognitio, Jethro, Amazon Athena, Azure Data Lake Analytics and alternatives to these.

Category Definition

Engines that allow analytical queries expressed in a high level language (often SQL) to be run over one or more underlying data stores or databases, often including HDFS (and often using table definitions from the Hive Metastore), but with support for other Hadoop, relational and NoSQL databases commonly supported. Will support exploitation of raw data (focusing on schema on read and the ability to query across sources) and/or exploitation of data prepared for analytics (focusing on competing with analytical databases). Many technologies started as batch query engines (with high query startup costs and limited support for concurrent queries), but most can now be considered interactive with support for multiple concurrent low latency queries. Given the propensity for querying over Hadoop data using SQL, many of these technologies are often referred to as SQL-on-Hadoop technologies.

Further Information

See our Data Storage Formats page for information on file formats to use with these query engines.

See our Analytical Databases page for information on technologies that provide analytical capabilities as a self contained storage and query engine stack.

The following analyst material covers a number of technologies in this category:

Open Source Technologies

Apache HiveSupports the execution of SQL queries over data in HDFS using MapReduce, Spark or Tez based on tables defined in the Hive Metastore
Apache ImpalaAn MPP query engine that supports the execution of SQL queries over in HDFS, HBase, Kudu and S3 based on tables defined in the Hive Metastore
PrestoDistributed SQL query engine over data in HDFS, NoSQL and relational databases and Kafka, originally created and open sourced by Facebook - https://prestodb.io/
Apache DrillAn MPP query engine that supports queries over one or more underlying databases or datasets without first defining a schema and with the ability to join data from multiple data stores together.
Apache LensProvides a cube based federated view over a range of data stores including HDFS, HBase, relational databases, S3 and Redshift - http://lens.apache.org/
Apache Spark SQLHive compatible SQL query engine that use Spark to execute queries over any Spark supported data source
Apache PigTechnology for running analytical and data processing jobs written in Pig Latin against data in Hadoop using MapReduce, Tez and Spark
Apache MRQL (Incubating)Supports the execution of MRQL queries over data in Hadoop using MapReduce, Hama, Spark or Flink - http://mrql.apache.org/
KylinSupports the querying of Hive tables as OLAP cubes

Commercial Technologies

KognitioIn memory database engine that can run as a YARN application on Hadoop over data in HDFS (as a free offering) or as a standalone cluster over data in HDFS, the cloud and other databases (as a commercial offering with a free trial) - https://kognitio.com/
JethroSQL query engine over HDFS and S3 that supports indexing, auto generation of cubes and results caching - https://jethro.io/
AtScaleCube based semantic layer with query optimisation, virtual cube caching and row level security over Hadoop, RedShift and SQL data sources - https://atscale.com/
IBM Big SQLSQL engine that runs on Hadoop over Hive tables, but that can also federate into RDMS and NoSQL databases and object stores - https://www.ibm.com/us-en/marketplace/big-sql
Oracle Big Data SQLAllows federated queries from an Oracle databases over Hadoop and NoSQL databases, with push down of logic and support for Oracle security - https://www.oracle.com/database/big-data-sql/index.html
Arcadia DataAnalytics engine that runs over Hadoop, with integrated drag and drop visual analytics and dashboards and a free tier (Arcadia Instant) - https://www.arcadiadata.com/product/
Kyvos InsightsOLAP cubes on Hadoop - http://www.kyvosinsights.com/olap-cubes-on-hadoop/

Cloud Technologies

Amazon AthenaSQL query service over data in Amazon S3 - https://aws.amazon.com/athena/
Azure Data Lake AnalyticsMassively parallel analytics job service, with support for U-SQL, R, Python, and .NET - https://azure.microsoft.com/en-us/services/data-lake-analytics/
Qubole QauntumServerless query engine based on Presto - https://www.qubole.com/product/data-platform/quantum-by-qubole/

Blog Posts