The Apache Software Foundation edit
The Apache Software Foundation is a non-profit organisation that supports a wide range of open source projects, including providing and mandating a standard governance model (including the use of the Apache license), holding all trademarks for project names and logos, and providing legal protection to developers. It was founded in 1999 and now oversees nearly 200 projects. Vendor Information
Other Names Apache Analytical Query Capabilities
HAWQ A port of the Greenplum MPP database (which itself is based on PostgreSQL) to run over YARN and HDFS. Tajo Distributed analytical database engine supporting queries over data in HDFS, Amazon S3, Google Cloud Storage, OpenStack Swift and local storage, and querying over Postgres, HBase and Hive tables. Kudu Columnar storage technology for tables of structured data, supporting low latency reads, updates and deletes by primary key, as well as analytical column/table scans. Quickstep (Retired) High performance database engine supporting SQL queries based on a University of Wisconsin-Madison project - now retired https://github.com/apache/incubator-quickstep Hive Supports the execution of SQL queries over data in HDFS using MapReduce, Spark or Tez based on tables defined in the Hive Metastore Pig Technology for running analytical and data processing jobs written in Pig Latin against data in Hadoop using MapReduce, Tez and Spark MRQL (Incubating) Supports the execution of MRQL queries over data in Hadoop using MapReduce, Hama, Spark or Flink - http://mrql.apache.org/ Impala An MPP query engine that supports the execution of SQL queries over in HDFS, HBase, Kudu and S3 based on tables defined in the Hive Metastore Drill An MPP query engine that supports queries over one or more underlying databases or datasets without first defining a schema and with the ability to join data from multiple datastores together. Lens Provides a federated view over multiple data stores using a single shared schema server based on the Hive Metastore - http://lens.apache.org/ Kylin Supports the creation and querying of OLAP cubes on Hadoop, building cubes from star schema data in Hive into HBase, and then providing a SQL interface that queries across Hive and HBase as required - http://kylin.apache.org/ Analytical Search Capabilities
Solr A search server built on Apache Lucene with a REST-like API for loading and searching data. Compute Cluster Management
Hadoop/YARN Resource management and job scheduling & monitoring for the Hadoop ecosystem. Slider (Incubating) Application for deploying long running cluster applications on YARN, now effectively dead following the plan to add support for long running services directly into YARN Twill Abstraction over YARN that reduces the complexity of developing distributed applications - http://twill.apache.org/ Mesos Resource management over large clusters of machines Aurora Mesos framework for long-running services and cron jobs ZooKeeper Service for managing coordination (e.g. configuration information and synchronisation) of distributed and clustered systems. Curator A set of Java libraries that make using Apache ZooKeeper much easier - http://curator.apache.org/ Myriad (Incubating) Tool that allows YARN applications to run over Apache Mesos, allowing them to co-exist and share cluster resources. REEF A framework for developing distributed apps on top of cluster frameworks such as YARN or Mesos - http://reef.apache.org/ Data Formats
Avro Data serialisation framework that supports both messaging and data storage, primarily using a compact binary format but also supports a JSON format. Parquet Data serialisation framework that supports a columnar storage format to enable efficient querying of data. Arrow In memory columnar data format supporting high performance data exchange and fast analytical access ORCFile Evolution of RCFile, spun out into it’s own Apache project CarbonData Columnar format created by Huawei to address a number of perceived shortcomings in existing formats Iceberg (incubating) File based table format for large, slow-moving tabular data - http://iceberg.incubator.apache.org/ Data Ingestion
Nifi General purpose technology for the movement of data between systems, including the ingestion of data into an analytical platform. Gobblin (Incubating) Framework for managing big data ingestion, including replication, organization and lifecycle management Flume Specialist technology for the continuous movement of data using a set of independent agents connected together into pipelines. Sqoop Specialist technology for moving bulk data between Hadoop and structured (relational) databases. ManifoldCF Framework for replicating data from content repositories to analytical search technologies - http://manifoldcf.apache.org/ Data Processing
Hadoop/MapReduce A data transformation and aggregation technology proven at extreme scale that works on key value pairs Spark A high performance general purpose distributed data processing engine based on directed acyclic graphs that primarily runs in memory, but can spill to disk if required Tez Data processing framework based on Directed Acyclic Graphs (DAGs), that runs natively on YARN and was designed to be a replacement for the use of MapReduce within Hadoop analytical tools Crunch An abstraction layer over MapReduce (and now Spark) that provides a high level Java API for creating data transformation pipelines Nemo (Incubating) A runtime for data processing languages that dynamically adjusts to the runtime environment - https://nemo.incubator.apache.org/ Crail (Incubating) High performance distributed and tiered (in memory, flash and disk) storage layer for temporary data that provides memory, storage and network access that bypasses the JVM and OS, and support for Spark and Hadoop - http://crail.incubator.apache.org/ Graph Technologies
Giraph An iterative, highly scalable graph processing system built on top of MapReduce and based on Pregel Hama A general purpose BSP (Bulk Synchronous Parallel) processing engine inspired by Pregel and DistBelief that runs over Mesos or YARN. Commons RDF (0) Commons library for working with RDF data - <commons.apache.org/proper/commons-rdf/> Jena Framework for developing Semantic Web and Linked Data applications in Java - http://jena.apache.org/ Rya (Incubating) RDF triple store built on Apache Accumulo - http://rya.apache.org/ S2Graph (Incubating) OLTP graph database built on Apache HBase - https://s2graph.incubator.apache.org/ TinkerPop Graph compute framework for transactional and analytical use cases that’s integrated with a number of graph database technologies - http://tinkerpop.apache.org Spark/GraphX Spark library for processing graphs and running graph algorithms Hadoop and Related Technologies
Hadoop A distributed storage and compute platform consisting of a distributed filesystem (HDFS), a cluster resource management layer (YARN), and MapReduce, a solution built on HDFS and YARN for massive scale parallel processing of data Bigtop Apache open source distribution of Hadoop Ambari Platform for installing, managing and monitoring Apache Hadoop clusters Atlas A metadata and data governance solution for Hadoop. Knox A stateless gateway for the Apache Hadoop ecosystem that provides perimeter security Ranger A centralised security framework for managing access to data in Hadoop Sentry A centralised security framework for managing access to data in Hadoop Eagle Security and performance monitoring solution for Hadoop, donated by eBay - http://eagle.apache.org/ Falcon Data feed management system for Hadoop, although no longer appears under development and is deprecated from HDP. In Memory Technologies
Ignite A distributed in-memory data fabric/grid, supporting a range of different use cases and capabilities Geode In memory data management platform, born of Pivotal Gemfire - http://geode.apache.org/ Mnemonic Hybrid memory / storage object model framework - http://mnemonic.apache.org/ Machine Learning Technologies
Spark/MLLib Spark library for running Machine Learning algorithms Mahout Machine learning technology comprising of a Scala based linear algebra engine (codenamed Samsara) with an R-like DSL/API that runs over Spark (with experimental support for H2O and Flink) MADlib Machine learning in SQL for PostgreSQL, Greenplum and Apache HAWQ - http://madlib.apache.org/ OpenNLP Machine learning based toolkit for the processing of natural language text - http://opennlp.apache.org/ SAMOA (Incubating) Machine learning framework that runs over multiple stream processing engines including Storm, Flink and Samza - http://samoa.apache.org/ SINGA (Incubating) Framework for developing machine learning libraries over a range of hardware - https://singa.apache.org/ SystemML Delarative machine learning over local, Spark or MapReduce execution engines - http://systemml.apache.org/ Hivemall (Incubating) Scalable machine learning library implemented as Hive UDFs/UDAFs/UDTFs - http://hivemall.incubator.apache.org/ NoSQL Wide Column Stores
Accumulo NoSQL wide-column datastore based on Google BigTable that runs on Hadoop and HDFS Cassandra Distributed wide-column datastore based on Amazon Dynamo and Google BigTable HBase NoSQL wide-column datastore based on Google BigTable that runs on Hadoop and HDFS Fluo Implementation of Google Percolator for maintaining aggregations in Accumulo - https://fluo.apache.org/ Omid (Incubating) ACID transaction support over MVCC key/value NoSQL datastores with support for Apache Hbase - http://omid.apache.org/ Tephra (Incubating) ACID transaction support over Apache Hbase, used by Tigon and Apache Phoenix - http://tephra.apache.org/ OLTP Databases
Phoenix An OLTP SQL query engine over Apache HBase tables that supports a subset of SQL 92 (including joins), and comes with a JDBC driver. Trafodion OLTP on Hadoop solution based on Tandom NoStop database IP with commercial support from Esgyn - https://trafodion.incubator.apache.org/ IoT Databases
IoTDB (incubating) Massive scale IoT time series DB - http://iotdb.incubator.apache.org/; https://wiki.apache.org/incubator/IoTDBProposal Streaming Analytics
Storm Specialised distributed stream processing technology based on a single record (not micro batch) model with at least once processing semantics. Flink Specialised stream processing technology inspired by the Google Data Flow model based on a single record (not micro batch) model, with exactly once processing semantics (for supported sources and sinks) via light weight checkpointing and support for batch processing. Spark/Streaming Spark library for continuous stream processing, that allows stream and batch processing (including Spark SQL and MLlib operations) to be combined Apache Kafka Streams Stream processing framework built over Apache Kafka, with support for stateful tables Beam Model and SDKs for running batch and streaming workflows over Apex, Flink, Spark and Google Dataflow - https://beam.apache.org/ Apex Data transformation engine based on Directed Acyclic Graph (DAG) flows configured through a Java API or via JSON that runs over YARN and HDFS with native support for both micro-batch streaming and batch uses cases Heron (Incubating) The stream processing framework that Twitter built after Storm, with a Storm compatible API - http://heron.incubator.apache.org/ Samza Stream processing framework built on Kafka and YARN - http://samza.apache.org/ Bahir A suite of streaming connectors for Spark and Flink, including support for Akka, MQTT, Twitter and ZeroMQ - http://bahir.apache.org/ Gearpump (Retired) Real-time streaming engine based on the micro-service Actor model, now retired - http://gearpump.apache.org/ Streaming Data Stores
Kafka Technology for buffering and storing real-time streams of data between producers and consumers, with a focus on high throughput at low latency. BookKeeper Distributed log storage service from Yahoo - http://bookkeeper.apache.org/ DistributedLog Distributed log service from Twitter supporting durability, replication and strong consistency built over Apache BookKeeper - http://bookkeeper.apache.org/distributedlog/ Pulsar Distributed pub-sub messaging from Yahoo, with persistent message storage based on Apache BookKeeper - http://pulsar.incubator.apache.org/ Workflow Management
Oozie Technology for managing workflows of jobs on Hadoop clusters. Airflow Workflow automation and scheduling system that can be used to author and manage data pipelines Other Technologies
DataFu A set of libraries for working with data in Hadoop, consisting of two sub-projects - DataFu Pig (a set of Pig User Defined Functions) and DataFu Hourglass (a framework for incremental processing using MapReduce). AsterixDB Scalable “Big Data Management System” - https://asterixdb.apache.org/ Chukwa Specialist technology for the ingestion of continuous data flows into an Hadoop cluster, and the subsequent management and analysis of the data - https://chukwa.apache.org/ Edgent (Incubating) Stream processing programming model and lightweight runtime to execute analytics at devices on the edge or at the gateway, previously known as Quarks - http://edgent.apache.org/ Gora ORM with support for a range of NoSQL, Search and Hadoop data formats - http://gora.apache.org/ Helix A framework for building long lived persistent distributed systems - http://helix.apache.org/ Kerby Java Kerberos binding - http://directory.apache.org/kerby/ MetaModel Technology for reading and writing database metadata with connectors for a wide range of databases - http://metamodel.apache.org/ Toree (Incubating) Framework to allow interactive applications to communicate with a remote Spark cluster - http://toree.apache.org/ Calcite A framework for building SQL based data access capabilities, supporting a SQL parser and validator and tools for the transformation and (cost based) optimisation of SQL expression trees. Livy (Incubating) A service that allows Spark jobs (pre-compiled JARs) or code snippets (Scala or Python) to be executed by remote systems over a REST API or via clients for Java, Scala and Python. Superset (Incubating) Web based tool for interactive exploration for OLAP style data, supporting interactive drag and drop querying, composable dashboards and a SQL workspace (SQL Lab). Zeppelin A web based notebook for interactive data analytics. Commons Compress Suite of Java libraries for working with a range of compression and packaging formats - https://commons.apache.org/proper/commons-compress/ Commons CSV Suite of Java libraries for workng with CSV files - https://commons.apache.org/proper/commons-csv/ Griffin Data Quality Service platform built on Apache Hadoop and Apache Spark - http://griffin.apache.org/ Tika Toolkit for extracting text from a wide range of document formats - http://tika.apache.org/ UIMA Framework for unstructured data analysis - http://uima.apache.org/ Links
News
Blog Posts