Technologies edit   create  

A catalogue of data transformation, data platform and other technologies used within the Data Engineering space

AlluxioA distributed virtual storage layer, supporting key-value and filesystem interfaces (including HDFS compatibility and a FUSE driver) with support for a range of computation and storage frameworks (including Spark, MapReduce, HBase and Hive) over multiple storage layers (including in-memory, local, network, cloud and cluster file systems) with the ability to create unified and tiered storage, for example to create an in memory filesystem backed by disk to accelerate analytics jobs. Supports a POSIX like access control model, and a CLI and web interface for browsing the storage layer and an S3 compatible API. Java based, Open Source under the Apache 2.0 licence, hosted on GitHub, with development led by Alluxio (with significant external contributions), although they don't appear to yet provide commercial support (but do provide training). Started in December 2012, open sourced in April 2013, with a v1.0 release in February 2016. Formally known as Tachyon.
Amazon EMRService for dynamically provisioning Hadoop clusters on Amazon EC2 infrastructure, with the ability to select one of more Hadoop based services to be pre-installed and configured. Supports selection of EC2 instance types, EC2 spot and reserved instances, programmatic execution of service jobs (steps), persistent or transient (terminate after pre-defined steps have been executed) clusters, automatic or manual scaling of live clusters, cloning of clusters, HDFS on local (EBS) node storage, an HDFS compatible filesystem (EMR File System - EMRFS) for accessing Amazon S3 storage (that supports consistency using DynamoDB for metadata), automatic configuration of Hadoop clusters and firewalls, integration with AWS CloudWatch and AWS Identity and Access Management, Hadoop encryption and Kerberos authentication, persistent storage of Hive metadata in AWS Glue Data Catalog, and bootstrap actions for custom configuration or installation of other services (with a GitHub repo of open source bootstrap action extensions). Manageable via the AWS Management Console, the AWS CLI, a REST API and a range of SDKs. Priced at an hourly rate (charged per second) based on the EC2 instance types being used, which is in addition to any EC2 or EBS charges.
Amazon S3An object store service with eventual consistency, focusing on massive durability and scalability, with support for multiple storage tiers (including Amazon Glacier and Glacier Deep Archive) and deep integration to the AWS ecosystem. Objects are organised into buckets and indexed by string, with the option to list objects by prefix and to summarise results based on a delimiter allowing a filesystem to be approximated. Metadata against objects is managed via S3 Object Tags, key-value pairs applied to objects that can be added, modified or deleted at any time. Lifecycle management policies can be assigned to name prefixes or object tags to automatically delete objects or move them between storage tiers. Supports versioning of objects, access control (at the bucket or object level), retrieving subsets of objects via server side queries (S3/Glacier select), batch operations (including Lambda function exection), replication of objects and metadata to a bucket in a different AWS region (cross-region replication), encryption of objects and support for SSL connections, immutable blobs (via Glacier Vault Lock), full auditing of all object operations, analytics on object operations, multi-part uploads, multi-object deletions, a flat-file output of object names and metadata (S3 Inventory), downloads via the bittorrent protocol, static website hosting and time limited object download URLs. Quotes a 99.999999999% guarentee that data won't be lost, with data stored redundantly across multiple devices and facilities within the chosen region, and scalability past trillions of objects. Provides a web based management console, mobile management app, a REST API and SDKs for a wide range of languages. First launched in March 2006.
Apache AccumuloNoSQL wide-column datastore based on BigTable. Supports horizontal scalability, cell based access control (based on arbitrary boolean expressions of user security labels), high availability, atomic read-modify-write operations, map reduce support (both as a source and sink), table constraints, LDAP and Kerberos integration, the use of HDFS for underlying storage, and replication between instances. Comes with a web based monitoring interface (Accumulo Monitor) and a CLI. Written in Java, with thrift based API allowing access from other languages including C++, Python, Ruby. Originally developed at the NSA, donated to the Apache Foundation in September 2011, before graduating in March 2012, and is still under active development.
Apache AirflowA workflow management system designed for orchestrating repeated data integration tasks on a schedule, with workflows configured in Python as a Directed Acyclic Graph (DAG) of tasks. A scheduler is responsible for identifying tasks to be run, with an executor responsible for determining where tasks should run (with support for local execution or remote execution using Celery, Dask, Mesos and Kubernetes, with the ability to define custom executors). Supports periodic execution of workflows (based on a schedule interval), sensor operators (that wait until some condition is true, e.g. a file exists), automatic retry of failed tasks, catchup of historic task executions, task templating, triggers and complex dependancies, shared connection configuration, configurable job parallelism, variables that can be configured through the UI, re-usable sub DAGs and experimental support for data lineage with integration to Apache Atlas. Packaged with a wide variety of prebuilt 'operators' for data integration; databases (MySQL, PostgreSQL, Oracle), Hadoop (Hive, Pig, Sqoop) and cloud services (Amazon Web Services, Google Cloud Platform and Microsoft Azure services), with the ability to write your own. Comes with a command line and web interface to manage and monitor workflows and perform administrative actions on the environment and an experimental REST API. Persists workflow management state and operational metadata in either a MySQL or PostgreSQL relational database and queryable using SQL via the web interface to create simple charts. Includes a security model with support for a range of authentication methods including LDAP, Kerberos (limited), OAuth and Google Authentication. Originally developed at Airbnb and donated to the Apache Foundation's incubator program in June 2015. Under active development with a wide range of contributors. Commercial support is available from a variety of vendors who distribute it as a standalone managed service (Astronomer and Google), to run on Kubernetes (Astronomer), or part of wider managed data service offering (Qubole).
Apache AmbariPlatform for installing, managing and monitoring Apache Hadoop clusters. Supports the installation of different versions of different distributions of Hadoop through Stack definitions (with support for HDP out of the box, and further stacks and add ons available through management packs), and the specification of Blueprints (cluster layouts and configuration for a given Stack) that can be used to programmatically create multiple clusters (e.g. dev, test and production). Also supports both rolling (no downtime) and express (faster but with downtime) upgrades; cluster administration (including adding and removing nodes/services, viewing the status of nodes/services, and configuring services with the versioning of configuration and the ability to rollback changes); the automated Kerberization of clusters; the collection, storage (in HBase) and visualisation (via Grafana or through dashboards in Ambari) of system and Hadoop component metrics via the Ambari Metrics System (AMS); alerting on statuses and metrics; the collection, storage (in Solr) and searching/viewing of log entries from across the Hadoop cluster (currently in technical preview); and a framework for UI components within Ambari (Ambari Views, treated here as a sub-project). Web based, with a REST API, and backed by a backend database (Oracle, MySQL or Postgres). Donated to the Apache Foundation by Hortonworks, IBM and Yahoo in August 2011 as the Hadoop Management System (HMS), graduating in December 2013 after changing it's name to Ambari. Still under active development with a large number of contributors.
Apache Ambari >  Ambari ViewsFramework within Ambari that allows new applications or views to be added to Ambari, based on new client side code (HTML, JavaScript and CSS) supported by new backend code (Java) that exposes REST API end points for the UI to consume. Comes with support for a number of views out of the box, including YARN Queue Manager (supports the creation and configuration of YARN capacity schedule queues), Files (supports copying and moving, uploading and setting permissions on files in HDFS), Falcon (supports defining, scheduling and monitoring data management pipelines), Hive (supports browsing databases, executing queries and viewing explain plans, saving queries, viewing query history and uploading data to Hive tables), Pig (supports executing Pig scripts and viewing execution history), SmartSense (supports capture and download of bundles), Storm (supports viewing cluster status, monitoring topologies, perform topology management and access metrics and logs) and Tez (supports viewing and debugging Tez jobs), along with technical previews of Workflow Designer, Zeppelin and Hue migration views. Views can be deployed into a standalone Ambari instance to separate these from the primary Ambari management instance and to support scaling out.
Apache ApexData transformation engine based on Directed Acyclic Graph (DAG) flows configured through a Java API or via JSON, with a stated focus on performance, code re-use, testability and ease of operations. Runs over YARN and HDFS with native support for both micro-batch streaming and batch uses cases, and includes a range of standard operators and connectors (called Apex Malhar). An Apache project, graduating in April 2016, having been originally donated in August 2015 by DataTorrent from their DataTorrent RTS product which launched in June 2014. Java based, with development lead by DataTorrent who distribute it as DataTorrent RTS in two editions - a Community Edition (which also includes a basic management GUI and a tool for configuring Apex for data ingestion), and an Enterprise Edition (which further includes a graphical transformation editor, a self service dashboard, security integration and commercial support, and is also available as a cloud offering).
Apache ArrowIn-memory data structure specification for building columnar based data systems. Provides a standard interchange format to allow sharing of data between processes on a node without the overhead of moving or transforming the data, permits O(1) random access and has the ability to represent both flat relational structures and complex hierarchical nested data. Data is organised using a columnar structure memory-layout making it cache efficient for analytical workloads (which typically group all data relevant to a column operation together) and allows execution engines to take advantage of modern CPU SIMD (Single Instruction Multiple Data) instructions which work on multiple data values simultaneously in a single CPU clock cycle. Supports Java, C, C++, JavaScript, Python, Go, Ruby and Rust. Seeded from the Apache Drill project and promoted directly to a top level Apache project in February 2016 followed by an initial 0.1 release in October 2016. Used in a range of other projects including Drill, Spark, Impala, Kudu, Pandas and others. Has not yet reached a v1.0 milestone, but is still under active development with a range of contributors from a number of other Apache and non-Apache data projects.
Apache AtlasA metadata and data governance solution for Hadoop. Supports an extensible metadata model with out of the box support for Hive datasets and data lineage from Hive queries and Sqoop imports, with limited support for Falcon, Storm and Kafka. Allows datasets and data items to be tagged (and for these tags to be used for access control by Apache Ranger), and includes support for business taxonomies as a technical preview. Implemented as a graph based database using Titan (which by default uses HBase and Solr), with a web based user interface and a REST API for searching and visualising/retrieving metadata, and Kafka topics for the ingest of metadata (primarily from hooks in metadata sources such as Hive or Sqoop) and the publishing of metadata change events. Donated to the Apache Foundation in May 2015 by the Hortonworks Data Governance Initiative in partnership with Aetna, Merck, Target, Schlumberger and SAS, graduating in June 2017. Has not yet reached a v1.0 milestone, but is still under active development.
Apache AuroraA service scheduler for defining and managing bundled tasks as jobs across a cluster of servers using Mesos, leveraging Mesos for resource allocation and isolation at the task level. Operates as a Mesos framework, a Python based domain specific language (DSL) for job template definition, an executor for carrying out the workload described in the DSL, an associated command line interface for schedule management and a web interface providing read-only status of jobs and associated diagnostic information. Defines a fine-grained task state model to support resource allocation, rolling upgrades, health checking, priority-based scheduling and application maintenance. Handles cross-cutting concerns like observability and log collection. Supports priority-based scheduling, using pre-emption so that when resources are low, lower priority jobs can be stopped to make room for the higher priority tasks. An Apache project, originally created at Twitter, donated to the Apache Foundation in October 2013, graduating in March 2015 (0.8.0 Released). Hasn't yet reached a v1.0 milestone, however still under development from a range of contributors.
Apache AvroData serialisation framework that supports both messaging and data storage. Primarily uses a compact binary format but also supports a JSON format. Supports a range of data structures (including records, enumerations, arrays and maps) with APIs for a wide range of both static and dynamically typed languages. Schema based, with schemas primarily specified in JSON, and support for both code generation from schema definitions as well as dynamic runtime usage. Schemas are serialised alongside data, with support for automatic schema resolution if the schema used to read the data differs from that used to write it. Started as an Hadoop sub-project by Cloudera in April 2009, with an initial v1.0 release in July 2009, before becoming a top level Apache project in May 2010. Has seen significant adoption in the Hadoop ecosystem.
Apache BeamUnified batch and streaming programming model to define portable data processing pipelines and execute these using a range of different engines. Originating from the Google Dataflow model, focuses on unifying both styles of processing by treating static data sets as streams (which happen to have a beginning and an end), while achieving data correctness and the ability to handle late-arriving data through a set of abstractions and concepts that give users control over estimated quality of arrived data (completeness), duration to wait for results (latency) and how much speculative/redundant computation to do (cost). Allows business logic, data characteristics and trade-off strategies to be defined via different programming languages through pluggable language SDKs (with out of the box support for Java and Python). Supports a range of pluggable runtime platforms through pipeline runners, with support for a direct runner (for development and testing pipelines in a non-distributed environment), Apache Apex, Flink, Spark, and (under development) Gearpump runners, and a Google Cloud Dataflow runner. Also supports a growing set of connectors that allow pipelines to read and write data to various data storage systems (IOs). An Apache project, opened sourced by Google in January 2016, graduated in January 2017, with a first stable release (2.0) in May 2017. Written in Java and Python and under active development with a large number of contributors including Google, data Artisans, Talend and PayPal.
Apache BigtopAn Apache open source distribution of Hadoop. Packages up a number of Apache Hadoop components, certifies their interoperability using an automated integration test suite, and packages them up as RPMs/DEBs packages for most flavours of Linux. Also includes virtual machine images and vagrant, docker and puppet recipes for deploying and working with Hadoop. Does not patch projects for distribution, but requires any fixes to be made upstream. An Apache Open Source project, started by Cloudera, donated to the Apache foundation in June 2011, graduating in September 2012, with a 1.0 release in August 2015 based on Hadoop 2.6. Since donating the project, Cloudera have backed away from it, with the project lead moving to Pivotal in December 2013. Now has a broad range of contributors, however usage by the major distributors is not clear.
Apache CalciteA framework for building SQL based data access capabilities. Supports a SQL parser and validator, tools for the transformation and (cost based) optimisation of SQL expression trees, and an adapter framework for accessing metadata and executing queries (including out of the box adapters for a number of database technologies as well as CSV files and POJO objects), along with specific support for streaming SQL queries and optimising data cube queries to use materialised views. Also includes (as a sub-project named Avatica), a framework for building database drivers with support for a standard JDBC driver, server and wire protocols, plus a local embeddable JDBC driver. Used in a range of other projects including Drill, Flink, Hive, Kylin, Phoenix, Samza, Storm and Cascading. An Apache project, originally created by Julian Hyde in May 2012 as Optiq, donated to the Apache Foundation in May 2014, graduating in October 2015 following a v1.0 release in January 2015. Under active development with a range of contributors.
Apache CarbonDataUnified storage solution for Hadoop based on an indexed columnar data format, focusing on providing efficient processing and querying capabilities for disparate data access patterns. Data is loaded in batch, encoded, indexed using multiple strategies, compressed and written to HDFS using a columnar file format. Provides a number of highly configurable indexes (multi-dimensional key, min/max index, and inverted index), global dictionary encoding and column grouping to support interactive style OLAP queries, high throughput scan queries, low latency point queries and individual record queries. Also supports batch updates and deletes using delta bitmap files and compaction. Written in Java using Apache Thrift, supports all common primitive data types and complex nested data types including array and structures. Consists of several modules, the format specification and core implementation (columnar storage, indexing, compression, encoding), Hadoop input/output format interface, deep integration with Spark, interfacing to Spark SQL and the DataFrame API and connectors for Hive and Presto. Started back in 2013 at Huawei's India R&D center, donated to the Apache Foundation in 2015, graduated in April 2017, with a stable (1.1.0) release in May 2017, and under active development.
Apache CassandraDistributed wide-column datastore based on Amazon Dynamo and Google BigTable. Focuses on fault tolerance, linear scalability and operational simplicity with zero downtime based on a distributed masterless node and peer-to-peer design. Supports high availability using network topology aware data replication to avoid single points of failure, fast real-time and durable ingestion of data using an append-only log, strong query performance based on an in-memory index (log-structured merge-tree) that is persisted as a sorted string table (SST) for fast sequential retrieval, and tunable consistency (between strong and eventual) allowing availability (number of replicas on which a write must succeed), data accuracy (number of replicas must respond to a read request before returning data) and performance to be traded off on a global or per-operation basis. Does not support joins nor subqueries, rather, emphasises denormalisation through features like collections. Comes with a command line shell (cqlsh) for using Cassandra Query Language (resembling SQL), a wide number of drivers for many languages including Java, Python, Ruby, C++ and Go, and Nodetool, a CLI for cluster management. Metrics can be queried via JMX or pushed to external monitoring systems, SSL encryption provides secure communication, authentication and authorisation is provided based on internally controlled rolename/passwords and object permission management. An Apache project, graduating in February 2010, having been originally opened sourced in July 2008 by Facebook. Written in Java and under active development with major contributions from DataStax who distribute it as a part of their DataStax Enterprise offering. Other commercial vendors include Instaclustr and Winguzone who provide hosted and managed Apache Cassandra as a service on a number of major cloud providers.
Apache CrunchAn abstraction layer over MapReduce (and now Spark) that provides a high level Java API for creating data transformation pipelines, originally designed to make working with MapReduce easier based on the Google FlumeJava paper. Also includes connectors for HBase, Hive and Kafka, Java 8 lambda support, an experimental Scala wrapper for the API (Scrunch), and support for in memory pipelines and helper classes to support testing. Open sourced by Cloudera in October 2011, donated to the Apache Foundation in May 2012, before graduating in February 2013. Support for Spark was added as part of v0.10 in June 2014. Still being maintained, and appears to have had been adopted at a number of large companies, but with limited new development.
Apache DataFuA set of libraries for working with data in Hadoop. Consists of two sub-projects - DataFu Pig (a set of Pig User Defined Functions) and DataFu Hourglass (a framework for incremental processing using MapReduce). Originally created at LinkedIn, with the Pig UDFs being open sourced in January 2012 as DataFu, with a v1.0 release in September 2013. Split into sub-projects in October 2013 when LinkedIn open sourced DataFu Hourglass and added it to the project. Donated to the Apache Foundation in January 2014, graduating in February 2018. Last major release was v1.3 in November 2015, with a handful of bug fix releases but little development activity since then.
Apache DataFu >  DataFu HourglassA framework over MapReduce that supports the efficient generation of statistics of dated data by incrementally updating the previous days output. Supports both fixed length and fixed start point windows, and the generation of statistics by input partition or as a total over all input data.
Apache DataFu >  DataFu PigA set of user defined functions for Apache Pig, including support for statistical calculations, bag and set operations, sessionisation of streams of data, cardinality estimation, sampling, hashing, PageRank and others.
Apache DrillAn MPP query engine that supports queries over one or more underlying databases or datasets without first defining a schema and with the ability to join data from multiple datastores together. Supports a range of underlying technologies including HDFS, NAS, HBase, MongoDB, MapR-DB, MapR-FS, Kafka, OpenTSDB, Amazon S3, Azure Blob Storage, Google Cloud Storage, JDBC, Avro, JSON and Parquet. Pushes queries down to underlying datastores where possible, and supports an in-memory columnar datastore based on a schema free JSON document model for performing cross datastore query operations. Supports dynamic schema discovery, with support for complex and nested types, including a number of SQL extensions. Supports standard SQL, UDFs (including Hive UDFs) and comes with JDBC and ODBC drivers, a REST API, plus a shell, web console and C++ API. Designed to be horizontally scalable and to support high throughput and low latency use cases, and can run over YARN. Supports Kerberos and username/password authentication, plus a full authorisation model. Created by MapR Based on Google's Dremel paper, donated to the Apache Foundation in September 2012, graduating in November 2014, with a 1.0 release in May 2015, and is still under active development
Apache Druid (Incubating)An open source distributed database built to support sub-second OLAP / star schema style queries on both real-time and historical data, based on columnar storage and inverted indexes. All data must have a timestamp, one or more dimension fields, and then one or more measures, with data being aggregated by timestamp and dimension fields on ingest. Comes with a batch ingestor (with support for reading from HDFS, S3 and local files), a streaming ingestor (with support for local files and an HTTP endpoint), and a streaming data endpoint (Tranquility, with support for Kafka, Storm and Spark Streaming and an API for use with other systems), with real-time ingests not guaranteed under failure, but with supports hybrid architectures whereby real-time data ingests are replaced with batch refreshes when available. Architecture based on a number of different node types - historical nodes (which serve queries against a local cache of data that's been persisted in S3 or HDFS), real-time nodes (which support ingest and querying of streaming data, with data persisted and handed over to an historical node once aged), and broker nodes (which distribute queries to appropriate real-time and historical nodes and then collate the results). All data is segmented by date and time, with a metadata database (e.g. MySQL, PostgreSQL, or Derby) tracking segments and which nodes are serving them, and Apache ZooKeeper used for co-ordination and communication between nodes. Supports low latency lock free ingestion, a JSON REST endpoint for queries (with support for a range of query types including timeseries, TopN, groupBy and select), a range of SDKs, approximate and exact computations, multiple storage tiers (including data lifecycle rules on tiering and dropping data), metrics (for queries, ingestion, and coordination), rolling upgrades, HTTP authentication (including Kerberos, but no further security controls), and a number of experimental features including small dimension lookups (note that general joins are not supported), multi-value dimension fields and a SQL interface based on Apache Calcite. Started in 2011 by Metamarkets, open sourced under the GPL licence in October 2012, moving to an Apache licence in February 2015 and donated to the Apache Incubator in February 2018. Has a wide range of companies listed on the Druid website as users, and natively supported by Apache Superset and Grafana (via a plugin). Commercial support available from Imply (which distribute their own product based on Druid including a SQL interface and a data exploration tool called Pivot), and currently in tech preview as part of the Hortonworks Data Platform, where it's being integrated with Apache Hive.
Apache FalconData feed management system for Hadoop. Supports the definition, scheduling and orchestration (including support for late data and retry policies) of data processing pipelines (referred to as processes, with support for Ozzie, Spark, Hive and Pig jobs), the management of the data produced and consumed by these pipelines (referred to as feeds, with support for data in HDFS and Hive) and the generation and visualisation of pipeline lineage information, all across multiple Hadoop clusters. Also includes the ability to mirror or replicate HDFS and Hive data between clusters, to failover processing between clusters and to import and export data using Sqoop. Supports both a web and command line interface and a REST API. An Apache project, graduating in December 2014, having been originally donated by inMobi in April 2013. Hasn't yet reached a v1.0 milestone, is seeing very little development activity, and as of HDP 3.0 will no longer be distributed by Hortonworks.
Apache FlinkSpecialised stream processing technology inspired by the Google Data Flow model. Based on a single record (not micro batch) model, with exactly once processing semantics (for supported sources and sinks) via light weight checkpointing, and focusing on high throughput, low latency use cases. Supports both a Java and Scala API, with a fluent DataStream API for working with continuous data flows (including a flexible windowing API that supports both event time and processing time windows and support for out of order or late data), and a DataSet API for working with batch data sets (that uses the same streaming execution engine). Also supports a number of connectors and extra libraries, including experimental support for SQL expressions, a CEP library (FlinkCEP) that can be used to detect complex event patterns, a beta package for running Storm apps on Flink, a graph processing library (Gelly) and a machine learning library (FlinkML). Clustered, with support for YARN, Mesos and Kubernetes as well as standalone clusters. Open sourced by Data Artisans in April 2013, donated to the Apache Foundation in April 2014 before graduating in August 2014. Under active development with a large number of contributors and a range of user case studies. Sold as a hosted managed service (dA Platform) by Data Artisans who also supply training.
Apache FlumeSpecialist technology for the continuous movement of data using a set of independent agents connected together into pipelines. Supports a wide range of sources, targets and buffers (channels), along with the ability to chain agents together and to modify and drop events in-flight. Designed to be highly reliable, and to support reconfiguration without the need for a restart. Heavily integrated with the Hadoop ecosystem. An Apache project, donated by Cloudera in June 2011, graduating in June 2012, with a v1.2 release (the first considered ready for production use) in July 2012. Java based, with commercial support available as part of most Hadoop distributions.
Apache GiraphAn iterative, highly scalable graph processing system built on top of MapReduce and based on Pregel, with a number of features added including a framework for creating re-usable code (called blocks). An Apache project, graduating in May 2012, having been originally donated by Yahoo in August 2011. Java based, no commercial support available, but is mature and has been adopted by a number of companies (including LinkedIn and most famously Facebook who scaled it to process a trillion edges), and has a number of active developers.
Apache Gobblin (Incubating)Java based framework for ingesting data into Hadoop. Ingestion jobs are defined through job configuration files, and are made up of a number of stages - a Source identifies work to be done and generates Work Units which are then processed by Tasks, with Tasks consisting of an Extractor (reads the records to be processed), one or more Converters (a 1:N transformation of records), a Quality Checker (covers both record and file checks), a Fork Operator (allows data to be written to multiple targets) and a Writer (writes out completed records), with the output of a completed task being committed by a Publisher. Gobblin ships with a number of standard components, including support for a range of sources and targets, as well as supporting custom implementations of any stage. Jobs can be run using a number of frameworks, including MapReduce (with all tasks running as mapper only jobs), YARN, and as Java threads within a single JVM, with some modes also supporting an internal scheduler and job management engine. Supports job locks (to ensure multiple instances of the same job don't run at the same time), job history metadata (via a job execution history store that supports a REST API that can be used to monitor jobs), exactly-once processing support (via Publisher commits), failure handling (retrying both within and across jobs), capture and forwarding of execution and data quality metrics, post processing of data (e.g. to remove duplicates or generate aggregations), partitioned writers, job configuration file templates, Hive table registration, high availability, data retention management (automatically deleting old data according to a number of retention rules), and data purging (Gobblin Compliance). Developed at LinkedIn from late 2013, first announced in November 2014 and open sourced shortly afterwards, before being donated to the Apache Foundation in February 2017, and with stated deployments at a number of large organisations.
Apache HadoopA distributed storage and compute platform consisting of a distributed filesystem (HDFS) and a cluster workload and resource management layer (YARN), along with MapReduce, a solution built on HDFS and YARN for massive scale parallel processing of data. Has an extensive ecosystem of compatible technologies. An Apache Open Source project, started in January 2006 as a Lucene sub-project, becoming a top level project in January 2008, with a 1.0 release in December 2011 (containing HDFS and MapReduce), and a 2.2 release (the first 2.x GA release) in October 2013 (adding YARN). Work is currently underway to split out the data storage layer of HDFS (the HDDS sub-project) and to implement an object store on top of this that can co-exist with HDFS (the Ozone sub-project). Very active, with a deep and broad range of contributors, and backing from multiple commercial vendors.
Apache Hadoop >  HDDSA common distributed and resilient block storage layer that will eventually underpin HDFS and Ozone, delivering increased scalability. Implemented as a Storage Container Manager (SCM) service (that performs block management) and DataNode services (inherited from HDFS that run on storage nodes and manage block IO). Blocks are arranged into containers (with the replication strategy defined at the container level). Currently under active development as part of the development of Ozone. Previously known as HDSL (Hadoop Distributed Storage Layer)
Apache Hadoop >  HDFSA highly resilient distributed cluster file system proven at extreme scale. Consists of a single NameNode service (that's responsible for all metadata management, including the filesystem namespace and block management) plus DataNode services that run on all storage nodes (that manage block IO). Supports NameNode high availability, metadata resilience (via a transaction log), data resilience (via block replication or erasure coding), user authentication, extended ACLs, snapshots, quotas, central caching, a REST API, an NFS gateway, rolling upgrades, rack awareness, transparent encryption, NameNode federation (support for multiple independant NameNodes on the same cluster serving different namespaces) and support for heterogeneous storage. Part of the original Hadoop code base, becoming an Apache Hadoop sub-project in July 2009. Currently being updated to run over the new HDDS (Hadoop Distributed Data Storage) layer, moving block management from the NameNode to a new Storage Container Manager to increase scalability.
Apache Hadoop >  MapReduceA data transformation and aggregation technology proven at extreme scale that works on key value pairs and consists of three transformation stages - map (a general transformation of the input key value pairs), shuffle (brings all pairs with the same key together) and reduce (an aggregation of all pairs with the same key). Part of the original Hadoop code base, becoming an Apache Hadoop sub-project in July 2009.
Apache Hadoop >  OzoneAn object store built on top of the new Hadoop HDDS block storage layer that can co-exist with HDFS. Implemented as an Ozone Manager (OM) service that manages the object store namespace, utilising the HDDS Storage Container Manager for block management. Objects are arranged into buckets, which themselves are arranged into volumes. Supports consistent writes, an RPC API, an Amazon S3 compatible REST API, a CLI, a load generation tool (Freon, previously Corona), and an Hadoop Compatible File System (OzoneFS), with a stated plan for mountable LUN storage (Quadra). Originally announced in October 2014, re-invigorated under the Hortonwworks Open Hybrid Architecture Initiative in September 2018, and currently under active development with a suggested release as part of HDP 3.2.
Apache Hadoop >  YARNResource management and job scheduling & monitoring for the Hadoop ecosystem. Includes support for capacity guarantees amongst other scheduling options, long running services, GPU and FPGA scheduling and isolation and experimental support for launching applications within docker containers. Added as an Apache Hadoop sub-project as part of Hadoop 2.x (with a GA release as part of 2.2 in October 2013) having been started in January 2008.
Apache HamaA general purpose BSP (Bulk Synchronous Parallel) processing engine inspired by Pregel and DistBelief that runs over Mesos or YARN. Supports BSP, graph computing and machine learning programming models, as well as Apache MRQL. An Apache project, donated in 2008, and graduated in 2012. Java based, with no commercial support available, limited case studies for it's use and limited active developers, with the last release being in June 2015.
Apache HAWQA port of the Greenplum MPP database (which itself is based on PostgreSQL) to run over YARN and HDFS. Supports all the features of Greenplum (ACID transactions, broad SQL support and in database language and analytics support, including support for Apache MADLib), integration with Apache Ambari, an Input Format for MapReduce to read HAWQ tables, and both row and Parquet (column) based storage of data managed by HAWQ. Also supports queries over data not managed by HAWQ via external tables, with a Java based framework (PXF) for accessing external data, and out of the box support for accessing data in HDFS (text, Avro, JSON), Hive and HBase, with a number of open source connectors also available. Fault tolerant and horizontally scalable, with the ability to scale up or down on the fly. Originally created as Pivotal HAWQ based on a fork of Greenplum in 2011, with an initial 1.0 release as part of Pivotal HD in July 2013. Open sourced and donated to the Apache Foundation in September 2015, becoming Apache HAWQ, with the first open source release (2.0) in October 2016, and graduating in August 2018. Development led by Pivotal, who also distribute binaries as Pivotal HDB and provide training, consultancy and support. Pivotal HDB is also available as Hortonworks HDB, with support and consultancy provided by Hortonworks.
Apache HBaseNoSQL wide-column datastore based on Google BigTable. Data for an HBase table is distributed across regions, with each region made up of a store per column family (with stores either hosted in memory or on disk), with regions served and managed by region servers, which in turn are monitored and managed by master servers (which are also responsible for metadata changes and can run in a multi-master configuration), with the architecture supporting horizontal scalability and high availability. Supports strongly consistent reads and writes (with all reads and writes going through a single region server), with the option to perform non consistent reads from data replicated between multiple region servers given more consistent performance during region server failure. Supports get, put (insert/update), scan (iterating over a set of rows) and delete operations, the option to bulk load via Map Reduce and Spark, and the option to execute custom code within the HBase cluster via co-processors (observer co-processors execute either before or after specific events, endpoint co-processors allow execution of batch analytics). Also supports medium sized binary objects (up to 10Mb), versioning and fine grained RBAC security controls, including visibility expressions at the cell level for authorising end user access. Runs on Hadoop and HDFS, and is heavily integrated with the Hadoop ecosystem. Supports a CLI plus Java, Thrift and REST API, along with MapReduce and Spark integration as both a source and sink. An Apache project, first released as part of Hadoop 0.15 in October 2007 before graduating as a top level project in May 2010. Java based, with commercial support available as part of most Hadoop distributions.
Apache HiveTechnology that supports the exposure of data in Hadoop as structured tables and the execution of analytical SQL queries over these. Consists of a number of distinct components (that we treat as sub-projects) including Hive Metastore (stores the definitions of the structured tables), Hive Server (supports the execution of analytical SQL queries as MapReduce, Spark or Tez jobs) and HCatalog (allows MapReduce and Pig jobs to read and write Hive tables). First released by Facebook as an Hadoop contrib module in September 2008, becoming an Hadoop sub-project in November 2008, and a top level Apache project in September 2010, following a first official stable release (0.3) in April 2009. Java based, under active development from a number of large commercial sponsors, with commercial support available as part of most Hadoop distributions.
Apache Hive >  HCatalogLibraries for MapReduce and Pig to read and write data to and from Hive tables, albeit with some limitations. Also supports a CLI for querying and updating the Hive Metastore, however this doesn't support the full range of Hive DDL commands. Includes WebHCat, a REST API over the HCatalog CLI that also supports the execution of MapReduce, Pig, Hive and Sqoop jobs. Donated to the Apache foundation by Yahoo in March 2011, had WebHCat folded in in July 2012, graduating as a top level project in February 2013, but then almost immediately was folded into Hive in March 2013 as part of the Hive 0.11 release. Has seem limited development since this time.
Apache Hive >  Hive MetastoreA metadata service that allows structured tables to be defined over files in HDFS (and also HBase or Accumulo), providing an API that allows the metadata to be queried and updated by other tools including Impala, Spark SQL or RecordService. Supports partitioned and clustered tables, as well as complex field types such as arrays, maps and structs. Backed by a relational database (either MySQL, Postgres and Oracle). Part of the original Hive code base.
Apache Hive >  Hive ServerSupports the execution of SQL queries over data in HDFS based on tables defined in the Hive Metastore, as well as DDL to query and update the Hive Metastore. Focus is on analytical (OLAP) use cases, with some support for batch updates to data. Originally executed queries as MapReduce jobs, but significant investment from has seen support for executing queries as Spark and as Tez jobs, with work underway to support sub second query times using Tez (Hive LLAP). Recent changes have also seen it achieve significant SQL compliance, with support for SQL:2011 analytical functions on-going. Accepts queries over an API with JDBC and ODBC drivers available, and includes Beeline, a command line JDBC client. Technically referred to as Hive Server 2, and was introduced in Hive 0.11 as a replacement for the original Hive Server to address a number of concurrency and security issues.
Apache IgniteA distributed in-memory data fabric/grid with the ability to persist data to disk, supporting a number of use cases including a key value store (with SQL support), real time stream/event processing engine, arbitrary compute, long running service management, an in-memory HDFS compatible file system for acceleration of Hadoop jobs, an in-memory machine learning grid and in-memory shared Spark RDDs and Data Frames. An Apache project, graduating in September 2015, having been originally donated by GridGain from their In-Memory Data Fabric product launched in 2007. Java based, with development lead by GridGain who supply commercial support for Ingite, as well as a range of Ignite based products and services including GridGain Professional (with ongoing Q&A and bug fixes before they're included in Ignite), GridGain Enterprise and Ultimate (which includes extra features such as a management GUI, enterprise security, rolling upgrades, backup and recovery) and GridGain Cloud (beta).
Apache ImpalaAn MPP query engine that supports the execution of SQL queries over in HDFS, HBase, Kudu and S3 based on tables defined in the Hive Metastore. Focus is on analytical (OLAP) use cases, and more specifically on low latency interactive queries (rather than long running batch queries), with some support for batch inserts of data. Supports DDL statements for updating the Hive Metastore, uses (broadly) the same SQL syntax as Hive (including UDFs and a range of aggregate and analytical functions), as well as the same JDBC / ODBC drivers, and is therefore compatible with any Hive query tool (such as Beeline). Supports querying over data in Parquet, Text, Avro, RCFile and SequenceFile formats, with the ability to write Parquet and Text data. Support Kerberos and LDAP authentication, and integration with Apache Sentry for authorisation. Includes a shell (Impala Shell) that supports some shell only commands for tuning performance and diagnosing problems. Created by Cloudera, started in May 2011 and first announced in October 2012, with a 1.0 GA release in May 2013. Donated to the Apache Foundation in December 2015, graduating in November 2017, and is still under active development.
Apache KafkaTechnology for buffering and storing real-time streams of data between producers and consumers, with a focus on high throughput at low latency. Based on a distributed, horizontally scalable architecture, with messages organised into topics which are partitioned and replicated across nodes (called brokers by Kafka) to provide resilience and written to disk to provide persistence. Topics may have multiple producers and consumers, with ability to do fault tolerant reads and to load balance across consumers (consumer groups). Records consist of a key, value and timestamp, with the ability to compact topics to remove updates and deletes by key. Supports rolling upgrades, a full security model (including secure and authenticated connections and ACLs for controlling access to topics), the ability to set quotas (for data produced or consumed), Yammer metrics for both servers and clients, and tools to mirror data to a second cluster (mirror maker) and re-distribute partitions across nodes (for example when adding new nodes). Comes with a Java client, but clients for a wide range of languages are also available. Has two sub-projects (Kafka Connect and Kafka Streams) that are bundled with the main product. Originally developed at LinkedIn, being open sourced in January 2011, before being donated to the Apache Foundation in July 2011 and graduating in October 2012. Development is primarily led by Confluent (which was founded by the team that built Kafka at LinkedIn), who have a number of open source and commercial offerings based around Kafka. Commercial support is also available from most Hadoop vendors.
Apache Kafka >  Kafka ConnectFramework for building scalable and reliable integrations between Kafka and other technologies, either for importing or exporting data. Part of the core Apache Kafka open source technology, connectors are available for a wide range of systems, including Hadoop, relational, NoSQL and analytical databases, search technologies and message queues amongst others, with an API for developing custom connectors. Supports lightweight transformations, and runs separately to Kafka, in either a stand-alone or distributed cluster mode, with a REST API for managing connectors. Introduced in Kafka 0.9, previously known as Copycat
Apache Kafka >  Kafka StreamsA stream processing technologies that's tightly integrated to Apache Kafka, consuming and publishing events from and to Kafka topics (and potentially writing output to external systems). Based on an event-at-a-time model (i.e. not micro batch), with support for stateful processing, windowing, aggregations, joining and re-processing data. Supports a low level DSL API, as well as a high level API that provides both stream and table abstractions (where tables present the latest record for each key). Executes as a stand-alone process, with support for parallel processing across threads within a single instance and across multiple instances, with the ability to dynamically scale the number of instances. Introduced in Kafka 0.10.
Apache KnoxA stateless gateway for the Apache Hadoop ecosystem that provides perimeter security. Includes support for user authentication (via LDAP, Active Directory and a number of single sign on solutions), access authorisation on a per service basis, transitions to Kerberos authentication, reverse proxying and auditing, extension points for supporting new services, audit capabilities, and out of the box support for a number of Hadoop technology end points. An Apache project, started by Hortonworks in February 2013, donated to the Apache Foundation two months later in April, before graduating in February 2014. Hit v1.0 in February 2018, and still under active development.
Apache KuduColumnar storage technology for tables of structured data, supporting low latency reads, updates and deletes by primary key, as well as analytical column/table scans. Provides Java, C++ and Python APIs, is queryable via Impala and Spark SQL, and provides Spark, Flume and MapReduce connectors. Supports cluster deployments (including co-existence with Hadoop), with tables partitioned into tablets (configurable on a per table basis), with tablets then replicated and distributed across the cluster, using the Raft Consensus Algorithm for consistency. Also supports variable column encoding (including bit shuffle, run length, dictionary and prefix encoding) and compression. Includes a web UI for reporting operational information, and metrics available from the command line, via HTTP or via a log file. Started in November 2012, with a initial beta release in September 2015. Donated to the Apache Foundation in December 2015, graduating in July 2016, with a 1.0 release in September 2016. Implemented in C++.
Apache KylinAn open source distributed analytic engine built to support sub-second OLAP / star schema style queries using SQL on extremely large datasets on Hadoop. Data is read from a star schema data model in Hive to build a data cube of pre-calculated metrics by dimensions using MapReduce or Spark with the results stored in a key-value datastore (HBase). SQL queries can be submitted to the query engine, with results returned with sub-second latency if the required data exists in an HBase cube, otherwise the query is optionally routed back to its original source on Hadoop. Supports compression of large datasets by dictionary encoding cube data using a triple data structure, combination pruning and aggregation grouping of dimensions for efficient data storage, and uses approximation query capability (HyperLogLog) to estimate distinct items and TopN to answer top-k queries. Row keys are composed by dimension encoded values and HBase's fuzzy row filtering is performed directly on the storage nodes to implement low latency lookups. Simple additive and aggregation operations (sum, count or like) are also performed on the storage nodes using HBase coprocessors to provide efficient computational parallelism and minimise network latency. Uses Apache Calcite for SQL parsing and optimisation, comes with an ODBC driver, a JDBC driver and a REST API to integrate with third party business intelligence tools such as Tableau, Microsoft Excel and PowerBI. Includes a web interface and REST API for model building and cube design (with support for hierarchy, joint and derived dimensions), job management (full, incremental and streaming builds) and monitoring and permission management (providing security at a project or cube level). New beta features include building cubes from Kafka streaming data and cube building using Spark instead of MapReduce. Originally developed at Ebay, donated to the Apache Foundation in November 2014, graduating in November 2015, with a 1.0 release in September 2015, and still under active development. Commercial support available from Kyligence, who distribute their own product based on Kylin replacing HBase with a custom columnar storage engine (with cell level access control and integration with LDAP), along with a web based BI tool for self service analysis and a dashboard for Kylin cluster management.
Apache LivyA service that allows Spark jobs (pre-compiled JARs) or code snippets (Scala or Python) to be executed by remote systems over a REST API or via clients for Java, Scala and Python. Supports re-use of Spark Contexts (and caching and sharing of RDDs across jobs and clients), multiple concurrent clients, secure authenticated communications and batch job submissions. Started in November 2015 based on code from Hue, with a formal announcement and first release in June 2016 based on development led by Cloudera, Hortonworks and Microsoft, before being donated to the Apache Foundation in June 2017. Hasn't yet graduated, but under active development, and used by tools such as Hue and Zeppelin.
Apache MahoutMachine learning technology comprising of a Scala based linear algebra engine (codenamed Samsara) with an R-like DSL/API that runs over Spark (with experimental support for H2O and Flink), an optimiser, a wide variety of pre-made algorithms, and a Scala REPL (based on Spark Shell) for interactive execution. Can be embedded and integrated within larger applications, for example MLlib when running over Spark. Also includes some original, now deprecated, algorithms implemented over MapReduce. Created in January 2008 as a Lucene sub-project, becoming a top level Apache project in April 2010. The original MapReduce algorithms were deprecated and Samsara introduced as part of v0.10 in April 2015.
Apache MesosOpen source cluster manager for providing efficient resource utilization across a cluster of servers through resource sharing and isolation. Allows a cluster of servers to be shared across diverse cluster computing frameworks so that different distributed workloads such as container orchestration, machine learning, analytics and stateful big data technologies can be run without interfering with each other. Has the ability to dynamically allocate resources across the servers as needed and delegates control over scheduling to the frameworks through an abstraction layer called a resource offer to support a wide array of computing frameworks. Resource isolation is implemented using a universal containeriser, supporting numerous containers including native Mesos containers and Docker containers. Fault tolerance of the Mesos instance in control of the cluster is implemented using ZooKeeper. Started as a research project in the UC Berkeley RAD Lab, open sourced in 2011, with a v1.0 release in July 2016, which, included the 'unified containeriser' and GPU-based scheduling. Written in C++, uses Google Protocol Buffers for messaging and serialization to allow frameworks to be written in a variety of languages including C++, Java, Python, Go, Haskell, and Scala. Under active development, open sourced under the Apache 2.0 license, hosted on the Apache git repository and mirrored on GitHub. Software startup Mesosphere sells the Datacenter Operating System, a distributed operating system, based on Apache Mesos.
Apache MyriadTool that allows YARN applications to run over Apache Mesos, allowing them to co-exist and share cluster resources. Consists of Myriad Executor, a Mesos managed task that in turns manages a YARN Node Manager, and Myriad Scheduler, a plugin for the YARN Resource Manager that delegates resource negotiation to Mesos (and launches Myriad Executor processes on required nodes via Mesos). Supports fixed resource allocation to YARN Node Managers, as well as fine-grained scaling where resources are dynamically requested from Mesos. Includes a web based user interface and REST API that includes support for scaling YARN resources when using fixed resource allocation. Originally created by eBay, MapR and Mesosphere and dondated to the Apache Foundation in March 2015. Has not yet graduated or reached a 1.0 release, with development activity seeming very quiet since October 2016.
Apache NiFiGeneral purpose technology for the movement of data between systems, including the ingestion of data into an analytical platform. Based on directed acyclic graph of Processors and Connections, with the unit of work being a FlowFile (a blob of data plus a set of key/value pair attributes). Supports guaranteed delivery of FlowFiles, with NiFi resiliently storing state (by default to a local write ahead log) and data blobs (by default a set of local partitions on disk), with all FlowFile transformations executed via a thread pool within the NiFi instance (with the option to deploy multiple NiFi instances as a cluster). All flows are configured in a graphical user interface, which is also used for management and operations (starting/stopping individual Processors and viewing real time statuses, statistics and other information). Also supports some record level operations (via RecordReaders and RecordSetWriters), data provenance (reporting on the processing events and lineage of individual FlowFiles), scheduling of Processor execution (based on periodic execution timers or cron specifications), multi-threaded Processor execution, configuration of Processor batch sizes (to enable low latency or high throughput), prioritised queues within Connections (allowing FlowFiles to be processed based on their age or a priority attribute as an alternative to FIFO), back pressure (based on counts or data volume against individual Connections) and pressure release (automatic discarding of FlowFiles based on their age), the ability to stream data to and from other NiFi instances and other streaming technologies, the ability to import and export flows as XML (flow templates), an expression language for setting Processor configuration and populating FlowFile attributes, Controller Services to provide shared services to processors (e.g. access to credentials, shared state), Reporting Tasks to output status and statistics information and a user security model. Extensible through the addition of custom Processors, Controller Services, Reporting Tasks and Prioritizers, and integrates with Apache Ranger and Apache Ambari. Originally developed at the NSA as Niagara Files, before being donated to the Apache Foundation in November 2014, graduating in July 2015. Java based, with development lead by Hortonworks after their aquisition of Onyara (which was set up by original NiFi developers to provide commercial support and services).
Apache NiFi >  MiNiFiLightweight headless version of NiFi used to collect and process data at it's source, before forwarding it on for centralised processing. Supports all key NiFi functionality including all NiFi processors, guaranteed delivery, data buffering (including back pressure and pressure release) and prioritised queuing, however flows are specified in configuration files, status information and statistics are only available via Reporting Tasks or via a CLI, and provenance can only be viewed by exporting events via Reporting Tasks to log files or a full NiFi instance. Supports warm re-deployments, automatically restarting to load a new configuration written to disk or pushed or pulled over HTTP. Available as a Java or Native C++ executable. Started in March 2016, with a 0.1 release in December 2016.
Apache NiFi >  NiFi RegistryA solution for the configuration management of NiFi flows. Integrates with NiFi to allow users to store, retrieve and upgrade flows, keeping a full history of all changes to a flow committed to the registry, with flows stored and organised by buckets. Supports local users and groups, or authentication via certificates, LDAP or Kerberos, with access control policies allowing read, write and delete permissions to be specified for buckets, users and groups. Has a Web based UI and a REST interface for managing buckets, local users and groups, viewing flow history and for managing access control. First released in January 2018.
Apache OozieTechnology for managing workflows of jobs on Hadoop clusters. Primary concepts include workflows (a sequence of jobs modelled as a directed acyclic graph), coordinators (schedule the execution of workflows based on the time or the presence of data) and bundles (collections of coordinators), with all configuration specified in XML. Supports a range of technologies, including MapReduce, Pig, Hive, Sqoop, Spark, Java executables and shell scripts. Includes a server component, a metadata database for holding definitions and state (with support for a range of database technologies), a command line interface and a read only web interface for viewing the status of jobs. Also supports the parameterisation of workflows, the modelling of datasets (and the use of these to manage dependencies between workflows within coordinators), automatic retry and failure handling, and the ability to send job status notifications via HTTP or JMS. Open sourced by Yahoo in June 2010. Donated to the Apache Foundation in July 2011, graduating in August 2012. Commercial support available as part of most Hadoop distributions
Apache ORCSelf-describing, type-aware, columnar file format to enable efficient querying and storage of data on Hadoop. Provides built-in storage indexes, column statistics and bloom filters to allow execution engines to implement predicate and projection push-down, partition pruning and cost based optimisation for low latency reads. Uses multi-version concurrency control to support ACID transactions and allow Hive to implement bulk insert, update, delete and streaming ingest (micro batch) use cases. Implements type-aware encoding for efficient compression (run-length for integer and dictionary for string). Schema definition is stored along side the data and supports all primitive data types and complex nested data structures. Uses protocol buffers to store meta data. Comes with a Java library for reading and writing the file format and includes a MapReduce compatible API, a C++ library for reading the file format (donated by Vertica) and a set of Java and C++ tools for inspecting and benchmarking ORC files. Created by Hortonworks in January 2013 as part of the initiative to massively speed up Hive and improve the storage efficiency of data stored in Hadoop, split off from Apache Hive to become a separate top level Apache project in April 2015 with a 1.0 release in January 2016.
Apache ParquetData serialisation framework that supports a columnar storage format to enable efficient querying of data. Built using Apache Thrift, and supports complex nested data structures, using techniques from the Google Dremel paper. Consists of three sub-projects, the specification and Thrift definitions (Parquet Format), the Java and Hadoop libraries (Parquet MR) and the C++ implementation (Parquet CPP). Created as a joint effort between Twitter and Cloudera based on work started as part of Avro Trevni, with an initial v1.0 release in July 2013. Donated to the Apache Foundation in May 2014, graduating in April 2015. Has seen significant adoption in the Hadoop ecosystem.
Apache PhoenixA SQL query engine over Apache HBase tables that supports a subset of SQL 92 (including joins), and comes with a JDBC driver. Supports a range of features including ACID transactions (via Apache Tephra), user defined functions, secondary indexes, atomic upserts, views, multi tenancy tables (where each user or tenant can only see their data) and dynamic columns (which are only specified at query time). Supports a range of SQL DDL commands, creating and modifying underlying HBase tables as required, or can run over existing HBase tables in a read only mode. Comes with connectors to allow Spark, Hive, Pig, Flume and MapReduce to read and write Phoenix tables, and a number of utilities, including a bulk loader and a command line SQL tool. Open sourced by SalesForce in January 2013 at v1.0, donated to the Apache foundation in December 2013, before graduating in May 2014. Commercial support available through Hortonworks as part of HDP, with Cloudera making it available via Cloudera Labs without support. Active project with a range of contributors, including many from SalesForce and Hortonworks.
Apache PigTechnology for running analytical and data processing jobs against data in Hadoop. Jobs are written in Pig Latin (a custom procedural language that can be extended using user defined functions in a range of languages), which is then translated into Map Reduce or Tez (with Spark in preview) for execution. Supports both a batch mode for running pre-defined scripts and an interactive mode, and connectors for reading and writing to HBase and Accumulo as well as HDFS. Originally developed at Yahoo in 2006 before being donated to the Apache Foundation in October 2007. Graduated as an Hadoop sub-project in October 2008, before becoming a top level project in September 2010. Although has not had a v1.0 release, has been production quality for many years. Commercial support available as part of most Hadoop distributions
Apache RangerA centralised security framework for managing access to data in Hadoop. Supports integration with LDAP and Active Directory for user authentication, a central policy server/store (with a web based administration interface and REST API), and plugins for Hadoop components (including HDFS, Hive, HBase, Storm, Knox, Solr, Kafka, YARN, Atlas and NiFi) to manage authorisation of user access to data. Supports data masking and row level access policies (currently only supported by Hive), the ability to define policies against tags as well as directly against resources (with tags assigned to resources externally, e.g. in Apache Atlas), and the ability to use more complex conditions (e.g. denying access after an expiration date or based on a users location). Extendable with the ability to add support for new services (Ranger Stacks) and to add custom decision rules (via content enrichers and condition evaluators). Also supports a full audit capability of access requests and decisions, and a key management service for HDFS encryption keys. An Apache project, donated in July 2014 as Argus by the Hortonworks following their acquisition of XA Secure, graaduating in February 2017. Reached v1.0 in March 2018, and is still under active development with a range of contributors.
Apache SentryA centralised security framework for managing access to data in Hadoop. Supports integration with LDAP and Active Directory for user authentication, a central policy server/store, and plugins for Hadoop components (including Hive, Solr, Impala and HDFS, with support for Kafka and Sqoop2 in preview) to manage authorisation of user access to data, although HDFS support is limited to Hive data only. Also supports row level filtering policies for Solr, and historical support for defining policies in files per service (Sentry Policy Files). Integrates with the Hue security app (to manage permissions) and with Cloudera Navigator (for authorisation audit events). Started in 2012 as Cloudera Access, with an initial 1.0 release in 2013 as Sentry. Donated to the Apache Foundation in August 2013, graduating in March 2016.
Apache Slider (retired)Framework for hosting long running distributed applications on YARN, allowing YARN to manage the resources these applications use. Can handle any application that supports a base set of requirements (including being able to install and run from a tarball), with experimental support for docker packaged applications. Operates as a YARN application master (the Slider AM), an associated command line interface and lightweight agents to manage running components. Supports manual scaling, automatic recovery, rolling upgrades and component placement controls, and includes out of the box configuration for a number of applications including Accumulo, HBase, Kafka, Memcached, Solr, Storm and Tomcat. Originally donated to the Apache Foundation in April 2014 based on the Hortonworks Hoya (HBase on YARN) project, and subsequently consumed the DataTorrent Koya (Kafka on YARN) project. Retired before graduating in May 2018 following the plan to add support for long running services directly into YARN under YARN-4692.
Apache SolrA search server built on Apache Lucene with a REST-like API for loading and searching data. Supports a distributed deployment (SolrCloud) that can run over HDFS on an Hadoop cluster. Includes an administration web interface, an extensible plugin architecture, support for schemaless indexing, faceted, grouped and clustered results, hit highlighting, geo-spacial and graph searches, near real time indexing and searching, (experimental) streaming expressions for parallel compute (including support for MapReduce and SQL) and broad authentication and security capabilities. A sub-project of the Apache Lucene project, originally donated to the Apache foundation by CNET Networks in January 2006, graduating as a top level project in January 2007, before merging with the Lucene project in March 2010. Java based, with commercial support available as part of most Hadoop distributions (although this is bundled as Cloudera Search with CDH and HDP Search with HDP), as well as from Lucidworks.
Apache SparkA high performance general purpose distributed data processing engine based on directed acyclic graphs that primarily runs in memory, but can spill to disk if required, and which supports processing applications written in Java, Scala, Python and R (SparkR). Includes a number of sub-projects that support more specialised analytics including Spark SQL (batch and streaming analytics using declarative logic over structured data), Spark Streaming (micro-batch stream processing), MLlib (machine learning) and GraphX (graph analytics). Requires a cluster manager (YARN, EC2, Kubernetes and Mesos are supported as well as standalone clusters) and can access data in a wide range of technologies (including HDFS, other Hadoop data sources, relational databases and NoSQL databases). An Apache project, originally started at UC Berkley in 2009, open sourced in 2010, and donated to the Apache foundation in June 2013, graduating in February 2014. v1.0 was released in May 2014, with a v2.0 release in July 2016. Java based, with development led by Databricks (who sell a Spark hosted service), and with commercial support available as part of most Hadoop distributions.
Apache Spark >  GraphXSpark library for processing graphs and running graph algorithms, based on graph model that supports directional edges with properties on both vertices and edges. Graphs are constructed from a pair of collections representing the edges and vertex, either directly from data on disk using builders, or prepared using other Spark functionality, with the ability to also view the graph as a set of triples. Supports a range of graph operations, as well as an optimised variant of the Pregel API, and a set of out of the box algorithms (including PageRank, connected components and triangle count). First introduced in Spark 0.9, with a production release as part of Spark 1.2, however has seen almost no new functionality since then.
Apache Spark >  MLlibSpark library for running Machine Learning algorithms. Supports a range of algorithms (including classifications, regressions, decision trees, recommendations, clustering and topic modelling), including iterative algorithms. As of Spark 2.0 utilises a DataFrame (Spark SQL) based API, with the original RDD based API now in maintenance only. First introduced in Spark 0.8 after being collaboratively developed with the UC Berkeley MLbase project, and still under active development.
Apache Spark >  Spark SQLSpark library for processing structured data, using either SQL statements or a DataFrame API. Supports querying and writing to local datasets (including JSON, Parquet, Avro, Orc and CSV) as well as external data sources (including Hive and JDBC), including the ability to query across data sources. Includes Catalyst, a cost based optimiser that turns high level operations into low level Spark DAGs for execution. Also includes a Hive compatible Thrift JDBC/ODBC server that's compatible with Beeline and the Hive JDBC and ODBC drivers, and a REPL CLI for interactive queries. Introduced in Spark 1.0 with a production release in Spark 1.3, with substantially improved SQL functionalities in Spark 2.0.
Apache Spark >  Spark StreamingSpark library for continuous stream processing, using a DStream (discretized stream) API. Uses a micro-batch execution model leveraging core Spark to execute the specified logic against each micro-batch (a DStream is a sequence of Spark RDDs), with the ability to also use other Spark batch operations (including Spark SQL and MLlib) against each micro-batch. This model also provides fault tolerance through exactly-once processing semantics. Supports a number of data sources (including HDFS, sockets, Flume, Kafka, Kinesis and messaging buses), as well as functions to maintain state and to execute windowed operations. First introduced in Spark 0.7, with a production release as part of Spark 0.9, however development appears to be largely stopped following the introduction of Structured Streaming in Spark 2.0
Apache Spark >  Structured StreamingExtension to the Spark SQL DataFrame API to allow Spark SQL queries to be executed over streams of data, with the engine continuously updating and maintaining the result as new data arrives. Uses the full Spark SQL engine (including the Catalyst optimiser), and supports end-to-end exactly-once semantics via checkpointing when sources have sequential offsets. Supports aggregations over sliding event-time windows, including support for late data and watermarking. Introduced in Spark 2.0 with a production release in Spark 2.2.
Apache SqoopSpecialist technology for moving bulk data between Hadoop and structured (relational) databases. Command line based, with the ability to import and export data between a range of databases (including mainframe partitioned datasets) and HDFS, Hive, HBase and Accumulo. Executes as MapReduce jobs, supports parallel partitioned unloads, writing to Avro, Sequence File, Parquet and text files, incremental imports and saved jobs that can be shared via a simple metadata store. An Apache project, started in May 2009 as an Hadoop contrib module, migrating to a Cloudera GitHub project in April 2010 (with a v1.0 release shortly after), before being donated to the Apache foundation in June 2011, graduating in March 2012. The last major release (v1.4) was in November 2011, with only minor releases since then. However in January 2012 a significant re-write was announced as part of a proposed v2.0 release to address a number of usability, security and architectural issues. This will introduce a new Sqoop Server and Metadata Repository, supporting both a CLI and web UI, centralising job definitions, database connections and credentials, as well as enabling support for a wider range of connectors including NoSQL databases, Kafka and (S)FTP folders. Java based, with commercial support available as part of most Hadoop distributions.
Apache StormSpecialised distributed stream processing technology based on a single record (not micro batch) model with at least once processing semantics. Processing flows are called topologies based on a directed acyclic graph of spouts (which produce unbounded streams of tuples) and bolts (which process streams and optionally produce output streams). Supports high throughput and low latency use cases, horizontal scalability, fault tolerance (failed workers are automatically restarted and failed over to new nodes if required), back pressure, windowing (with support for sliding and tumbling windows based on time or event counts), stateful bolts and a shared bolt storage cache (that's updatable from the command line). Also includes a higher level micro batch API (Trident) that supports exactly-once processing semantics, fault-tolerant state management and higher level operations including joins, aggregations and groupings, support for SQL (StormSQL) and frameworks and utilities to make defining and deploying topologies easier (Flux). Has both a graphical web based and command line interface, plus a REST API. Primarily written in Clojure, JVM based, but supports multiple languages through the use of Thrift for defining and submitting topologies, and the use of spouts that can interface to other languages using JSON over stdin/stdout. Originally created at BackType, before being open sourced in September 2011 after the acquisition of BackType by Twitter. Donated to the Apache Foundation in September 2013, graduating in September 2014, with a 1.0 release in April 2016. Has multiple reference cases for being deployed at scale, including Twitter, and is still under active development.
Apache Superset (incubating)Web based tool for interactive exploration for OLAP style data, supporting interactive drag and drop querying, composable dashboards and a SQL workspace (SQL Lab). Originally built to query Druid, but now supports a wide range of SQL (and NoSQL) databases, with a lightweight semantic layer allowing control of how data sources are displayed in the UI and which fields can be filtered and aggregated. Users can create Slices (a visualisation of the results of an OLAP style query, with support for a range of visualisations including charts, heat maps, maps, pivot tables, and word clouds amongst others, the ability to configure the query using UI controls, and the ability to configure and customise the visualisation), with multiple slides then composable into a Dashboard (that also support interative filters that connect to multiple slices). Also supports a full SQL IDE (SQL Lab) that supports multiple tabs, a full query history, the ability to apply any data visualisation to results and to browse database metadata, and support for long-running queries using a backend query handler and results store. Other features include query results caching, a plug-in and extensibility framework, the ability to brand and skin the web application, and a robust security model for controlling access to slices, dashboards and data, with support for a range of authentication methods including OpenID, LDAP and OAuth. Originally developed by AirBnB in 2015 as Panoramix, before being renamed to Caravel and then to Superset. Donated to the Apache Foundation in June 2017 and still incubating, with development now led by AirBnB and Hortonworks.
Apache TajoDistributed analytical database engine. Supports HDFS, Amazon S3, Google Cloud Storage, OpenStack Swift and local storage, and querying over Postgres, HBase and Hive tables. Provides a standard SQL interface, JDBC driver, and supports partitioning, compression and indexing (currently experimental). An Apache project, donated by Gruter in March 2013, and graduated in April 2014. Java based, with development lead by Gruter who also supply commercial support, a Tajo managed service, a data analytics hub (Qrytica) built on Tajo, and a Tajo Data Warehouse appliance.
Apache TezData processing framework based on Directed Acyclic Graphs (DAGs), that runs natively on YARN and was designed to be a replacement for the use of MapReduce within Hadoop analytical tools (primarily Hive and Pig), and therefore offer better performance with similar scalability. Targeted more at application developers rather than data engineers, includes a number of performance optimisations (including dynamic DAG re-configuration during execution and re-use of sessions and containers), and comes with a UI for viewing live and historic Tez job executions based on information in the YARN Application Timeline Server. Created by Hortonworks and donated to the Apache Foundation in February 2013 before graduating in July 2014. Still under active development, and now used by Cascading and Flink in addition to Hive and Pig.
Apache WhirrA set of libraries (now moved to the Apache Attic and no longer maintained) for deploying and managing a supported set of services in a cloud environment. Written in Java, with explicit support for a set of standard services (including Hadoop, Cassandra, HBase, Elasticsearch and Solr) configured through property files. Uses jclouds to provision and manage cloud infrastructure, and provides both a CLI and Java API. Originally a set of python scripts maintained as an Hadoop contrib project. Donated to the Apache Foundation in May 2010, graduating in August 2011. Development ceased in September 2012, with the project being moved to the Apache Attic in March 2015.
Apache ZeppelinA web based notebook for interactive data analytics. Supports a wide range of interpreters (including Spark, JDBC SQL, Pig, Elasticsearch, Beam, Flink, Shell, Python amongst many others), a range of output formats (plain text, HTML, mathematical expressions using MathJax and tabular data), a range of visualisations for tabular data (including the ability to add more via a JavaScript NPM based plugin system called Helium), forms for user entry of parameters, and an Angular API to enable dynamic and interactive functionality within notebooks. Has a plugable storage for notebooks (with out of the box support for git, S3, Azure and ZeppelinHub), support for multi-user environments and a security model. Open sourced by NFLabs (now called ZEPL) in 2013 before being donated to the Apache Foundation in December 2014, graduating in May 2016. Under active development with a wide range of contributors, led by ZEPL, who sell Zeppelin as a managed service (previously called ZeppelinHub, now just called Zepl).
Apache ZooKeeperService for managing coordination (e.g. configuration information and synchronisation) of distributed and clustered systems. Based on a hierarchical key-value store, with support for things such as sequential nodes (whose names are automatically assigned a sequence number suffix), ephemeral nodes (which only exist whilst their owners session exists) and the ability to watch nodes. Guarantees that all writes are serial and ordered (i.e. all clients will see them in the same order), meaning it's more appropriate for low write high read scenarios. Can run in a high available cluster called an ensemble. Originally an Hadoop sub-project, but graduated to a top level Apache project in January 2011. Java based, still under active development, and used by a range of technologies including Hadoop, Mesos, HBase, Kafka and Solr.
Azure HDInsightService for dynamically provisioning Hadoop clusters on Azure Virtual Machines based on a set of pre-defined cluster templates for Hadoop, Spark, HBase, Storm, Hive LLAP, Kafka or Machine Learning. Based on the Hortonworks HDP distribution of Hadoop, with support for Azure Blob Storage and Azure Data Lake Storage (both strongly consistent) but not local HDFS. Supports manual scaling of in-flight clusters, integration with Azure Log Analytics, encryption, use of external SQL database for Hive metadata and script actions (scripts that can be run during or after cluster creation). Comes with an Enterprise Security Package add-on that adds integration with Azure Active Directory, role based access control for Hive and Spark using Apache Ranger and security audit logs. Manageable via the Azure Portal, Powershell, a REST API and integrates with a number of development IDEs (e.g. for interactive development of Spark jobs). Priced at an hourly rate (billed per minute) based on the VM instance types being used, in addition to any Virtual Machine charges.
ChronosA framework for Apache Mesos to schedule and orchestrate jobs to periodically run at fixed times, dates or intervals in a clustered environment. Leverages Mesos for resource allocation and isolation and provides a REST API and web interface for job definition and job management. Reoccurring jobs are defined using ISO8601 repeating interval notation and may also be triggered by the completion of other jobs to create dependency based jobs. Uses ZooKeeper for state management and typically deployed as a service under Marathon for high-availability. Supports writing and exporting of job metrics to various systems for further analysis and notifications to various endpoints such as email and chat messaging systems. Originally created at AirBnB and written in Scala, opened sourced in March 2013 under the Apache 2.0 license, hosted under the Apache Mesos Community Projects group-owned repositories on GitHub.
CloudbreakSolution for deploying and managing Hadoop clusters on cloud infrastructure based on automatically provisioned infrastructure running base docker images with Hadoop provisioned on top via Apache Ambari using Blueprints. Includes out of the box support for Amazon Web Services, Microsoft Azure, Google Cloud Platform and OpenStack, plus a Service Provider Interface (SPI) for adding support for new providers. Supports automated scaling of clusters based on Ambari Metrics and Alerts (Periscope), custom scripts that can be run on hosts before or after deployment (Recipes), a number of out of the box Blueprints, the use of custom docker images, data locality specifiers, Kerberized clusters and support for external AD/LDAP servers. Manageable through a web UI, a REST API, a CLI and an interactive shell. Originally created by SequenceIQ, with an initial beta release in July 2014, with SequenceIQ then acquired by Hortonworks in April 2015, and a 1.0 release of Cloudbreak included in HDP 2.3 in July 2015. Open sourced under the Apache 2.0 licence, with a stated plan for the code to be donated to the Apache Foundation.
Cloudera AltusPlatform for accessing individual CDH capabilities as services. Currently supports the deployment and management of CDH clusters on cloud infrastructure (Director, previously Cloudera Director), the execution of Spark, MapReduce or Hive over Spark or MapReduce jobs (Altus Data Engineering), the dynamic provisioning of Impala clusters (Altus Data Warehouse), with a stated future plan for R- and Python-based machine learning workloads (Altus Data Science) and an HBase based operational database service. Runs on Amazon Web Services or Microsoft Azure over external data in Amazon S3 or Azure Data Lake Storage, with a stated plan to expand support to other cloud service providers (specifically the Google Cloud Platform) in the future. Includes Altus SDX, allowing metadata (e.g. Hive table definitions) to be automatically persisted across transient workloads, referenced via a namespace. Supports a web based UI, a (Python) CLI and a Java SDK, with full user authentication and role based access management, and integration with AWS and Azure security. Launched in May 2017, with a per node / per hour pricing model.
Cloudera Altus >  Cloudera Altus Data EngineeringManaged service for the execution of Spark, MapReduce or Hive (over MapReduce or Spark) jobs using managed CDH clusters on AWS and Azure cloud infrastructure over data in Amazon S3 or Azure Data Lake Storage (ADLS). Jobs run on clusters within a defined AWS or Azure environment, which can be transient (created and terminated on demand) or persistent, with each cluster supporting one service type (Hive, Spark, MapReduce) with a fixed node count. Jobs can then be queued individually or in batch for execution against an existing cluster or against a dynamically created cluster, with jobs specified either by uploading a JAR to S3 (for Spark or MapReduce) or via a Hive script (either directly uploaded or uploaded to S3), and the ability to either halt or continue the queue on job failure. Supports access to clusters via SSH, read only access to Cloudera Manager, a SOCKS proxy to cluster web UIs (including the CM admin console, YARN history server and Spark history server), and access to server and workload logs (including the ability to write these to S3 for access after clusters have been terminated). All nodes managed by Altus are tagged with the cluster name and node role (master, worker or Cloudera Manager) and bootstrap scripts can be specified for execution on nodes after cluster startup.
Cloudera Altus >  Cloudera Altus Data WarehouseImpala as a managed service, supporting the dynamic provisionng of Impala clusters on AWS and Azure cloud infrastructure over data in Amazon S3 or Azure Data Lake Storage (ADLS). Clusters consist of a coordinator node and multiple worker nodes, with read-only access to a Cloudera Manager instance, with the node count fixed on creation. Supports JDBC and ODBC access to data, along with access to clusters via SSH, read only access to Cloudera Manager and a SOCKS proxy for access to the Impala web UIs. Previously known as Cloudera Altus Analytical DB.
Cloudera Altus >  Cloudera Altus DirectorSolution for deploying and managing Cloudera CDH Hadoop clusters on cloud infrastructure based on automatically provisioned infrastructure with Hadoop provisioned on top via Cloudera Manager. Includes out of the box support for Amazon Web Services, Microsoft Azure and Google Cloud Platform, with support for vSphere available from VMWare, with a Service Provider Interface (SPI) for adding support for new providers. Server component must be manually deployed via an RPM. Supports the ability to scale clusters up and down, clone clusters, run post deployment scripts, and create Kerberized and highly available clusters. Manageable through a web UI, a REST API (with Python and Java APIs) and a CLI. Released as Cloudera Director at 1.0 in October 2014 as part of Cloudera Enterprise 5.2, being renamed to Cloudera Altus Director in September 2018 as part of CDH 6. Free to download and use, with commercial support available as part of a Cloudera Enterprise subscription.
Cloudera CDHA distribution of Hadoop based on the addition of a number of closed source products, including Cloudera Manager (for installing and managing clusters) and Cloudera Navigator (for managing metadata and the encryption of data). Bundled projects tend to lag the open source versions and pull forward more patches than other distributions. Also comes with a number of add-ons, including ODBC and JDBC drivers for Hive and Impala, a number of Apache projects that aren't (yet) part of the core CDH distribution, and Workload XM (a cloud based service for analysing job logs). Available via RPMs, or can be installed using Cloudera Manager (for local installs) or Cloudera Director (for installation on cloud platforms). Comes in a number of editions including Cloudera Enterprise (under an annual per node or elastic cloud licence model with commercial support) and Cloudera Express (a free version without some enterprise features), with Cloudera Enterprise coming in a range of licence options (listed on the Cloudera website under products) with each including support for different Apache products. First released in March 2009.
Cloudera Data Science WorkbenchA web based notebook for interactive data analytics on Hadoop (with both CDH and HDP supported) that uses docker to provide custom execution environments for each notebook. Supports Python, R and Scala interpreters, plus remote execution of Spark with out of the box support for Hadoop security. Notebook code is run within a docker container in a managed Kubernetes instance, allowing different libraries to be installed and used by different notebooks, and other dependancies to be installed via terminal access to the container or via custom Docker images. Also includes support for version control (via git), tracking of model tests (Experiments), automatic deployment of models and all dependancies behind a REST endpoint (Models), collaboration via shared projects, sharing of notebooks via HTTP URLs, publishing of notebooks as HTML and scheduled execution of notebooks via workflows (including dependancies on other jobs). Originally created by Sense.io, which was acquired by Cloudera in March 2016. Initial GA release was 1.0 in April 2017, with support for HDP added in January 2019
Cloudera ManagerPlatform for installing, managing and monitoring Cloudera CDH Hadoop clusters. Supports creation of clusters using step by step wizards, plus cluster templates for creating multiple clusters with the same configuration (e.g. dev, test and production), using either native OS packages or parcels (a Cloudera Manager distribution format that has a number of advantages over packages). Also supports the administration and configuration of clusters (including user and resource management, and the ability to manage multiple clusters); the automated Kerberization of clusters; monitoring of cluster, host and service statuses, health and metrics; generation of events and the use of custom triggers to take action on these; the visualisation of metrics; centralised log management; HDFS reports and automatic replication of data to a backup/DR cluster. Also integrates directly with Cloudera Support to enable proactive support. Web based, with a REST API and a full security model with auditing of all actions, and the ability to add support for custom services. Introduced in January 2012 as a replacement for the Cloudera Management Suite (CMS). Available for free without some enterprise features, or as part of a Cloudera CDH subscription.
Cloudera NavigatorA suite of solutions including Navigator Data Management (technical metadata management, lineage, cluster activity and analytics, cluster audit and automated policy actions), Navigator Encryption (filesystem level encryption, key management and integration with HDFS transparent encryption), and Navigator Optimizer (a solution for identifying SQL workloads that are candidates for migration to Hadoop and then optimising these once on Hadoop)) built around the Cloudera CDH Hadoop distribution. All products are commercial closed source products, that are only available with an appropriate Cloudera Enterprise licence.
Cloudera Navigator Data EncryptionA suite of products that complement HDFS transparent encryption to provide data at rest encryption across an Hadoop cluster. Includes Navigator Encrypt (a solution for encrypting Linux filesystems, with access granted to approved processes), Navigator Key Trustee Server (a software based solution for managing encryption keys), Navigator Key HSM (allows Navigator Key Trustee Server to use a Hardware Security Module as the root of trust for keys), Navigator Key Trustee KMS (an Hadoop Key Management Service that uses Navigator Key Trustee Server as the underlying key store) and Navigator HSM KMS (an hadoop Key Management Service backed by an HSM where encryption zone keys originate on and never leave the HSM). First released in 2014 following the acquisition of Gazzang.
Cloudera Navigator Data ManagementSolution for managing data on a CDH Hadoop cluster. Automatically extracts metadata relating to HDFS, Hive, Impala, MapReduce, Oozie, Pig, S3, Spark, Sqoop and YARN, including data structures (databases, tables and columns) and jobs (relating to data transformation) based on activity within a cluster (rather than statically analysing code), allowing it to be searched, filtered and viewed, including displaying lineage diagrams showing how data moves through the system, a Data Stewardship dashboard of key data management information (including statistics on the data held in the cluster and the activity relating to this data), analytics on the data held in HDFS, and a full audit capability of all activity on the cluster. Allows custom metadata to be added to objects, including descriptions, key-value pairs and tags, with the option to define metadata namespaces and data types / value constraints (managed metadata), plus the ability to pre-set custom attributes (via job properties for MapReduce jobs and JSON .navigator files for HDFS files), and the ability to define data lifecycle management policies (allowing actions to be specified based on metadata, e.g. to archive any files that haven't been accessed for six months). Web based, with a full user security model, and a REST API and Java SDK for integrating external applications with metadata held in Navigator. Initial 1.0 release was in February 2013.
Cloudera Navigator OptimizerA web based hosted service for analysing SQL logs from a range of relational databases to provide guidance on offloading workloads to Hadoop, and from Hive and Impala to provide guidance on optimising workloads running on Hadoop. Can analyse query logs, query metadata, schemas and statistics, and includes a Java utility to mask literal values in SQl queries and logs, and to encrypting schema identifiers before files are uploaded. Provides analytics on the overall query workload (including by similarity and risk, as well as by uploaded metrics such as cpu usage, memory usage and file system reads/writes) and recommendations for improvements to queries (to reduce risk, and to make external queries Hadoop compatible), with risk representing the level of Hadoop compatibility. Formally Xplain.io which was founded in 2013, acquired by Cloudera in February 2015, with a GA release as Cloudera Navigator Optimizer in July 2016.
Cloudera SearchA distribution of Apache Solr that also includes a number of tools for integrating with Solr using Morphlines. Includes two utilities for loading data from HDFS, the Crunch Indexer Tool (direct Solr inserts using Crunch over Spark or MapReduce), and the MapReduce Indexer Tool (creates Solr index files using Map Reduce, optionally putting these live), plus two utilities for loading data from HBase based on the Lily HBase Indexer, the Batch Indexer (for batch loads) and the NRT (Near Real Time) Indexer (for continuous replication of HBase events). First released in June 2013, with a GA release in September 2013 as part of CDH 4.3. Included tools are open sourced under an Apache 2.0 licence.
Confluent EnterpriseA package of software built around Apache Kafka and the Confluent Open Source product, with the addition of a number of commercial closed source products including a JMS client, Control Centre (for managing Kafka clusters), Multi DC Replication (active-active replication between Kafka clusters) and Auto Data Balancing. The JMS client is an implementation of the standard JMS provider interface over a Kafka topic. Control Centre is a web based UI that supports system health monitoring (broker and topic metrics and statuses based on information from the Confluent Metrics Reporter, a plugin for Kafka clusters that reports metrics to a Kafka topic), real time stream monitoring (statistics on the production and consumption of messages including the level of consumption and latency based on statistics from Confluent Monitoring Interceptors, a plugin for Kafka producers and consumers that reports statistics to a Kafka topic), the GUI based creation of Kafka connect pipelines, viewing of cluster and topic information, and e-mail alerting based on custom triggers on on topic, consumer group or broker metrics. Multi DC Replication is an optional licenced connector for Kafka connect that enables replication between two remote Kafka clusters, including active-active synchronisation. Auto Data Balancing is a tool for re-balancing topic partitions across cluster nodes, recommending moves based on information form the Confluent Metrics Reporter and rack awareness to ensure load is distributed evenly across the cluster, and easily allowing for the additional or removal of nodes. Also includes the Confluent Support Metrics features which collects broker and cluster metadata and metrics and forwards these onto Confluent for proactive support. Confluent Enterprise is the commercial version of their Confluent Platform, with an open source version also available as Confluent Open Source. Includes full commercial support for all open and closed source products. First GA release was version 1.0 of the Confluent Platform in February 2015.
Confluent Open SourceA package of open source projects built around Apache Kafka with the addition of the Confluent Schema Registry, Kafka REST Proxy, a number of connectors for Kafka Connect and a number of Kafka clients (language SDKs). The Schema Registry allows Kafka Avro message schemas to be defined and versioned centrally, with schemas stored in a Kafka topic, a REST interface for managing schemas, support for schema evolution (with support for backwards, forwards and full compatibility between versions), plugins for Kafka clients to serialise / deserialise messages using the schemas, and support for running as a distributed service. The REST Proxy provides a REST interface onto a Kafka cluster, with support for viewing cluster metadata (covering brokers, topics, partitions and configuration) and both submitting and consuming messages, with support for JSON, JSON-encoded Avro and base64 messages, and integration to the Schema Registry for Avro messages. Bundled connectors for Kafka Connect include HDFS, JDBC, Elasticsearch and S3. Bundled client libraries (all open source) include those for C/C++, Go, .NET and Python. Also includes a Version Collector that reports version information to Confluent. Used to include Camus, a tool for unloading Kafka topics to HDFS, but this has now been deprecated in favour of Kafka Connect. Development of the open source projects is led by Confluent, who then bundle and distribute them for free as the Confluent Open Source version of their Confluent Platform, with the Confluent Enterprise version adding a number of closed source features and commercial support for all open and closed source products. Available as a zip, tar, deb or rpm package from Confluent, with all source code hosted on GitHub. First GA release was version 1.0 of the Confluent Platform in February 2015.
Databricks DeltaStorage layer for tabular structured data within the Databricks Unified Analytics Platform that supports ACID transactions and data skipping. Data is persisted to Amazon S3 or Azure Blob Storage as Parquet files with metadata stored in a Hive Metastore, and includes full integration with Spark Structured Streaming and Spark SQL. Supports batch appends, overwrites, updates, upserts and deletes and streaming appends or overwrites, with new data written as new delta files (with changes collapsed during reads) supported by a transaction log. Allows multiple writers able to simultaneously modify a dataset, and ensures readers are always presented with a consistent view through the use of snapshots. Includes support for a number of SQL management extensions, including viewing the transaction history (describe history), accessing previous versions of datafiles (by timestamp or version), collapsing delta files to improve performance (optimize) and removing old files left around to support snapshooted reads (vacuum). Supports performant reads through standard Hive partitioning (including support for partition pruning) and data skipping (reducing data read based on recorded min/max values for data files which can be enhanced by z ordering data). Also supports views over tables and backward compatible schema changes, including support for auto addition of new fields based on input data. Currently in preview, having been first announced in October 2018.
DataKitchen DataOps PlatformPlatform to enable aoption of DataOps practices for data engineering, science, and analytic teams. These DataOps practices combine ideas in Agile development, DevOps, statistical process control, data science model deployment, and test data automation through a series of steps in a collaborative workflow. Within DataKitchen's product, users start with "Kitchens." Each Kitchen represents a place to do work: production environments, development sandboxes, etc. Kitchens are a collection of the data, data stores, tools (ETL, data science, visualization), code or configuration used by those tools, git branch, and the necessary servers and software. These collections can be created, merged, or shut down. A Kitchen can also be a current environment already available in the organization. Kitchens can be individual or shared across groups. When working in Kitchen, team members create and run "Recipes." Each Recipe is a directed graph of steps. A Recipe represents the workflow pipeline used to deliver analytics: acquire data, transform data, call a machine learning model, and visualize data. A Recipe utilized the tools that DataKitchen's customers already own. As Recipes are running, tests are embedded in the Recipe to detect values, ranges, distribution, frequency, implied and enforced integrity, and other business based checks on data or processing. "Order" metadata is created that is no just about lineage and descriptors of the data and jobs, but also includes statistics such as wall-clock time, processing requirements, test data outputs, and more. Alerts are delivered if selected tests fail. If not found, data and process errors reduce the business users trust. DataKitchen's customers see a meaningful reduction in the number of data errors, incorrect results, and late deliveries. Another DataKitchen goal is to reduce the time it takes to move changes from development into production. When business customers request new changes, DataKitchen can continually and automatically deploy those changes. A challenge is to make sure that those changes do not cause regressions, functional errors, or performance problems in production. The embedded Recipe tests serve a dual role in resolving this problem. In production environments, those tests provide surveillance and alerts, but in development, those tests make sure that any change in code or configuration does not cause a problem further down the pipeline. DataKitchen allows users on different teams and location to collaborate. One method is to use "Ingredients" to create sharable, reusable, component services. Multiple Recipes can call Ingredients, and they have a standard, function-like API. The platform was designed and implemented for secure multi-tenant multi-cloud, and multi-environment deployments. Finally, interfacing with Data Kitchen is supported by a user interface, command line or APIs. This is important because Data Kitchen can accept data and metadata from other processes you may already have in place. The DataKitchen DevOps Platform is a commercial product, available as a managed service with optional on-prem agent installation, and was first released in 2014.
Denodo PlatformData Virtualisation platform, enabling a logical schema to be defined over a range of relational, NoSQL, flat file and application / data APIs that can then be queried through a range of end points. Supported data sources include a wide range of databases (relational, in memory, MPP, Hadoop, cloud, OLAP cube and NoSQL), flat files (Hadoop, text, binary, Office, including support for (S)FTP, compression and encryption), application APIs (e.g. Salesforce, SAP, Oracle e-business, Twitter), RDF semantic repositories via SPARQL, mainframes, data APIs (SOAP and REST), JMS queues and the ability to scrape web pages, with data accessible via SQL (JDBC, ODBC), data APIs (SOAP and REST) and web widgets (Sharepoint, Java, AJAX), with the ability to transform, cleanse and combine data from multiple sources into a single semantic model using the relevant source system query language. Supports a dynamic query optimiser (which pushes query logic down to the underlying data source, with the ability to move data between sources and take advantage of data replicated in multiple sources to maximise logic pushdown), caching (either by query or by full materialisation, allowing tables in the semantic layer to be pre-generated, with support for scheduled and incremental updates and the use of external ETL tools), data writes back to source (with support for a distributed transaction manager and 2-phase commits), a full security model (role based access control at the row/column level, with authentication pass-through to data sources), resource and workload management, metadata visualisation (including search and lineage views), self service data discovery (execution of ad-hoc queries outside of the semantic layer), search (over data and metadata, including support for semantic mining and extraction of text sources), an SDK (for adding new source adapters, custom functions and stored procedures), and a graphical UI (supporting wizard-driven configuration, integration with external configuration management tools and release management). Can we deployed stand alone or as a cluster (supporting both active/active and active/passive configurations and shared caches, with support for geo-replication), and is also available for AWS and as a free Denodo Express version (with limits on the number of active queries and results). A commercial product from Denodo Technologies Inc, who were founded in 1999 with the first release of the Denodo Platform shortly afterwards.
Elastic CloudSeries of solutions for accessing Elasticsearch (including Kibana and the X-Pack commercial extensions) as a service, or for automated provisioning, managing and monitoring Elasticsearch clusters in the cloud or on premesis. Includes Elasticsearch Service (automated deployment and management of Elasticsearch on AWS, Azure or Google Cloud Platform), Elastic Cloud Enterprise (orchestration of Elasticsearch on local or cloud infrastructure), Elastic App Search Service (Elasticsearch as a service for your applications) and Elastic Site Search Service (Elasticsearch as a service for your website). Elasticsearch Service include support for scripting and plugins, high availability across multiple zones, automated security configuration, upgrades, scaling and backups through a management web UI (Elastic Cloud Console / Cloud UI). Elastic Cloud Enterprise supports the same capability on your own infrastructure, and includes an API in addition to the web UI for configuring and managing clusters, with Elasticsearch and Kibana provisioned using Docker containers. Each service offering is available under a range of subscription licence tiers with differing levels of support and some feature differences; Elastic Cloud Enterprise is freely available but requires you to provide your own Elasticsearch licences. Elasticsearch Service was launched in July 2015 as Elastic Cloud; Elastic Cloud Enterprise was first released as alpha in December 2016, with a 1.0 GA release in May 2017. Elastic Cloud is the only Elasticsearch service offering that includes the Elastic X-Pack features.
Elastic X-PackA set of commercial extensions to the Elastic open source products (Elasticsearch, Kibana and Logstash) that were discontinued in June 2018 with the release of version 6.3 of the Elastic stack, with the individual components now open sourced under an Elastic licence and bundled with the relevent Elastic open source products, although the majority still require a commercial licence from Elastic to be enabled. Included Security (formally Shield), Alerting (formally Watcher), Monitoring (formally Marvel), Reporting, Graph, Machine Learning and Application Performance Monitoring (APM).
ElasticsearchA distributed search server built on Apache Lucene that supports a number of advanced analytics over search results. Data is stored in indexes, with each index able to support multiple schemas (types), with the data itself sharded to support distributed parallel queries, with multiple replicas of each shard providing resilience and redundancy. Supports both pre-defined and schemaless types, all standard Lucene functionality (including faceting, grouping, clustering, hit highlighting, geo support, near real time indexing), the ability to update and delete documents (by id or query), upsert operations, batch operations, re-indexing (from one index into a second index), generated or calculated fields, document versioning and optimistic concurrency control, nested searches based on sub-documents or explicit parent-child document links, templated searches, a range of aggregations (include support for metrics, bucketing results, matrix calculations and custom aggregations using pipelines), custom analysers for indexing data, custom transformation pipelines prior to indexing (via an ingest node), the ability to query across clusters (cross cluster search), a plugin framework, registered queries that are executed against newly indexed data (percolation) and the ability to snapshot and restore indexes using HDFS, S3, Azure and Google Cloud. Also now includes a number of features that were previously bundled separately in the Elastic X-Pack, including Security (encryption of data and links, authentication via LDAP and Active Directory, authorisation at the cluster, index, document and field level, and full audit logging), Monitoring (export of cluster, nod and index metrics), Alerting (via Watcher, allowing registration of scheduled queries over monitoring data that can perform a number of extensible actions), Graph (APIs for working with relationships, with connections between indexed terms generated on the fly using Elasticsearch aggregations and relevance scoring), SQL access (via a REST API, CLI or JDBC interface), Machine Learning (support for automated anomaly detection jobs over time-series data run on the ElasticSearch cluster) and Rollup (aggregation of historical data), the majority of which require a commercial licence from Elastic in order to be enabled. Comes with a REST API, with clients available for a range of languages including Java, C#, Python, JavaScript, PHP, Perl and Ruby. First released in February 2010, with a 1.0 release in February 2014. Open source under the Apache licence, with the exception of the X-Pack components which are under an Elastic licence following the open sourcing of X-Pack in version 6.3. Development is led by Elastic, who were formed in 2012 by the creator of Elasticsearch and a lead Lucene contributor, and who provide commercial support, licences to enable the commercial X-Pack features, and an on-site or public cloud service offering (Elastic Cloud).
Elasticsearch-HadoopA suite of open source components for querying and writing documents to Elasticsearch from a range of Hadoop technologies, including MapReduce, Hive, Pig, Spark, Cascading and Storm. Specific functionality includes InputFormat and OutputFormat libraries for MapReduce, a Hive storage handler allowing external tables to be defined over Elasticsearch indexes, read and write functions for Pig, Java and Scala RDD based libraries for Spark, Spark SQL support, Spark Streaming support, an Elasticsearch Tap for Cascading and a dedicated Spout and Bolt for Storm. Used to include functionality for writing snapshots of Elasticsearch indexes to HDFS which is now part of the Snapshot and Restore functionality in Elasticsearch. Certified with CDH, MapR and HDP.
Google Cloud ComposerManaged workflow orchestration service built on Apache Airflow that's designed for running data integration tasks on a repeated schedule. Implemented on a micro-service architecture, the Airflow database and web server are implemented on App Engine and access protected using Identify-Aware Proxy (an enterprise security model that enables employees to work from untrusted networks without the use of a VPN), while the scheduler, executor and worker nodes are implemented on Kubernetes Engine. Integrated with Cloud Storage for staging DAGs, plugins, data dependencies and Stackdriver for real-time logging and monitoring of Airflow service and workflow logs. Manageable via a web (Cloud Platform Console and Airflow web interfaces), command line interface (Cloud SDK) or an RPC and REST API. Allows custom Airflow plugins and Python dependencies from the Python Package Index to be installed. Priced an an hourly rate (charged per minute) based on the size of a Cloud Composer environment, which is in addition to any Kubernetes Engine, Compute Engine or Persistent Disk and network egress charges.
Google Cloud DataProcService for dynamically provisioning Hadoop clusters on Google Compute Engine based on a single standard set of Hadoop services. Supports selection of virtual machines (including custom machine types and machines with GPUs), usage of custom VM images, a claimed cluster startup time of less than 90 seconds, local storage and HDFS filesystem, programmatic execution of jobs, workflows (parameterisable operations that create clusters, run jobs and then delete the cluster), manual and automatic scaling, initialisation actions (to install extra services or run scripts, with a set of open source actions available), optional components (automatic addition of extra services), automatic deletion of clusters (based on time, usage or idleness), integration with Stackdriver Logging and Monitoring and encryption of data in HDFS and Cloud Storage. Manageable via the Google Cloud Console Web UI and SDK plus an RPC and REST API. Priced an an hourly rate (charged per second) based on the specification of the VMs being used, which is in addition to any Compute Engine or Persistent Disk charges.
Google Cloud StorageAn object store service with strong consistency, multiple storage tiers and deep integration to the Google Cloud ecosystem. Objects are organised into buckets and indexed by string, with the option to list objects by prefix and to summarise results based on a delimiter allowing a filesystem to be approximated. Storage tiers supported include multi-regional (data is distributed across regions in a geo area), dual-regional, regional, nearline and coldline (designed for data accessed less than once per month/year respectively). Supports object lifecycle management allowing for automatic deletion or moving of objects between storage tiers. Supports versioning of objects, access control (via Google Cloud IAM, bucket and object ACLs and time-limited access via signed URLs), encryption of objects and support for SSL connections, auditing of object operations via Google Cloud Audit, gzip uncompression on read, custom domains, multi-part uploads via merging of objects after upload (Composite Objects), acccess and storage logs as downloadable CSV files and batching of request. Quotes a 99.999999999% guarentee that data won't be lost, and availability of 99.9% for regional and 99.95% for multi-regional storage tiers. Provides a web based management console (Google Cloud Platform Console), CLI (gsutil), JSON and XML REST API and SDKs for a wide range of languages.
GreenplumA shared nothing, massively parallel processing (MPP) database optimised for analytical / OLAP workloads that includes support for reployment on Kubernetes. Based on a PostgreSQL fork, it is essentially multiple PostgreSQL databases working together as a single logical database. Supports a cost-based query optimiser optimised for large analytical workloads, multiple storage models (including append only, columnar and heap), full ACID compliance and concurrent transactions, multiple index types, broad SQL support, a range of client connectors (including ODBC and JDBC), high capacity bulk load and unload tools, in database query language support (including Python, R, Perl, Java and C), and in database analytics support (including machine learning via Apache MADLib, search using Solr via GPText, geographic analytics via PostGIS and encryption via PGCrypto). Originally created by Greenplum (the company) which was founded in September 2003 before being brought by EMC in 2010, with Greenplum (the database) then spun out as part of Pivotal Software in 2013 before being open sourced in in October 2015 under the Apache 2.0 licence with the source code hosted on GitHub. Development is still led by Pivotal (with little evidence of outside contributions), who also distribute binaries as Pivotal Greenplum (with a number of extra enhancements, detiled in the release notes) and provide training, consultancy and support.
Hortonworks Data Cloud for AWSService that supports the creation and management of HDP clusters on Amazon Web Services (AWS). Management is done through a Cloud Controller AWS Product that provides a web interface and CLI for orchestrating the creation of AWS resources and the deployment of clusters using Ambari, and the subsequent scaling or cloning of the cluster. Supports a number of standard cluster types, including Data Science (Spark, Zeppelin), EDW-ETL (Hive, Spark) and EDW-Analytics (Hive, Zeppelin), with clusters also including Tez, Pig and Scoop, along with a number of standard node types, including worker nodes (that support HDFS and YARN) and computer nodes (that only support YARN). Clusters are designed to be ephemeral, however Amazon RDS can be used to provide persistent storage of Cloud Controller and Hive metadata, and Amazon S3 can be used to provide persistent cluster storage. Also supports Hortonworks SmartSense, cluster templates, the use of Spot Instances for compute nodes, and node recipes for executing custom scripts pre/post the Ambari cluster setup. Comes with free community support from Hortonworks. First launched in November 2016, but appears to be discontinued as of HDP 3.0 with Hortonworks move to a multi cloud strategy via Cloudbreak
Hortonworks DataFlowA distribution of a set of Apache and Hortonworks open source technologies for processing and running analytics on data 'in motion', with all products integrated with Apache Ranger for security, Apache Ambari for management and Schema Registery for central schema management. All bundled Apache open source projects are based on official Apache project releases, with any patches for bug fixes or new features being official Apache project patches from later releases of the relevant project. Available as RPMs or through Apache Ambari (via a management pack), and as an on-site or in the cloud managed service (as Hortonworks Operational Services), but is not currently available via Cloudbreak or as a cloud service. The HDF softare is provided free of charge, with training, consultancy and support available from Hortonworks. First released in September 2015 as a distribution of just NiFi following the acquisition by Hortonworks of Onyara (a company founded by some of the original creators of NiFi).
Hortonworks Data PlatformA distribution of Hadoop based on a commitment to the Apache open source ecosystem. All bundled projects are Apache open source projects based on official Apache project releases, with any patches for bug fixes or new features being official Apache project patches pulled from later releases of the relevant project. Available as RPMs, through Apache Ambari (for local installs) or Cloudbreak (for installation on cloud platforms), and as an on-site or in the cloud managed service (as Hortonworks Operational Services). Comes with a number of add-ons that aren't part of the core product, including HDP Search, Hortonworks HDB and ODBC and JDBC drivers for Hive, Spark SQL and Apache Phoenix. The HDP software is available free of charge, with training, consultancy and support available from Hortonworks, including a flex support subscription, a consumption based model for the use of HDP on-premise or in the cloud. Also available for IBM Power Systems. The Hortonworks Data Platform was first released in June 2012.
Hortonworks Data Platform for WindowsA version of the Hortonworks Data Platform natively compiled for Windows that was discontinued as of HDP 2.5 in August 2016. First announced in March 2013, with a GA release in May 2013. Didn't use Apache Ambari for installation and management (instead being installed via a standard Windows installer), didn't support SmartSense, and didn't include some technologies (such as Accumulo, Atlas, Kafka, Solr, Spark and Hue).
Hortonworks Data Platform SearchAn add on package to HDP that bundles up Solr, Banana, and a suite of libraries and tools for integrating with Solr from Hadoop (utilities for loading data from HDFS), Hive (a SerDe to allow Solr data to be read and written as a Hive table), Pig (store and load functions), HBase (replication of HBase events to Solr based on the Lily HBase indexer), Storm and Spark (both SDKs for integrating with Solr). Available as an add on Ambari management pack or as a set of RPMs. Built, maintained and supported by Lucidworks on behalf of Hortonworks, first announced in April 2014 as part of the introduction of Solr with HDP 2.1.
Hortonworks DataPlaneAn extensible platform for managing data ecosystems, with capabilities delivered through plugable applications. Supports the registration and management of DataPlane applications and the registration of Ambari managed clusters that are then accessible to these applications. Supports role based access control, with LDAP integration for users and groups and support for app specific roles. Runs on docker, with state held in an external database, and integrates with Knox (for SSO and access to clusters). Future services referenced include Cloudbreak and IBM DSX. Stated plan is for this to be a cloud service, however this is not currently generally available, and the documentation currently details installation steps for a local machine. First released in November 2017.
Hortonworks DataPlane >  Data Analytics StudioA DataPlane application for running Hive queries, managing Hive tables, and diagnosing Hive query performance issues. Supports a query editor (with autocomplete, a visual explain plan, performance improvement recommendations, saved queries and results downloading), a query search tool (with pre-defined queries for expensive, long running, non-optimised and failed queries, a range of filters and saved searches), a database management tool (supporting searching, browsing, interrogation, creation and modification of databases, tables, partitions and columns as well as uploading of data from local storage or HDFS) and table impact reporting (showing reads, writes, projections, aggregations, filters and joins by table and column, with support for dynamic heatmaps overlaid on entity relationship diagrams). Requires a Ambari mangement pack (the DAS engine) to be installed on all clusters.
Hortonworks DataPlane >  Data Lifecycle ManagerA DataPlane application for replicating HDFS and Hive data between two clusters along with any associated metadata and security policies. Clusters already registered with DataPlane can be paired, at which point replication policies can be defined, which result in replication jobs running at the selected interval. Supports replicating between HDFS and cloud object storage (with some caveats around replication of security policies), replication of encrypted HDFS data, TLS encryption of replication streams, one to many replication, support for Atlas metadata replication, reporting on and management of replication jobs and HDFS snapshottable directories, with jobs executed by DLM Engine processes on the appropriate cluster. Stated future plans include support for automatic tiering of data between clusters and point in time backup and restore.
Hortonworks DataPlane >  Data Steward StudioA DataPlane application for viewing and understanding data assets, with supported data assets currently limited to Hive tables on clusters with Atlas and Ranger installed. Supports viewing metadata associated with data assets (including properties, lineage, security policies and audit logs), profiling of data (with profiling performed by a background Spark process, with support for data summarisation, identifying sensitive/personal data and profiling user access to data), grouping of data assets into asset collections, taging and rating of data assets and collections and dashboard views of metadata by cluster and collection.
Hortonworks DataPlane >  Streams Messaging ManagerA DataPlane application for monitoring Apache Kafka clusters. Provides an overview view of producers, topics (and their partitions), brokers and consumer groups, showing key statistics and the connections between them, with the ability to propagate filters based on these connections. Also provides detail views, profiles and historic graphs for each producer, topic, broker and consumer group, with the ability to link out to Atlas to see end to end lineage and Ambari Grafana for detailed metrics. Metrics and status information is also provided over a REST API, with a REST Server Agent running on each cluster being monitored.
Hortonworks SmartSenseSupports the capture of diagnostic information from HDP and HDF clusters (including configuration, metrics and logs from both Hadoop and the Operating System) into a bundle for upload (either manually or automatically) to the Hortonworks support portal to assist in the resolution of support issues and the delivery of cluster optimisation and preventative action recommendations, with support for anonymisation (including IP addresses and host names, with support for further custom rules) and encryption of information in bundles and a SmartSense gateway to proxy uploads if direct internet access isn't available. Also includes functionality to help understand and analyse cluster activity include the Activity Analyser (aggregates data from YARN, Tez, MapReduce and HDFS into Ambari Metrics) and Activity Explorer (an embedded instance of Apache Zeppelin with pre-built notebooks for exploring and visualising cluster activity). Installable and manageable through Apache Ambari. Part of the Hortonworks support offering, introduced in June 2015 as part of HDP 2.3.
HudiSpark library for managing tabular structured data on Hadoop that supports atomic transactions, near real time ingestion and quering, incremental reading of data for further processing and upserts, updates & deletes. Data is stored in HDFS, with a folder for each table partition, and with data files chunked by Hadoop block size (with each chunk allocated a unique fileid). Supports two storage mechanisms - Copy on Write (maintains a data as a Parquet file for each chunk that's re-written for updates and deletes) and Merge on Read (also maintains data as a Parquet file, however new data for a chunk is written to an Avro delta file, with an async background compaction process to merge all new delta files into the Parquet file for a chunk). Data is queryable via Hive, Presto and SparkSQL via a custom InputReaders through three views - Read Optimised (only queries Parquet files), Real Time (queries both Parquet and Avro delta files, merging in the deltas at query time) and Incremental (only reads Avro files to provide new data since a given commit). Supports strongly consistent atomic transactional commits (with a commit log (the timeline) used to prevent data from being queried until it is commmitted, and with support for automatically rolling back failed commits and the ability to manually rollback specific commits) and read isolation (all data filenames include the commit id meaning data files are never modified once committed, with a cleanup process to remove old redudant files). Compactions are non blocking, lock free and asynchronous, with pluggable strategies for prioritising compactions. All records must have a unique key, with a key lookup (either via bloom filter of external HBase table) used to identify updates and identify which chunk that update should be applied to. Also pluggable to support alternative storage formats to Parquet and Avro if required. Spark APIs includes support for incremental reads, bulk inserts, upserts and Spark SQL, and includes integration with Hive and Presto (including a Hive Metadata sync tool that incrementally pushes table and partition metadata to the Hive metastore for Hive and Presto), a CLI, the ability to generate Graphite metrics and a number of utilities (including the ability to stream data from Kafka and Sqoop into Hudi). Created at Uber where it's used in production, and open sourced in December 2016. Name stands for Hadoop Upserts anD Incrementals.
HueWeb application to allow users and administrators to work with a Hadoop cluster. Features include a SQL query tool (with auto-complete, a SQL expression builder, plotting results as a graph or on a map, and the ability to refine results) over any JDBC compatible database, a Pig query tool (with auto-complete and parameterised queries), a Solr search tool (drag a drop creation of Solr dashboards with grid, timeline, graph, map and filter widgets, a tool for indexing data into Solr and a Solr index browser), a query notebook (Spark, PySpark, Scala, Hive, Impala, Pig and R queries along with visualisation of results as graphs and maps), an Oozie management tool (graphical Oozie workflow, coordinator and bundle editors and an Oozie monitoring and management dashboard), an Apache Sentry configuration tool (for managing permissions to Hive tables and Solr collections), an HDFS and S3 file browser and manager (including the ability to upload and edit data), a YARN job browser (viewing logs and statistics), a Hive Metastore manager (browse, view sample data, create and manage databases and tables), an HBase table manager (browse, view, edit, create and manage tables), a Sqoop2 manager (create, manage and execute Sqoop2 jobs), a ZooKeeper manager (list, view and edit) and a user workspace for saving work done in Hue, organising this in folders and sharing it with other users. Originally released by Cloudera as Cloudera Desktop in October 2009, before being open sourced as Hue in June 2010. Python/Django based, under active development with a wide range of contributors, and available for all major Hadoop distributions.
InfluxDBA time series database implemented in Go available in both open source and enterprise editions. Each data point consists of a metric name (measurement), a UNIX nano timestamp, a set of tag key value pairs, and a set of value key value pairs, with the combination of measurement and tag keys refered to as a series. Data is stored in a custom time series index (TSI) engine which supports very large numbers of series allowing for huge cardinalities of tag and value keys. Queries are written in InfluxQL (a varient of SQL), which includes support for creating and managing databases and series, listing series metadata (including measurements, tag keys and values and field keys), managing queries, writing the results of queries back into InfluxDB into a new series, a range of analytical SQL functions including aggregations (e.g. sum, count, spread, stddev), selections (e.g. first, last, percentile, sample) and transformations (e.g. ceiling, derivative, moving_average), and support for registering continuous queries that are run automatically and periodically within a database to create aggregate tables. Also supports retention policies for the automatic deletion of historic data, basic authentication and authorisation (at the database level), HTTPS connections, service plugins that allow data to be written to InfluxDB in alternative protocols (with out of the box support for UDP, Graphite, CollectD, Prometheus and OpenTSDB protocols), snapshot backups, statistics and diagnostic information, and an HTTP API and CLI for writing and querying data. Available as an open source version (under an MIT licence but limited to a single node), and as two commercial products - InfluxEnterprise (with support for clustering, access control and incremental backups) and InfluxCloud (InfluxEnterprise as a cloud based service). Originally created in 2013, and is part of the open source TICK suite along with Telegraf (ingestion of data), Chronograf (admin UI and visualisation) and Kapacitor (streaming analytics and actions).
KiteA set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem. Consists of three sub-projects - Kite Data (a logical dataset abstraction over Hadoop), Morphlines (embeddable configuration driven transformation pipelines) and Kite Maven Plugin (a Maven plugin for deploying Hadoop applications). Java based, Open Source under the Apache 2.0 licence and hosted on GitHub. First released in May 2013 by Cloudera as the Cloudera Development Kit (CDK), renamed to Kite in December 2013, and reached a v1.0 release in February 2015 with a number of external contributors. Last release was v1.1 in June 2015, with very little development activity since this time.
Kite >  Kite DataLibrary that provides a logical dataset and record abstraction over HDFS, S3, local filesystems and HBase, including support for partitioning and views (which allow datasets to be filtered and supports automatic partition pruning). Provides a command line interface and Maven plugin for managing and viewing datasets. Supports Crunch, Flume, Spark and MapReduce, and can integrate with a Hive Metastore to make datasets available through Hive and Impala. Stores data using Avro (utilising Avro schema evolution / resolution) or Parquet.
Kite >  Kite Maven PluginA Maven plugin that supports the packaging, deployment and execution of applications onto Hadoop.
Kite >  MorphlinesA configuration driven in-memory transformation pipeline that can be embedded into any Java code base, with specific support for Flume, MapReduce, HBase, Spark and Solr. Supports multiple different file types including CSV, Avro, JSON, Parquet, RCFile, SequenceFile, ProtoBuf and XML plus gzip, bzip2, tar zip and jar files. Also supports a number of transformation steps out of the box, including integration with Apache Tika for reading common file formats.
LlamaFramework for long running low-latency distributed applications to request resources from YARN, built to support Apache Impala. Operates as an un-managed YARN application master (that handles resource requests over a Thrift API and delivers resource notifications) and a node manager plugin (that delivers resource availability information to co-located services). Created by Cloudera in August 2013 and hosted on GitHub under an Apache 2.0 licence. Maintained by Cloudera to support new Impala and CDH releases, but now deprecated and will no longer be included in CDH from v6.0 onwards.
MapR Converged Data PlatformA data platform built that provides Hadoop compatibility (via YARN and the MapR-FS HDFS compatible API), NoSQL and streaming data storage via MapR-DB and MapR-ES respectively, and a bundle of open source Hadoop projects via the MapR Ecosystem Pack. Comes with an installer (MapR Installer), a web based user interface for management (MapR Control System), and a monitoring and alerting solution (MapR Monitoring). Available as a free community edition (which excludes some enterprise features such as snapshots, high availability, disaster recovery and replication), a full commercial edition, and as MapR Edge (a small footprint edition that can run on low power and embedded hardware close to data sources to perform initial data filtering and processing before forwarding data on to a central cluster via MapR replication), MapR-XD (an edition that focuses on MapR-FS plus the Orbit Cloud Suite to provide web scale file and container storage), MapR Converged Data Platform for Docker (a marketing name for using the Converged Data Platform as persistent storage for docker containers) and MapR Data Fabric for Kubernetes (ditto but for Kubernetes). Supports a number of add-ons, including the Persistent Application Client Container (PACC, a docker image containing the client libraries required to connect to a MapR Converged Data Platform), MapR Orbit Cloud Suite (which adds support for deployment of cloud infrastructure along with MapR, integration with cloud object stores, plus mirroring and replication, with support for multi-tenancy, object tiering and OpenStack integration announced) and MapR Data Science Refinery (a docker based analytics notebook powered by Apache Zeppelin that fully integrates with the MapR Converged Data Platform). Supports deployment in the cloud (AWS and Azure), and is available as a managed service. First released as MapR v1.0 in 2010
MapR-DBNoSQL database built over MapR-FS, supporting wide column and JSON document tables and HBase and OJAI (Open JSON application interface) APIs. Tables are stored as first class objects in MapR-FS volumes, and are sharded into table regions / tablets. JSON document tables are schemaless, support read and write access to individual document fields, subsets of fields or whole documents, finding documents by id or native secondary indexes, a set of atomic operations for mutating documents, a change data capture API, and integration with Spark, Hive and MapReduce. Wide column (binary) tables are largely equivalent to HBase tables, and partially support the HBase API, but without support for custom HBase filters or co-processors. Supports replication at the table, column family or column level, either synchronously or asynchronously, and in either master-master or master-slave configurations, with support for replicating to Elasticsearch. Authentication is managed through access control expressions (ACEs) at the field level (for JSON document tables) or at the column level (for wide column tables). Introduced in MapR v4.0 in Sept 2014, with document supported added in MapR 5.1 in Feb 2016.
MapR-ESTechnology for buffering and storing real-time streams of data, built over MapR-FS, with support for a Kafka compatible API. Messages (key/value pairs where the key is optional) are organised into topics, which are partitioned and stored as first class objects within MapR-FS volumes, with topics then grouped into streams. Supports encryption of streams, automatic deletion of messages (via a time to live), consumer groups, authorisation using ACEs (access control expressions), plus replication of topics to one or more remote MapR-ES instances either synchronously or asynchronously, including support for Kafka's MirrorMaker. Comes with Java, C and Python libraries and includes a Kafka compatible API. Introduced in MapR 5.1 in Feb 2016. Previously called MapR Streams.
MapR Expansion PackA package of open source Hadoop projects certified to work together against one or more versions of the MapR Converged Data Platform. Has new major releases roughtly once a quarter, with most components kept resonably up to date with the open source version, with any patching done publically in GitHub. Available as RPMs, and installable via the MapR Installer. These components were originally bundled as part of the MapR Converged Data Platform, but were broken out as the MapR Ecosystem Pack in September 2016 to allow them to be released independantly. Renamed to the MapR Expansion Pack as of version 4.0.
MapR-FSResilient distributed cluster file system that supports HDFS and NFS/POSIX (v3/v4) access. An S3 compatible API is provided by the MapR Object Store gateway bundled as part of the MapR Expansion Pack. Supports POSIX compliance, arbitrary in place updates to files (unlike HDFS which is append only), distributed metadata (it has no equivalent of the HDFS Name Node), block level mirroring to a remote cluster for DR or load balancing, encryption at rest, automatic storage tiering (including to external object storage) and snapshots (which provide point in time read only views). Data is stored in containers (which manage data blocks and the replication of these over the cluster), and logically organised into volumes (which manage files, directories and block allocation across one or more containers), which also provide multi-tenancy support, with administrative control, data placement, job execution, snapshots and mirroring all configurable against a volume. Supports encrypted communications, full auditing capabilities, Kerberos and Linux PAM for authentication, authorisation via ACLs (against clusters, volumes and job queues), POSIX file permissions (against files and directories) and Access Control Expressions (ACEs, arbitrary boolean expressions against volumes, files and directories). First releases as part of MapR v1.0 in 2010.
MapR MonitoringA collection of open source components used to capture, store and visualise metrics and log messages across a MapR Converged Data Platform. Components used include collectd (to capture metrics), OpenTSDB (a time-series database that runs on top of MapR-DB to store metrics), Grafana (to visualise and graph metrics into dashboards), FluentD (to collect and parse log messages), Elasticsearch (to store and index log messages for search) and Kibana (to search and view log messages). Metrics captured include cpu, disk, memory and network metrics, plus metrics for Drill, YARN and the MapR components. Log messages are captured from system logs, plus YARN, ZooKeeper, Drill, Hbase, Hive, Oozie, Spark, the MapR component logs, and the logs for the MapR Monitoring components. Both Grafana and Kibana include starter sample dashboards. First released in June 2016 as part of the Spyglass initiative.
Mesosphere DC/OSA distributed, hybrid-cloud operating system for elastic stateless micro services running in containers and stateful big data services, ensuring high datacenter utilization. At its core, Apache Mesos handles job scheduling, resource management and abstraction, high availability, infrastructure-level processes and pluggable containerizers for both Docker and native Mesos containers. Combined with Marathon, provides a container orchestration platform with support for launching, managing, scaling and networking containers. Focused on ease of use, provides an app-store-like service catalog (Universe) to install complex distributed systems including HDFS, Apache Spark, Apache Kafka, Apache Cassandra, CI/CD applications such as Jenkins, all of which have been optimised to run on Apache Mesos and a web interface for monitoring and management. Comes in two flavors; a free community edition for installation in the cloud and a commercial enterprise edition for on-premises, in the cloud, or across a hybrid environment and includes monitoring tools, support for enterprise security and compliance tools, advanced networking, and load balancing features. Offered via a subscription license, the enterprise edition also includes product support. Open sourced in April 2016 under the Apache 2.0 license, under active development led by Mesosphere with a range of contributors including Microsoft.
Mesosphere MarathonA framework for Apache Mesos and Mesosphere's Datacenter Operating System (DC/OS) to launch long-running services in a clustered environment and ensure that they continue to run in the event of a hardware or software failure. Implemented as a Mesos framework, leverages Mesos for resource allocation and isolation and provides a REST API and web interface for service definition, discovery and management. Provides constraints control to support service placement for high-available and locality, an event bus and health checking to support rolling deployments and upgrades. Provides local and external persistent storage and resurrection on the same node in the event of a failure to support stateful services (in beta). Often used as an orchestrator for other applications and services, can be run in highly-available mode by running multiple copies of the framework and using ZooKeeper to perform leader election in the event on an failure. Written in Scala, open sourced under the Apache 2.0 license, hosted on GitHub, with development led by Mesosphere who also distribute it as part of their Mesosphere's Datacenter Operating System (DC/OS) commercial offering.
Microsoft Azure Blob StorageAn object store service with strong consistency, with support for multiple blob types (block, page and append), multiple storage tiers (premium, hot, cold and archive) and deep integration to the Azure ecosystem. Block blobs are comprised of one or more blocks with operations done at the block level with changes made visible via a final commit; page blobs are collections of 512-byte pages optimised for random read and write operations against one or more pages; and append blobs only support modification via the addition of new data to the end of the blob. Objects are organised into containers and indexed by string, with the option to list objects by prefix and to summarise results based on a delimiter allowing a filesystem to be approximated. Supports name-value pair metadata against containers and objects, both optimistic and pessimistic (lock based) concurrency, snapshots (providing read only access to objects as they were when the snapshot was taken), soft deletes (allowing previous versions of objects to be recovered), immutable blobs, lifecycle management, access control via access tokens (shared access signatures), public access to containers, configurable geo redundancy, encryption of objects (Azure Storage Service Encryption - SSE) and support for SSL connections, multi-part uploads, the use of custom domains, and logging and metrics (Azure Storage Analytics). Provides a REST API, web app (Azure Storage Explorer), a range of SDKs, a CLI and PowerShell integration.
Microsoft Azure Data Lake StoreMassively scalable HDFS compatible filesystem as a service, based on Microsoft's Cosmos technology. Claims support for up to trillions of files and single files larger than one petabyte, with no limits on account sizes, file sizes or the amount of data that can be stored, and optimisation of parallel analytics workloads, with high throughput and IOPS performance. Supports user authentication via Azure Active Directory (AAD) (combined with OAuth and OpenID), role based access control for account management, POSIX ACLs for controlling access to data, encryption for both stored data and data in transit over the network,and built in auditing (of both data access and account management activities). Supports a standard WebHDFS API, an HDFS compatible interface (adl://) that's bundled with Apache Hadoop, a blob storage API, a web UI (Data Explorer) and SDKs for a range of languages. Does not natively support geo-replication with filesystems limited to a region, but data can be manually replicated via a number of routes if required. First announced in April 2015, with a general availability release in November 2016.
OpenLink Virtuoso Universal ServerMulti-model database (RDBMS, VDBMS) supporting tabular relational (SQL), graph relational (SPARQL), hybrid (SPARQL-in-SQL a/k/a SPASQL), XML (XPath, XQuery, XSLT), filesystem/objects, and other forms of data; First shipped in 1999, available as Open Source or Enterprise Edition; various add-ons available for Enterprise Edition; virtualizes local and/or remote tabular relational databases and/or other data sources as RDF semantic web data sources.
OpenStack SwiftAn open source object store with eventual consistency, that's available from a number of vendors as both an on site solution and a cloud based service offering. Objects are organised into containers and indexed by string, with the option to list objects by prefix and to summarise results based on a delimiter allowing a filesystem to be approximated. Supports configurable storage policies (each using a different storage ring allowing for differing hardware and replication levels to be used), erasure coding as well as standard replication (with erase coding providing smaller storage overheads at the code of higher CPU and read and write data), and multi-region clusters (based on configuring affinity for local operations). Also supports container and object metadata, object versioning, container to container mirroring via background synchronisation, authorisation via tokens from OpenStack Keystone, access control via container ACLs, support for large objects via segmentation (multi-part uploads combined with a special manifest file), scheduled and bulk object deletion, time limited access URLs,and encryption of data at rest. Provides a REST API and client SDKs. Originally created by Rackspace in 2009, becoming one of the first OpenStack technologies, with contributors now including SwiftStack, RedHat, HP, Intel, IBM among others.
OpenTSDBA time series database built on top of Apache HBase (with support for Google BigTable and Apache Cassandra recently added). Each data point consists of a metric name, a UNIX timestamp, a value (either integer or floating point) and a set of key value pair tags, where the tags define the potentially aggregations required. Data is stored with one row per metric, tag combination and hour, with all data points for that hour stored in that row under different column qualifiers based on the timestamp, allowing for more efficient in memory aggregations. Supports the recording (but not generation) of pre-aggregated data that will be used to accelerate queries, annotations (short text strings associated with timestamps and optionally time series that represent events), the organisation of metrics and tags into hierarchical trees, and the generation of statistics relating to performance, however currently does not support incrementing counters. Consists of a Time Series Daemon (TSD) (that exposes a Telnet RPC and HTTP JSON REST APIs and a simple web based UI for querying data) and a CLI (including the ability to batch import data), with each TSD opperating independantly of each other with no master or shared state allowing for horizontal scalability over a single underlying database. Supports a range of plugins, including the ability to support different deserialisation and authentication for HTTP REST calls, emmission of meta data (metrics, tags and annotations) to a search engine, real time publishing of data points to another destination and support for other RPC protocols. Open sourced on GitHub under both an LGPLv2.1+ and GPLv3+ licence, with development started in 2010, and has been adopted but a number of large organisations including MapR, Yahoo, Tumblr and ebay.
PravegaTechnology for the buffering and long term storage of streaming data, designed for low latency and high throughput, with support for exactly once semantics, durable writes, strict ordering, dynamic scaling, transactions and long term storage backed by HDFS. Data is stored in named streams (continuous streams of bytes organised into Events, with serialisation and de-serialisation done in clients), with streams partitioned by a Event Routing Key into stream segments. Data is stored in two tiers, the first using Apache BookKeeper for recent data, the second using HDFS for long term storage, with automatic ageing of data and seamless reads across tiers. Operates on a publish/subscribe model, with subscribers able to select any point in history to read from. Supports automatic scaling of streams (dynamically increasing or decreasing the number of stream segments based on the operations per second on the stream), exactly once semantics (ensuring events are read once and once only even after failure), durable writes (data is persisted before write operations are acknowledged), transactions (multiple events can be committed in a single operation), ordered streams (events for a given Routing Key will always be read in the same order they're written), ReaderGroups (allows multiple subscribers to co-ordinate reads from a single stream) and a state synchroniser API (allowing multiple clients to synchronise arbitrary state through Pravega). Supports a Java SDK and out of the box integration with Flink, along with support for deployment using docker swarm, dc/os and AWS (all currently in development). Open sourced under an Apache 2.0 licence, started in July 2016 within Dell EMC, and does not yet have a first formal release, but is under active development by a wider range of contributors. Stated plans for future functionality include automatic deletion of data based on a retention period, support for other tier 2 storage technologies, access control, runtime metrics and Spark support.
PrestoAn MPP query engine that supports queries over one or more underlying databases with the ability to join data from multiple datastores together. Supports a range of underlying technologies including Accumulo, Cassandra, Hive (HDFS), Kafka, Kudu, Redshift and a number of relational databases, with schemas read from the underlying database and cached within Presto. Architecture consists of a co-coordinator node that parses and plan queries, and worker nodes that execute tasks and process data. Extensible for new database connectors, data types, functions, access control schemas and event listeners. Supports resource management, spilling to disks when processing large results sets, a cost based optimiser, Kerberos and LDAP authentication, a CLI, and web interface for monitoring and managing queries and JDBC and ODBC drivers. Created at Facebook, announced and open sourced in 2013. Commercial support was originally provided by Hadapt, which was acquired by Teradata in 2015, before being spun out as Starburst in late 2017, who now provide an enterprise distribution and commercial support and services.
Quantcast File SystemOpen source HDFS compatible distributed file system, which focuses on improving performance and scalability over HDFS. Uses erase coding (specifically Reed-Solomon error correction) allowing each data block to be stored with a 50% overhead over 9 nodes with data able to be read from any 6 (half the space required by HDFS with 3 way replication). Also supports online addition of new data (chunk) nodes, automatic re-balancing and re-replication of data, Unix style permissions support and C++ and Java client libraries. Published benchmarks suggest a 50/75% read/write performance increase over HDFS, and significantly faster metadata operations. Now also runs over Amazon S3. Built and maintained by Quantcast, who open sourced it in August 2012. An evolution of the Kosmos File System (KFS), an open source project started by Kosmix in 2005, which Quantcast first adopted in 2007. Built in C++ and released under the Apache 2.0 licence.
Qubole Data ServiceHadoop as a managed service over AWS, Azure, Google Gloud Platform and Oracle Cloud. Supports Airflow, Hadoop, Presto and Spark cluster types, automatic management (starting, stopping and scaling) of clusters based on workload, automatic shared Hive metastores within accounts, role based access control (to accounts, clusters and UI/API functionality, with Hive authorisation to manage access to data), connectivity to external databases (Data Stores), labelling of clusters and routing of commands by label (allowing graceful cluster upgrades), custom node bootstrap commands, encryption, auditing, data caching (on AWS only via open source Rubix project), ODBC/JDBC drives. Has a rich web based user interface that supports exploration of data (in Hadoop, object stores and connected external databases), a command composer with auto completion (supporting Hive, Presto, Pig, Shell, Spark and Worklow commands) with auto completion and command history, parameterisable command templates, data management (import, export and upload), a visual query builder (Smart Query), Zeppelin based notebooks (including publication of public read only notebook views), command schedulers, cluster management and a range of usage and cluster metrics and graphs. Also supports a REST API. Priced per hour based on the cloud infrastructure being used, which is in addition to any cloud vendor costs. Launched in 2013.
RecordServiceAbstraction layer for accessing structured data in Hadoop that enforces fine grained access control (via Apache Sentry). Started in January 2015 and announced with an initial beta release in September 2015 and a stated plan to donate it to the Apache Foundation, however all code and documentation were taken down at the end of 2017, with the download page on Cloudera's website now simply stating that 'RecordService is in development'. Functionality that was available included support for reading data from HDFS and S3 in Parquet, Text, Sequence File, RC and Avro formats via a Hive table/view definition or a file path, with support for HBase and Kudu planned, direct access to data via C++ and Java APIs plus integration with MapReduce, Spark, Impala and Pig, with support for Hive planned, and support for the Apache Sentry security model, including table, view, file (via grants on uris to create external tables) and column level security, with row level filtering and data masking planned.
REX-RayOpen source, storage management solution providing containers to access external storage systems outside of the container's host thus enabling stateful applications such as databases to be run inside containers. Allows applications to save data beyond the lifecycle of a container and provides high-availability features for container restarts across hosts. Operates as a command line interface and lightweight agent that can be integrated into container runtimes (e.g. Docker, Mesos) to provide storage functionality such as volume creation, attaching, and mounting processes as well as container orchestrators (e.g. Docker Swarm, Kubernetes, or Marathon for Mesos) to attach a volume to a new host and resume state in the event of a host failure. Built on top of the libStorage library (also from Dell EMC), provides a storage plugin framework that allows access to multiple storage providers and platforms (Amazon EBS, EFS, S3FS, Dell EMC ScaleIO, Isilon etc.) and a flexible architecture that allows it to be deployed in a standalone, decentralised fashion on each container host or as a centralised service for easier management at large scale. Written in Go, open sourced under the Apache 2.0 licence, hosted on GitHub, with development led by Dell EMC. Has not yet reached a v1.0 milestone, but is still under active development.
Scality RINGA massively scalable commercial object store available as software for deployment on premises on commodity hardware. Based around a native object store core, but with POSIX filesystem support, and support for a range of APIs including file based (NFS, SMB and Linux FUSE), object based (S3 compatible and native Scality REST APIs), and OpenStack compatible (Swift, Cinder, Glance and Manila). Supports both variable level data replication and erasure coding, object encryption, file and object versioning, multi-site support (via synchronous and asynchronous replication, including support for replicating to Amazon S3), data location control, support for arbitrarily large objects, rolling upgrades and full authentication and access controls (including support for LDAP, Active Directory, AWS IAM and Kerberos). Comes with a CLI and web based GUI, and an add on solution (Scality Cloud Monitor) is available for monitoring and management. Sold by Scality, who were founded in 2010 and who focus on selling and supporting Scality RING, but who have also open sourced their S3 API as Zenko Cloudserver (previously S3 Server).
Schema RegistryA centralised registry for data schemas with support for NiFi, Kafka and Streaming Analytics Manager, allowing schemas to be defined and versioned centrally and removing the need to attach schema to every piece of data. Supports versioning of schemas (with a definable compatibility policy that validates that schemas are forward compatible, backward compatible, both or none), the ability to store and serve JAR files for serialising and de-serialising data, a REST API, Java SDK and web based user interface for managing schemas. NiFi integration supports record level operations (via RecordReaders and RecordSetWriters); Kafka integration supports Kafka Producers and Consumers. Requires a MySQL backend for schema storage, and either local of HDFS storage for serialiser/de-serialiser JAR files. Stated plan is to support a wider range of schema types (currently only Avro schemas are support), a range of other registry requirements (e.g. templates, machine learning models or business rules), and for integration with Apache Atlas and Ranger. Started by Hortonworks in October 2016, with an initial release as part of HDF 3.0 in June 2017.
Streaming Analytics ManagerA suite of open source web based tools to develop and operate stream analytics solutions and analyse the results, with pluggable support for the underlying streaming engine. Consists of Stream Builder (a web based GUI for building streaming data flows), Stream Operations (a web based management and operations tools for streaming applications) and Stream Insight (a bundling of Druid and Apache Superset to serve and analyse the results of streaming applications). Stream Builder supports creation of streaming flows using a drag and drop GUI, with support for a range of sources (including Kafka and HDFS), processors (including joins, window/aggregate functions, normalisation/projection and PMML model execution), and sinks (including email, HDFS, HBase, Hive, JDBC, Druid, Cassandra, Kafka, OpenTSDB and Solr), as well as support for custom sources, processors, sinks and functions (including window functions), and the ability to automatically deploy and execute applications. Stream Operations supports the management of multiple execution environments, the deployment, execution and management of applications within an environment, the capture of stream metrics via pluggable metrics storage (with support for Ambari and OpenTSDB), and web based dashboards to monitor applications and visualise key metrics. Started by Hortonworks in May 2015, with an initial release as part of HDF 3.0 in June 2017.
StreamSets Data CollectorGeneral purpose technology for the movement of data between systems, including the ingestion of batch and streaming data into an analytical platform. Pipelines are configured in a graphical user interface, and consist of a single origin, one or more processor stages and then one or more destinations, with support for a wide range of source/destination technologies and processor transformations. Supports a wide range of data formats, executors (tasks that can be triggered based on events from pipelines, e.g. to send e-mails or run a shell script), handling of erroroneous records, support for CDC CRUD records, previewing of data within the editor UI, real-time reporting and alerting on a range of execution and data quality metrics, the ability to dynamically handle changes to schemas and the semantic meaning of data and a full Python SDK. Can run in standalone mode (as a single process, with the option to run single or multi-threaded), as a Spark Straming or MapReduce job on a cluster, or in an ultralight agent (StreamSets Data Collector Edge). Java based, Open Source under the Apache 2.0 licence, hosted on GitHub, with development led by StreamSets who also provide commercial support and a number of commercial add-ons, including Control Hub (cloud service for developing and managing pipelines), Dataflow Performance Manager (for managing data metrics) and Data Protector (for managing senstive data). Started in October 2014, with a v1.0 release in September 2015.
ZenkoA distributed and resilient Amazon S3 API compatible object storage gateway / proxy. Utilises Zenko CloudServer (previously S3 Server) to provide an S3 compatible API, to proxy requests to either Scality RING, Amazon S3, Azure Blob Storage or Google Cloud Storage, and to provide persistent local storage or transient in-memory storage. Current solution is a Docker Swarm stack of a cluster of Zenko CloudServer instances with nginx as a front end load balancer. Manageable via Zenko Orbit, a cloud based portal. Roadmap includes support for Azure Blob Storage, support for other container management systems such as Kubernetes, plus two new sub-projects - Backbeat (which will provide policy-based data workflows such as replication or migration) and Clueso (which will provide object metadata search and analytics using Apache Spark). First released in July 2017, and hosted on GitHub under an Apache 2.0 licence.
Zenko >  Zenko CloudServerOpen source object storage server based on the S3 compatible API from Scality RING, with the ability to proxy requests to other S3 services (with support for Scality RING, Amazon S3, Azure Blob Storage and Google Cloud), or to store data in persistent local storage or transient in-memory storage, with support for concurrent use of multiple backends. Supports broad compatibility with the Amazon S3 API including bucket and object versioning, and has been tested against a range of Amazon S3 utilities, CLIs and SDKs. Written in Node.js, available as a Docker container, and can be deployed and used independantly of the rest of Zenko. Metadata and (locally persisted) data is managed by a data and metadata daemon (dmd), with the option to use a shared remote daemon (for example when running a cluster of CloudServers). First released in June 2016 as S3 Server before becoming being renamed to CloudServer and becoming part of Zenko in July 2017. Hosted on GitHub under an Apache 2.0 licence.