Streaming Data Stores edit

Our list of and information on commercial, open source and cloud based streaming data stores, including Kafka, Confluent, MapR-ES and alternatives to these.

Category Definition

Technologies for the persistent storage of continuous streams of data, with data access based on a publish/subscribe model. Should support multiple independent publishers and subscribers, the ability to add new subscribers and replay the history of a stream, horizontal scalability and load balancing, durable writes, ordered streams (data is always read in the order it was written), high throughput and low latency characteristics, handling of updates and deletes to source records, and the ability to secure the data.

Open Source Technologies

The following are open source Streaming Data Store technologies:

Apache Kafka	Technology for buffering and storing real-time streams of data between publishers to subscribers, with a focus on high throughput at low latency.
Confluent Open Source	A package of open source projects built around Apache Kafka with the addition of the Confluent Schema Registry, Kafka REST Proxy, a number of connectors for Kafka Connect and a number of Kafka clients (language SDKs).
Pravega	Technology for the buffering and long term storage of streaming data, designed for low latency and high throughput, with support for exactly once semantics, durable writes, strict ordering, dynamic scaling, transactions and long term storage backed by HDFS.
Apache BookKeeper	Distributed log storage service from Yahoo - http://bookkeeper.apache.org/
Apache DistributedLog	Distributed log service from Twitter supporting durability, replication and strong consistency built over Apache BookKeeper - http://bookkeeper.apache.org/distributedlog/
Apache Pulsar	Distributed pub-sub messaging from Yahoo, with persistent message storage based on Apache BookKeeper - http://pulsar.incubator.apache.org/
LogDevice	Open source distributed data store for sequential data from Facebook - https://logdevice.io/

Note that Apache Kafka is bundled with a number of Hadoop distributions.

Commercial Technologies

The following are commercial Streaming Data Store technologies:

Confluent Enterprise	A commercial version of the Confluent Open Source product, with the addition of a number of commercial closed source products including a JMS client, Control Centre (for managing Kafka clusters), Multi DC Replication (active-active replication between Kafka clusters) and Auto Data Balancing.
MapR-ES	Part of the MapR Converged Data Platform - supports streaming data storage capabilities and a Kafka compatible API
AMQ Streams	Kafka distrubtion from RedHat that runs on OpenShift - https://access.redhat.com/products/red-hat-amq-streams

Technologies Available as a Service

The following are Streaming Data Store technologies available as a managed service in the cloud:

Confluent Cloud	Confluent Enterprise as a service - https://www.confluent.io/confluent-cloud/
Amazon Kinesis Streams	Streaming data storage and publish service - https://aws.amazon.com/kinesis/streams/
Amazon Managed Streaming for Kafka (MSK) (public preview)	Fully managed, highly available, and secure Apache Kafka service - https://aws.amazon.com/msk/
Azure Event Hubs	Elastic service for the buffering and publishing of streaming event data with a Kafka compatible end point - https://azure.microsoft.com/en-us/services/event-hubs/
Google Cloud Pub/Sub	Real time message and streaming data service with “at least once” delivery - https://cloud.google.com/pubsub/

Blog Posts

Streaming Data Stores 2017-06-30 Streaming Data Stores Kafka Pravega Peter