Apache Spark edit

A high performance general purpose distributed data processing engine based on directed acyclic graphs that primarily runs in memory, but can spill to disk if required, and which supports processing applications written in Java, Scala, Python and R (SparkR). Includes a number of sub-projects that support more specialised analytics including Spark SQL (batch and streaming analytics using declarative logic over structured data), Spark Streaming (micro-batch stream processing), MLlib (machine learning) and GraphX (graph analytics). Requires a cluster manager (YARN, EC2, Kubernetes and Mesos are supported as well as standalone clusters) and can access data in a wide range of technologies (including HDFS, other Hadoop data sources, relational databases and NoSQL databases). An Apache project, originally started at UC Berkley in 2009, open sourced in 2010, and donated to the Apache foundation in June 2013, graduating in February 2014. v1.0 was released in May 2014, with a v2.0 release in July 2016. Java based, with development led by Databricks (who sell a Spark hosted service), and with commercial support available as part of most Hadoop distributions.

Technology Information

Other Names	Spark
Vendors	The Apache Software Foundation
Type	Commercial Open Source
Last Updated	November 2018 - v2.4

Sub-projects

Apache Spark > GraphX	Spark library for processing graphs and running graph algorithms, based on graph model that supports directional edges with properties on both vertices and edges. Graphs are constructed from a pair of collections representing the edges and vertex, either directly from data on disk using builders, or prepared using other Spark functionality, with the ability to also view the graph as a set of triples. Supports a range of graph operations, as well as an optimised variant of the Pregel API, and a set of out of the box algorithms (including PageRank, connected components and triangle count). First introduced in Spark 0.9, with a production release as part of Spark 1.2, however has seen almost no new functionality since then.
Apache Spark > MLlib	Spark library for running Machine Learning algorithms. Supports a range of algorithms (including classifications, regressions, decision trees, recommendations, clustering and topic modelling), including iterative algorithms. As of Spark 2.0 utilises a DataFrame (Spark SQL) based API, with the original RDD based API now in maintenance only. First introduced in Spark 0.8 after being collaboratively developed with the UC Berkeley MLbase project, and still under active development.
Apache Spark > Spark SQL	Spark library for processing structured data, using either SQL statements or a DataFrame API. Supports querying and writing to local datasets (including JSON, Parquet, Avro, Orc and CSV) as well as external data sources (including Hive and JDBC), including the ability to query across data sources. Includes Catalyst, a cost based optimiser that turns high level operations into low level Spark DAGs for execution. Also includes a Hive compatible Thrift JDBC/ODBC server that's compatible with Beeline and the Hive JDBC and ODBC drivers, and a REPL CLI for interactive queries. Introduced in Spark 1.0 with a production release in Spark 1.3, with substantially improved SQL functionalities in Spark 2.0.
Apache Spark > Spark Streaming	Spark library for continuous stream processing, using a DStream (discretized stream) API. Uses a micro-batch execution model leveraging core Spark to execute the specified logic against each micro-batch (a DStream is a sequence of Spark RDDs), with the ability to also use other Spark batch operations (including Spark SQL and MLlib) against each micro-batch. This model also provides fault tolerance through exactly-once processing semantics. Supports a number of data sources (including HDFS, sockets, Flume, Kafka, Kinesis and messaging buses), as well as functions to maintain state and to execute windowed operations. First introduced in Spark 0.7, with a production release as part of Spark 0.9, however development appears to be largely stopped following the introduction of Structured Streaming in Spark 2.0
Apache Spark > Structured Streaming	Extension to the Spark SQL DataFrame API to allow Spark SQL queries to be executed over streams of data, with the engine continuously updating and maintaining the result as new data arrives. Uses the full Spark SQL engine (including the Catalyst optimiser), and supports end-to-end exactly-once semantics via checkpointing when sources have sequential offsets. Supports aggregations over sliding event-time windows, including support for late data and watermarking. Introduced in Spark 2.0 with a production release in Spark 2.2.

Related Technologies

Is packaged by

Apache Bigtop, Hortonworks Data Platform, Cloudera CDH, MapR Expansion Pack, Cloudera Altus Data Engineering, Amazon EMR, Google Cloud DataProc, Qubole Data Service

Release History

version	release date	release links	release comment
2.2	2017-07-11	release notes; databricks view cost based optimiser
2.3	2018-02-28	release notes; databricks view; Kubernetes support; stream to stream joins; continuous streaming
2.4	2018-11-02	release notes; databricks view

News

http://spark.apache.org/news/index.html - project news
https://databricks.com/blog/category/engineering - Databricks engineering blog