The Mid Week News - 07/06/2017 edit
Technologies Apache Hadoop HDFS Apache Solr Apache Spark Spark SQL Apache Kafka Apache Ignite Apache Apex Apache Flink Hortonworks Data Platform Apache NiFi MiNiFi Cloudera Altus Director Peter
We interupt the current broadcast for another (semi regular) catchup up on the news…
New technology releases (details are on the relevant technology pages):
- Apache Apex and DataTorrent RTS have both seen new releases
- MiNiFi has seen 0.2 releases of it’s Java and C++ versions
- Apache Flink has seen a 1.3 release
- Hadoop has seen it’s latest (alpha3) release of 3.0 - details here
Other technology news:
- Cloudera have released Altus (blog post; tech details; announcement, a service for running data engineering jobs (Spark, Hive and MapReduce) in the cloud on on-demand clusters. One for us to dig into further in the not too distant future I think.
- Confluent have announced a cloud based offering of their Apache Kafka based solution, although it’s only in early access at the moment
- Cockroach DB, a distributed SQL database that could be regarded as an open source version of Google Spanner has hit 1.0
- AirBnB’s Superset has been donated to the Apache Foundation. This is well worth a look - it looks like an extreemly capable data exploration platform
- Pravega is a new open source streaming storage system from Dell/EMC - see here for an introduction
Technology updates:
- A write up of the new capabilities in NiFi 1.2 for processing large volumes of records more efficiently and running SQL on event streams
- If you’re interested in Apache Ignite, there’s a two part getting starting set of blog posts from GridGain here and here
- Solr is getting a new API
- Options for doing system maintenance on HDFS from Cloudera
- Hortonworks are working on Spark SQL integration with Apache Ranger, giving row/column level access control
- Cloudera are trumpeting their work on Spark and some of the new features they’ve enabled
- Cloudera’s pitch for why you should use Cloudera Director to give you cloud independence
- A summary of the history of Flink from dataArtisans
- There’s a new cloud data access guide for Hortonwork’s Data Platform
Interesting blog posts:
- Automating testing of data pipelines and then doing continuous integration is definitely a topic I want to talk more about (but I say that of everything). In the meantime Databricks have an article on using Cucumber with Spark
- Again from Databricks, and this feels topical for us - 5 reasons for choosing S3 over HDFS
- Some thoughts on Open Source, licences and whether some commercial open source products are really open source from Bloor - here
- A post from Cloudera on Envelope, a pre-developed Spark application for doing bi-temporal change management
- Another one from The Morning Paper - processing a trillion edge graph on a single machine
- The latest update from Bloor on graph technologies
- A case study on the use of OpenTSDB, Grafana, Kafka and Riemann for metrics collection and monitoring at Robinhood Engineering
- Confluent’s view on why streaming is the new ETL