2017 edit

We interupt the current broadcast for another (semi regular) catchup up on the news…

New technology releases (details are on the relevant technology pages):

Other technology news:

Cloudera have released Altus (blog post; tech details; announcement, a service for running data engineering jobs (Spark, Hive and MapReduce) in the cloud on on-demand clusters. One for us to dig into further in the not too distant future I think.
Confluent have announced a cloud based offering of their Apache Kafka based solution, although it’s only in early access at the moment
Cockroach DB, a distributed SQL database that could be regarded as an open source version of Google Spanner has hit 1.0
AirBnB’s Superset has been donated to the Apache Foundation. This is well worth a look - it looks like an extreemly capable data exploration platform
Pravega is a new open source streaming storage system from Dell/EMC - see here for an introduction

Technology updates:

A write up of the new capabilities in NiFi 1.2 for processing large volumes of records more efficiently and running SQL on event streams
If you’re interested in Apache Ignite, there’s a two part getting starting set of blog posts from GridGain here and here
Solr is getting a new API
Options for doing system maintenance on HDFS from Cloudera
Hortonworks are working on Spark SQL integration with Apache Ranger, giving row/column level access control
Cloudera are trumpeting their work on Spark and some of the new features they’ve enabled
Cloudera’s pitch for why you should use Cloudera Director to give you cloud independence
A summary of the history of Flink from dataArtisans
There’s a new cloud data access guide for Hortonwork’s Data Platform

Interesting blog posts:

Automating testing of data pipelines and then doing continuous integration is definitely a topic I want to talk more about (but I say that of everything). In the meantime Databricks have an article on using Cucumber with Spark
Again from Databricks, and this feels topical for us - 5 reasons for choosing S3 over HDFS
Some thoughts on Open Source, licences and whether some commercial open source products are really open source from Bloor - here
A post from Cloudera on Envelope, a pre-developed Spark application for doing bi-temporal change management
Another one from The Morning Paper - processing a trillion edge graph on a single machine
The latest update from Bloor on graph technologies
A case study on the use of OpenTSDB, Grafana, Kafka and Riemann for metrics collection and monitoring at Robinhood Engineering
Confluent’s view on why streaming is the new ETL