2018 edit

2018-01-10 Peter

It’s the first news back after the Christmas break, so brace yourself - it’s a massive bumper jam packed edition…

Technology updates (details are on the relevant technology pages):

The big one this week is Apache Hadoop 3.0 - there’s links to the release note on our Hadoop page and some links below to some commentry
Elasticsearch has hit 6.1, along with X-Pack and Elasticsearch Hadoop
Apache Solr 7.2 is out
Apache HBase 1.4 is out
Apache Drill is up to 1.12 - Kafka support is interesting
Apache Knox has hit 0.14
Apache Arrow has hit 0.8
Pravega - the Kafka challenger - has hit 0.2
MiNiFi has seen 0.3 releases of it’s Java version
Cloudbreak has seen it’s second 2.x technology preview release - 2.2

Other technology news:

Both ZDNet and Datanami have posts on Hadoop 3.0 and what the roadmap past this looks like - ZDNet; Datanami
Blog posts have appeared for Kudu 1.6 and Greenplum 5.3 that have been added to their technology pages. Greenplum is looking to move to a fully containerised deployment model - which is interesting.
Azure HDInsight has seen a big price reduction and a bunch of new announcements - link; ZDNet commentary
An excellent article from Ehud Kaldor and SwiftStack on the differences between NFS and Object Storage - link
Hortonworks have published a set of pre-canned streaming analytics projects using HDP and HDF, including Ad Serving, Clickstream Analysis and Predictive Maintenance - link
A couple of old Databricks announcements we didn’t cover at the time for some reason
- Databricks Unified Analytics Platform - Databricks runtime + interactive collaborative notebooks and dashboards + production job / notebook scheduling + enterprise security - homepage; blog
- Databricks Delta - a service over cloud blog stores like S3 that adds ACID transactions and support for automatic data indexing - homepage; blog
- And some thoughts from ZDNet - Spark in the cloud; Databricks strategy
Merv Adrian’s latest Hadoop tracker is up detailing the component versions used by the major Hadoop vendors - link
If you’ve got some time for reading, AtScale have a list of their top 10 posts and articles from 2017 - link
From ZDNet, their thoughts on big data in 2018 and the move to the cloud - link
The excellent db-engines site have announced their database of the year - link
DZone have published a Refcard for Kafka covering a whole pile of useful getting started information - link
A good write up of the features in Elasticsearch 6.0 from Logz.io - link
Are you running Kafka - we have a couple of posts this week from NewRelic and Confluent on monitoring it - NewRelic; Confluent
Azure Blob Storage now supports an archive level tier - link
A deep drive into the YARN capacity scheduler from Hortonworks - link
From Apache Flink - 2017 in review and plans for 2018 - link
dataArtisans have responded to the Databricks Spark Streaming vs Flink benchmark - link
Apache Mnemonic and Trafodion have graduated from the Apache Incubator - link; link
The Apache Nifi project has released the first (0.1) version of the NiFi registry for the configuration management of flows - link
A write-up from ZDNet on Streamsets - link
It’s an old article, but still interesting - ZDNet looked at graph vs rdf databases - link
By comparison this is ancient (from 2015), but looks like a really good intro the the HBase architecture from MapR - link
At the risk of this becoming a ZDNet fest - their views on big data in 2017 and 2018 - link
An update from the Pravega blog on their architecture and design principles - link
For the deeply technical - how to build a distributed log (streaming data store) - link
And last but not least, from Sonra - dimensional modelling on Hadoop - link