2017 edit

2017-06-16 Hadoop Distributions Hortonworks DataFlow Microsoft Azure Data Lake Store Apache Druid (Incubating) Apache Superset (incubating) Streaming Analytics Manager Peter

Let’s reminder ourselves of the plan for this week - Azure Data Lake Store, Druid, Cloudera Altus, Apache Superset and Pravega. How did I do? Three out of five.

I’ll come back to Cloudera Altus first thing next week, and Pravega by looking at streaming data stores in the near future, but this week ended up being dominated by serendipity and Hortonworks’ HDF 3.0 release (and their two new technologies - Schema Registry and Streaming Analytics Manager), and by a desire to have some content on some new and breaking stuff.

Oh, and this weeks news post ended up being a bit of a bumper post, with some stuff we need to dig into.

Let’s take the week in chronological order.

We started off by looking at Azure Data Lake Store, a wrap up from our lookup at Hadoop Compatible Filesystems. If you’re working in the cloud, your options to date have been one of the big object stores (such as Amazon S3), but that’s going to give you limitations on the size of an individual file, and a performance hit based on the inability to read files in a massive parallel way. Azure Data Lake Store (appears to be) a pretty unique offering in the space, giving an HDFS compatible filesystem that addresses these limitations, but at huge scale in the cloud.

Druid came next, by dint of it being included in tech preview in HDP 2.6. It seems a far more popular open source project than I’d considered, with some serious deployments, and delivers on a use case that traditionally would have required significant hardware and software investment to do at scale. What’s interesting is Hortonworks interest with it having been significant committers for a while, their plan to integrate it with Hive, and the fact they’re now bundling it with both HDP and HDF (although it’s in tech preview in both). That means they see a significant future for it, and to be honest I think that’s a pretty good bet.

Which leads us on to Apache Superset, recently donated to the Apache Foundation, and the tool of choice for use with Druid. Again, it hits a use case that’s traditionally been the preserve of commercial products, and again Hortonworks are going in, having been committers for a while and now bundling it (indirectly) with HDF (see later). Along with Druid, it’s got to be well worth a look if you have sort of requirement for delivering OLAP / cube type capabilities to end users.

As part of the mid week news, we took a quick stop off to look at the bucket load of technologies that have been deprecated as part of HDP 2.6. Apache Accumulo, Apache Kafka and Apache Storm are going as they’re being moved into other Hortonworks products - Kafka and Storm into HDF (one presumes), but I’m less sure about Accumulo. Perhaps it’s destined to become an add-on like Hawq and Solr - time will tell I guess. Then there’s Apache Flume (advice is to consider HDF instead), Apache Mahout (advice is to consider Spark MLLib instead), Apache Slider (being folded into YARN) and Hue (advice is to consider Ambari Views instead). The more interesting one is Apache Falcon, which looks dead in the water with no commits for a number of months now, and with no clear replacement. The suggestion is that something’s coming, but it’s not clear what. If we get NiFi with intermediate files stored in HDFS and the ability to run arbitrary Spark / MapReduce jobs as processors then that would be lovely!

And so on to HDF 3.0. This looks like a big release for Hortonworks - they’re going all in with streaming data (IoT / analysis of data in movement / however else they’re selling it), and I can’t shake the feeling that they’re stealing a march on a bunch of competitors by selling an integrated set of technologies that fit this space, combined with the required security and governance bits. And yes, both MapR and Cloudera bundle Kafka and some sort of streaming tech (Storm for MapR, Spark Streaming for Cloudera), but it’s not a focus in the same way it is for Hortonworks. And then the announcements this week around the new technologies they’re adding to HDF just re-enforces the fact that they’re talking streaming analytics far more seriously that they’re competitors, and I think they’re going to reap the rewards.

So first up for HDF 3.0 was Schema Registry. It fills a gap that Confluent had filled for Kafka with a commercial solution, but across all the technologies in the HDF stack. Having multiple jobs reading and writing the same data means that they all need to know the schema and understand / be updated as when the schema changes. You can solve this by bundling the schema in with the data (e.g. with Avro), but the overhead of this when you’re dealing with individual records is huge. So you stick a schema version number of the record, and you go and get the actually schema from the registry. Interoperability between jobs and the ability to evolve schemas in streaming solutions - done. But it seems like Hortonworks have bigger plans for this product, with the idea that it could support other items such as business rules and machine learning models that need to be re-used in multiple places.

And then they’re their Streaming Analytics Manager) - a collection of bits that are designed to make building and operating streaming analytics solutions easier and simpler and to reduce the bar to entry. The operations stuff, and the bundling of Druid and Apache Superset to analyse the results of streaming analytics feels like a no brainer. The graphical GUI to building streaming apps over your streaming engine of choice (starting with Storm obviously) is going to be an interesting one to track to see what sort of uptake there is. I’m sure they’ll be views that say this is going to be a lowest common denominator solution, that it will lack flexibility and control, and although that’s true I’m not sure it matters. For 80% of streaming analytical use cases it will probably do the job, and allow you to deliver solutions in a fraction of the time it would take otherwise. For everything else - you can still drop down into the underlying tech just as you did before.

That’s pretty much it for this week (finally!), but one last thing I’d like to call out. Hortonworks’ other big announcement this week was their new partnership. IBM are going to drop their Hadoop distribution and resell HDP, and HDP will resell IBM’s BigQuery and Data Science Experience (DSX). This feels like the natural evolution of the market consolidation that’s been going on for a while, and is exactly the model that Pivotal took with their Hadoop distribution. Hortonworks get access to IBM’s Hadoop customer base, and access to a data science as a service solution, and IBM get to continue and enhance their Hadoop offering whilst reducing their costs. And it puts a new light on the Hortonworks’ announcements around support for PowerPC and IBM Spectrum Scale.

Right - I’m done with this week. See you back on Monday when I’ll summarise the plan for next week (that I then won’t keep to), but we’re definitely going to start the week by looking at Cloudera Altus.