2017 edit

2017-08-25 Apache Kylin Apache Beam REX-Ray Zenko Structured Streaming Peter

So, a bit of a mixed bag this week, but what did we look at.

We started with three technologies summaries from Jeff Moszuti, looking at Apache Kylin, Apache Beam and Dell EMC REX-Ray. We then finished up the week by looking at Zenko and Spark Structured Streaming.

We last looked at OLAP cube technologies with Druid, which like Apache Kylin is tightly integrated with the Hadoop ecosystem. Unlike Druid however, Kylin doesn’t introduce new data storage, instead leveraging Hive and HBase, which potentially makes it more palatable if you don’t want more data management engines running on your cluster. What it lacks however is Druid’s support for combining streaming and batch data. And Kylin is an Apache project (if that’s important to you), however given Hortonwork’s recent commitment to Druid, it wouldn’t surprise me if Druid was heading that way as well.

I have to say I’m a little conflicted around Apache Beam. It aims to introduce a standard model for batch and stream processing that a range of different technologies can then support. The model feels ok, certainly Data Artisans liked it enough to completely rework the Flink API to be more aligned (details here), but I’m struggling to see the value. I have to admin to be slightly biased against abstractions like these - my feeling is that they’re great in concept, however there’s always a cost associated with an abstraction layer, either in not being able to achieve something easily because you’re fighting the abstraction, or in performance overheads from the translations involved, and switching between back end runners will never be as easy as you hope. And does anyone really care about being able to take batch/streaming code and easily migrate it between different back end execution engines? I can see what’s in it for Google - having Google Cloud Dataflow being the de-facto runner for running Beam code in the cloud puts them in a good position. Perhaps I’m just being cynical.

I’ve not much to say about Dell EMC REX-Ray, but hopefully we’ll look more at containerisation technologies and how they support persistent storage and how you might use them for analytics at some point in the future.

We looked at Scality’s open sourced S3 Server back when we were looking at object store. Just as a reminder, it was a Node based single process (i.e. not clustered or distributed) S3 compatible object store service, that could either proxy requests onto Scality Ring or Amazon S3, or could serve them from local or in memory storage. Useful for development and test, but probably not anything significant in production. It seems like it was pretty successful (they keep banging on about how it was downloaded over 600,000 times), and they’re therefore trying to make something more significant of it.

The result is that they’ve renamed S3 Server into Zenko Cloudserver, and made it part of a new much larger open source project called Zenko. At the moment all Zenko does is provide a Docker Swarm stack definition that allows multiple Cloudservers to be clustered behind a load balancer, but what they have planned is more interesting. First up is support for more backend services, including Azure Blob Storage and potentially Google Storage, also support for other container management systems such as Kubernetes, and then new new sub-projects - Backbeat (which will provide policy-based data workflows such as replication or migration) and Clueso (which will provide object metadata search and analytics using Apache Spark). The aim is to provide a gateway into multiple back end object stores with federation capabilities over the top. Sounds like a nice idea, and probably one to track.

And finally, I’ve refreshed all our Apache Spark technology summaries (including all the sub-projects) to make sure they’re up to date (which is odd given there’s only been one point release since they were written). The big bit that was missing was information on Spark Structured Streaming, which provides the ability to run a DataFrame or SQL query over streaming data (using the standard Spark SQL APIs), and have the result calculated and then updated/maintained as new data comes in. The result being that I think it’s probably time we took a deeper look into streaming technologies and understood the differences. It does feel like Spark is moving forward at a pace however, both the original Spark RDD API, and the original Spark Streaming API appear to now be effectively in maintenance mode, DataFrames being the future across both, including for machine learning with MLLib.

Right - enough for next week. See you after the weekend.