2017 edit

2017-03-31 Technologies Apache Sentry Cloudera Search Apache Kudu RecordService Peter

No technology summary today for various reasons, one of which is that I’m taking a break next week and we’ve probably got to a pretty good place to pause. We’ve finished looking at the new open source technologies in the Cloudera stack this week, with their proprietary closed source technologies to come, but let’s save those for a fresh week.

So what exactly have we looked at this week? We started by looking at Apache Sentry, Cloudera’s competitor to Apache Ranger and Cloudera Search, Cloudera’s competitor to Hortonworks’ HDP Search. We then looked at Apache Kudu, a structured data store that supports both updates and deletes by primary key as well as efficient analytical table scans, and RecordService, a new technology that’s still in beta that provides an API for tools (such as Spark and MapReduce) to access structured data in Hadoop with fine grained access control.

At some point, and I realise I keep saying this, I want to take a deeper look into the state of the security technologies in Hadoop, and more specifically a decent comparison of Apache Sentry and Apache Ranger. There’s some history behind how we’ve ended up with two Apache technologies that are essentially trying to solve the same problem, each has their pros and cons but either will probably do what you need them two, with Ranger possibly looking the slightly further ahead functionality wise. Note that I’ve make some minor tweaks to the Apache Ranger technology summary this week.

I’m not entirely sure where the whole “wrap Solr up with some tools and utilities and give it a new name” came from, but there are some interesting differences between Cloudera Search and HDP Search in the wraps that they put around Solr. Hortonworks bundle Banana (the Solr port of Kibana), they both support utilities for loading data from HDFS and moving data from HBase to Solr, but Hortonworks also bundle integrations with Solr for Hive, Pig, Storm and Spark. Note again that I’ve made some minor tweaks to the HDP Search technology summary this week.

Cloudera position Apache Kudu as the missing link between HDFS and HBase - able to perform updates and deletes by primary key (ala HBase), as well as analytical queries performing column and table scans (ala HDFS). It’s not going to replace either for their respective specialisms, but the ability to run analytical workloads over mutable data feels like a bit of a gap in the Hadoop ecosystem at the moment (that probably led to the rise of the lambda architecture). Kudu only provides the storage engine, but its combination with Apache Impala to provide a SQL interface on top feels like yet more evidence that Cloudera is firmly targeting the established OLAP database vendors. The Kudu whitepaper is worth a skim if you’re interested, including the performance comparison between Kudu, Parquet on HDFS and Phoenix on HBase.

And finally RecordService, the benefits of which seem twofold. Firstly, it provides a unified data access path for Hadoop technologies - rather than every technology (MapReduce, Spark, etc.) having to implement their own reader for each file format, they can just integrate with RecordService. This then enables the second - fine grained access control to this data. At the moment, when tools like MapReduce and Spark read data from HDFS, access can only be granted at the file level - RecordService will enable finer grained access control in this scenario, which can only be a good thing. Two thoughts however - you can achieve this today by forcing these tools to read data via Hive, however my guess is that there are some performance limitations on this at scale that RecordService will address, and that at the moment it’s very early days for RecordService, plus it’s tied to Apache Sentry, so it’s going to be interesting to see how much traction this gains into the wider Hadoop ecosystem.

As mentioned above I’m taking a break next week so there’ll be no updates, but we’ll be back on Monday 10th with a look at Cloudera’s closed source commercial technologies.