Why Choose Hadoop? edit

2017-05-12 Tech Categories Hadoop Distributions Apache Hadoop Peter

So our Hadoop Distros week is going to be a bit longer than a week, but hey, we’re all flexible and adaptable right?

Today I’d like to spout some thoughts about why you might consider Hadoop, and I’d like to start but looking at it’s history.

Hadoop was originally designed for the single specific use case of doing aggregations over enormous volumes of data, and the combination of HDFS and MapReduce delivered this capability in a way that (at least at the very large scale) was not possible before and hasn’t really been superseded since. Pig and Hive were introduced as nicer ways to write MapReduce code, but didn’t fundamentally change anything. Hadoop, being open sourced, was then picked up a number of companies (both vendors and users) and taken in a couple of (largely complementary) directions.

First, HDFS was positioned as a great place to put all your data so that you could run a range of analytics over the top - the mythical Data Lake (although we’ll definitely talk about the challenges of building a Data Lake with Hadoop at some point). This required new technologies to bring data in (Flume and Sqoop for example), new technologies to exploit this data (Spark and Mahout for example), and a way to make all these technologies play nicely together (YARN). However there’s only so far that data in a filesystem will get you in terms of analytics - if you’re looking to do anything outside of batch appends and scanning workloads you’re going to struggle.

And so other technologies were introduced that used HDFS as an underlying storage technology to enable new data storage capabilities - HBase as a NoSQL database and Solr for search indexing for example. And what that created was not a single place to put all your data, but a range of complementary technologies that support different use cases, but that can share underlying infrastructure. All of which is good.

But there were always going to be challenges trying to move Hadoop from where it started to a more general purpose capability - you’re always going to be pushing against it’s original architectural and design decisions. HDFS is not a general purpose cluster filesystem - it was designed for a specific use case (for example it can only do file appends rather than random updates and has a hard limit on the number of files it can hold based on the memory capacity of the Name Node for), which can cause limitations when trying to use it for more general purpose analytics or to underpin other technologies (Kudu has chosen not run over HDFS for example). And YARN was a relatively late addition to Hadoop, meaning many Hadoop technologies don’t support it (including Flume, Solr and HBase, although Hortonworks is trying to address this through Slider). Which means we’re now in a position whereby you potentially have multiple technologies competing rather than co-existing on our Hadoop cluster, which feels like it’s starting to dilute some of the potential value. If you’re interested in which technologies do or don’t run over HDFS / YARN, I’ve tried to summarise it in diagrammatic form here.

So what are your options around Hadoop? Firstly, as an ecosystem it contains a good set of technologies. HDFS plus Hive/Spark/etc. is a great platform for doing batch scanning analytical workloads. HBase and Solr are great technologies that stand up well to their competition, and Hive and Impala are starting to provide some serious competition for the established MPP database vendors. Deploying any one of these technologies to fulfil a role in your wider analytical ecosystem will do you well, but if you’re going to use a commercial service you’ll need to find a cost effective way of doing this - you don’t want to be buying the entire ecosystem if you’re not going to use it all. This is where Cloudera are going - starting to offer tailor packages that include sub-sets of the components focusing on specific use cases, and most of the Cloud or Hadoop as a Service offerings allow you to only pay for what you use.

Or you can deploy Hadoop as a common analytical platform - a single set of infrastructure and a single purchase to give you a single platform that can deliver you a range of capabilities and fulfil a range of roles in a cost effective way. Note that this generally only works if you’re deploying on site rather than using the Cloud however. This is where Hortonworks and MapR are focusing - Hortonworks is investing in technologies such as Slider to allow everything to integrate with YARN, and MapR have built their entire offering around their own shared multi-tenancy storage capability (MapR-FS) that is designed and built for exactly this use case (and fulfils it better than HDFS).

In summary, Hadoop like any technology has it’s strengths and weaknesses. It’s not the be all and end all, it’s certainly not going to solve all your problems and magically make you a data driven organisation, it’s not going to dramatically decrease your costs, and deploying it is going to be as much hard work than deploying any other technology. However I’m hoping that the material I’ve added to this site over the last few months will help you start to understand what Hadoop is and how it might mean one or more of your use cases, whether that’s helping you start an evaluation of how one of it’s technologies stacks up against it’s competition, or to understand the ecosystem and how a specific distribution or offering can meet your range of use cases.

So that’s some thoughts on why Hadoop - on Monday we’ll summarise the options for how you can get it…