Hadoop Technology Options edit

2017-05-16 Tech Categories Hadoop Distributions Apache Hadoop Peter

In the last post we looking at why you might want to use Hadoop - today I want to dig into the options for deploying or using Hadoop capabilities. Consider this a companion piece to the list of these options that’s now available on our Hadoop Distributions page

Broadly speaking, if you’re looking at deploying Hadoop the options you have fall into three categories.

Firstly, you can use an Hadoop managed service. This automates the provisioning and management of Hadoop, allowing you to easily create, scale and destroy clusters. You’ll obviously need to either have or get your data into the cloud, however this can be a great option for dipping your toe in the waters and exploring what Hadoop can do in a very cost effective way. However, if you want a persistent cluster, or if you’re doing anything at a significant scale, then this can get expensive very quickly, so it’s worth doing your sums first. Hadoop as a service does open up another use case however - if you already have you data in the cloud then you can use an Hadoop managed service to analyse that data in place using transient processing clusters - just spin up a cluster for a specific workload and then terminate it, meaning that you’re only paying whilst your workload is running. Note however that cloud object stores are significantly slower than HDFS running on local storage in a cloud based cluster.

Secondly, you can deploy and manage Hadoop yourself, either on your own infrastructure or on cloud infrastructure. The choice of your own vs cloud infrastructure is subject to all the usual considerations around TCO, manageability and having your data into the cloud, however both Cloudera and Hortonworks include tools (Cloudera Directory and Cloudbreak respectively) for managing cloud based deployments (including the provisioning of cloud infrastructure and the ability to scale up and down) that are also probably your prime options if you’re running an internal cloud such as OpenStack. Whatever infrastructure you use, you’re responsible for your Hadoop installation, which means you’ll need to make a decision about which Hadoop distribution to use, and whether you use a free open source version or purchase commercial support. You’ll also need to make sure you have access to all the specialist skills required to deploy, secure and manage your cluster - although the various management tools have improved significantly over the last few years, there are still significant decisions and configuration work to be done.

Lastly, you can use an Hadoop appliance - a prepackaged bundle of dedicated infrastructure and Hadoop software. There’s generally price premium for these, however they’re generally well architected for performance and can massively accelerate an onsite deployment, although they may or may not be easier to manage that a custom Hadoop deployment.

Selecting an Hadoop technology is never going to be a quick or easy process - the range of different options, the different ways it can be deployed, the range of different capabilities, the different cost models and level of support, the skills required and the management and maintenance costs make it a complex decision. However I’m hoping that the material on this site will give you a starting point for understanding the different options to help inform this decision making process. The Hadoop Distributions page that lists the various options should be your first call, and I’ve started a comparison of the technologies bundled in the major distributions that we’ve looked at to date on an Hadoop Distributions Comparison page. Neither of these are complete by a long stretch, however let me know of any obvious gaps, or even better send through a pull request with any new information.