Thoughts on Query Engines and Analytical Databases edit  

Right, we’ve finally completed our look at analytical databases, with a short stop to look at query engines.

So let’s summarise and spew some thoughts…

Let’s start with analytical databases, with a bit of noddy pseudo-history tale telling. Once upon a time the relational database was invented, but these initially focused on transactional use creates - creating, finding, updating and deleting records - what’s now sometimes referred to as OLTP (online transaction processing). However SQL was a great and so people started using it to try and generate reports on the data in their relational databases - queries that focused on aggregating and joining significant portions of their data. And that was generally fine, until data volumes increased to the point where databases designed for transactional workloads struggled to handle the loads generated from these reports or analytics - fetching lots of data off disk and aggregating it was expensive.

And so the analytical OLAP (online analytical processing) databases were born - designed to primarily support the large full table scan aggregation queries, while still retaining the core relational database support for transactions. These were marketed as data warehouse or analytical databases, and with them came a bunch of new technologies - parallelism and the invention of the MPP (massively parallel processing) database, as full table scans and aggregations lend themselves well to partitioning and parallelisation (even if joins do not); columnar compression, which enables faster and more efficient table scans of columns from database tables; pre-aggregation of data, through materialized views and pre-generated cubes; and a range of new functionality designed to complement SQL as an analytical tool, such as support for machine learning, geographical analytics, map reduce and custom analytical functions. Today, these are often sold as appliances, bundling clusters of compute and storage servers with huge bandwidth interconnects between them, however many have now also started embracing the cloud, being available as a cloud service but also to a lesser extents as cloud native software. Open source projects in this space however are scarce.

Instead, the the charge of open source software into the space ended up being spearheaded by Hadoop. Whilst analytical databases have long separated storage and compute (with some interesting abilities to push some parts of the query down to the storage layer), it was Hadoop that’s formalised this by separating them into completely separate and interchangeable components. This can be seen from the very first versions of Hadoop, in that it was made up of two products - HDFS and MapReduce - storage and compute, and over time there’s been evolution on both sides. In storage, HDFS has remained a constant, but there’s been huge innovation in the storage formats of data (see our Data Storage Formats page). On the compute side, there’s been a veritable explosion of query engines.

These started out as batch query engines - every query starting up a new job (initial MapReduce) to go and read all the data and execute the query. However there’s been significant push into low latency high concurrency query engines, led by the big Hadoop vendors - Cloudera with Impala and Hortonworks with Hive - both trying to make Hadoop a realistic competitor in the analytical database market. Compared to analytical databases Hadoop query engines are new technologies - their query optimisers and SQL compliance generally lag the mature analytical databases, however they’re seeing significant investment, and they’re not at the point where they’ll probably support most of your use cases.

And this split of storage and compute has brought some interesting benefits. This first is support for the whole “schema-on-read” shtick - the idea that you can query your raw data without having to do any preparation first. It’s slow, and painful, but for an initial exploratory analysis it’s a valuable tool. And of course if there’s value in the data, you’re still feel to do some work that makes it quicker, easier and more efficient to query. The second is that these tools have naturally evolved some level of query federation - if I want to exploit raw un prepared data I have to accept that this data may be in multiple places in a range of formats, and if I have a query engine that’s separated from the underlying storage, why not make that storage pluggable, and support multiple storage platforms. And so many of these tools support querying over a range of data sources, from HDFS, to S3, to HBase, to relational and NoSQL database, and (interestingly) some emerging support for Kafka . They don’t have the level of sophistication around semantic layers and caching and materialisation of data the Data Virtualization technologies do, but it can still be a hugely valuable capability.

And now we’re starting to see commercial vendors get involved, both large established vendors (Teradata, IBM, Oracle) who are updating their products to run over HDFS and external data stores, but also new vendors in this space who’ve seen an opportunity to sell commercial products in this space that offer a level of functionality and maturity that maybe some of the open source products can’t match. If you’re running open source Hadoop, some of these may well be worth a look.

Finally - a couple of footnotes…

Although many of these tools support SQL, there are many that have their own query languages - Pig has Pig Latin, MRQL has it’s own language - and of course you could probably count most of the graph and machine learning projects as query languages, which suggests I’ve probably not named this category particularly well. SQL is always going to be the dominant language in this space however, and anything with it’s own query language is unlikely to make an impact.

It’s also worth commenting on the role of the Hive metadata in the Hadoop query engine ecosystem. If you’re separating compute and storage, you need some way of telling the compute what data’s available for query and what format it’s in. Within Hive, this was the Hive Metastore, and this is now gradually being adopted as the standard in this space, giving some interesting interoperability options, in that if you define a table in the Hive Metastore, you can query that with either Hive or Impala (or any other technology that uses the Metastore). And now there’s a proposal to break the Metastore out of the Hive project into it’s own top level Apache project, to reflect the wider role it has in the Hadoop ecosystem.

Right - that will do, and hopefully brings the protracted birth of our analytical databases content to and end. Have a good weekend everyone, and we’ll see you on Monday for a week or two of random catch up and clean up.