Thoughts on Hadoop Data Formats edit  

This week we looked at two file formats for Hadoop; ORC and CarbonData and a new in-memory data structure specification; Arrow. Earlier on this year we looked at data serialisation frameworks - Avro and Parquet. File formats, data serialisation frameworks, specifications… Ahh, what a minefield! Today I’d like to try and make sense of it all by looking at the evolution of these various data formats to see how we got here.

Back in the old days, it wasn’t uncommon to process gzipped plain text delimited formatted data using MapReduce. However, if the data was supplied as lots of small files, eventually, pressure would be put on HDFS because of a hard limit on the number of files it can physically track. At the same time if too few large files were supplied, MapReduce wouldn’t be able to efficiently carve up the data for parallel processing. SequenceFile was the first file format to address both of these issues by providing a container in which many small files could be put into a larger single file with synchronisation markers to permit efficient splitting of files to distribute the workload. The file format also allowed for different types of compressions to be used inside of the file to provide a finer level of compression control whilst still maintaining its splitability characteristics. And for a while, SequenceFile was fine if you were just using it to do large-scale distributed batch processing or building a self-contained system using Java, such a storage manager.

As a file format, Avro builds on top of the container file format and synchronisation marker ideas but adds the ability to model relational and complex data. It achieves this through the concept of a schema that allows you to specify the structure, type and the meaning of your data, and when distributed together with your data, you’ll have self describing data that can be used as both a wire format for communication and a serialisation format for persistent data. But Avro’s creator - Doug Cutting didn’t just stop there, Avro had a much grander aspiration. Its design accommodated schema evolution which allowed your data to be long lived and reusable by other applications outside of the Hadoop and Java ecosystem.

So far, when it comes to serialising data for persistence, our file formats have been writing consecutive elements of a row next to each other on disk. Around 2010/2011, there were several research projects experimenting with column-oriented data layout designs on HDFS. The idea behind a columnar data structure was to lay out your data so that column values were adjacent to one another. For read queries that only process a small subset of columns but over a large number of rows at a time (the so called typical analytical workload), it was possible to read only the required column values off disk, thereby minimising disk I/O. Contrary to the more traditional row-oriented data layout for the same type of query, you were forced to read all the other columns off disk, keep the columns required by the query and discard the unused ones. Column data being uniform in type, also had the additional benefit of being highly compressible, allowing the same data to be stored on disk in a smaller form and further reducing I/O. This approach to optimise data placement was popularised by databases such as MonetDB and HP Vertica. The tradeoff would be slower writes, but this proved a good optimisation for analytical workloads. The record columnar file format (RCFile) came out of one of these research projects and became widely adopted in the Hadoop ecosystem.

By 2013, there was gathering interest towards using Hadoop for interactive, data warehouse-style SQL queries, and combining Hive with RCFile for data storage was a popular choice. Hortonworks launched the Stinger initiative with the goal to dramatically speed up Hive and make it more enterprise-ready. Building on top of the columnar file format of RCFile, the team introduced Avro’s schema concept to allow the format to model complex nested structures which wasn’t previously supported by RCFile. Armed with this metadata, it was also possible to intelligently select an appropriate compression schema based on a column’s data type. The format also allowed additional metadata such as min and max indexes to be collected which could then be later used to intelligently skip irrelevant parts of the data without the need for large, complex, or manually maintained indexes. Other improvements introduced over Avro and RCFile was the ability to identify the boundaries on which files could be split with having to scan for synchronisation markers. This new file format was known as the Optimized Record Columnar (ORC) File. The Avro team had also been experimenting with a columnar file format design called Trevni which was picked up by Cloudera and Twitter to develop Parquet. On the other hand, ORC was spun out of Hive into a separate project but was initially kept closely integrated. Parquet was positioned as a more general-purpose columnar file format for use with any Hadoop framework, although essentially both projects shared the same fundamental ideas. Today, they both stand in their own right and are integrated with a number of different frameworks.

Up until now we’ve being looking at data formats that are primarily optimised for a single type of query analysis. This is not ideal if you want your data processing platform to support a wide spectrum of different types of query analysis. CarbonData from Huawei aims to tackle the ‘one format to rule them all’ idea heads on. Building on the previous formats, CarbonData introduces multi-dimensional key indexes inspired from the likes of Mondrian (an early open source OLAP server) and Apache Kylin to support multi-dimensional OLAP style queries, inverted indexes for count distinct like operations, and the ability to group columns together to support detailed queries which fetch many columns out of a wide table.

Finally, we come to Apache Arrow. Unlike the previous data formats we’ve discussed, Arrow isn’t about serialising data to disk, but is an in-memory data format that focuses on CPU throughput for efficient processing and for data exchange between process/systems without serialisation and deserialisation. Similar to ORC and Parquet, data is structured in a columnar structure, so when it comes to analytical workloads, only the required data can be supplied to the CPU. This data placement strategy takes full advantage of the on-chip cache storage (which is 100x faster to access than main memory), pipelining, and SIMD (Single Instruction Multiple Data) instructions which work on multiple data values simultaneously in a single CPU clock cycle. As an in-memory data format, this concept wasn’t new and had already been implemented in both Drill and Hive. What is special about Arrow is its goal to define a standard interchange format to allow sharing of data between processes without the overhead of moving or transforming the data. This is important when you want to put together a data processing platform that isn’t limited only to Hadoop.

In summary, we’ve seen that the original Hadoop data formats were designed to solve very specific use cases. As the drive to develop Hadoop into a more general purpose analytics platform and expand its use outside of the Java ecosystem, the data formats have rapidly evolved, with each new format building on top of the ideas of its predecessor. While the Hadoop ecosystem continues to evolve to cater for new use cases, I expect to see continued innovation in this space.

So I hope this little journey has helped you navigate your way through the complex and rapidly evolving collection of Hadoop data formats - on Monday I’ll hand you back over to Peter.