Data Storage Formats edit  

Our list of and information on data storage formats, including Avro, Parquet, ORCCFile, Carbondata and alternatives to these.

Category Definition

Libraries that support the storage of data on disk for data storage, real-time or batch analytics. Popularised by the use of distributed file systems in analytical platforms, common features include support for schema evolution (the ability to make changes to the schema but still read all historical data), support for both row and columnar data layouts (supporting efficient batch processing and analytical workloads respectively), complex record formats including nested objects and arrays, indexing to support random data access, and support for efficient inserts, updates and deletes as well as ACID transactions.

Further Information

A good introduction from Silicon Valley Data Science to text, SequenceFile, Avro, Parquet and ORC, along with some benchmarks - http://www.svds.com/dataformats/

Comparison of JSON, Avro, ORC and Parquet, including benchmarks, from August 2016 - https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet-65740483

Comparison of Parquet, Avro, HBase and Kudu, from January 2017 - https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines

Text Based Formats

Plain TextOften formatted as either delimited (e.g. comma separated values (CSV) or pipe separated values (PSV)) or fixed width fields
XMLSupports the use of external schema definitions, but format can be verbose and serialisation/deserialisation performance often poor
JSON (Javascript Object Notation)More efficient and terse than XML, but still suffers poor serialisation/deserialise performance

Row Based Formats

SequenceFilePart of the Hadoop project, a container of binary key/value pairs used to store multiple smaller files inside a single large file - https://wiki.apache.org/hadoop/SequenceFile
AvroBinary serialisation format supporting schema evolution and complex record types

Column Based Formats

RCFileEarly implementation of a columnar record format as part of the Apache Hive project - https://cwiki.apache.org/confluence/display/Hive/RCFile
ORCFileEvolution of RCFile, spun out into it’s own Apache project
ParquetColumnar format spun out from the Avro Trevni format
CarbonDataColumnar format created by Huawei to address a number of perceived shortcomings in existing formats

Map Based Formats

MapFile, SetFile, ArrayFile, BloomMapFilePart of the Hadoop project and built on SequenceFile, with MapFile being the original file format used by HBase - http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/
TFileContainer of key-value pairs, part of the Hadoop project - http://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/io/file/tfile/TFile.html
HFileHBase storage format based on TFile and the SSTable format from the Google BigTable paper - http://hbase.apache.org/book.html#_hfile_format_2

And of course, any object store, NoSQL key value store or NoSQL wide column store can be used to store data by key.

In Memory Formats

ArrowIn memory columnar data format supporting high performance data exchange and fast analytical access
MnemonicHybrid memory / storage object model framework with Spark and MapReduce support - http://mnemonic.apache.org/

Data Storage Services

KuduCluster based columnar storage engine that supports updates and deletes by primary key

Blog Posts