Apache ORC edit

Self-describing, type-aware, columnar file format to enable efficient querying and storage of data on Hadoop. Provides built-in storage indexes, column statistics and bloom filters to allow execution engines to implement predicate and projection push-down, partition pruning and cost based optimisation for low latency reads. Uses multi-version concurrency control to support ACID transactions and allow Hive to implement bulk insert, update, delete and streaming ingest (micro batch) use cases. Implements type-aware encoding for efficient compression (run-length for integer and dictionary for string). Schema definition is stored along side the data and supports all primitive data types and complex nested data structures. Uses protocol buffers to store meta data. Comes with a Java library for reading and writing the file format and includes a MapReduce compatible API, a C++ library for reading the file format (donated by Vertica) and a set of Java and C++ tools for inspecting and benchmarking ORC files. Created by Hortonworks in January 2013 as part of the initiative to massively speed up Hive and improve the storage efficiency of data stored in Hadoop, split off from Apache Hive to become a separate top level Apache project in April 2015 with a 1.0 release in January 2016.

Technology Information

Other Names	ORC
Vendors	The Apache Software Foundation
Type	Commercial Open Source
Last Updated	September 2019 - v1.6

Release History

version	release date	release links	release comment
1.5	2018-05-14	release notes
1.6	2019-08-03	release notes

News

https://orc.apache.org/news/ - news page
https://orc.apache.org/docs/releases.html - details of releases

Technology Information

Release History

Links

News