Hadoop Compatible Filesystems edit  

Our list of and information on commercial, open source and cloud based Hadoop compatible filesystems, including HDFS, MapR-FS, Alluxio, Ignite, Azure Data Lake Store and alternatives to these.

Category Definition

A parallel distributed filesystem that implements the Hadoop FileSystem API and conforms to the Hadoop Compatible Filesystem specification, allowing it to be used in place of HDFS. The use of the FileSystem API (over native filesystem access) allows for parallel reads and location aware block placement, with the HCFS specification covering runtime behaviour. Note that Hadoop Compatible Filesystems (as per HDFS) are not fully POSIX compliant, there is no formal compatibility test suite (although a test suite that will highlight potential issues is available), and that some implementations (for example object stores) do not (and cannot) fully conform to the specification.

Further Information

The specification for Hadoop compatible filesystems is included in the Hadoop documentation here.

Specialist Hadoop Compatible Filesystems

The following technologies are all designed specifically to be Hadoop compatible and to be drop in replacements for HDFS within an Hadoop cluster, meaning that they can co-exist with YARN and other analytical compute workloads on the same nodes:

HDFSThe Hadoop Distributed Filesystem, bundled with Hadoop
MapR-FSHadoop compatible, highly resilient and scalable distributed filesystem that also supports NoSQL table and streaming data native storage
Quantcast File SystemOpen source HDFS replacement, which focuses on improving performance and scalability over HDFS
Hops-HDFSExperimental solution based on HDFS but using a distributed MySQL cluster for metadata to increase performance and scalability - http://www.hops.io/?q=content/hdfs

See our Hadoop Distributions page for options on deploying Hadoop clusters utilising these technologies.

In Memory Hadoop Compatible Filesystems

There are also a number of in memory data grids / storage layers that provide Hadoop compatibility and the option to run Hadoop jobs entirely in memory or across tiered storage:

AlluxioDistributed virtual storage layer over memory and tiered storage with support for a range of interfaces. Previously known as Tachyon
GridGain/Apache IgniteDistributed in-memory data fabric/grid, including support for an in-memory Hadoop compatible filesystem

Cloud Hadoop Compatible Filesystems

Azure has an Hadoop compatible filesystem as a service:

Azure Data Lake StoreCloud based massively scalable HDFS compatible filesystem based on Microsoft Cosmos

Other Technologies

DataStax Enterprise file systemDistributed Hadoop compatible filesystem that runs on DataStax Enterprise, replacing the now deprecated Cassandra File System (CFS) - http://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/analytics/dsefsTOC.html

Parallel Distributed Filesystems

Parallel distributed filesystems provide similar capabilities to HDFS, including the ability to scale horizontally and to read a file in parallel from multiple nodes. Their general focus is on providing direct filesystem access plus NFS and object store APIs, and although most offer an Hadoop compatible API this is generally just to allow data to be exploited by Hadoop compatible workloads as a remote filesystem. Some may support installation on an Hadoop cluster as a drop in replacement for HDFS, however there are often compatibility issues and performance is often not as good as HDFS.

Object Stores

Most object stores also provide Hadoop compatible APIs, and although this means that Hadoop can natively read and write from them using the Hadoop Filesystem API, they are not considered Hadoop Compatible Filesystems due to their lack of compliance to the compliance specification. More details can be found in the “Object Stores vs. Filesystems” section of the specification page.

One of the potential differences is the use of an eventual consistency model by object stores, which particularly affects Amazon S3. The Hadoop S3A client now includes S3Guard, a feature which uses a database such as DynamoDB to provide a consistent view of the object store - see the Hadoop docs page for S3Guard for more details.

See our Object Stores page for our list of object store technologies.

Blog Posts