Steve is co-author of the Swift connector. author of the Hadoop FS spec and general mentor of the S3A work. Been full time on S3A, using Spark as the integration test suite, since March
Rajesh has been doing lots of scale runs and profiling, initially for Hive/Tez, now looking at LLAP performance problems on cloud. Chris has done work on HDFS, Azure WASB and most recently S3A
Everything uses the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API.
Under the FS API go filesystems and object stores.
HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.
Object stores are often eventually consistent. Objects are replicated across servers for availability. Changes to a replica take time to propagate to the other replicas; the store is inconsistent during this process.
Directory rename and delete may be performed as a series of operations on the client. Specifically, delete(path, recursive=true) may be implemented as "list the objects, and delete them singly or in batches", and rename(source, dest) may be implemented as "copy all the objects, and then delete them".
Create Consistency: a newly created object will always be immediately visible HEAD/GET Update: overwritten objects may take time to be visible Delete: delete operations may not be visible to all callers Listing: list operations may take time to list created objects, lag in changed metadata about existing objects, and lag in observing deleted objects.
At the beginning of any query, the slower and incomplete listing operations hamper "partitioning” phase, the selection of files with relevant data. At the end of jobs, it doesn't support instantaneous renames, stops S3 being safely used as a destination for work.
Amazon EMR reimplementing something similar: storing all that directory information in Amazon DynamoDB. But there's never been any equivalent in the open source S3 client(s) in Apache Hadoop.
The EMR File System (EMRFS) and the Hadoop Distributed File System (HDFS) are both installed on your EMR cluster. EMRFS is an implementation of HDFS which allows EMR clusters to store data on Amazon S3. You can enable Amazon S3 server-side and client-side encryption as well as consistent view for EMRFS using the AWS Management Console, AWS CLI, or you can use a bootstrap action (with CLI or SDK) to configure additional settings for EMRFS.
Enabling Amazon S3 server-side encryption allows you to encrypt objects written to Amazon S3 by EMRFS. EMRFS support for Amazon S3 client-side encryption allows your cluster to work with S3 objects that were previously encrypted using an Amazon S3 encryption client. Consistent view provides consistency checking for list and read-after-write (for new put requests) for objects in Amazon S3. Enabling consistent view requires you to store EMRFS metadata in Amazon DynamoDB. If the metadata is not present, it is created for you.
That's changed with the S3Guard extension to Hadoop's "s3a" client.
You can now use Amazon DynamoDB as a complete, high-performance reference "store of record" for all the S3 directory information Speeds up output as well as input Allowing S3 to be used as the intermediate store of queries and the direct destination of output with a consistent model