1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
S3Guard: What’s in Your
Consistency Model?
Mingliang Liu @liuml07
Steve Loughran @steveloughran
December 2016
Steve Loughran
Hadoop committer & PMC, ASF Member
Mingliang Liu
Apache Hadoop committer
Chris Nauroth,
Hadoop committer & PMC, ASF member
Rajesh Balamohan
Tez Committer & PMC
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
S3A:
Hadoop File System for S3
(EMR: use Amazon's s3:// )
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Storage Use Evolution
HDFS
Application
HDFS
Application
GoalEvolution towards cloud storage as the primary Data Lake
Input Output
Backup Restore
Input
Output
Copy
HDFS
Application
Input
Output
tmp
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs
Hadoop File System - One Interface Fits All
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
A FileSystem: Directories, Files  Data
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
S3A: Object Store Pretending A FileSystem
 Cloud Object Stores designed for
– Scale
– Cost
– Geographic Distribution
– Availability
 Cloud apps dedicatedly deal with cloud storage semantics and limitations
 Hadoop apps should work on cloud storage transparently
– S3A partially adheres to the FileSystem specification
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
01
01
01
01
hash("/work/pending/part-00")
["s01", "s02", "s04"]
hash(name)->blob
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What Is The Problem?
 Performance
– separated from compute
– cloud storage not designed for file-like access patterns
 Limitations in APIs
– delete(path, recursive=true)
– rename(source, dest)
 Eventual consistency
– Create Consistency
– Update
– Delete
– Listing
• take time to list created objects
• lag in changed metadata about existing objects
• lag in observing deleted objects
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]
rename(): A Series of Operations on The Client
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Eventual Consistency From FileSystem’s View
 When listing "a directory”
– Newly created files may not yet be visible, deleted ones still present
 After updating an object
– Opening and reading the object may still return the previous data
 After deleting an object
– Opening the object may succeed, returning the data
 While reading an object
– If object is updated or deleted during the process
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
HEAD /work/pending/part-00
GET /work/pending/part-00
200
200
200
Eventually Consistent – Seeing Deleted Data
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
S3Guard:
Fast, Consistent S3 Metadata
(EMR: use Amazon's EMRFS)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
S3Guard: Fast, Consistent S3 Metadata
 Inspired by Apache licensed S3mper project from Netflix
 Using DynamoDB as the consistent metadata store
 Mutating file system operations
– Update both S3 and DynamoDB
 Read operations
– Return results to callers as sourced from S3
– First check their results against the metadata in DynamoDB
– S3A waits and rechecks both S3 and DynamoDB until they agree
 Goals
– Provide consistent list and get status operations on S3 objects written with S3Guard enabled
• listStatus() after put and delete
• getFileStatus() after put and delete
– Provide tools to manage associated metadata and caching policies.
– Configurable error handling when inconsistency is detected
– Performance improvements that impact real workloads.
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00
200
HEAD part-00
200
HEAD part-00
404
PUT part-00
200
00
DynamoDB As The Consistent Metadata Store
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
https://issues.apache.org/jira/browse/HADOOP-13345
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2011 – 2016. All Rights Reserved18
Questions?
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Backup Slides
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01
x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01
... DATA ...
GET /work/pending/part-01
Content-Length: 1-8192
GET /?prefix=/work&delimiter=/
REST APIs

S3Guard: What's in your consistency model?

  • 1.
    1 © HortonworksInc. 2011 – 2016. All Rights Reserved S3Guard: What’s in Your Consistency Model? Mingliang Liu @liuml07 Steve Loughran @steveloughran December 2016
  • 2.
    Steve Loughran Hadoop committer& PMC, ASF Member Mingliang Liu Apache Hadoop committer Chris Nauroth, Hadoop committer & PMC, ASF member Rajesh Balamohan Tez Committer & PMC
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved S3A: Hadoop File System for S3 (EMR: use Amazon's s3:// )
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Storage Use Evolution HDFS Application HDFS Application GoalEvolution towards cloud storage as the primary Data Lake Input Output Backup Restore Input Output Copy HDFS Application Input Output tmp
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved org.apache.hadoop.fs.FileSystem hdfs s3awasb adlswift gs Hadoop File System - One Interface Fits All
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved / work pending part-00 part-01 00 00 00 01 01 01 complete part-01 rename("/work/pending/part-01", "/work/complete") A FileSystem: Directories, Files  Data
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved S3A: Object Store Pretending A FileSystem  Cloud Object Stores designed for – Scale – Cost – Geographic Distribution – Availability  Cloud apps dedicatedly deal with cloud storage semantics and limitations  Hadoop apps should work on cloud storage transparently – S3A partially adheres to the FileSystem specification
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 hash("/work/pending/part-01") ["s02", "s03", "s04"] 01 01 01 01 hash("/work/pending/part-00") ["s01", "s02", "s04"] hash(name)->blob
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved What Is The Problem?  Performance – separated from compute – cloud storage not designed for file-like access patterns  Limitations in APIs – delete(path, recursive=true) – rename(source, dest)  Eventual consistency – Create Consistency – Update – Delete – Listing • take time to list created objects • lag in changed metadata about existing objects • lag in observing deleted objects
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 hash("/work/pending/part-01") ["s02", "s03", "s04"] copy("/work/pending/part-01", "/work/complete/part01") 01 01 01 01 delete("/work/pending/part-01") hash("/work/pending/part-00") ["s01", "s02", "s04"] rename(): A Series of Operations on The Client
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Eventual Consistency From FileSystem’s View  When listing "a directory” – Newly created files may not yet be visible, deleted ones still present  After updating an object – Opening and reading the object may still return the previous data  After deleting an object – Opening the object may succeed, returning the data  While reading an object – If object is updated or deleted during the process
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 01 DELETE /work/pending/part-00 HEAD /work/pending/part-00 GET /work/pending/part-00 200 200 200 Eventually Consistent – Seeing Deleted Data
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved S3Guard: Fast, Consistent S3 Metadata (EMR: use Amazon's EMRFS)
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved S3Guard: Fast, Consistent S3 Metadata  Inspired by Apache licensed S3mper project from Netflix  Using DynamoDB as the consistent metadata store  Mutating file system operations – Update both S3 and DynamoDB  Read operations – Return results to callers as sourced from S3 – First check their results against the metadata in DynamoDB – S3A waits and rechecks both S3 and DynamoDB until they agree  Goals – Provide consistent list and get status operations on S3 objects written with S3Guard enabled • listStatus() after put and delete • getFileStatus() after put and delete – Provide tools to manage associated metadata and caching policies. – Configurable error handling when inconsistency is detected – Performance improvements that impact real workloads.
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 01 DELETE part-00 200 HEAD part-00 200 HEAD part-00 404 PUT part-00 200 00 DynamoDB As The Consistent Metadata Store
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Demo
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved https://issues.apache.org/jira/browse/HADOOP-13345
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved© Hortonworks Inc. 2011 – 2016. All Rights Reserved18 Questions?
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Backup Slides
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved 00 00 00 01 01 s01 s02 s03 s04 HEAD /work/complete/part-01 PUT /work/complete/part01 x-amz-copy-source: /work/pending/part-01 01 DELETE /work/pending/part-01 PUT /work/pending/part-01 ... DATA ... GET /work/pending/part-01 Content-Length: 1-8192 GET /?prefix=/work&delimiter=/ REST APIs

Editor's Notes

  • #3 Steve is co-author of the Swift connector. author of the Hadoop FS spec and general mentor of the S3A work. Been full time on S3A, using Spark as the integration test suite, since March Rajesh has been doing lots of scale runs and profiling, initially for Hive/Tez, now looking at LLAP performance problems on cloud. Chris has done work on HDFS, Azure WASB and most recently S3A
  • #6 Everything uses the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API. Under the FS API go filesystems and object stores. HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.
  • #10 Object stores are often eventually consistent. Objects are replicated across servers for availability. Changes to a replica take time to propagate to the other replicas; the store is inconsistent during this process. Directory rename and delete may be performed as a series of operations on the client. Specifically, delete(path, recursive=true) may be implemented as "list the objects, and delete them singly or in batches", and rename(source, dest) may be implemented as "copy all the objects, and then delete them". Create Consistency: a newly created object will always be immediately visible HEAD/GET Update: overwritten objects may take time to be visible Delete: delete operations may not be visible to all callers Listing: list operations may take time to list created objects, lag in changed metadata about existing objects, and lag in observing deleted objects.
  • #12 At the beginning of any query, the slower and incomplete listing operations hamper "partitioning” phase, the selection of files with relevant data. At the end of jobs, it doesn't support instantaneous renames, stops S3 being safely used as a destination for work.
  • #14 Amazon EMR reimplementing something similar: storing all that directory information in Amazon DynamoDB. But there's never been any equivalent in the open source S3 client(s) in Apache Hadoop. The EMR File System (EMRFS) and the Hadoop Distributed File System (HDFS) are both installed on your EMR cluster. EMRFS is an implementation of HDFS which allows EMR clusters to store data on Amazon S3. You can enable Amazon S3 server-side and client-side encryption as well as consistent view for EMRFS using the AWS Management Console, AWS CLI, or you can use a bootstrap action (with CLI or SDK) to configure additional settings for EMRFS. Enabling Amazon S3 server-side encryption allows you to encrypt objects written to Amazon S3 by EMRFS. EMRFS support for Amazon S3 client-side encryption allows your cluster to work with S3 objects that were previously encrypted using an Amazon S3 encryption client. Consistent view provides consistency checking for list and read-after-write (for new put requests) for objects in Amazon S3. Enabling consistent view requires you to store EMRFS metadata in Amazon DynamoDB. If the metadata is not present, it is created for you.
  • #16 That's changed with the S3Guard extension to Hadoop's "s3a" client. You can now use Amazon DynamoDB as a complete, high-performance reference "store of record" for all the S3 directory information Speeds up output as well as input Allowing S3 to be used as the intermediate store of queries and the direct destination of output with a consistent model