Cloudy with a chance of Hadoop - real world considerations

Cloudy with a chance of Hadoop
Running Hadoop in the cloud(s)
Ram Venkatesh
Mingliang Liu

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Presenters
Mingliang Liu
Software Engineer, Hortonworks
Apache Hadoop Committer
Ram Venkatesh
Senior Director of Engineering, Hortonworks

Agenda
 Use cases and scenarios
 Problems encountered and lessons learned
 A couple deep dives
– Fault tolerance
– Object storage consistency
 Wrap-up

Hadoop-in-the-cloud Use Cases
 Full on-premise multi-user, multi-tenant cluster with all Hadoop ecosystem components
 Per-workload clusters for specific use cases
– Clusters for processing batch jobs such as MapReduce, Pig, Hive or Spark jobs
– Clusters for running interactive workloads such as Hive LLAP or Livy
 Dev, QA, UAT setups for non- and pre- production use cases
 Production setups with SLAs and monitoring and, and,
 Self-service vs full-service
– Some clusters setup by sophisticated dev-ops and admin groups
– Some clusters spun up by end users such as data engineers or data scientists
 Long-running vs ephemeral
 Varying security and compliance requirements
No one-size-fits-all solution possible!

Hortonworks Cloud Solutions
Microsoft AWS Google
Managed Azure HDInsight
Non-Managed /
Marketplace
Hortonworks Data
Cloud for AWS
Cloud IaaS
Hortonworks Data Platform
(via Cloudbreak and Ambari)
FOCUS OF THIS TALK

Easily Launch HDP on Any Cloud with Cloudbreak
Dev / Test
Bi / Analytics
IoT
On-Premises
Cloudbreak is a tool for
provisioning clusters on
cloud infrastructure.
Cloudbreak allows
enterprises to simplify the
provisioning of clusters in
the cloud and optimize
their use of cloud resources
as workloads change.

Cloudbreak Goals and Motivations
 Declarative/full Hadoop stack provisioning in all major
cloud providers
 Automate and unify the process
 Same process through a cluster lifecycle (Dev, QA,
UAT, Prod)
 Provide first-class dev ops tooling - UI, REST API and
CLI/shell
 Flexible cluster shapes – security, HA, cluster
topologies
 Cloud friendly – elasticity, auto-scaling, fault
tolerance, auto recovery

Lessons Learned
 Not all cloud providers are the same
– Difference in performance, storage and functionality
 Know your customer – capacity planning is fundamentally different
– Based on workload type (batch / interactive and ad-hoc / long running)
– Use heterogeneous clusters
– Cluster size is a variable in your calculations
– Trial and error – mistakes are cheap, iterate until you find your best fit
 Storage and what you do with it matters
– Multiple choices (ephemeral, block storage and object stores)
– Speed, Cost, Reliability are all important factors to think about
– defaultFS vs cloud object store connector architecture
– Default Hive warehouse directory configuration

Lessons Learned: Cloud Provider Specific
 Compute
– Find your instance types for the workload, use heterogeneous clusters
– Different instance types for transient (e.g. C4, M4) and long running (e.g. H2, D2) clusters
– Dedicated instances (to avoid noise, regulations e.g. HIPPA)
 Network
– Use enhanced networking (Amazon Linux by default, RHEL based – apply patch)
– Placement groups, cross AZ deployments
– Not all instance types can use the 10Gbit network (e.g. use 8x)
 Storage
– Azure - ephemeral disk is faster than root disk – does not survive auto-updates
– Multiple storage choices – WASB, ADLS
– Multiple connector choices with S3 – pick S3a…it’s the latest

Lessons You DON’T Want to Learn
 Security considerations - Defense in depth Data
Workload
Compute
Network

Some Things to Think About
 Network security
– Private networks, internet connectivity, ports and protocols, security groups
 Compute
– Edge nodes vs cluster nodes, SSH access and who needs it, IAM roles are your friend
 Workloads
– Authenticated end points, API security, workload-specific authentication and authorization
 Data Protection
– At rest, in motion and everything in between
 Extra Credit
– Audit and traceability
– DevOps as code
– Declarative automation for review and change management

And Finally: “Cloud Readiness” and Fault Tolerance
 VM != Bare Metal
 Cloud Storage != HDFS
 HDFS NameNode & YARN ResourceManager HA != Cloud fault tolerance
 All the parts matter…its all ephemeral, y’all
 Externalize everything – files, tables, schema, policies, ambari state, cloudbreak state

Demo
Auto-scaling and Fault Tolerance

Cloud Storage Made Better
Bridging S3 consistency model and Hadoop applications

HDFS And Cloud Storage Are Not Mutually Exclusive
HDFS
Application
HDFS
Application
GoalEvolution towards cloud storage as the primary Data Lake
Input Output
Backup Restore
Input
Output
Copy
Application
Input
Output
Tmp
HDFS

Object Store Pretending A FileSystem
 Cloud Object Stores designed for
– Scale
– Cost
– Geographic Distribution
– Availability
 Cloud apps dedicatedly deal with cloud storage semantics and limitations
 Hadoop apps should work on cloud storage transparently
– S3A/WASB partially adheres to the FileSystem specification
– ADL supports WebHDFS REST API

Hadoop FileSystem: One Interface Fits All
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs

Practical Problems Using Cloud Storage Service in Hadoop
 P1: Performance
– Separated from compute (e.g. data locality)
– Slow metadata read
 P2: Limitations in APIs
– File formats and access patterns vs object oriented streaming
– Not atomic operations
• delete(path, recursive=true)
• rename(source, dest)
 P3: Eventual consistency (S3 specific)
– List
– Delete
– Update

P2: Not Atomic API: rename()
 A Series of Operations on The Client
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")
["s02", "s03", "s04"]
copy("/work/pending/part-01",
"/work/complete/part01")
01
01
01
01
delete("/work/pending/part-01")
hash("/work/pending/part-00")
["s01", "s02", "s04"]

P3: Eventual Consistency From FileSystem’s View
 When listing a directory
– Newly created files may not yet be visible, deleted ones still present
 After updating a file
– Opening and reading the file may still return the previous data
 After deleting a file
– Opening the file may succeed, returning the data
 While reading an object
– If object is updated or deleted during the process

P3: Eventually Consistent – Seeing Deleted Data
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200

S3Guard: Fast And Consistent S3 Metadata
 Goals
– Provide consistent list and get status operations on S3 objects written with S3Guard enabled
• listStatus() after put and delete
• getFileStatus() after put and delete
– Performance improvements that impact real workloads
– Provide tools to manage associated metadata and caching policies.
 Again, 100% open source in Apache Hadoop community
– Hortonworks, Cloudera, Western Digital, Disney …
 Inspired by Apache licensed S3mper project from Netflix
 Seamless integration with S3AFileSystem

S3Guard: Core Ideas
 Using a consistent store (DynamoDB) for indexing metadata
 Mutating file system operations
– Update both S3 and DynamoDB
 Read operations
– Return results to callers as sourced from S3
– First check their results against the metadata in DynamoDB
– For failures, S3A waits and rechecks both S3 and DynamoDB
 Try it today!
<property>
<name>fs.s3a.metadatastore.impl</name>
<value>org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore</value>
</property>

S3Guard Write/Read Path
Hadoop Application
Amazon DynamoDB Amazon S3
S3AFileSystem
The Hadoop FileSystem for Amazon S3
DynamoDB Client S3 Client
FileSystem operations
WRITE
fs metadata
2
READ
fs metadata
1
WRITE
object data
1
LIST
object info
2
READ
object data
3
Write
Read

S3Guard: Faster Metadata Listing
0
10
20
30
40
50
60
70
Total Runtime of Query Split Computation Rumtime
Runtime(seconds)
Hive Query Performance with S3Guard
Without S3Guard With S3Guard
* The lower, the better

Practical Problems Using Amazon S3 in Hadoop
 P1: Performance
– Separated from compute
– Slow metadata read
 P2: Limitations in APIs
 P3: Eventual consistency
– Listing inconsistency
– Delete inconsistency
– Update inconsistency

Learn More
 Try Cloudbreak Today
– https://hortonworks.com/open-source/cloudbreak/
 Try Hortonworks Data Cloud Today
– GA: https://aws.amazon.com/marketplace/pp/B01LXOQBOU
– Technical Preview: http://hortonworks.github.io/hdp-aws/
 BREAKOUT SESSIONS
– Wednesday, June 14 @ 5:50p, Don’t Let Spark Burn Your House: Perspectives on Securing Spark
 CRASH COURSE
– Thursday, June 15 @ 3:00p – 6:00p, Apache Spark and Apache Hive processing on the Cloud
 BIRDS OF A FEATHER
– Thursday, June 15 @ 5:00p, Security and Governance
– Thursday, June 15 @ 5:00p, Cloud and Operations

Thank you

Cloudy with a chance of Hadoop - real world considerations

More Related Content

What's hot

Similar to Cloudy with a chance of Hadoop - real world considerations

More from DataWorks Summit

Recently uploaded

Cloudy with a chance of Hadoop - real world considerations

Editor's Notes