Learn how to set up a highly scalable, robust, and secure Hadoop platform using Amazon EMR. We'll perform a demonstration using a 100-node Amazon EMR cluster and take you through the best practices and performance tuning required for different workloads to ensure they are production ready.
Speaker: Amo Abeyaratne, Big Data Consultant, Amazon Web Services
Featured Customer - Ambidata
3. Size: Growing in Petabytes
“If every byte was a word, and you took a
second to read a word, it will take you 32
million years to read a whole Petabyte”
5. Strategy: Divide and Conquer
Hadoop Amazon EMR
A Managed Hadoop
Framework in the Cloud
Hadoop on EC2
Managing on your own can
be a LOT of work
Hadoop has to
scale?
8. Agenda
Why
EMR?
Well Architected EMR
Design for Production
DEMO
Challenge: Data is Everywhere
Size: Growing in PBs
Strategy: Divide & Conquer
Tool: Amazon EMR
12. Why EMR? Decouple Storage and Compute
Amazon Kinesis
(Streams, Firehose)
Hadoop Jobs
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto | Impala)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
ETL Jobs
Hive External Metastore
i.e Amazon RDS
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3 for Storage
create external table t_name(..)...
location s3://bucketname/path-to-file/
13. Why EMR? Elastic
Intelligent resize: Wait for
work to finish before
stopping instances
Core nodes for scaling
HDFS
Task nodes for scaling
processing power
Use instance groups: to
manage different instance
types in the same cluster
14. Why EMR? Current
Application Open source
release
EMR release
Spark 1.5 September 9, 2015 September 2015
Spark 1.5.2 November 9, 2015 November 2015
Spark 1.6 January 4, 2016 January 2016
Spark 1.6.1 March 9, 2016 April 4, 2016
15. AWS Data Pipeline for data
flow Orchestration
Amazon KMS for
encryption key
management
Why EMR? Easy Integration with AWS services
• Kinesis
Connector for
streaming data
access
• Spark
Streaming with
Kinesis Client
Library (KCL)
S3 via EMRFS
Wrap Amazon EMR cluster with
IAM roles for access policies
Map Amazon DynamoDB
tables with DynamoDB-
connector for Hive
Connect to Amazon
Redshift through redshift-
spark connector
EBS support
for scaling
HDFS
16. Why EMR? Low-cost
Spot instances: Bid
for unused EC2s at
up to 90% less price
Transient clusters:
Terminate the cluster
when not in use
Reserved instances:
For persistent
clusters, make use of
EC2 reserved
instances to save up
to 50%
17. Agenda
Why
EMR?
Well Architected
EMR Design for
Production
DEMO- Automation
- Decouple
- Elastic
- Integration
- Current
- Cost-efficient
Challenge: Data is Everywhere
Size: Growing in PBs
Strategy: Divide & Conquer
Tool: Amazon EMR
19. EMR Cluster
with 100-nodes
- All data stored on Amazon S3
- Data accessed through EMRFS
- Spark on EMR for processing
- Ganglia for Monitoring the workload
Demo
Spark-SQL
Spark 1.6.1
YARN
Hadoop
EMRFS
Amazon S3
20. Agenda
Why EMR?
Well Architected
EMR Design for
Production
DEMO
- Automation
- Decouple
- Elastic
- Integration
- Current
- Cost-efficient
- SparkSQL
- EMRFS (S3://)
- Ganglia
- EMR CLI
- EMR Console
Challenge: Data is Everywhere
Size: Growing in PBs
Strategy: Divide & Conquer
Tool: Amazon EMR
22. Well Architected EMR: Design for Production
SecurityReliabilityPerformance
Cost
Efficiency
23. Well Architected EMR: Performance Efficiency
Choice of
Instance Type
and cluster size
Choice of Storage Framework
Performance
24. Performance: Choice of instance type - Master
Less than
50 nodes?
Heavy
network
I/O
M3.xlarge
YES
NO
C3 family or R3
with Enhanced
networking
YES
M3.2xlarge or
larger
25. Performance: Choice of instance type – Workers
Compute Memory Storage
Machine Learning
C1 Family
C3 Family
CC1.4xlarge
CC2.8xlarge
M2 Family
R3 Family
Cr1.8xlarge
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M3 Family
M1 Family
27. Performance: Cluster Sizing
Guidelines:
- Size based on HDFS storage first if needed
- Add enough (task) nodes to handle processing
- Do not add more than 5 tasks nodes per core node
- Prefer smaller clusters of larger machines
It’s a space-time trade off
28. EMRFS (S3) HDFS
Performance: Choice of storage
- Ability to decouple
- Reliable and durable
- Cost efficient
- Works well for jobs that read a
dataset once per run
- Need a persistent cluster
- Reliability is configurable. But need
multiple nodes to achieve replication
factor
- Great for jobs with iterative reads on the
same dataset like machine learning
Combine with s3-dist-cp and move from S3 once to
HDFS for iterative workloads
30. S3 Performance: Range GET vs Data Locality?
GET Range 128-192MB
GET Range 0-64MB
GET Range 64-128MB
GET Range (n-64)-nMB
EMR worker nodes
S3 object (use larger files)
31. 1.2TB/Day logs
30TB /Day data
250 Hadoop Jobs
75Billion transactions/Day
5 Petabytes of Data
S3: Real world heavy EMRFS users
25 PB Data Warehouse
on Amazon S3
> 1PB read each day
33. S3: EMRFS at NASDAQ
Access needs drop off dramatically over time – But, never throw anything away!
Yesterday >> last month >> last quarter >> last year..
34. Performance: Framework Performance
Count these words
Count = 1 These = 1 Words = 1
Embarrassingly parallel?
Count = 1
These = 1
Words = 1
Can it be optimized with a DAG?
A
B
C
D
E
35. Reliability
- Store your metadata
outside the cluster
- Multi-AZ RDS cluster
will give you HA
- Keep data and
Applications on S3
- Maintain source of
truth for data on S3
(An immutable data
set)
Automate with:
- Bootstrap actions
- Config options
- Cloudformation
Failure Management Disaster Recovery Change Management
36. Security
Data Protection:
Encryption
- Server side
- Client side
- HDFS Transparent
- RPC with SSL
- File system with
LUKS
Privilege
Management
- IAM roles
- Secure Integration
with AWS services
- Hue, HiveServer2 or
3rd Party tools
support for role
based access
Infrastructure
Protection
- VPC
- Private Subnets
- S3 endpoints
- NAT
- Security Groups
- Audit with logs on S3
37. Security: End to End Encryption
Amazon S3 Bucket
AWS KMS
AWS S3 SDK
AmazonS3EncryptionClient()
Encrypted Object
EMRFS with
Client-side Encryption
HDFS
transparent
encryption
with Hadoop
KMS
spark.ssl.enabled hadoop.rpc.protecti on
hadoop.ssl.enabl ed
mapreduce.shuffle.ssl.enabled
0utput writes via EMRFS with
Client-side Encryption enabled
Amazon S3 Bucket
LUKS with
bootstrap
action for local
file systems
Server Side Encryption
for S3 via KMS or any
other Key Management
service
38. Cost Efficiency
Matching Supply and
Demand
• Is the cluster big enough?
• Can we make it transient?
• Monitor the usage with
Ganglia and Amazon
CloudWatch alarms
Using cost-effective
resources
• S3 instead of HDFS for
larger datasets?
• Taking advantage of Spot
and Reserved instances?
Optimise over time
• Monitor and watch out for
new instance types,
features that may reduce
cost.
39. Agenda
Why EMR?
Well Architected
EMR Design for
Production
DEMO
- Automation
- Decouple
- Elastic
- Integration
- Current
- Cost-efficient
- SparkSQL
- EMRFS (S3://)
- Ganglia
- EMR CLI
- EMR Console
- Presto
- Performance tuning
- Reliability
- Security facts
- Cost efficiency
Challenge: Data is Everywhere
Size: Growing in PBs
Strategy: Divide & Conquer
Tool: Amazon EMR
71. 100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
100
2 hrs
C
2 hrs
D
8 hrs
B
2 hrs
72. 100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
100
2 hrs
C
2 hrs
D
250
2 hrs
E
8 hrs
B
2 hrs
73. 100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
8 hrs
B
2 hrs
A
100
2 hrs
C
2 hrs
D
250
2 hrs
C
1,500 compute-hrs in 12
hrs
1,700 compute-hrs
in 17 hrs
2 hrs
74. 1 Job Per Cluster
Increased Predictability
Ephemeral Clusters
128. AWS Training & Certification
Intro Videos & Labs
Free videos and labs to
help you learn to work
with 30+ AWS services
– in minutes!
Training Classes
In-person and online
courses to build
technical skills –
taught by accredited
AWS instructors
Online Labs
Practice working with
AWS services in live
environment –
Learn how related
services work
together
AWS Certification
Validate technical
skills and expertise -
identify qualified IT
talent or show you
are AWS cloud ready
Learn more: aws.amazon.com/training
129. Your Training Next Steps:
ü Visit the AWS Training & Certification pod to discuss your
training plan & AWS Summit training offer
ü Register & attend AWS instructor led training
ü Get Certified
AWS Certified? Visit the AWS Summit Certification Lounge to pick up your swag
Learn more: aws.amazon.com/training