Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amo Abeyaratne
Big Data and Analytics Consultant, Amazon Web Services
Ben Lever
CTO, Ambiata
Tune your Big Data Platform to Work at Scale
Taking Hadoop to the Next Level with Amazon EMR
Technical 301

Challenge: Data is Everywhere
Phones
Sensors
Websites

Size: Growing in Petabytes
“If every byte was a word, and you took a
second to read a word, it will take you 32
million years to read a whole Petabyte”

Strategy: Divide and Conquer
Hadoop Amazon EMR
A Managed Hadoop
Framework in the Cloud
Hadoop on EC2
Managing on your own can
be a LOT of work
Hadoop has to
scale?

Let’s Launch a Cluster

Tool: Amazon EMR
aws emr create-cluster --release-label emr-4.0.0 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=99,InstanceType=m3.xlarge --auto-terminate
aws emr create-cluster
--applications Name=Hadoop Name=Hive Name=Hue Name=Spark Name=Ganglia Name=Zeppelin-Sandbox
--tags 'name=summitdemo'
--ec2-attributes '{"KeyName":"amo_ubuntu","InstanceProfile":"EMR_EC2_DefaultRole","AvailabilityZone":"us-east-1b","EmrManagedSlaveSecurityGroup":"sg-
07fa956c","EmrManagedMasterSecurityGroup":"sg-0dfa9566"}'
--service-role EMR_DefaultRole
--release-label emr-4.4.0
--name 'Summit_Demo_EMR'
--instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master instance group -
1"},{"InstanceCount":9,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core instance group -
2"},{"InstanceCount":50,"EbsConfiguration":{"EbsBlockDeviceConfigs":[],"EbsOptimized":false},"InstanceGroupType":"TASK","InstanceType":"m3.xlarge","Name":"Task instance group - 7"}]'
--configurations '[{"Classification":"emrfs-
site","Properties":{"fs.s3.consistent.retryPeriodSeconds":"10","fs.s3.consistent":"true","fs.s3.consistent.retryCount":"5","fs.s3.consistent.metadata.tableName":"EmrFSMetadata"},"Configurations":[]},{"Classi
fication":"hive-
site","Properties":{"javax.jdo.option.ConnectionUserName":"admin","javax.jdo.option.ConnectionPassword":"Passw0rd!","javax.jdo.option.ConnectionURL":"jdbc:mysql://testmysql51.cotvmrgi63jf.us-east-
1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true"},"Configurations":[]}]' --region us-east-1

Agenda
Why
EMR?
Well Architected EMR
Design for Production
DEMO
Size: Growing in PBs
Strategy: Divide & Conquer
Tool: Amazon EMR

Why EMR?
Automation Decouple Elastic
Integration Low-costCurrent

Why EMR? Automation
EC2 Provisioning Cluster Setup Hadoop Configuration
Installing ApplicationsJob submissionMonitoring and
Failure Handling

Why EMR? Decouple Storage and Compute
Amazon Kinesis
(Streams, Firehose)
Hadoop Jobs
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto | Impala)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
ETL Jobs
Hive External Metastore
i.e Amazon RDS
Workload specific clusters
(Different sizes, Different Versions)
Amazon S3 for Storage
create external table t_name(..)...
location s3://bucketname/path-to-file/

Why EMR? Elastic
Intelligent resize: Wait for
work to finish before
stopping instances
Core nodes for scaling
HDFS
Task nodes for scaling
processing power
Use instance groups: to
manage different instance
types in the same cluster

Why EMR? Current
Application Open source
release
EMR release
Spark 1.5 September 9, 2015 September 2015
Spark 1.5.2 November 9, 2015 November 2015
Spark 1.6 January 4, 2016 January 2016
Spark 1.6.1 March 9, 2016 April 4, 2016

AWS Data Pipeline for data
flow Orchestration
Amazon KMS for
encryption key
management
Why EMR? Easy Integration with AWS services
• Kinesis
Connector for
streaming data
access
• Spark
Streaming with
Kinesis Client
Library (KCL)
S3 via EMRFS
Wrap Amazon EMR cluster with
IAM roles for access policies
Map Amazon DynamoDB
tables with DynamoDB-
connector for Hive
Connect to Amazon
Redshift through redshift-
spark connector
EBS support
for scaling
HDFS

Why EMR? Low-cost
Spot instances: Bid
for unused EC2s at
up to 90% less price
Transient clusters:
Terminate the cluster
when not in use
Reserved instances:
For persistent
clusters, make use of
EC2 reserved
instances to save up
to 50%

Agenda
Why
EMR?
Well Architected
EMR Design for
Production
DEMO- Automation
- Decouple
- Elastic
- Integration
- Current
- Cost-efficient
Tool: Amazon EMR

EMR Cluster
with 100-nodes
- All data stored on Amazon S3
- Data accessed through EMRFS
- Spark on EMR for processing
- Ganglia for Monitoring the workload
Demo
Spark-SQL
Spark 1.6.1
YARN
Hadoop
EMRFS
Amazon S3

Agenda
Why EMR?
Well Architected
EMR Design for
Production
DEMO
- Automation
- Decouple
- Elastic
- Integration
- Current
- Cost-efficient
- SparkSQL
- EMRFS (S3://)
- Ganglia
- EMR CLI
- EMR Console
Tool: Amazon EMR

Well Architected Amazon EMR

Well Architected EMR: Design for Production
SecurityReliabilityPerformance
Cost
Efficiency

Well Architected EMR: Performance Efficiency
Choice of
Instance Type
and cluster size
Choice of Storage Framework
Performance

Performance: Choice of instance type - Master
Less than
50 nodes?
Heavy
network
I/O
M3.xlarge
YES
NO
C3 family or R3
with Enhanced
networking
YES
M3.2xlarge or
larger

Performance: Choice of instance type – Workers
Compute Memory Storage
Machine Learning
C1 Family
C3 Family
CC1.4xlarge
CC2.8xlarge
M2 Family
R3 Family
Cr1.8xlarge
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M3 Family
M1 Family

How Many Nodes Do I Need?

Performance: Cluster Sizing
Guidelines:
- Size based on HDFS storage first if needed
- Add enough (task) nodes to handle processing
- Do not add more than 5 tasks nodes per core node
- Prefer smaller clusters of larger machines
It’s a space-time trade off

EMRFS (S3) HDFS
Performance: Choice of storage
- Ability to decouple
- Reliable and durable
- Cost efficient
- Works well for jobs that read a
dataset once per run
- Need a persistent cluster
- Reliability is configurable. But need
multiple nodes to achieve replication
factor
- Great for jobs with iterative reads on the
same dataset like machine learning
Combine with s3-dist-cp and move from S3 once to
HDFS for iterative workloads

Storage Performance: S3 vs HDFS at Netflix

S3 Performance: Range GET vs Data Locality?
GET Range 128-192MB
GET Range 0-64MB
GET Range 64-128MB
GET Range (n-64)-nMB
EMR worker nodes
S3 object (use larger files)

1.2TB/Day logs
30TB /Day data
250 Hadoop Jobs
75Billion transactions/Day
5 Petabytes of Data
S3: Real world heavy EMRFS users
25 PB Data Warehouse
on Amazon S3
> 1PB read each day

S3: EMRFS at NASDAQ
Access needs drop off dramatically over time – But, never throw anything away!
Yesterday >> last month >> last quarter >> last year..

Performance: Framework Performance
Count these words
Count = 1 These = 1 Words = 1
Embarrassingly parallel?
Count = 1
These = 1
Words = 1
Can it be optimized with a DAG?
A
B
C
D
E

Reliability
- Store your metadata
outside the cluster
- Multi-AZ RDS cluster
will give you HA
- Keep data and
Applications on S3
- Maintain source of
truth for data on S3
(An immutable data
set)
Automate with:
- Bootstrap actions
- Config options
- Cloudformation
Failure Management Disaster Recovery Change Management

Security
Data Protection:
Encryption
- Server side
- Client side
- HDFS Transparent
- RPC with SSL
- File system with
LUKS
Privilege
Management
- IAM roles
- Secure Integration
with AWS services
- Hue, HiveServer2 or
3rd Party tools
support for role
based access
Infrastructure
Protection
- VPC
- Private Subnets
- S3 endpoints
- NAT
- Security Groups
- Audit with logs on S3

Security: End to End Encryption
Amazon S3 Bucket
AWS KMS
AWS S3 SDK
AmazonS3EncryptionClient()
Encrypted Object
EMRFS with
Client-side Encryption
HDFS
transparent
encryption
with Hadoop
KMS
spark.ssl.enabled hadoop.rpc.protecti on
hadoop.ssl.enabl ed
mapreduce.shuffle.ssl.enabled
0utput writes via EMRFS with
Client-side Encryption enabled
Amazon S3 Bucket
LUKS with
bootstrap
action for local
file systems
Server Side Encryption
for S3 via KMS or any
other Key Management
service

Cost Efficiency
Matching Supply and
Demand
• Is the cluster big enough?
• Can we make it transient?
• Monitor the usage with
Ganglia and Amazon
CloudWatch alarms
Using cost-effective
resources
• S3 instead of HDFS for
larger datasets?
• Taking advantage of Spot
and Reserved instances?
Optimise over time
• Monitor and watch out for
new instance types,
features that may reduce
cost.

Agenda
Why EMR?
Well Architected
EMR Design for
Production
DEMO
- Automation
- Decouple
- Elastic
- Integration
- Current
- Cost-efficient
- SparkSQL
- EMRFS (S3://)
- Ganglia
- EMR CLI
- EMR Console
- Presto
- Performance tuning
- Reliability
- Security facts
- Cost efficiency
Tool: Amazon EMR

AWS Sydney Summit 2016
EMR @ Ambiata
Ben Lever
CTO, Ambiata

1 Petabyte
25M customer
decisions daily

Ephemeral Clusters1
S3 as the Data Lake2
IAM Roles for Clusters3

100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs2 hrs

100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs2 hrs

EMR
Cluster
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs2 hrs

100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs2 hrs

100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
2 hrs
A
2 hrs

100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
8 hrs
B
2 hrs

100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
100
2 hrs
C
2 hrs
D
8 hrs
B
2 hrs

100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
100
2 hrs
C
2 hrs
D
250
2 hrs
E
8 hrs
B
2 hrs

100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
8 hrs
B
2 hrs
A
100
2 hrs
C
2 hrs
D
250
2 hrs
C
1,500 compute-hrs in 12
hrs
1,700 compute-hrs
in 17 hrs
2 hrs

1 Job Per Cluster
Increased Predictability
Ephemeral Clusters

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
S3 Data Lake

Hadoop
Spark
R/C/Python
EMR
EMR
EC2

Hadoop
Spark
R/C/Python
BI
EMR
EMR
EC2
Redshift

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
IAM
IAM
IAM
IAM

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
RD
WT
RD
WT
RD
WT
RD
WT

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
RD
A B C D E F G H
WT
RD
WT
RD
WT
RD
WT

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
RD
WT

HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
D
E
G

Live
Where production
data pipelines
are executed

Live Lab
Where data
scientists
can experiment

Live Lab Dev
Where new
data pipelines
are tested

AWS Training & Certification
Intro Videos & Labs
Free videos and labs to
help you learn to work
with 30+ AWS services
– in minutes!
Training Classes
In-person and online
courses to build
technical skills –
taught by accredited
AWS instructors
Online Labs
Practice working with
AWS services in live
environment –
Learn how related
services work
together
AWS Certification
Validate technical
skills and expertise -
identify qualified IT
talent or show you
are AWS cloud ready
Learn more: aws.amazon.com/training

Your Training Next Steps:
ü Visit the AWS Training & Certification pod to discuss your
training plan & AWS Summit training offer
ü Register & attend AWS instructor led training
ü Get Certified
AWS Certified? Visit the AWS Summit Certification Lounge to pick up your swag
Learn more: aws.amazon.com/training

Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301

Similar to Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301