SlideShare a Scribd company logo
1 of 130
©  2016,  Amazon  Web  Services,  Inc.  or  its  Affiliates.  All  rights  reserved.
Amo  Abeyaratne
Big  Data  and  Analytics  Consultant,  Amazon  Web  Services
Ben  Lever
CTO,  Ambiata
Tune  your  Big  Data  Platform  to  Work  at  Scale  
Taking  Hadoop  to  the  Next  Level  with  Amazon  EMR  
Technical  301
Challenge:  Data  is  Everywhere
Phones
Sensors
Websites
Size:  Growing  in  Petabytes
“If  every  byte  was  a  word,  and  you  took  a  
second to  read  a word,  it  will  take  you  32
million years to  read  a  whole  Petabyte”
How  Do  We  Face  It?
Strategy:  Divide  and  Conquer
Hadoop Amazon  EMR
A  Managed  Hadoop  
Framework  in  the  Cloud
Hadoop  on  EC2
Managing  on  your  own  can  
be  a    LOT of  work
Hadoop  has  to  
scale?
Let’s  Launch  a  Cluster
Tool:  Amazon  EMR
aws emr create-cluster --release-label emr-4.0.0 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=99,InstanceType=m3.xlarge --auto-terminate
aws emr create-­cluster  
-­-­applications  Name=Hadoop Name=Hive Name=Hue Name=Spark Name=Ganglia Name=Zeppelin-­Sandbox
-­-­tags  'name=summitdemo'  
-­-­ec2-­attributes  '{"KeyName":"amo_ubuntu","InstanceProfile":"EMR_EC2_DefaultRole","AvailabilityZone":"us-­east-­1b","EmrManagedSlaveSecurityGroup":"sg-­
07fa956c","EmrManagedMasterSecurityGroup":"sg-­0dfa9566"}'  
-­-­service-­role  EMR_DefaultRole
-­-­release-­label  emr-­4.4.0  
-­-­name  'Summit_Demo_EMR'  
-­-­instance-­groups  '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master  instance  group  -­
1"},{"InstanceCount":9,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core   instance  group  -­
2"},{"InstanceCount":50,"EbsConfiguration":{"EbsBlockDeviceConfigs":[],"EbsOptimized":false},"InstanceGroupType":"TASK","InstanceType":"m3.xlarge","Name":"Task  instance  group  -­ 7"}]'  
-­-­configurations  '[{"Classification":"emrfs-­
site","Properties":{"fs.s3.consistent.retryPeriodSeconds":"10","fs.s3.consistent":"true","fs.s3.consistent.retryCount":"5","fs.s3.consistent.metadata.tableName":"EmrFSMetadata"},"Configurations":[]},{"Classi
fication":"hive-­
site","Properties":{"javax.jdo.option.ConnectionUserName":"admin","javax.jdo.option.ConnectionPassword":"Passw0rd!","javax.jdo.option.ConnectionURL":"jdbc:mysql://testmysql51.cotvmrgi63jf.us-­east-­
1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true"},"Configurations":[]}]'  -­-­region  us-­east-­1
Agenda
Why  
EMR?
Well  Architected  EMR  
Design  for  Production
DEMO
Challenge:  Data  is  Everywhere
Size:  Growing  in  PBs
Strategy:  Divide  &  Conquer
Tool:  Amazon  EMR
Why  EMR?
Why  EMR?
Automation Decouple Elastic
Integration Low-­costCurrent
Why  EMR?  Automation
EC2  Provisioning Cluster  Setup Hadoop  Configuration
Installing  ApplicationsJob  submissionMonitoring  and  
Failure  Handling
Why  EMR?  Decouple  Storage  and  Compute
Amazon  Kinesis
(Streams,  Firehose)
Hadoop  Jobs
Persistent  Cluster  – Interactive  Queries
(Spark-­SQL  |  Presto  |  Impala)
Transient  Cluster  -­ Batch  Jobs
(X  hours  nightly)  – Add/Remove  Nodes
ETL  Jobs
Hive  External  Metastore
i.e Amazon  RDS
Workload  specific  clusters
(Different  sizes,  Different  Versions)
Amazon  S3  for  Storage
create external table t_name(..)...
location s3://bucketname/path-to-file/
Why  EMR?  Elastic
Intelligent  resize:  Wait  for  
work  to  finish  before  
stopping  instances
Core  nodes  for  scaling  
HDFS
Task  nodes  for  scaling  
processing  power
Use  instance  groups:  to  
manage  different  instance  
types  in  the  same  cluster
Why  EMR?  Current
Application Open source  
release  
EMR  release
Spark  1.5 September  9,  2015 September  2015
Spark  1.5.2 November  9,  2015 November 2015
Spark  1.6 January  4,  2016 January  2016
Spark 1.6.1 March 9,  2016 April  4,  2016
AWS  Data  Pipeline  for  data  
flow    Orchestration
Amazon  KMS  for  
encryption  key  
management
Why  EMR?  Easy  Integration  with  AWS  services
• Kinesis  
Connector  for  
streaming  data  
access
• Spark  
Streaming  with  
Kinesis  Client  
Library  (KCL)
S3  via  EMRFS  
Wrap  Amazon  EMR  cluster  with  
IAM  roles  for  access  policies
Map  Amazon  DynamoDB
tables  with  DynamoDB-­
connector for  Hive
Connect  to  Amazon  
Redshift  through  redshift-­
spark  connector
EBS  support  
for  scaling  
HDFS
Why  EMR?  Low-­cost
Spot  instances:  Bid  
for  unused  EC2s  at  
up  to  90%  less  price
Transient  clusters:  
Terminate  the  cluster  
when  not  in  use
Reserved  instances:  
For  persistent  
clusters,  make  use  of  
EC2  reserved  
instances  to  save  up  
to  50%
Agenda
Why  
EMR?
Well  Architected  
EMR  Design  for  
Production
DEMO-­ Automation
-­ Decouple
-­ Elastic
-­ Integration
-­ Current
-­ Cost-­efficient
Challenge:  Data  is  Everywhere
Size:  Growing  in  PBs
Strategy:  Divide  &  Conquer
Tool:  Amazon  EMR  
Demo
EMR  Cluster  
with  100-­nodes
-­ All  data  stored  on  Amazon  S3  
-­ Data  accessed  through    EMRFS
-­ Spark  on  EMR  for  processing  
-­ Ganglia  for  Monitoring  the  workload
Demo
Spark-­SQL
Spark  1.6.1
YARN
Hadoop
EMRFS
Amazon  S3
Agenda
Why  EMR?
Well  Architected  
EMR  Design  for  
Production
DEMO
-­ Automation
-­ Decouple
-­ Elastic
-­ Integration
-­ Current
-­ Cost-­efficient
-­ SparkSQL
-­ EMRFS  (S3://)
-­ Ganglia
-­ EMR  CLI
-­ EMR  Console
Challenge:  Data  is  Everywhere
Size:  Growing  in  PBs
Strategy:  Divide  &  Conquer
Tool:  Amazon  EMR
Well  Architected  Amazon  EMR
Well  Architected  EMR:  Design  for  Production
SecurityReliabilityPerformance
Cost  
Efficiency
Well  Architected  EMR:  Performance  Efficiency
Choice  of  
Instance  Type  
and  cluster  size
Choice  of  Storage Framework  
Performance
Performance:  Choice  of  instance  type  -­ Master
Less  than  
50  nodes?
Heavy  
network
I/O
M3.xlarge
YES
NO
C3  family  or  R3  
with  Enhanced  
networking
YES
M3.2xlarge  or  
larger
Performance:  Choice  of  instance  type  – Workers  
Compute Memory Storage
Machine  Learning
C1  Family
C3  Family
CC1.4xlarge
CC2.8xlarge
M2  Family
R3  Family
Cr1.8xlarge
Interactive  Analysis
D2  Family
I2  Family
Large  HDFS
General
Batch  Process
M3  Family
M1  Family
How  Many  Nodes  Do  I  Need?
Performance:  Cluster  Sizing
Guidelines:
-­ Size  based  on  HDFS  storage  first  if  needed
-­ Add  enough  (task)  nodes  to  handle  processing
-­ Do  not  add  more  than  5  tasks  nodes  per  core  node
-­ Prefer  smaller  clusters  of  larger  machines
It’s  a  space-­time  trade  off
EMRFS  (S3) HDFS
Performance:  Choice  of  storage
-­ Ability  to  decouple  
-­ Reliable  and  durable
-­ Cost  efficient
-­ Works  well  for  jobs  that  read  a  
dataset  once  per  run
-­ Need  a  persistent  cluster
-­ Reliability  is  configurable.  But  need  
multiple  nodes  to  achieve  replication  
factor
-­ Great  for  jobs  with  iterative  reads  on  the  
same  dataset  like  machine  learning  
Combine  with  s3-­dist-­cp and  move  from  S3  once  to  
HDFS  for  iterative  workloads
Storage  Performance:  S3  vs  HDFS  at  Netflix
S3  Performance:  Range  GET  vs  Data  Locality?  
GET  Range  128-­192MB
GET  Range  0-­64MB
GET  Range  64-­128MB
GET  Range  (n-­64)-­nMB
EMR  worker  nodes
S3  object  (use  larger  files)
1.2TB/Day  logs
30TB  /Day  data  
250  Hadoop  Jobs
75Billion  transactions/Day  
5  Petabytes  of  Data
S3:  Real  world  heavy  EMRFS  users
25 PB  Data  Warehouse
on  Amazon  S3
>  1PB  read  each  day
S3:  EMRFS  at  FINRA
S3:  EMRFS  at  NASDAQ
Access  needs  drop  off  dramatically  over  time  – But,  never  throw  anything  away!
Yesterday  >>  last  month  >>  last  quarter  >>  last  year..
Performance:  Framework  Performance
Count  these  words
Count  =  1 These  =  1 Words  =  1
Embarrassingly  parallel?
Count  =  1
These  =  1
Words  =  1
Can  it  be  optimized  with  a  DAG?
A
B
C
D
E
Reliability
-­ Store  your  metadata  
outside  the  cluster
-­ Multi-­AZ  RDS  cluster  
will  give  you  HA
-­ Keep  data  and  
Applications  on  S3
-­ Maintain  source  of  
truth  for  data  on  S3  
(An  immutable  data  
set)
Automate  with:
-­ Bootstrap  actions
-­ Config options
-­ Cloudformation
Failure  Management Disaster  Recovery Change  Management
Security
Data  Protection:
Encryption
-­ Server  side
-­ Client  side
-­ HDFS  Transparent
-­ RPC  with  SSL
-­ File  system  with  
LUKS
Privilege  
Management
-­ IAM  roles
-­ Secure  Integration  
with  AWS  services
-­ Hue,  HiveServer2  or  
3rd Party  tools  
support  for  role  
based  access
Infrastructure  
Protection
-­ VPC
-­ Private  Subnets
-­ S3  endpoints
-­ NAT
-­ Security  Groups
-­ Audit  with  logs  on  S3
Security:  End  to  End  Encryption
Amazon  S3  Bucket
AWS  KMS
AWS  S3  SDK
AmazonS3EncryptionClient()
Encrypted  Object
EMRFS  with  
Client-­side  Encryption
HDFS  
transparent  
encryption  
with  Hadoop  
KMS
spark.ssl.enabled hadoop.rpc.protecti on
hadoop.ssl.enabl ed
mapreduce.shuffle.ssl.enabled
0utput  writes  via  EMRFS  with  
Client-­side  Encryption  enabled
Amazon  S3  Bucket
LUKS  with  
bootstrap  
action  for  local  
file  systems
Server  Side  Encryption  
for  S3  via  KMS  or  any  
other  Key  Management  
service
Cost  Efficiency
Matching  Supply  and  
Demand
• Is  the  cluster  big  enough?
• Can  we  make  it  transient?
• Monitor  the  usage  with  
Ganglia  and  Amazon  
CloudWatch alarms
Using  cost-­effective  
resources
• S3  instead  of  HDFS  for  
larger  datasets?
• Taking  advantage  of  Spot  
and  Reserved  instances?
Optimise over  time
• Monitor  and  watch  out  for  
new  instance  types,  
features  that  may  reduce  
cost.
Agenda
Why  EMR?
Well  Architected  
EMR  Design  for  
Production
DEMO
-­ Automation
-­ Decouple
-­ Elastic
-­ Integration
-­ Current
-­ Cost-­efficient
-­ SparkSQL
-­ EMRFS  (S3://)
-­ Ganglia
-­ EMR  CLI
-­ EMR  Console
-­ Presto
-­ Performance  tuning
-­ Reliability
-­ Security  facts
-­ Cost  efficiency
Challenge:  Data  is  Everywhere
Size:  Growing  in  PBs
Strategy:  Divide  &  Conquer
Tool:  Amazon  EMR
AWS Sydney Summit 2016
EMR @ Ambiata
Ben Lever
CTO, Ambiata
Big Data Hypothesis
Personalisation RevenueData
Customer Personalisation
Data Science
Operational
Scalable
Reliable
1 Petabyte
25M customer
decisions daily
S3
EC2
RDS
ELB
SES
SNS
EMR
EMR Patterns
Ephemeral Clusters1
S3 as the Data Lake2
IAM Roles for Clusters3
Ephemeral Clusters1
S3 as the Data Lake2
IAM Roles for Clusters3
Permanent
Cluster
100
nodes
100
2 hrs
A
100
2 hrs
A
2 hrs
100
2 hrs
A B
4 hrs2 hrs
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs2 hrs
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs2 hrs
EMR
Cluster
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs2 hrs
100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs2 hrs
100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
2 hrs
A
2 hrs
100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
8 hrs
B
2 hrs
100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
100
2 hrs
C
2 hrs
D
8 hrs
B
2 hrs
100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
2 hrs
A
100
2 hrs
C
2 hrs
D
250
2 hrs
E
8 hrs
B
2 hrs
100
100
2 hrs
A B
4 hrs
C D
2 hrs 2 hrs
E
5 hrs
50
8 hrs
B
2 hrs
A
100
2 hrs
C
2 hrs
D
250
2 hrs
C
1,500 compute-hrs in 12
hrs
1,700 compute-hrs
in 17 hrs
2 hrs
1 Job Per Cluster
Increased Predictability
Ephemeral Clusters
Ephemeral Clusters1
S3 as the Data Lake2
IAM Roles for Clusters3
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
S3 Data Lake
Hadoop
EMR
Hadoop
Spark
EMR
EMR
Hadoop
Spark
R/C/Python
EMR
EMR
EC2
Hadoop
Spark
R/C/Python
BI
EMR
EMR
EC2
Redshift
$0.10 / hr
spot price
EMR
$0.10 / hr
spot price
EMR
$0.20 / hr
spot price
EMR
Ephemeral Clusters1
S3 as the Data Lake2
IAM Roles for Clusters3
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
IAM
IAM
IAM
IAM
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
RD
WT
RD
WT
RD
WT
RD
WT
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
RD
A B C D E F G H
WT
RD
WT
RD
WT
RD
WT
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
RD
WT
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
RD
WT
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
RD
WT
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
RD
WT
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
RD
WT
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
D
E
G
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
HDFS
CPU CPU CPU CPU
A
A B C D E F G H
C
D
RD
WT
RD
WT
D
E
G
IAM Roles @ Ambiata
Live
Where production
data pipelines
are executed
Live
Live
Live
Live
Live Lab
Where data
scientists
can experiment
Live Lab
Live Lab
Live Lab Dev
Where new
data pipelines
are tested
Live Lab Dev
Live Lab Dev
Live Lab Dev
Live Lab Dev
Ephemeral Clusters1
S3 as the Data Lake2
IAM Roles for Clusters3
EMR
AWS  Training  &  Certification
Intro  Videos  &  Labs  
Free  videos  and  labs  to  
help  you  learn  to  work  
with  30+  AWS  services  
– in  minutes!
Training  Classes
In-­person  and  online  
courses  to  build  
technical  skills  –
taught  by  accredited  
AWS  instructors
Online  Labs  
Practice  working  with  
AWS  services  in  live  
environment  –
Learn  how  related  
services  work  
together
AWS  Certification
Validate  technical  
skills  and  expertise    -­
identify  qualified  IT  
talent  or  show  you  
are  AWS  cloud  ready
Learn  more:  aws.amazon.com/training
Your  Training  Next  Steps:
ü Visit  the  AWS  Training  &  Certification  pod  to  discuss  your  
training  plan  &  AWS  Summit  training  offer
ü Register  &  attend  AWS  instructor  led  training
ü Get  Certified
AWS  Certified?  Visit  the  AWS  Summit  Certification  Lounge  to  pick  up  your  swag
Learn  more:  aws.amazon.com/training
Thank  you!

More Related Content

What's hot

AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 

What's hot (20)

BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Introduction to Amazon Relational Database Service
Introduction to Amazon Relational Database ServiceIntroduction to Amazon Relational Database Service
Introduction to Amazon Relational Database Service
 
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS SummitAmazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
Amazon RDS: Deep Dive - SRV310 - Chicago AWS Summit
 

Similar to Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301

Similar to Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301 (20)

Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
Databases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 RecapDatabases & Analytics AWS re:invent 2019 Recap
Databases & Analytics AWS re:invent 2019 Recap
 
AWS Basics .pdf
AWS Basics .pdfAWS Basics .pdf
AWS Basics .pdf
 
AWS Basics .pdf
AWS Basics .pdfAWS Basics .pdf
AWS Basics .pdf
 
2017 09-27 big data- how to securely implement and automate on aws (1)
2017 09-27 big data- how to securely implement and automate on aws (1)2017 09-27 big data- how to securely implement and automate on aws (1)
2017 09-27 big data- how to securely implement and automate on aws (1)
 
AWS re:Invent 2016 recap (part 1)
AWS re:Invent 2016 recap (part 1)AWS re:Invent 2016 recap (part 1)
AWS re:Invent 2016 recap (part 1)
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
AWS Primer and Quickstart
AWS Primer and QuickstartAWS Primer and Quickstart
AWS Primer and Quickstart
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level with Amazon EMR - Technical 301

  • 1. ©  2016,  Amazon  Web  Services,  Inc.  or  its  Affiliates.  All  rights  reserved. Amo  Abeyaratne Big  Data  and  Analytics  Consultant,  Amazon  Web  Services Ben  Lever CTO,  Ambiata Tune  your  Big  Data  Platform  to  Work  at  Scale   Taking  Hadoop  to  the  Next  Level  with  Amazon  EMR   Technical  301
  • 2. Challenge:  Data  is  Everywhere Phones Sensors Websites
  • 3. Size:  Growing  in  Petabytes “If  every  byte  was  a  word,  and  you  took  a   second to  read  a word,  it  will  take  you  32 million years to  read  a  whole  Petabyte”
  • 4. How  Do  We  Face  It?
  • 5. Strategy:  Divide  and  Conquer Hadoop Amazon  EMR A  Managed  Hadoop   Framework  in  the  Cloud Hadoop  on  EC2 Managing  on  your  own  can   be  a    LOT of  work Hadoop  has  to   scale?
  • 7. Tool:  Amazon  EMR aws emr create-cluster --release-label emr-4.0.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=99,InstanceType=m3.xlarge --auto-terminate aws emr create-­cluster   -­-­applications  Name=Hadoop Name=Hive Name=Hue Name=Spark Name=Ganglia Name=Zeppelin-­Sandbox -­-­tags  'name=summitdemo'   -­-­ec2-­attributes  '{"KeyName":"amo_ubuntu","InstanceProfile":"EMR_EC2_DefaultRole","AvailabilityZone":"us-­east-­1b","EmrManagedSlaveSecurityGroup":"sg-­ 07fa956c","EmrManagedMasterSecurityGroup":"sg-­0dfa9566"}'   -­-­service-­role  EMR_DefaultRole -­-­release-­label  emr-­4.4.0   -­-­name  'Summit_Demo_EMR'   -­-­instance-­groups  '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master  instance  group  -­ 1"},{"InstanceCount":9,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core   instance  group  -­ 2"},{"InstanceCount":50,"EbsConfiguration":{"EbsBlockDeviceConfigs":[],"EbsOptimized":false},"InstanceGroupType":"TASK","InstanceType":"m3.xlarge","Name":"Task  instance  group  -­ 7"}]'   -­-­configurations  '[{"Classification":"emrfs-­ site","Properties":{"fs.s3.consistent.retryPeriodSeconds":"10","fs.s3.consistent":"true","fs.s3.consistent.retryCount":"5","fs.s3.consistent.metadata.tableName":"EmrFSMetadata"},"Configurations":[]},{"Classi fication":"hive-­ site","Properties":{"javax.jdo.option.ConnectionUserName":"admin","javax.jdo.option.ConnectionPassword":"Passw0rd!","javax.jdo.option.ConnectionURL":"jdbc:mysql://testmysql51.cotvmrgi63jf.us-­east-­ 1.rds.amazonaws.com:3306/hive?createDatabaseIfNotExist=true"},"Configurations":[]}]'  -­-­region  us-­east-­1
  • 8. Agenda Why   EMR? Well  Architected  EMR   Design  for  Production DEMO Challenge:  Data  is  Everywhere Size:  Growing  in  PBs Strategy:  Divide  &  Conquer Tool:  Amazon  EMR
  • 10. Why  EMR? Automation Decouple Elastic Integration Low-­costCurrent
  • 11. Why  EMR?  Automation EC2  Provisioning Cluster  Setup Hadoop  Configuration Installing  ApplicationsJob  submissionMonitoring  and   Failure  Handling
  • 12. Why  EMR?  Decouple  Storage  and  Compute Amazon  Kinesis (Streams,  Firehose) Hadoop  Jobs Persistent  Cluster  – Interactive  Queries (Spark-­SQL  |  Presto  |  Impala) Transient  Cluster  -­ Batch  Jobs (X  hours  nightly)  – Add/Remove  Nodes ETL  Jobs Hive  External  Metastore i.e Amazon  RDS Workload  specific  clusters (Different  sizes,  Different  Versions) Amazon  S3  for  Storage create external table t_name(..)... location s3://bucketname/path-to-file/
  • 13. Why  EMR?  Elastic Intelligent  resize:  Wait  for   work  to  finish  before   stopping  instances Core  nodes  for  scaling   HDFS Task  nodes  for  scaling   processing  power Use  instance  groups:  to   manage  different  instance   types  in  the  same  cluster
  • 14. Why  EMR?  Current Application Open source   release   EMR  release Spark  1.5 September  9,  2015 September  2015 Spark  1.5.2 November  9,  2015 November 2015 Spark  1.6 January  4,  2016 January  2016 Spark 1.6.1 March 9,  2016 April  4,  2016
  • 15. AWS  Data  Pipeline  for  data   flow    Orchestration Amazon  KMS  for   encryption  key   management Why  EMR?  Easy  Integration  with  AWS  services • Kinesis   Connector  for   streaming  data   access • Spark   Streaming  with   Kinesis  Client   Library  (KCL) S3  via  EMRFS   Wrap  Amazon  EMR  cluster  with   IAM  roles  for  access  policies Map  Amazon  DynamoDB tables  with  DynamoDB-­ connector for  Hive Connect  to  Amazon   Redshift  through  redshift-­ spark  connector EBS  support   for  scaling   HDFS
  • 16. Why  EMR?  Low-­cost Spot  instances:  Bid   for  unused  EC2s  at   up  to  90%  less  price Transient  clusters:   Terminate  the  cluster   when  not  in  use Reserved  instances:   For  persistent   clusters,  make  use  of   EC2  reserved   instances  to  save  up   to  50%
  • 17. Agenda Why   EMR? Well  Architected   EMR  Design  for   Production DEMO-­ Automation -­ Decouple -­ Elastic -­ Integration -­ Current -­ Cost-­efficient Challenge:  Data  is  Everywhere Size:  Growing  in  PBs Strategy:  Divide  &  Conquer Tool:  Amazon  EMR  
  • 18. Demo
  • 19. EMR  Cluster   with  100-­nodes -­ All  data  stored  on  Amazon  S3   -­ Data  accessed  through    EMRFS -­ Spark  on  EMR  for  processing   -­ Ganglia  for  Monitoring  the  workload Demo Spark-­SQL Spark  1.6.1 YARN Hadoop EMRFS Amazon  S3
  • 20. Agenda Why  EMR? Well  Architected   EMR  Design  for   Production DEMO -­ Automation -­ Decouple -­ Elastic -­ Integration -­ Current -­ Cost-­efficient -­ SparkSQL -­ EMRFS  (S3://) -­ Ganglia -­ EMR  CLI -­ EMR  Console Challenge:  Data  is  Everywhere Size:  Growing  in  PBs Strategy:  Divide  &  Conquer Tool:  Amazon  EMR
  • 22. Well  Architected  EMR:  Design  for  Production SecurityReliabilityPerformance Cost   Efficiency
  • 23. Well  Architected  EMR:  Performance  Efficiency Choice  of   Instance  Type   and  cluster  size Choice  of  Storage Framework   Performance
  • 24. Performance:  Choice  of  instance  type  -­ Master Less  than   50  nodes? Heavy   network I/O M3.xlarge YES NO C3  family  or  R3   with  Enhanced   networking YES M3.2xlarge  or   larger
  • 25. Performance:  Choice  of  instance  type  – Workers   Compute Memory Storage Machine  Learning C1  Family C3  Family CC1.4xlarge CC2.8xlarge M2  Family R3  Family Cr1.8xlarge Interactive  Analysis D2  Family I2  Family Large  HDFS General Batch  Process M3  Family M1  Family
  • 26. How  Many  Nodes  Do  I  Need?
  • 27. Performance:  Cluster  Sizing Guidelines: -­ Size  based  on  HDFS  storage  first  if  needed -­ Add  enough  (task)  nodes  to  handle  processing -­ Do  not  add  more  than  5  tasks  nodes  per  core  node -­ Prefer  smaller  clusters  of  larger  machines It’s  a  space-­time  trade  off
  • 28. EMRFS  (S3) HDFS Performance:  Choice  of  storage -­ Ability  to  decouple   -­ Reliable  and  durable -­ Cost  efficient -­ Works  well  for  jobs  that  read  a   dataset  once  per  run -­ Need  a  persistent  cluster -­ Reliability  is  configurable.  But  need   multiple  nodes  to  achieve  replication   factor -­ Great  for  jobs  with  iterative  reads  on  the   same  dataset  like  machine  learning   Combine  with  s3-­dist-­cp and  move  from  S3  once  to   HDFS  for  iterative  workloads
  • 29. Storage  Performance:  S3  vs  HDFS  at  Netflix
  • 30. S3  Performance:  Range  GET  vs  Data  Locality?   GET  Range  128-­192MB GET  Range  0-­64MB GET  Range  64-­128MB GET  Range  (n-­64)-­nMB EMR  worker  nodes S3  object  (use  larger  files)
  • 31. 1.2TB/Day  logs 30TB  /Day  data   250  Hadoop  Jobs 75Billion  transactions/Day   5  Petabytes  of  Data S3:  Real  world  heavy  EMRFS  users 25 PB  Data  Warehouse on  Amazon  S3 >  1PB  read  each  day
  • 32. S3:  EMRFS  at  FINRA
  • 33. S3:  EMRFS  at  NASDAQ Access  needs  drop  off  dramatically  over  time  – But,  never  throw  anything  away! Yesterday  >>  last  month  >>  last  quarter  >>  last  year..
  • 34. Performance:  Framework  Performance Count  these  words Count  =  1 These  =  1 Words  =  1 Embarrassingly  parallel? Count  =  1 These  =  1 Words  =  1 Can  it  be  optimized  with  a  DAG? A B C D E
  • 35. Reliability -­ Store  your  metadata   outside  the  cluster -­ Multi-­AZ  RDS  cluster   will  give  you  HA -­ Keep  data  and   Applications  on  S3 -­ Maintain  source  of   truth  for  data  on  S3   (An  immutable  data   set) Automate  with: -­ Bootstrap  actions -­ Config options -­ Cloudformation Failure  Management Disaster  Recovery Change  Management
  • 36. Security Data  Protection: Encryption -­ Server  side -­ Client  side -­ HDFS  Transparent -­ RPC  with  SSL -­ File  system  with   LUKS Privilege   Management -­ IAM  roles -­ Secure  Integration   with  AWS  services -­ Hue,  HiveServer2  or   3rd Party  tools   support  for  role   based  access Infrastructure   Protection -­ VPC -­ Private  Subnets -­ S3  endpoints -­ NAT -­ Security  Groups -­ Audit  with  logs  on  S3
  • 37. Security:  End  to  End  Encryption Amazon  S3  Bucket AWS  KMS AWS  S3  SDK AmazonS3EncryptionClient() Encrypted  Object EMRFS  with   Client-­side  Encryption HDFS   transparent   encryption   with  Hadoop   KMS spark.ssl.enabled hadoop.rpc.protecti on hadoop.ssl.enabl ed mapreduce.shuffle.ssl.enabled 0utput  writes  via  EMRFS  with   Client-­side  Encryption  enabled Amazon  S3  Bucket LUKS  with   bootstrap   action  for  local   file  systems Server  Side  Encryption   for  S3  via  KMS  or  any   other  Key  Management   service
  • 38. Cost  Efficiency Matching  Supply  and   Demand • Is  the  cluster  big  enough? • Can  we  make  it  transient? • Monitor  the  usage  with   Ganglia  and  Amazon   CloudWatch alarms Using  cost-­effective   resources • S3  instead  of  HDFS  for   larger  datasets? • Taking  advantage  of  Spot   and  Reserved  instances? Optimise over  time • Monitor  and  watch  out  for   new  instance  types,   features  that  may  reduce   cost.
  • 39. Agenda Why  EMR? Well  Architected   EMR  Design  for   Production DEMO -­ Automation -­ Decouple -­ Elastic -­ Integration -­ Current -­ Cost-­efficient -­ SparkSQL -­ EMRFS  (S3://) -­ Ganglia -­ EMR  CLI -­ EMR  Console -­ Presto -­ Performance  tuning -­ Reliability -­ Security  facts -­ Cost  efficiency Challenge:  Data  is  Everywhere Size:  Growing  in  PBs Strategy:  Divide  &  Conquer Tool:  Amazon  EMR
  • 40. AWS Sydney Summit 2016 EMR @ Ambiata Ben Lever CTO, Ambiata
  • 44.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 54.
  • 56. EMR
  • 58. Ephemeral Clusters1 S3 as the Data Lake2 IAM Roles for Clusters3
  • 59. Ephemeral Clusters1 S3 as the Data Lake2 IAM Roles for Clusters3
  • 64. 100 2 hrs A B 4 hrs2 hrs
  • 65. 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs2 hrs
  • 66. 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs2 hrs
  • 67. EMR Cluster 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs2 hrs
  • 68. 100 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs2 hrs
  • 69. 100 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs 2 hrs A 2 hrs
  • 70. 100 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs 50 2 hrs A 8 hrs B 2 hrs
  • 71. 100 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs 50 2 hrs A 100 2 hrs C 2 hrs D 8 hrs B 2 hrs
  • 72. 100 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs 50 2 hrs A 100 2 hrs C 2 hrs D 250 2 hrs E 8 hrs B 2 hrs
  • 73. 100 100 2 hrs A B 4 hrs C D 2 hrs 2 hrs E 5 hrs 50 8 hrs B 2 hrs A 100 2 hrs C 2 hrs D 250 2 hrs C 1,500 compute-hrs in 12 hrs 1,700 compute-hrs in 17 hrs 2 hrs
  • 74. 1 Job Per Cluster Increased Predictability Ephemeral Clusters
  • 75. Ephemeral Clusters1 S3 as the Data Lake2 IAM Roles for Clusters3
  • 80. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU
  • 81. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU
  • 82. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU
  • 83. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU
  • 84. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU
  • 85. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU S3 Data Lake
  • 86.
  • 91.
  • 92. $0.10 / hr spot price EMR
  • 93. $0.10 / hr spot price EMR
  • 94.
  • 95. $0.20 / hr spot price EMR
  • 96.
  • 97. Ephemeral Clusters1 S3 as the Data Lake2 IAM Roles for Clusters3
  • 98. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU
  • 99. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU IAM IAM IAM IAM
  • 100. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU RD WT RD WT RD WT RD WT
  • 101. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU RD A B C D E F G H WT RD WT RD WT RD WT
  • 102. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU A A B C D E F G H C D RD WT RD WT RD WT
  • 103. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU A A B C D E F G H C D RD WT RD WT RD WT
  • 104. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU A A B C D E F G H C D RD WT RD WT RD WT
  • 105. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU A A B C D E F G H C D RD WT RD WT RD WT
  • 106. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU A A B C D E F G H C D RD WT RD WT RD WT
  • 107. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU A A B C D E F G H C D RD WT RD WT D E G
  • 108. HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU HDFS CPU CPU CPU CPU A A B C D E F G H C D RD WT RD WT D E G
  • 109. IAM Roles @ Ambiata
  • 110.
  • 112. Live
  • 113. Live
  • 114. Live
  • 115. Live
  • 119. Live Lab Dev Where new data pipelines are tested
  • 124. Ephemeral Clusters1 S3 as the Data Lake2 IAM Roles for Clusters3
  • 125. EMR
  • 126.
  • 127.
  • 128. AWS  Training  &  Certification Intro  Videos  &  Labs   Free  videos  and  labs  to   help  you  learn  to  work   with  30+  AWS  services   – in  minutes! Training  Classes In-­person  and  online   courses  to  build   technical  skills  – taught  by  accredited   AWS  instructors Online  Labs   Practice  working  with   AWS  services  in  live   environment  – Learn  how  related   services  work   together AWS  Certification Validate  technical   skills  and  expertise    -­ identify  qualified  IT   talent  or  show  you   are  AWS  cloud  ready Learn  more:  aws.amazon.com/training
  • 129. Your  Training  Next  Steps: ü Visit  the  AWS  Training  &  Certification  pod  to  discuss  your   training  plan  &  AWS  Summit  training  offer ü Register  &  attend  AWS  instructor  led  training ü Get  Certified AWS  Certified?  Visit  the  AWS  Summit  Certification  Lounge  to  pick  up  your  swag Learn  more:  aws.amazon.com/training