SlideShare a Scribd company logo
1 of 19
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR: Optimize Transient
Clusters for Data Processing & ETL
Anthony Virtuoso
Principal Engineer
AWS
A N T 3 4 1
Eric Mills
Senior Software Engineer
AWS EMR
Esther Kundin
Senior Software Engineer
AWS EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• Stateless clusters
• Scaling clusters
• Reducing costs
• Cluster orchestration
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon EMR basics
• Master node
• Manages cluster
• NameNode and JobTracker
• Core nodes
• Task tracker (compute)
• DataNode (HDFS)
• Task nodes
• Task tracker only
• No HDFS
Amazon EMR cluster
Task instance groupCore instance group
HDFS HDFS
Master instance group
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example cluster
• Mixed workloads
• Capacity exhausted at peak
• Paying for idle time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Making our cluster stateless
• Maintain metastores off
cluster
• Faster startup time lowers
cost
Amazon RDS
Amazon
Redshift
Amazon
Athena
AWS Glue
AWS Glue
Data Catalog
EMR cluster EMR cluster EMR cluster
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Making our cluster stateless
Master node
CPU
Memory
HDFS
storage
CPU
Memory
HDFS
storage
CPU
Memory
HDFS
storage
CPU
Memory
HDFS
storage
Master node
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Memory
Amazon S3 as streaming HDFS through EMRFS
Old clustering/Localized model Amazon EMR decoupled model
HDFS has 3x replication
500 TB dataset equals 1.5 PB cluster with replication
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling our cluster
• Scale out or in cluster task instances using automatic scaling
• Control through policies which monitor Amazon CloudWatch metrics
• Popular metrics include ’YARNMemoryAvailablePercentage’ and ‘ContainerPendingRatio’
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example cluster with Auto Scaling
• Clusters adapt to demand needs
• Peak throughput: finish faster
• Pay less for idle time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Turning on Spot Instances
Master node Core instances Task instances
• Master node and at least one core node should be on-demand
• Launch clusters in the optimal availability zone based on capacity
and price
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Instance fleets for advanced Spot provisioning
Master node Core instances Task instances
• Mix and match instance types and Spot versus on-demand
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example cluster with Spot Instances
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Options for orchestrating a cluster
Amazon EMR JobFlow/Step API
Configure a cluster and launch a
job using Step API
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other schedulers on
Amazon Elastic Compute Cloud (Amazon EC2)
Create a pipeline
to schedule cluster creation and
job scheduling
AWS Lambda
Use AWS Lambda to
launch clusters using the
Amazon EMR Step API
Use Oozie on your
cluster to build
DAGs of jobs
Use an external orchestration
system to launch clusters and jobs
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Example cluster with orchestration
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Takeaways
• Use Amazon Simple Storage Service (Amazon S3) for storage
• Use Amazon Aurora or AWS Glue for remote metastore
• Auto Scaling for task instances
• Use Spot Instances for lower costs
• Interact with clusters and submit steps with JobFlow/Step API, AWS
Data Pipeline, or Airflow on Amazon EC2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Anthony Virtuoso
Eric Mills
Esther Kundin
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Appendix: Job optimization
Hadoop
Compress input and output data
Adjust number of mappers and reducers
Skewed joins
Tez
Spark
Dynamic allocation settings
RDD reuse
Correct join type—broadcast join
https://spark.apache.org/docs/latest/tuning.html

More Related Content

What's hot

Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerAmazon Web Services
 
Deploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control TowerDeploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control TowerAmazon Web Services
 
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksAmazon Web Services
 
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Amazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 
MLops workshop AWS
MLops workshop AWSMLops workshop AWS
MLops workshop AWSGili Nachum
 
CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...
CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...
CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Introduction to AWS Organizations
Introduction to AWS OrganizationsIntroduction to AWS Organizations
Introduction to AWS OrganizationsAmazon Web Services
 
Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...
Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...
Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...Amazon Web Services
 
AWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfAWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfSrinjoySaha12
 
Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations CloudHesive
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 

What's hot (20)

Intro to SageMaker
Intro to SageMakerIntro to SageMaker
Intro to SageMaker
 
Machine Learning & Amazon SageMaker
Machine Learning & Amazon SageMakerMachine Learning & Amazon SageMaker
Machine Learning & Amazon SageMaker
 
Introducing Amazon SageMaker
Introducing Amazon SageMakerIntroducing Amazon SageMaker
Introducing Amazon SageMaker
 
Amazon SageMaker
Amazon SageMakerAmazon SageMaker
Amazon SageMaker
 
Deploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control TowerDeploy and Govern at Scale with AWS Control Tower
Deploy and Govern at Scale with AWS Control Tower
 
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech TalksIntroducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
Introducing Amazon Aurora with PostgreSQL Compatibility - AWS Online Tech Talks
 
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
Build Deep Learning Applications Using MXNet and Amazon SageMaker (AIM418) - ...
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
MLops workshop AWS
MLops workshop AWSMLops workshop AWS
MLops workshop AWS
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...
CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...
CI/CD for Your Machine Learning Pipeline with Amazon SageMaker (DVC303) - AWS...
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Introduction to AWS Organizations
Introduction to AWS OrganizationsIntroduction to AWS Organizations
Introduction to AWS Organizations
 
Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...
Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...
Scalable, Automated Anomaly Detection with GuardDuty, CloudTrail, & Amazon Sa...
 
AWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdfAWS Partner Data Analytics on AWS_Handout.pdf
AWS Partner Data Analytics on AWS_Handout.pdf
 
Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations Serverless data and analytics on AWS for operations
Serverless data and analytics on AWS for operations
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Machine Learning on AWS
Machine Learning on AWSMachine Learning on AWS
Machine Learning on AWS
 
Introducing Amazon EKS
Introducing Amazon EKSIntroducing Amazon EKS
Introducing Amazon EKS
 

Similar to Amazon EMR: Optimize Transient Clusters for Data Processing & ETL (ANT341) - AWS re:Invent 2018

A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018Amazon Web Services
 
Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...
Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...
Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...Amazon Web Services
 
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Amazon Web Services
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Amazon Web Services
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Amazon Web Services
 
AWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day IsraelAWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day IsraelAmazon Web Services
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverlessKim Kao
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Amazon Web Services
 
Using Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMDUsing Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMDAmazon Web Services
 
Scaling from zero to millions of users
Scaling from zero to millions of usersScaling from zero to millions of users
Scaling from zero to millions of usersAmazon Web Services
 
Using Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy MicroservicesUsing Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy MicroservicesAmazon Web Services
 
Optimize Amazon EC2 for Fun and Profit
Optimize Amazon EC2 for Fun and Profit Optimize Amazon EC2 for Fun and Profit
Optimize Amazon EC2 for Fun and Profit Amazon Web Services
 
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...Amazon Web Services
 
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Amazon Web Services
 
Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...
Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...
Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...Amazon Web Services
 
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...Amazon Web Services Japan
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczAmazon Web Services
 
Adding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San FranciscoAdding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San FranciscoAmazon Web Services
 
Using Search with a Database: Database Week SF
Using Search with a Database: Database Week SFUsing Search with a Database: Database Week SF
Using Search with a Database: Database Week SFAmazon Web Services
 

Similar to Amazon EMR: Optimize Transient Clusters for Data Processing & ETL (ANT341) - AWS re:Invent 2018 (20)

A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
A Deep Dive into What's New with Amazon EMR (ANT340-R1) - AWS re:Invent 2018
 
Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...
Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...
Running Amazon EKS Workloads on Amazon EC2 Spot Instances (CMP403-R1) - AWS r...
 
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
Best practices for optimizing your EC2 costs with Spot Instances | AWS Floor28
 
Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018Serverless Architectural Patterns: Collision 2018
Serverless Architectural Patterns: Collision 2018
 
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
Running Lean Architectures: How to Optimize for Cost Efficiency (ARC202-R2) -...
 
AWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day IsraelAWS Lambda use cases and best practices - Builders Day Israel
AWS Lambda use cases and best practices - Builders Day Israel
 
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
2018 10-19-jc conf-embrace-legacy-java-ee-by-aws-serverless
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
 
Using Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMDUsing Tableau and AWS for Fearless Reporting at UMD
Using Tableau and AWS for Fearless Reporting at UMD
 
Scaling from zero to millions of users
Scaling from zero to millions of usersScaling from zero to millions of users
Scaling from zero to millions of users
 
Using Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy MicroservicesUsing Containers and Serverless to Deploy Microservices
Using Containers and Serverless to Deploy Microservices
 
Optimize Amazon EC2 for Fun and Profit
Optimize Amazon EC2 for Fun and Profit Optimize Amazon EC2 for Fun and Profit
Optimize Amazon EC2 for Fun and Profit
 
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
Control for Your Cloud Environment Using AWS Management Tools (ENT226-R1) - A...
 
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
Better, Faster, Cheaper – Cost Optimizing Compute with Amazon EC2 Fleet #savi...
 
Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...
Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...
Running Enterprise Test/Dev on Amazon EC2 Spot Instances (CMP407-R1) - AWS re...
 
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
AWS Black Belt Online Seminar 2018 re:Invent Recap: Compute, Container and Ne...
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter Dachnowicz
 
Adding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San FranciscoAdding Search to DynamoDB: Database Week San Francisco
Adding Search to DynamoDB: Database Week San Francisco
 
Using Search with a Database: Database Week SF
Using Search with a Database: Database Week SFUsing Search with a Database: Database Week SF
Using Search with a Database: Database Week SF
 
Amazon EC2 Spot Instances
Amazon EC2 Spot InstancesAmazon EC2 Spot Instances
Amazon EC2 Spot Instances
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Amazon EMR: Optimize Transient Clusters for Data Processing & ETL (ANT341) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR: Optimize Transient Clusters for Data Processing & ETL Anthony Virtuoso Principal Engineer AWS A N T 3 4 1 Eric Mills Senior Software Engineer AWS EMR Esther Kundin Senior Software Engineer AWS EMR
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • Stateless clusters • Scaling clusters • Reducing costs • Cluster orchestration
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR basics • Master node • Manages cluster • NameNode and JobTracker • Core nodes • Task tracker (compute) • DataNode (HDFS) • Task nodes • Task tracker only • No HDFS Amazon EMR cluster Task instance groupCore instance group HDFS HDFS Master instance group
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example cluster • Mixed workloads • Capacity exhausted at peak • Paying for idle time
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Making our cluster stateless • Maintain metastores off cluster • Faster startup time lowers cost Amazon RDS Amazon Redshift Amazon Athena AWS Glue AWS Glue Data Catalog EMR cluster EMR cluster EMR cluster
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Making our cluster stateless Master node CPU Memory HDFS storage CPU Memory HDFS storage CPU Memory HDFS storage CPU Memory HDFS storage Master node CPU Memory CPU Memory CPU Memory CPU Memory Amazon S3 as streaming HDFS through EMRFS Old clustering/Localized model Amazon EMR decoupled model HDFS has 3x replication 500 TB dataset equals 1.5 PB cluster with replication
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling our cluster • Scale out or in cluster task instances using automatic scaling • Control through policies which monitor Amazon CloudWatch metrics • Popular metrics include ’YARNMemoryAvailablePercentage’ and ‘ContainerPendingRatio’
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example cluster with Auto Scaling • Clusters adapt to demand needs • Peak throughput: finish faster • Pay less for idle time
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Turning on Spot Instances Master node Core instances Task instances • Master node and at least one core node should be on-demand • Launch clusters in the optimal availability zone based on capacity and price
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instance fleets for advanced Spot provisioning Master node Core instances Task instances • Mix and match instance types and Spot versus on-demand
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example cluster with Spot Instances
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Options for orchestrating a cluster Amazon EMR JobFlow/Step API Configure a cluster and launch a job using Step API Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on Amazon Elastic Compute Cloud (Amazon EC2) Create a pipeline to schedule cluster creation and job scheduling AWS Lambda Use AWS Lambda to launch clusters using the Amazon EMR Step API Use Oozie on your cluster to build DAGs of jobs Use an external orchestration system to launch clusters and jobs
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example cluster with orchestration
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Takeaways • Use Amazon Simple Storage Service (Amazon S3) for storage • Use Amazon Aurora or AWS Glue for remote metastore • Auto Scaling for task instances • Use Spot Instances for lower costs • Interact with clusters and submit steps with JobFlow/Step API, AWS Data Pipeline, or Airflow on Amazon EC2
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 17. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Anthony Virtuoso Eric Mills Esther Kundin
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Appendix: Job optimization Hadoop Compress input and output data Adjust number of mappers and reducers Skewed joins Tez Spark Dynamic allocation settings RDD reuse Correct join type—broadcast join https://spark.apache.org/docs/latest/tuning.html