SlideShare a Scribd company logo
1 of 95
Download to read offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Abhishek Sinha, Amazon Web Services
Gaurav Agrawal, AOL Inc
October 2015
BDT208
A Technical Introduction to
Amazon EMR
What to Expect from the Session
• Technical introduction to Amazon EMR
• Basic tenets
• Amazon EMR feature set
• Real-Life experience of moving a 2-PB, on-premises
Hadoop cluster to the AWS cloud
• Is not a technical introduction to Apache Spark, Apache
Hadoop, or other frameworks
Amazon EMR
• Managed platform
• MapReduce, Apache Spark, Presto
• Launch a cluster in minutes
• Open source distribution and MapR
distribution
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
Make it easy, secure, and
cost-effective to run
data-processing frameworks
on the AWS cloud
What Do I Need to Build a Cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method
An Example EMR Cluster
Master Node
r3.2xlarge
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge
Slave Group – Task
m3.2xlarge (EC2 Spot)
HDFS (DataNode).
YARN (NodeManager).
NameNode (HDFS)
ResourceManager
(YARN)
Choice of Multiple Instances
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS
Select an Instance
Choose Your Software (Quick Bundles)
Choose Your Software – Custom
Hadoop Applications Available in Amazon EMR
Choose Security and Access Control
You Are Up and Running!
You Are Up and Running!
Master Node DNS
You Are Up and Running!
Information about the software you are
running, logs and features
You Are Up and Running!
Infrastructure for this cluster
You Are Up and Running!
Security Groups and Roles
Use the CLI
aws emr create-cluster
--release-label emr-4.0.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK
Programmatic Access to Cluster Provisioning
Now that I have a cluster, I need to process
some data
Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
On an On-premises Environment
Tightly coupled
Compute and Storage Grow Together
Tightly coupled
Storage grows along with
compute
Compute requirements vary
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized capacity
Provisioned capacity
Contention for Same Resources
Compute
bound
Memory
bound
Separation of Resources Creates Data Silos
Team A
Replication Adds to Cost
3x
Single datacenter
So how does Amazon EMR solve these problems?
Decouple Storage and Compute
Amazon S3 is Your Persistent Data Store
11 9’s of durability
$0.03 / GB / month in US-East
Lifecycle policies
Versioning
Distributed by default
EMRFSAmazon S3
The Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file-system
• Streams data directly from Amazon S3
• Uses HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Consistent view – consistency for read after write
• Support for encryption
• Fast listing of objects
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-
apache/input/'
Benefit 1: Switch Off Clusters
Amazon S3Amazon S3 Amazon S3
Auto-Terminate Clusters
You Can Build a Pipeline
Or You Can Use AWS Data Pipeline
Input data
Use Amazon EMR to
transform unstructured
data to structured
Push to
Amazon S3
Ingest into
Amazon
Redshift
Sample Pipeline
Run Transient or Long-Running Clusters
Run a Long-Running Cluster
Amazon EMR cluster
Benefit 2: Resize Your Cluster
Resize the Cluster
Scale Up, Scale Down, Stop a resize,
issue a resize on another
How do you scale up and save cost ?
Spot Instance
Bid
Price
OD
Price
Spot Integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
The Spot Bid Advisor
Spot Integration with Amazon EMR
• Can provision instances from the Spot market
• Replaces a Spot instance incase of interruption
• Impact of interruption
• Master node – Can lose the cluster
• Core node – Can lose intermediate data
• Task nodes – Jobs will restart on other nodes (application
dependent)
Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
Resize Nodes with Spot Instances
20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
Resize Nodes with Spot Instances
50 % less run-time ( 14  7)
25% less cost (140  105)
Scaling Hadoop Jobs with Spot
http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/
1500 to 2000 clusters
6000 Jobs
For each instance_type in (Availability Zone, Region)
{
cpuPerUnitPrice = instance.cpuCores/instance.spotPrice
if (maxCpuPerUnitPrice < cpuPerUnitPrice) {
optimalInstanceType = instance_type;
}
}
Source: Github /Bloomreach/ Briefly
Intelligent Scale Down
Intelligent Scale Down: HDFS
Effectively Utilize Clusters
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Benefit 3: Logical Separation of Jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3
Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Amazon S3 as a Data Lake
Nate Sammons, Principal Architect – NASDAQ
Reference – AWS Big Data Blog
Re-cap
Rapid provisioning of clusters
Hadoop, Spark, Presto, and other applications
Standard open-source packaging
De-couple storage and compute and scale them
independently
Resize clusters to manage demand
Save costs with Spot instances
How AOL Inc. moved a 2 PB Hadoop
cluster to the AWS cloud
Gaurav Agrawal
Senior Software Engineer, AOL Inc.
AWS Certified Associate Solutions Architect
AOL Data Platforms Architecture 2014
Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months
Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
AOL Data Platforms Architecture 2015
1
2
2
3
4
56
Migration
• Web Console vs. CLI
Web Console and CLI
Web Console for Training
Setup IAM for users
AWS Services Options
S3 Data upload
EMR Creation & Steps
Try & Test multiple approaches
CLI is your friend..!!!
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
bucket-prod-control
Environment Level Buckets
Dev, QA, Production, Analyst
Project Level Buckets
Code, Data, Log, Extract and Control
Compressed Snappy Data to GZIP
Multi Platforms Support
Best Compression
Lowest storage cost
Low cost for Data OUT
bucket-dev bucket-qa
bucket-prod bucket-analyst
bucket-prod-code
bucket-prod-log
bucket-prod-data
bucket-prod-extract
76%
Less Storage
70K
Saving/Year
Copy Existing Data to S3
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes
AOL Data Platforms Architecture 2015
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission - CLI
EMR Jobs Submission - CLI
In-house scheduler
Common Utilities
Provision EMR
Push/Pull Data to S3
Job submission to Scheduler
Database Load
JSON Files
Applications, Steps, Bootstrap,EC2 attributes, Instance Groups
Future : Event Driven Design – Lambda, SQS
EMR Jobs Submission - CLI
aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" 
--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav"
"Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" 
--visible-to-all-users 
--ec2-attributes file://omni_awssot.generic.ec2_attributes.json 
--ami-version "3.7.0" 
--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ 
--enable-debugging 
--instance-groups file://omni_awssot.generic.instance_groups.json 
--auto-terminate 
--applications file://omni_awssot.generic.applications.json 
--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json 
--steps file://omni_awssot.generic.steps.json
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
Monitoring
EMR WatchDog : Node.js
Duplicate Clusters
Failed Clusters
Long-running Clusters
Long-provisioning Clusters
CloudWatch Alarms
Monthly Billing
S3 Bucket Size
SNS Email Notifications
Amazon CloudWatch
Amazon SNS
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
Elasticity
Why be Elastic?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 09/05/2015 Cores Nodes
Daily Processes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Core Nodes Demand - 09/20/2015 Core Nodes
No Clusters
Spike in Demand
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores Nodes
Major Restatement
Demand > 10K EC2
Elasticity
Why be Elastic?
True Cloud Architecture
Spot is an Open Market
Scale Horizontally
Our Limit : 3,000 EC2/Region
Multiple Regions
Multiple Instance Types
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Cost Management & BCDR
Cost Management & BCDR
Multi Region Deployment
Best AZ for pricing
Design for failure
Global. BC-DR.
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Cost Management & BCDR
• Optimization
Optimization
Data Management
Partition Data on S3
S3 Versioning/Lifecycle
How many nodes?
Based on Data Volume
Complete hour for pricing
Hadoop Run-time Params
Memory Tuning
Compress M & R Output
Combine Splits Input format
Security
Score Card
Feature AWS
Pay for what you use ✔
Decouple Storage and Compute ✔
True Cloud Architecture ✔
Self Service Model ✔
Elastic & Scalable ✔
Global Infrastructure. BCDR. ✔
Quick & Easy Deployments ✔
Redshift External Tables on S3 ?
More languages for Lambda ?
AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
AWS vs. In-House Cost
10/8/2015
Amazon Web Services
1/4th Cost of In-House Hadoop Infrastructure
1/4th Cost
Data Platforms. AOL Inc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Core…
Restatement Use Case
• Restate historical data going back 6 months
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
Tag All Resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM Roles and Policies
Use of Application ID
Enable CloudTrail
S3 Lifecycle ManagementS3 Versioning
Separate Code/Data/Logs buckets
Keyless EMR Clusters
Hybrid Model
Enable Debugging
Create Multiple CLI Profiles
Multi-Factor Authentication
CloudWatch Billing Alarms
Spot EC2 Instances
SNS notifications for failures
Loosely coupled Apps
Scale Horizontally
Best Practices & Suggestions
Remember to complete
your evaluations!
Thank you!
Photo Credits
• Key Board : http://bit.ly/1LRQMdR
• Compression : http://bit.ly/1MtT3Pa
• Optimization : http://bit.ly/1FlidQD
• WatchDog : http://bit.ly/1OX50j6
• Elasticity : http://bit.ly/1YFfCr4
• Fish Bowl : http://bit.ly/1VjrcJd
• Blank Cheque : http://bit.ly/1RkTgGe

More Related Content

What's hot

AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
Amazon Web Services
 

What's hot (20)

BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
ABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS GlueABD315_Serverless ETL with AWS Glue
ABD315_Serverless ETL with AWS Glue
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWS
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar SeriesIntroduction to Amazon Kinesis Firehose - AWS August Webinar Series
Introduction to Amazon Kinesis Firehose - AWS August Webinar Series
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis LabsRedis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
Redis + Kafka = Performance at Scale | Julien Ruaux, Redis Labs
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Heterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCTHeterogenous Migration with DMS & SCT
Heterogenous Migration with DMS & SCT
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 

Viewers also liked

Viewers also liked (20)

AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
A Mayo Clinic Big Data Implementation
A Mayo Clinic Big Data ImplementationA Mayo Clinic Big Data Implementation
A Mayo Clinic Big Data Implementation
 
Optimizing Costs and Efficiency of AWS Services
Optimizing Costs and Efficiency of AWS Services Optimizing Costs and Efficiency of AWS Services
Optimizing Costs and Efficiency of AWS Services
 
Account Separation and Mandatory Access Control Partner Summit
Account Separation and Mandatory Access Control Partner SummitAccount Separation and Mandatory Access Control Partner Summit
Account Separation and Mandatory Access Control Partner Summit
 
Enterprise IT in the Cloud
Enterprise IT in the Cloud Enterprise IT in the Cloud
Enterprise IT in the Cloud
 
Putting it All Together: Securing Systems at Cloud Scale
Putting it All Together: Securing Systems at Cloud ScalePutting it All Together: Securing Systems at Cloud Scale
Putting it All Together: Securing Systems at Cloud Scale
 
(SEC203) Journey to Securing Time Inc's Move to the Cloud
(SEC203) Journey to Securing Time Inc's Move to the Cloud(SEC203) Journey to Securing Time Inc's Move to the Cloud
(SEC203) Journey to Securing Time Inc's Move to the Cloud
 
Financial Services Analytics on AWS
Financial Services Analytics on AWSFinancial Services Analytics on AWS
Financial Services Analytics on AWS
 
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
 
(SPOT303) Security Operations at Massive Scale
(SPOT303) Security Operations at Massive Scale(SPOT303) Security Operations at Massive Scale
(SPOT303) Security Operations at Massive Scale
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
(NET405) Build a Remote Access VPN Solution on AWS
(NET405) Build a Remote Access VPN Solution on AWS(NET405) Build a Remote Access VPN Solution on AWS
(NET405) Build a Remote Access VPN Solution on AWS
 
Accelerate Track
Accelerate TrackAccelerate Track
Accelerate Track
 
Amazon WorkSpaces for Education
Amazon WorkSpaces for EducationAmazon WorkSpaces for Education
Amazon WorkSpaces for Education
 
基幹業務もHadoop(EMR)で!!のその後
基幹業務もHadoop(EMR)で!!のその後基幹業務もHadoop(EMR)で!!のその後
基幹業務もHadoop(EMR)で!!のその後
 

Similar to (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Similar to (BDT208) A Technical Introduction to Amazon Elastic MapReduce (20)

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 
Real world High Performance & High Throughput Computing on AWS
Real world High Performance & High Throughput Computing on AWSReal world High Performance & High Throughput Computing on AWS
Real world High Performance & High Throughput Computing on AWS
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Abhishek Sinha, Amazon Web Services Gaurav Agrawal, AOL Inc October 2015 BDT208 A Technical Introduction to Amazon EMR
  • 2. What to Expect from the Session • Technical introduction to Amazon EMR • Basic tenets • Amazon EMR feature set • Real-Life experience of moving a 2-PB, on-premises Hadoop cluster to the AWS cloud • Is not a technical introduction to Apache Spark, Apache Hadoop, or other frameworks
  • 3. Amazon EMR • Managed platform • MapReduce, Apache Spark, Presto • Launch a cluster in minutes • Open source distribution and MapR distribution • Leverage the elasticity of the cloud • Baked in security features • Pay by the hour and save with Spot • Flexibility to customize
  • 4. Make it easy, secure, and cost-effective to run data-processing frameworks on the AWS cloud
  • 5. What Do I Need to Build a Cluster ? 1. Choose instances 2. Choose your software 3. Choose your access method
  • 6. An Example EMR Cluster Master Node r3.2xlarge Slave Group - Core c3.2xlarge Slave Group – Task m3.xlarge Slave Group – Task m3.2xlarge (EC2 Spot) HDFS (DataNode). YARN (NodeManager). NameNode (HDFS) ResourceManager (YARN)
  • 7. Choice of Multiple Instances CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Machine Learning Batch Processing In-memory (Spark & Presto) Large HDFS
  • 9. Choose Your Software (Quick Bundles)
  • 10. Choose Your Software – Custom
  • 12. Choose Security and Access Control
  • 13. You Are Up and Running!
  • 14. You Are Up and Running! Master Node DNS
  • 15. You Are Up and Running! Information about the software you are running, logs and features
  • 16. You Are Up and Running! Infrastructure for this cluster
  • 17. You Are Up and Running! Security Groups and Roles
  • 18. Use the CLI aws emr create-cluster --release-label emr-4.0.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge Or use your favorite SDK
  • 19. Programmatic Access to Cluster Provisioning
  • 20. Now that I have a cluster, I need to process some data
  • 21. Amazon EMR can process data from multiple sources Hadoop Distributed File System (HDFS) Amazon S3 (EMRFS) Amazon DynamoDB Amazon Kinesis
  • 22. Amazon EMR can process data from multiple sources Hadoop Distributed File System (HDFS) Amazon S3 (EMRFS) Amazon DynamoDB Amazon Kinesis
  • 23. Amazon EMR can process data from multiple sources Hadoop Distributed File System (HDFS) Amazon S3 (EMRFS) Amazon DynamoDB Amazon Kinesis
  • 24. On an On-premises Environment Tightly coupled
  • 25. Compute and Storage Grow Together Tightly coupled Storage grows along with compute Compute requirements vary
  • 26. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 27. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Re-processingWeekly peaks Steady state
  • 28. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Underutilized capacity Provisioned capacity
  • 29. Contention for Same Resources Compute bound Memory bound
  • 30. Separation of Resources Creates Data Silos Team A
  • 31. Replication Adds to Cost 3x Single datacenter
  • 32. So how does Amazon EMR solve these problems?
  • 34. Amazon S3 is Your Persistent Data Store 11 9’s of durability $0.03 / GB / month in US-East Lifecycle policies Versioning Distributed by default EMRFSAmazon S3
  • 35. The Amazon EMR File System (EMRFS) • Allows you to leverage Amazon S3 as a file-system • Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than open source components • Consistent view – consistency for read after write • Support for encryption • Fast listing of objects
  • 36. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION ‘samples/pig-apache/input/'
  • 37. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION 's3://elasticmapreduce.samples/pig- apache/input/'
  • 38. Benefit 1: Switch Off Clusters Amazon S3Amazon S3 Amazon S3
  • 40. You Can Build a Pipeline
  • 41. Or You Can Use AWS Data Pipeline Input data Use Amazon EMR to transform unstructured data to structured Push to Amazon S3 Ingest into Amazon Redshift
  • 43. Run Transient or Long-Running Clusters
  • 44. Run a Long-Running Cluster Amazon EMR cluster
  • 45. Benefit 2: Resize Your Cluster
  • 46. Resize the Cluster Scale Up, Scale Down, Stop a resize, issue a resize on another
  • 47. How do you scale up and save cost ?
  • 49. Spot Integration aws emr create-cluster --name "Spot cluster" --ami-version 3.3 InstanceGroupType=MASTER, InstanceType=m3.xlarge,InstanceCount=1, InstanceGroupType=CORE, BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK, BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
  • 50. The Spot Bid Advisor
  • 51. Spot Integration with Amazon EMR • Can provision instances from the Spot market • Replaces a Spot instance incase of interruption • Impact of interruption • Master node – Can lose the cluster • Core node – Can lose intermediate data • Task nodes – Jobs will restart on other nodes (application dependent)
  • 52. Scale up with Spot Instances 10 node cluster running for 14 hours Cost = 1.0 * 10 * 14 = $140
  • 53. Resize Nodes with Spot Instances Add 10 more nodes on Spot
  • 54. Resize Nodes with Spot Instances 20 node cluster running for 7 hours Cost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35 Total $105
  • 55. Resize Nodes with Spot Instances 50 % less run-time ( 14  7) 25% less cost (140  105)
  • 56. Scaling Hadoop Jobs with Spot http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/ 1500 to 2000 clusters 6000 Jobs
  • 57. For each instance_type in (Availability Zone, Region) { cpuPerUnitPrice = instance.cpuCores/instance.spotPrice if (maxCpuPerUnitPrice < cpuPerUnitPrice) { optimalInstanceType = instance_type; } } Source: Github /Bloomreach/ Briefly
  • 60. Effectively Utilize Clusters 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 61. Benefit 3: Logical Separation of Jobs Hive, Pig, Cascading Prod Presto Ad-Hoc Amazon S3
  • 62. Benefit 4: Disaster Recovery Built In Cluster 1 Cluster 2 Cluster 3 Cluster 4 Amazon S3 Availability Zone Availability Zone
  • 63. Amazon S3 as a Data Lake Nate Sammons, Principal Architect – NASDAQ Reference – AWS Big Data Blog
  • 64. Re-cap Rapid provisioning of clusters Hadoop, Spark, Presto, and other applications Standard open-source packaging De-couple storage and compute and scale them independently Resize clusters to manage demand Save costs with Spot instances
  • 65. How AOL Inc. moved a 2 PB Hadoop cluster to the AWS cloud Gaurav Agrawal Senior Software Engineer, AOL Inc. AWS Certified Associate Solutions Architect
  • 66. AOL Data Platforms Architecture 2014
  • 67. Data Stats & Insights Cluster Size 2 PB In-House Cluster 100 Nodes Raw Data/Day 2-3 TB Data Retention 13-24 Months
  • 68. Challenges with In-House Infrastructure Fixed Cost Slow Deployment Cycle Always On Self Serve Static : Not Scalable Outages Impact Production Upgrade Storage Compute
  • 69. AOL Data Platforms Architecture 2015 1 2 2 3 4 56
  • 71. Web Console and CLI Web Console for Training Setup IAM for users AWS Services Options S3 Data upload EMR Creation & Steps Try & Test multiple approaches CLI is your friend..!!!
  • 72. Migration • Web Console vs. CLI • Copy Existing Data to S3
  • 73. bucket-prod-control Environment Level Buckets Dev, QA, Production, Analyst Project Level Buckets Code, Data, Log, Extract and Control Compressed Snappy Data to GZIP Multi Platforms Support Best Compression Lowest storage cost Low cost for Data OUT bucket-dev bucket-qa bucket-prod bucket-analyst bucket-prod-code bucket-prod-log bucket-prod-data bucket-prod-extract 76% Less Storage 70K Saving/Year Copy Existing Data to S3
  • 74. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options
  • 75. EMR Design Options Transient Amazon S3 Elastic Cluster On-Demand vs. Reserved vs. Core NodesAmazon EMR vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes
  • 76. AOL Data Platforms Architecture 2015
  • 77. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission - CLI
  • 78. EMR Jobs Submission - CLI In-house scheduler Common Utilities Provision EMR Push/Pull Data to S3 Job submission to Scheduler Database Load JSON Files Applications, Steps, Bootstrap,EC2 attributes, Instance Groups Future : Event Driven Design – Lambda, SQS
  • 79. EMR Jobs Submission - CLI aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" --tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" --visible-to-all-users --ec2-attributes file://omni_awssot.generic.ec2_attributes.json --ami-version "3.7.0" --log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ --enable-debugging --instance-groups file://omni_awssot.generic.instance_groups.json --auto-terminate --applications file://omni_awssot.generic.applications.json --bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json --steps file://omni_awssot.generic.steps.json
  • 80. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring
  • 81. Monitoring EMR WatchDog : Node.js Duplicate Clusters Failed Clusters Long-running Clusters Long-provisioning Clusters CloudWatch Alarms Monthly Billing S3 Bucket Size SNS Email Notifications Amazon CloudWatch Amazon SNS
  • 82. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity
  • 83. Elasticity Why be Elastic? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 09/05/2015 Cores Nodes Daily Processes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Core Nodes Demand - 09/20/2015 Core Nodes No Clusters Spike in Demand 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Cores Nodes Major Restatement Demand > 10K EC2
  • 84. Elasticity Why be Elastic? True Cloud Architecture Spot is an Open Market Scale Horizontally Our Limit : 3,000 EC2/Region Multiple Regions Multiple Instance Types
  • 85. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity • Cost Management & BCDR
  • 86. Cost Management & BCDR Multi Region Deployment Best AZ for pricing Design for failure Global. BC-DR.
  • 87. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity • Cost Management & BCDR • Optimization
  • 88. Optimization Data Management Partition Data on S3 S3 Versioning/Lifecycle How many nodes? Based on Data Volume Complete hour for pricing Hadoop Run-time Params Memory Tuning Compress M & R Output Combine Splits Input format Security
  • 89. Score Card Feature AWS Pay for what you use ✔ Decouple Storage and Compute ✔ True Cloud Architecture ✔ Self Service Model ✔ Elastic & Scalable ✔ Global Infrastructure. BCDR. ✔ Quick & Easy Deployments ✔ Redshift External Tables on S3 ? More languages for Lambda ?
  • 90. AWS vs. In-House Cost 0 2 4 6 Service Cost Comparison AWS In-House Source : AOL & AWS Billing Tool 4xIn-House / Month 1xAWS / Month ** In-House cluster includes Storage, Power and Network cost.
  • 91. AWS vs. In-House Cost 10/8/2015 Amazon Web Services 1/4th Cost of In-House Hadoop Infrastructure 1/4th Cost Data Platforms. AOL Inc.
  • 92. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Core… Restatement Use Case • Restate historical data going back 6 months Availability Zones 10 550 EMR Clusters 24,000 Spot EC2 Instances 0 10 20 30 40 50 60 70 Timing Comparison In-House AWS
  • 93. Tag All Resources Infrastructure as CodeCommand Line Interface JSON as configuration files IAM Roles and Policies Use of Application ID Enable CloudTrail S3 Lifecycle ManagementS3 Versioning Separate Code/Data/Logs buckets Keyless EMR Clusters Hybrid Model Enable Debugging Create Multiple CLI Profiles Multi-Factor Authentication CloudWatch Billing Alarms Spot EC2 Instances SNS notifications for failures Loosely coupled Apps Scale Horizontally Best Practices & Suggestions
  • 95. Thank you! Photo Credits • Key Board : http://bit.ly/1LRQMdR • Compression : http://bit.ly/1MtT3Pa • Optimization : http://bit.ly/1FlidQD • WatchDog : http://bit.ly/1OX50j6 • Elasticity : http://bit.ly/1YFfCr4 • Fish Bowl : http://bit.ly/1VjrcJd • Blank Cheque : http://bit.ly/1RkTgGe