SlideShare a Scribd company logo
1 of 95
Download to read offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Abhishek Sinha, Amazon Web Services
Gaurav Agrawal, AOL Inc
October 2015
BDT208
A Technical Introduction to
Amazon EMR
What to Expect from the Session
• Technical introduction to Amazon EMR
• Basic tenets
• Amazon EMR feature set
• Real-Life experience of moving a 2-PB, on-premises
Hadoop cluster to the AWS cloud
• Is not a technical introduction to Apache Spark, Apache
Hadoop, or other frameworks
Amazon EMR
• Managed platform
• MapReduce, Apache Spark, Presto
• Launch a cluster in minutes
• Open source distribution and MapR
distribution
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
Make it easy, secure, and
cost-effective to run
data-processing frameworks
on the AWS cloud
What Do I Need to Build a Cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method
An Example EMR Cluster
Master Node
r3.2xlarge
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge
Slave Group – Task
m3.2xlarge (EC2 Spot)
HDFS (DataNode).
YARN (NodeManager).
NameNode (HDFS)
ResourceManager
(YARN)
Choice of Multiple Instances
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Machine
Learning
Batch
Processing
In-memory
(Spark &
Presto)
Large HDFS
Select an Instance
Choose Your Software (Quick Bundles)
Choose Your Software – Custom
Hadoop Applications Available in Amazon EMR
Choose Security and Access Control
You Are Up and Running!
You Are Up and Running!
Master Node DNS
You Are Up and Running!
Information about the software you are
running, logs and features
You Are Up and Running!
Infrastructure for this cluster
You Are Up and Running!
Security Groups and Roles
Use the CLI
aws emr create-cluster
--release-label emr-4.0.0
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge
Or use your favorite SDK
Programmatic Access to Cluster Provisioning
Now that I have a cluster, I need to process
some data
Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
Amazon EMR can process data from multiple sources
Hadoop Distributed File
System (HDFS)
Amazon S3 (EMRFS)
Amazon DynamoDB
Amazon Kinesis
On an On-premises Environment
Tightly coupled
Compute and Storage Grow Together
Tightly coupled
Storage grows along with
compute
Compute requirements vary
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized capacity
Provisioned capacity
Contention for Same Resources
Compute
bound
Memory
bound
Separation of Resources Creates Data Silos
Team A
Replication Adds to Cost
3x
Single datacenter
So how does Amazon EMR solve these problems?
Decouple Storage and Compute
Amazon S3 is Your Persistent Data Store
11 9’s of durability
$0.03 / GB / month in US-East
Lifecycle policies
Versioning
Distributed by default
EMRFSAmazon S3
The Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file-system
• Streams data directly from Amazon S3
• Uses HDFS for intermediates
• Better read/write performance and error handling than
open source components
• Consistent view – consistency for read after write
• Support for encryption
• Fast listing of objects
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION ‘samples/pig-apache/input/'
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex(
host STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
)
LOCATION 's3://elasticmapreduce.samples/pig-
apache/input/'
Benefit 1: Switch Off Clusters
Amazon S3Amazon S3 Amazon S3
Auto-Terminate Clusters
You Can Build a Pipeline
Or You Can Use AWS Data Pipeline
Input data
Use Amazon EMR to
transform unstructured
data to structured
Push to
Amazon S3
Ingest into
Amazon
Redshift
Sample Pipeline
Run Transient or Long-Running Clusters
Run a Long-Running Cluster
Amazon EMR cluster
Benefit 2: Resize Your Cluster
Resize the Cluster
Scale Up, Scale Down, Stop a resize,
issue a resize on another
How do you scale up and save cost ?
Spot Instance
Bid
Price
OD
Price
Spot Integration
aws emr create-cluster --name "Spot cluster" --ami-version 3.3
InstanceGroupType=MASTER,
InstanceType=m3.xlarge,InstanceCount=1,
InstanceGroupType=CORE,
BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2
InstanceGroupType=TASK,
BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
The Spot Bid Advisor
Spot Integration with Amazon EMR
• Can provision instances from the Spot market
• Replaces a Spot instance incase of interruption
• Impact of interruption
• Master node – Can lose the cluster
• Core node – Can lose intermediate data
• Task nodes – Jobs will restart on other nodes (application
dependent)
Scale up with Spot Instances
10 node cluster running for 14 hours
Cost = 1.0 * 10 * 14 = $140
Resize Nodes with Spot Instances
Add 10 more nodes on Spot
Resize Nodes with Spot Instances
20 node cluster running for 7 hours
Cost = 1.0 * 10 * 7 = $70
= 0.5 * 10 * 7 = $35
Total $105
Resize Nodes with Spot Instances
50 % less run-time ( 14  7)
25% less cost (140  105)
Scaling Hadoop Jobs with Spot
http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/
1500 to 2000 clusters
6000 Jobs
For each instance_type in (Availability Zone, Region)
{
cpuPerUnitPrice = instance.cpuCores/instance.spotPrice
if (maxCpuPerUnitPrice < cpuPerUnitPrice) {
optimalInstanceType = instance_type;
}
}
Source: Github /Bloomreach/ Briefly
Intelligent Scale Down
Intelligent Scale Down: HDFS
Effectively Utilize Clusters
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Benefit 3: Logical Separation of Jobs
Hive, Pig,
Cascading
Prod
Presto Ad-Hoc
Amazon S3
Benefit 4: Disaster Recovery Built In
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Amazon S3
Availability Zone Availability Zone
Amazon S3 as a Data Lake
Nate Sammons, Principal Architect – NASDAQ
Reference – AWS Big Data Blog
Re-cap
Rapid provisioning of clusters
Hadoop, Spark, Presto, and other applications
Standard open-source packaging
De-couple storage and compute and scale them
independently
Resize clusters to manage demand
Save costs with Spot instances
How AOL Inc. moved a 2 PB Hadoop
cluster to the AWS cloud
Gaurav Agrawal
Senior Software Engineer, AOL Inc.
AWS Certified Associate Solutions Architect
AOL Data Platforms Architecture 2014
Data Stats & Insights
Cluster Size
2 PB
In-House
Cluster
100 Nodes
Raw
Data/Day
2-3 TB
Data
Retention
13-24 Months
Challenges with In-House Infrastructure
Fixed Cost
Slow Deployment
Cycle
Always On Self Serve
Static : Not Scalable Outages Impact Production Upgrade
Storage Compute
AOL Data Platforms Architecture 2015
1
2
2
3
4
56
Migration
• Web Console vs. CLI
Web Console and CLI
Web Console for Training
Setup IAM for users
AWS Services Options
S3 Data upload
EMR Creation & Steps
Try & Test multiple approaches
CLI is your friend..!!!
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
bucket-prod-control
Environment Level Buckets
Dev, QA, Production, Analyst
Project Level Buckets
Code, Data, Log, Extract and Control
Compressed Snappy Data to GZIP
Multi Platforms Support
Best Compression
Lowest storage cost
Low cost for Data OUT
bucket-dev bucket-qa
bucket-prod bucket-analyst
bucket-prod-code
bucket-prod-log
bucket-prod-data
bucket-prod-extract
76%
Less Storage
70K
Saving/Year
Copy Existing Data to S3
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
EMR Design Options
Transient
Amazon S3
Elastic Cluster
On-Demand vs. Reserved vs.
Core NodesAmazon EMR
vs. Persistent Cluster
vs. local HDFS
vs. Static Cluster
Spot
vs. Task Nodes
AOL Data Platforms Architecture 2015
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission - CLI
EMR Jobs Submission - CLI
In-house scheduler
Common Utilities
Provision EMR
Push/Pull Data to S3
Job submission to Scheduler
Database Load
JSON Files
Applications, Steps, Bootstrap,EC2 attributes, Instance Groups
Future : Event Driven Design – Lambda, SQS
EMR Jobs Submission - CLI
aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" 
--tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav"
"Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" 
--visible-to-all-users 
--ec2-attributes file://omni_awssot.generic.ec2_attributes.json 
--ami-version "3.7.0" 
--log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ 
--enable-debugging 
--instance-groups file://omni_awssot.generic.instance_groups.json 
--auto-terminate 
--applications file://omni_awssot.generic.applications.json 
--bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json 
--steps file://omni_awssot.generic.steps.json
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
Monitoring
EMR WatchDog : Node.js
Duplicate Clusters
Failed Clusters
Long-running Clusters
Long-provisioning Clusters
CloudWatch Alarms
Monthly Billing
S3 Bucket Size
SNS Email Notifications
Amazon CloudWatch
Amazon SNS
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
Elasticity
Why be Elastic?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 09/05/2015 Cores Nodes
Daily Processes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Core Nodes Demand - 09/20/2015 Core Nodes
No Clusters
Spike in Demand
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Cores Nodes
Major Restatement
Demand > 10K EC2
Elasticity
Why be Elastic?
True Cloud Architecture
Spot is an Open Market
Scale Horizontally
Our Limit : 3,000 EC2/Region
Multiple Regions
Multiple Instance Types
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Cost Management & BCDR
Cost Management & BCDR
Multi Region Deployment
Best AZ for pricing
Design for failure
Global. BC-DR.
Migration
• Web Console vs. CLI
• Copy Existing Data to S3
• EMR Design options
• EMR Jobs Submission – CLI
• Monitoring
• Elasticity
• Cost Management & BCDR
• Optimization
Optimization
Data Management
Partition Data on S3
S3 Versioning/Lifecycle
How many nodes?
Based on Data Volume
Complete hour for pricing
Hadoop Run-time Params
Memory Tuning
Compress M & R Output
Combine Splits Input format
Security
Score Card
Feature AWS
Pay for what you use ✔
Decouple Storage and Compute ✔
True Cloud Architecture ✔
Self Service Model ✔
Elastic & Scalable ✔
Global Infrastructure. BCDR. ✔
Quick & Easy Deployments ✔
Redshift External Tables on S3 ?
More languages for Lambda ?
AWS vs. In-House Cost
0 2 4 6
Service
Cost Comparison
AWS
In-House
Source : AOL & AWS Billing Tool
4xIn-House / Month
1xAWS / Month
** In-House cluster includes Storage, Power and Network cost.
AWS vs. In-House Cost
10/8/2015
Amazon Web Services
1/4th Cost of In-House Hadoop Infrastructure
1/4th Cost
Data Platforms. AOL Inc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Cores Nodes Demand - 06/01/2015 Core…
Restatement Use Case
• Restate historical data going back 6 months
Availability Zones
10
550
EMR Clusters
24,000
Spot EC2 Instances
0
10
20
30
40
50
60
70
Timing Comparison
In-House
AWS
Tag All Resources
Infrastructure as CodeCommand Line Interface
JSON as configuration files
IAM Roles and Policies
Use of Application ID
Enable CloudTrail
S3 Lifecycle ManagementS3 Versioning
Separate Code/Data/Logs buckets
Keyless EMR Clusters
Hybrid Model
Enable Debugging
Create Multiple CLI Profiles
Multi-Factor Authentication
CloudWatch Billing Alarms
Spot EC2 Instances
SNS notifications for failures
Loosely coupled Apps
Scale Horizontally
Best Practices & Suggestions
Remember to complete
your evaluations!
Thank you!
Photo Credits
• Key Board : http://bit.ly/1LRQMdR
• Compression : http://bit.ly/1MtT3Pa
• Optimization : http://bit.ly/1FlidQD
• WatchDog : http://bit.ly/1OX50j6
• Elasticity : http://bit.ly/1YFfCr4
• Fish Bowl : http://bit.ly/1VjrcJd
• Blank Cheque : http://bit.ly/1RkTgGe

More Related Content

What's hot

Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceAmazon Web Services
 
Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSAmazon Web Services
 
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivFinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivAmazon Web Services
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
AWS S3 and GLACIER
AWS S3 and GLACIERAWS S3 and GLACIER
AWS S3 and GLACIERMahesh Raj
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost ManagementAmazon Web Services
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나Amazon Web Services Korea
 
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSEnterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSAmazon Web Services
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3Mark Cohen
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 

What's hot (20)

Getting Started with AWS Database Migration Service
Getting Started with AWS Database Migration ServiceGetting Started with AWS Database Migration Service
Getting Started with AWS Database Migration Service
 
Databases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWSDatabases - Choosing the right Database on AWS
Databases - Choosing the right Database on AWS
 
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel AvivFinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
FinOps - AWS Cost and Operational Efficiency - Pop-up Loft Tel Aviv
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
AWS S3 and GLACIER
AWS S3 and GLACIERAWS S3 and GLACIER
AWS S3 and GLACIER
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
AWS Lake Formation을 통한 손쉬운 데이터 레이크 구성 및 관리 - 윤석찬 :: AWS Unboxing 온라인 세미나
 
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWSEnterprise-Database-Migration-Strategies-and-Options-on-AWS
Enterprise-Database-Migration-Strategies-and-Options-on-AWS
 
Cost optimization on AWS
Cost optimization on AWSCost optimization on AWS
Cost optimization on AWS
 
AWS 101
AWS 101AWS 101
AWS 101
 
ABCs of AWS: S3
ABCs of AWS: S3ABCs of AWS: S3
ABCs of AWS: S3
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)Amazon Relational Database Service (Amazon RDS)
Amazon Relational Database Service (Amazon RDS)
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Intro to AWS: Storage Services
Intro to AWS: Storage ServicesIntro to AWS: Storage Services
Intro to AWS: Storage Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

Viewers also liked

AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...Amazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesAmazon Web Services
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMRABC Talks
 
Optimizing Costs and Efficiency of AWS Services
Optimizing Costs and Efficiency of AWS Services Optimizing Costs and Efficiency of AWS Services
Optimizing Costs and Efficiency of AWS Services Amazon Web Services
 
Account Separation and Mandatory Access Control Partner Summit
Account Separation and Mandatory Access Control Partner SummitAccount Separation and Mandatory Access Control Partner Summit
Account Separation and Mandatory Access Control Partner SummitAmazon Web Services
 
Putting it All Together: Securing Systems at Cloud Scale
Putting it All Together: Securing Systems at Cloud ScalePutting it All Together: Securing Systems at Cloud Scale
Putting it All Together: Securing Systems at Cloud ScaleAmazon Web Services
 
(SEC203) Journey to Securing Time Inc's Move to the Cloud
(SEC203) Journey to Securing Time Inc's Move to the Cloud(SEC203) Journey to Securing Time Inc's Move to the Cloud
(SEC203) Journey to Securing Time Inc's Move to the CloudAmazon Web Services
 
Financial Services Analytics on AWS
Financial Services Analytics on AWSFinancial Services Analytics on AWS
Financial Services Analytics on AWSAmazon Web Services
 
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
(SEC316) Harden Your Architecture w/ Security Incident Response SimulationsAmazon Web Services
 
(SPOT303) Security Operations at Massive Scale
(SPOT303) Security Operations at Massive Scale(SPOT303) Security Operations at Massive Scale
(SPOT303) Security Operations at Massive ScaleAmazon Web Services
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with ExamplesJoe McTee
 
(NET405) Build a Remote Access VPN Solution on AWS
(NET405) Build a Remote Access VPN Solution on AWS(NET405) Build a Remote Access VPN Solution on AWS
(NET405) Build a Remote Access VPN Solution on AWSAmazon Web Services
 
基幹業務もHadoop(EMR)で!!のその後
基幹業務もHadoop(EMR)で!!のその後基幹業務もHadoop(EMR)で!!のその後
基幹業務もHadoop(EMR)で!!のその後Keigo Suda
 
(ARC401) Cloud First: New Architecture for New Infrastructure
(ARC401) Cloud First: New Architecture for New Infrastructure(ARC401) Cloud First: New Architecture for New Infrastructure
(ARC401) Cloud First: New Architecture for New InfrastructureAmazon Web Services
 

Viewers also liked (20)

AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
AWS re:Invent 2016: Deep Dive: Amazon EMR Best Practices & Design Patterns (B...
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar SeriesIntroducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Map Reduce along with Amazon EMR
Map Reduce along with Amazon EMRMap Reduce along with Amazon EMR
Map Reduce along with Amazon EMR
 
A Mayo Clinic Big Data Implementation
A Mayo Clinic Big Data ImplementationA Mayo Clinic Big Data Implementation
A Mayo Clinic Big Data Implementation
 
Optimizing Costs and Efficiency of AWS Services
Optimizing Costs and Efficiency of AWS Services Optimizing Costs and Efficiency of AWS Services
Optimizing Costs and Efficiency of AWS Services
 
Account Separation and Mandatory Access Control Partner Summit
Account Separation and Mandatory Access Control Partner SummitAccount Separation and Mandatory Access Control Partner Summit
Account Separation and Mandatory Access Control Partner Summit
 
Enterprise IT in the Cloud
Enterprise IT in the Cloud Enterprise IT in the Cloud
Enterprise IT in the Cloud
 
Putting it All Together: Securing Systems at Cloud Scale
Putting it All Together: Securing Systems at Cloud ScalePutting it All Together: Securing Systems at Cloud Scale
Putting it All Together: Securing Systems at Cloud Scale
 
(SEC203) Journey to Securing Time Inc's Move to the Cloud
(SEC203) Journey to Securing Time Inc's Move to the Cloud(SEC203) Journey to Securing Time Inc's Move to the Cloud
(SEC203) Journey to Securing Time Inc's Move to the Cloud
 
Financial Services Analytics on AWS
Financial Services Analytics on AWSFinancial Services Analytics on AWS
Financial Services Analytics on AWS
 
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
(SEC316) Harden Your Architecture w/ Security Incident Response Simulations
 
(SPOT303) Security Operations at Massive Scale
(SPOT303) Security Operations at Massive Scale(SPOT303) Security Operations at Massive Scale
(SPOT303) Security Operations at Massive Scale
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
(NET405) Build a Remote Access VPN Solution on AWS
(NET405) Build a Remote Access VPN Solution on AWS(NET405) Build a Remote Access VPN Solution on AWS
(NET405) Build a Remote Access VPN Solution on AWS
 
Accelerate Track
Accelerate TrackAccelerate Track
Accelerate Track
 
Amazon WorkSpaces for Education
Amazon WorkSpaces for EducationAmazon WorkSpaces for Education
Amazon WorkSpaces for Education
 
基幹業務もHadoop(EMR)で!!のその後
基幹業務もHadoop(EMR)で!!のその後基幹業務もHadoop(EMR)で!!のその後
基幹業務もHadoop(EMR)で!!のその後
 
(ARC401) Cloud First: New Architecture for New Infrastructure
(ARC401) Cloud First: New Architecture for New Infrastructure(ARC401) Cloud First: New Architecture for New Infrastructure
(ARC401) Cloud First: New Architecture for New Infrastructure
 

Similar to (BDT208) A Technical Introduction to Amazon Elastic MapReduce

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...Amazon Web Services
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSAmazon Web Services
 
Real world High Performance & High Throughput Computing on AWS
Real world High Performance & High Throughput Computing on AWSReal world High Performance & High Throughput Computing on AWS
Real world High Performance & High Throughput Computing on AWSAmazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Amazon Web Services
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWSDanilo Poccia
 

Similar to (BDT208) A Technical Introduction to Amazon Elastic MapReduce (20)

Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS...
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 
Real world High Performance & High Throughput Computing on AWS
Real world High Performance & High Throughput Computing on AWSReal world High Performance & High Throughput Computing on AWS
Real world High Performance & High Throughput Computing on AWS
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)Design patterns and best practices for data analytics with amazon emr (ABD305)
Design patterns and best practices for data analytics with amazon emr (ABD305)
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data Analytics on AWS
Data Analytics on AWSData Analytics on AWS
Data Analytics on AWS
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

(BDT208) A Technical Introduction to Amazon Elastic MapReduce

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Abhishek Sinha, Amazon Web Services Gaurav Agrawal, AOL Inc October 2015 BDT208 A Technical Introduction to Amazon EMR
  • 2. What to Expect from the Session • Technical introduction to Amazon EMR • Basic tenets • Amazon EMR feature set • Real-Life experience of moving a 2-PB, on-premises Hadoop cluster to the AWS cloud • Is not a technical introduction to Apache Spark, Apache Hadoop, or other frameworks
  • 3. Amazon EMR • Managed platform • MapReduce, Apache Spark, Presto • Launch a cluster in minutes • Open source distribution and MapR distribution • Leverage the elasticity of the cloud • Baked in security features • Pay by the hour and save with Spot • Flexibility to customize
  • 4. Make it easy, secure, and cost-effective to run data-processing frameworks on the AWS cloud
  • 5. What Do I Need to Build a Cluster ? 1. Choose instances 2. Choose your software 3. Choose your access method
  • 6. An Example EMR Cluster Master Node r3.2xlarge Slave Group - Core c3.2xlarge Slave Group – Task m3.xlarge Slave Group – Task m3.2xlarge (EC2 Spot) HDFS (DataNode). YARN (NodeManager). NameNode (HDFS) ResourceManager (YARN)
  • 7. Choice of Multiple Instances CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family Disk/IO d2 family i2 family General m1 family m3 family Machine Learning Batch Processing In-memory (Spark & Presto) Large HDFS
  • 9. Choose Your Software (Quick Bundles)
  • 10. Choose Your Software – Custom
  • 12. Choose Security and Access Control
  • 13. You Are Up and Running!
  • 14. You Are Up and Running! Master Node DNS
  • 15. You Are Up and Running! Information about the software you are running, logs and features
  • 16. You Are Up and Running! Infrastructure for this cluster
  • 17. You Are Up and Running! Security Groups and Roles
  • 18. Use the CLI aws emr create-cluster --release-label emr-4.0.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1, InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge Or use your favorite SDK
  • 19. Programmatic Access to Cluster Provisioning
  • 20. Now that I have a cluster, I need to process some data
  • 21. Amazon EMR can process data from multiple sources Hadoop Distributed File System (HDFS) Amazon S3 (EMRFS) Amazon DynamoDB Amazon Kinesis
  • 22. Amazon EMR can process data from multiple sources Hadoop Distributed File System (HDFS) Amazon S3 (EMRFS) Amazon DynamoDB Amazon Kinesis
  • 23. Amazon EMR can process data from multiple sources Hadoop Distributed File System (HDFS) Amazon S3 (EMRFS) Amazon DynamoDB Amazon Kinesis
  • 24. On an On-premises Environment Tightly coupled
  • 25. Compute and Storage Grow Together Tightly coupled Storage grows along with compute Compute requirements vary
  • 26. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 27. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Re-processingWeekly peaks Steady state
  • 28. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Underutilized capacity Provisioned capacity
  • 29. Contention for Same Resources Compute bound Memory bound
  • 30. Separation of Resources Creates Data Silos Team A
  • 31. Replication Adds to Cost 3x Single datacenter
  • 32. So how does Amazon EMR solve these problems?
  • 34. Amazon S3 is Your Persistent Data Store 11 9’s of durability $0.03 / GB / month in US-East Lifecycle policies Versioning Distributed by default EMRFSAmazon S3
  • 35. The Amazon EMR File System (EMRFS) • Allows you to leverage Amazon S3 as a file-system • Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than open source components • Consistent view – consistency for read after write • Support for encryption • Fast listing of objects
  • 36. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION ‘samples/pig-apache/input/'
  • 37. Going from HDFS to Amazon S3 CREATE EXTERNAL TABLE serde_regex( host STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' ) LOCATION 's3://elasticmapreduce.samples/pig- apache/input/'
  • 38. Benefit 1: Switch Off Clusters Amazon S3Amazon S3 Amazon S3
  • 40. You Can Build a Pipeline
  • 41. Or You Can Use AWS Data Pipeline Input data Use Amazon EMR to transform unstructured data to structured Push to Amazon S3 Ingest into Amazon Redshift
  • 43. Run Transient or Long-Running Clusters
  • 44. Run a Long-Running Cluster Amazon EMR cluster
  • 45. Benefit 2: Resize Your Cluster
  • 46. Resize the Cluster Scale Up, Scale Down, Stop a resize, issue a resize on another
  • 47. How do you scale up and save cost ?
  • 49. Spot Integration aws emr create-cluster --name "Spot cluster" --ami-version 3.3 InstanceGroupType=MASTER, InstanceType=m3.xlarge,InstanceCount=1, InstanceGroupType=CORE, BidPrice=0.03,InstanceType=m3.xlarge,InstanceCount=2 InstanceGroupType=TASK, BidPrice=0.10,InstanceType=m3.xlarge,InstanceCount=3
  • 50. The Spot Bid Advisor
  • 51. Spot Integration with Amazon EMR • Can provision instances from the Spot market • Replaces a Spot instance incase of interruption • Impact of interruption • Master node – Can lose the cluster • Core node – Can lose intermediate data • Task nodes – Jobs will restart on other nodes (application dependent)
  • 52. Scale up with Spot Instances 10 node cluster running for 14 hours Cost = 1.0 * 10 * 14 = $140
  • 53. Resize Nodes with Spot Instances Add 10 more nodes on Spot
  • 54. Resize Nodes with Spot Instances 20 node cluster running for 7 hours Cost = 1.0 * 10 * 7 = $70 = 0.5 * 10 * 7 = $35 Total $105
  • 55. Resize Nodes with Spot Instances 50 % less run-time ( 14  7) 25% less cost (140  105)
  • 56. Scaling Hadoop Jobs with Spot http://engineering.bloomreach.com/strategies-for-reducing-your-amazon-emr-costs/ 1500 to 2000 clusters 6000 Jobs
  • 57. For each instance_type in (Availability Zone, Region) { cpuPerUnitPrice = instance.cpuCores/instance.spotPrice if (maxCpuPerUnitPrice < cpuPerUnitPrice) { optimalInstanceType = instance_type; } } Source: Github /Bloomreach/ Briefly
  • 60. Effectively Utilize Clusters 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  • 61. Benefit 3: Logical Separation of Jobs Hive, Pig, Cascading Prod Presto Ad-Hoc Amazon S3
  • 62. Benefit 4: Disaster Recovery Built In Cluster 1 Cluster 2 Cluster 3 Cluster 4 Amazon S3 Availability Zone Availability Zone
  • 63. Amazon S3 as a Data Lake Nate Sammons, Principal Architect – NASDAQ Reference – AWS Big Data Blog
  • 64. Re-cap Rapid provisioning of clusters Hadoop, Spark, Presto, and other applications Standard open-source packaging De-couple storage and compute and scale them independently Resize clusters to manage demand Save costs with Spot instances
  • 65. How AOL Inc. moved a 2 PB Hadoop cluster to the AWS cloud Gaurav Agrawal Senior Software Engineer, AOL Inc. AWS Certified Associate Solutions Architect
  • 66. AOL Data Platforms Architecture 2014
  • 67. Data Stats & Insights Cluster Size 2 PB In-House Cluster 100 Nodes Raw Data/Day 2-3 TB Data Retention 13-24 Months
  • 68. Challenges with In-House Infrastructure Fixed Cost Slow Deployment Cycle Always On Self Serve Static : Not Scalable Outages Impact Production Upgrade Storage Compute
  • 69. AOL Data Platforms Architecture 2015 1 2 2 3 4 56
  • 71. Web Console and CLI Web Console for Training Setup IAM for users AWS Services Options S3 Data upload EMR Creation & Steps Try & Test multiple approaches CLI is your friend..!!!
  • 72. Migration • Web Console vs. CLI • Copy Existing Data to S3
  • 73. bucket-prod-control Environment Level Buckets Dev, QA, Production, Analyst Project Level Buckets Code, Data, Log, Extract and Control Compressed Snappy Data to GZIP Multi Platforms Support Best Compression Lowest storage cost Low cost for Data OUT bucket-dev bucket-qa bucket-prod bucket-analyst bucket-prod-code bucket-prod-log bucket-prod-data bucket-prod-extract 76% Less Storage 70K Saving/Year Copy Existing Data to S3
  • 74. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options
  • 75. EMR Design Options Transient Amazon S3 Elastic Cluster On-Demand vs. Reserved vs. Core NodesAmazon EMR vs. Persistent Cluster vs. local HDFS vs. Static Cluster Spot vs. Task Nodes
  • 76. AOL Data Platforms Architecture 2015
  • 77. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission - CLI
  • 78. EMR Jobs Submission - CLI In-house scheduler Common Utilities Provision EMR Push/Pull Data to S3 Job submission to Scheduler Database Load JSON Files Applications, Steps, Bootstrap,EC2 attributes, Instance Groups Future : Event Driven Design – Lambda, SQS
  • 79. EMR Jobs Submission - CLI aws emr create-cluster –name "prod_dataset_subdataset_2015-10-08" --tags "Env=prod" "Project=Omniture" "Dataset=DATASET" "Owner=gaurav" "Subdataset=SUBDATASET" "Date=2015-10-08" "Region=us-east-1" --visible-to-all-users --ec2-attributes file://omni_awssot.generic.ec2_attributes.json --ami-version "3.7.0" --log-uri s3://bucket-prod-log/DATASET_NAME/SUBDATASET_NAME/ --enable-debugging --instance-groups file://omni_awssot.generic.instance_groups.json --auto-terminate --applications file://omni_awssot.generic.applications.json --bootstrap-actions file://omni_awssot.generic.bootstrap_actions.json --steps file://omni_awssot.generic.steps.json
  • 80. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring
  • 81. Monitoring EMR WatchDog : Node.js Duplicate Clusters Failed Clusters Long-running Clusters Long-provisioning Clusters CloudWatch Alarms Monthly Billing S3 Bucket Size SNS Email Notifications Amazon CloudWatch Amazon SNS
  • 82. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity
  • 83. Elasticity Why be Elastic? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 09/05/2015 Cores Nodes Daily Processes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Core Nodes Demand - 09/20/2015 Core Nodes No Clusters Spike in Demand 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Cores Nodes Major Restatement Demand > 10K EC2
  • 84. Elasticity Why be Elastic? True Cloud Architecture Spot is an Open Market Scale Horizontally Our Limit : 3,000 EC2/Region Multiple Regions Multiple Instance Types
  • 85. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity • Cost Management & BCDR
  • 86. Cost Management & BCDR Multi Region Deployment Best AZ for pricing Design for failure Global. BC-DR.
  • 87. Migration • Web Console vs. CLI • Copy Existing Data to S3 • EMR Design options • EMR Jobs Submission – CLI • Monitoring • Elasticity • Cost Management & BCDR • Optimization
  • 88. Optimization Data Management Partition Data on S3 S3 Versioning/Lifecycle How many nodes? Based on Data Volume Complete hour for pricing Hadoop Run-time Params Memory Tuning Compress M & R Output Combine Splits Input format Security
  • 89. Score Card Feature AWS Pay for what you use ✔ Decouple Storage and Compute ✔ True Cloud Architecture ✔ Self Service Model ✔ Elastic & Scalable ✔ Global Infrastructure. BCDR. ✔ Quick & Easy Deployments ✔ Redshift External Tables on S3 ? More languages for Lambda ?
  • 90. AWS vs. In-House Cost 0 2 4 6 Service Cost Comparison AWS In-House Source : AOL & AWS Billing Tool 4xIn-House / Month 1xAWS / Month ** In-House cluster includes Storage, Power and Network cost.
  • 91. AWS vs. In-House Cost 10/8/2015 Amazon Web Services 1/4th Cost of In-House Hadoop Infrastructure 1/4th Cost Data Platforms. AOL Inc.
  • 92. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Cores Nodes Demand - 06/01/2015 Core… Restatement Use Case • Restate historical data going back 6 months Availability Zones 10 550 EMR Clusters 24,000 Spot EC2 Instances 0 10 20 30 40 50 60 70 Timing Comparison In-House AWS
  • 93. Tag All Resources Infrastructure as CodeCommand Line Interface JSON as configuration files IAM Roles and Policies Use of Application ID Enable CloudTrail S3 Lifecycle ManagementS3 Versioning Separate Code/Data/Logs buckets Keyless EMR Clusters Hybrid Model Enable Debugging Create Multiple CLI Profiles Multi-Factor Authentication CloudWatch Billing Alarms Spot EC2 Instances SNS notifications for failures Loosely coupled Apps Scale Horizontally Best Practices & Suggestions
  • 95. Thank you! Photo Credits • Key Board : http://bit.ly/1LRQMdR • Compression : http://bit.ly/1MtT3Pa • Optimization : http://bit.ly/1FlidQD • WatchDog : http://bit.ly/1OX50j6 • Elasticity : http://bit.ly/1YFfCr4 • Fish Bowl : http://bit.ly/1VjrcJd • Blank Cheque : http://bit.ly/1RkTgGe