SlideShare a Scribd company logo
1 of 70
Download to read offline
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. 
Amazon Elastic MapReduce: 
Deep Dive and Best Practices 
Ian Meyers, AWS (meyersi@) 
October 29th, 2014
Outline 
Introduction to Amazon EMR 
Amazon EMR Design Patterns 
Amazon EMR Best Practices 
Observations from AWS
Map-Reduce Engine Vibrant Ecosystem 
Hadoop-as-a-Service 
Massively Parallel 
Cost Effective AWS Wrapper 
Integrated to AWS services 
What is EMR?
HDFS 
Amazon EMR
EMRfs 
HDFS 
Amazon EMR 
Amazon S3 Amazon 
DynamoDB
EMRfs 
HDFS 
Data management Analytics languages 
Amazon EMR 
Amazon S3 Amazon 
DynamoDB
EMRfs 
HDFS 
Data management Analytics languages 
Amazon EMR Amazon 
RDS 
Amazon S3 Amazon 
DynamoDB
EMRfs 
HDFS 
Data management Analytics languages 
Amazon 
Redshift 
Amazon EMR Amazon 
RDS 
Amazon S3 Amazon 
DynamoDB 
AWS Data Pipeline
Amazon EMR Introduction 
Launch clusters of any size in a matter of minutes 
Use variety of different instance sizes that match 
your workload 
Don’t get stuck with hardware 
Don’t deal with capacity planning 
Run multiple clusters with different sizes, specs 
and node types
Elastic MapReduce & Amazon S3 
EMR has an optimised driver for Amazon S3 
64MB Range Offset Reads to increase performance 
Elastic MapReduce Consistent View further 
Increases Performance 
Addresses Consistency 
S3 Cost - $.03/GB - Volume Based Price Tiering
Outline 
Introduction to Amazon EMR 
Amazon EMR Design Patterns 
Amazon EMR Best Practices 
Observations from AWS
Amazon EMR Design Patterns 
Pattern #1: Transient vs. Alive Clusters 
Pattern #2: Core Nodes and Task Nodes 
Pattern #3: Amazon S3 & HDFS
Pattern #1: Transient vs. Alive Clusters
Pattern #1: Transient Clusters 
Cluster lives for the duration of the job 
Shut down the cluster when the job is done 
Data persist on 
Amazon S3 
Input & Output 
Data on 
Amazon S3
Benefits of Transient Clusters 
1. Control your cost 
2. Minimum maintenance 
• Cluster goes away when job is done 
3. Practice cloud architecture 
• Pay for what you use 
• Data processing as a workflow
Alive Clusters 
Very similar to traditional Hadoop deployments 
Cluster stays around after the job is done 
Data persistence model: 
Amazon S3 
Amazon S3 Copy To HDFS 
HDFS and Amazon S3 as 
backup
Alive Clusters 
Always keep data safe on Amazon S3 even if you’re 
using HDFS for primary storage 
Get in the habit of shutting down your cluster and start a 
new one, once a week or month 
Design your data processing workflow to account for failure 
You can use workflow managements such as AWS Data 
Pipeline
Pattern #2: Core & Task nodes
Core Nodes 
Master instance group 
Amazon EMR cluster 
Core instance group 
HDFS HDFS 
Run 
TaskTrackers 
(Compute) 
Run DataNode 
(HDFS)
Core Nodes 
Can add core 
nodes 
More HDFS 
space 
More 
CPU/memory 
Master instance group 
Amazon EMR cluster 
Core instance group 
HDFS HDFS HDFS
Core Nodes 
Can’t remove 
core nodes 
because of 
HDFS 
Master instance group 
Core instance group 
HDFS HDFS HDFS 
Amazon EMR cluster
Amazon EMR Task Nodes 
Run TaskTrackers 
No HDFS 
Reads from core 
node HDFS Task instance group 
Master instance group 
Core instance group 
HDFS HDFS 
Amazon EMR cluster
Amazon EMR Task Nodes 
Can add 
task nodes 
Task instance group 
Master instance group 
Core instance group 
HDFS HDFS 
Amazon EMR cluster
Amazon EMR Task Nodes 
More CPU 
power 
More 
memory 
Task instance group 
Master instance group 
Core instance group 
HDFS HDFS 
Amazon EMR cluster
Amazon EMR Task Nodes 
You can 
remove task 
nodes when 
processing 
is completed 
Task instance group 
Master instance group 
Core instance group 
HDFS HDFS 
Amazon EMR cluster
Amazon EMR Task Nodes 
You can 
remove task 
nodes when 
processing 
is completed 
Task instance group 
Master instance group 
Core instance group 
HDFS HDFS 
Amazon EMR cluster
Task Node Use-Cases 
Speed up job processing using Spot market 
Run task nodes on Spot market 
Get discount on hourly price 
Nodes can come and go without interruption to your cluster 
When you need extra horsepower for a short amount of time 
Example: Need to pull large amount of data from Amazon S3
Pattern #3: Amazon S3 & HDFS
Option 1: Amazon S3 as HDFS 
Use Amazon S3 as your 
permanent data store 
HDFS for temporary storage 
data between jobs 
No additional step to copy 
data to HDFS 
Amazon EMR Cluster 
Task Instance 
Group 
Core Instance 
Group 
HDFS HDFS 
Amazon S3
Benefits: Amazon S3 as HDFS 
Ability to shut down your cluster 
HUGE Benefit!! 
Use Amazon S3 as your durable storage 
11 9s of durability
Benefits: Amazon S3 as HDFS 
No need to scale HDFS 
Capacity 
Replication for durability 
Amazon S3 scales with your data 
Both in IOPs and data storage
Benefits: Amazon S3 as HDFS 
Ability to share data between multiple clusters 
Hard to do with HDFS 
Amazon S3 
EMR 
EMR
Benefits: Amazon S3 as HDFS 
Take advantage of Amazon S3 features 
Amazon S3 Server Side Encryption 
Amazon S3 Lifecycle Policies 
Amazon S3 versioning to protect against corruption 
Build elastic clusters 
Add nodes to read from Amazon S3 
Remove nodes with data safe on Amazon S3
EMR Consistent View 
Provides a ‘consistent view’ of data on S3 within a 
Cluster 
Ensures that all files created by a Step are available to 
Subsequent Steps 
Index of data from S3, managed by Dynamo DB 
Configurable Retry & Metastore 
New Hadoop Config File emrfs-site.xml 
fs.s3.consistent* System Properties
EMR Consistent View 
EMRfs 
HDFS 
Amazon EMR 
Amazon S3 Amazon 
DynamoDB 
File Data Processed Files Registry
EMR Consistent View 
Manage data in EMRFS using the emrfs client: 
emrfs 
describe-metadata, set-metadata-capacity, delete-metadata, 
create-metadata, list-metadata-stores - work 
with Metadata Stores 
diff - Show what in a bucket is missing from the index 
delete - Remove Index Entries 
sync - Ensure that the Index is Synced with a bucket 
import - Import Bucket Items into Index
What About Data Locality? 
Run your job in the same region as your Amazon 
S3 bucket 
Amazon EMR nodes have high speed connectivity 
to Amazon S3 
If your job Is CPU/memory-bound, locality doesn’t 
make a huge difference
Amazon S3 provides near linear scalability 
S3 Streaming 
Performance 
100 VMs; 9.6GB/s; $26/hr 
350 VMs; 28.7GB/s; $90/hr 
34 secs per terabyte 
GB/Second 
Reader Connections 
Performance & Scalability
When HDFS is a Better Choice… 
Iterative workloads 
If you’re processing the same dataset more than once 
Disk I/O intensive workloads
Option 2: Optimise for Latency with HDFS 
1. Data persisted on Amazon S3
Option 2: Optimise for Latency with HDFS 
2. Launch Amazon EMR and 
copy data to HDFS with 
S3distcp 
S3DistCp
Option 2: Optimise for Latency with HDFS 
3. Start processing data on 
HDFS 
S3DistCp
Benefits: HDFS instead of S3 
Better pattern for I/O-intensive workloads 
Amazon S3 as system of record 
Durability 
Scalability 
Cost 
Features: lifecycle policy, security
Outline 
Introduction to Amazon EMR 
Amazon EMR Design Patterns 
Amazon EMR Best Practices 
Observations from AWS
Amazon EMR Nodes and Size 
Use M1.Small Instances for functional testing 
Use XLarge + nodes for production workloads 
Use CC2/C3 for memory and CPU intensive 
jobs 
HS1, HI1, I2 instances for HDFS workloads 
Prefer a smaller cluster of larger nodes
Holy Grail Question 
How many nodes do I need?
Instance Resource Allocation 
• Hadoop 1 - Static Number of Mappers/Reducers 
configured for the Cluster Nodes 
• Hadoop 2 - Variable Number of Hadoop 
Applications based on File Splits and Available 
Memory 
• Useful to understand Old vs New Sizing
Instance Resources 
1 
2 
4 
8 
16 
32 
64 
128 
256 
512 
1024 
2048 
4096 
8192 
16384 
32768 
65536 
0 
50 
100 
150 
200 
250 
300 
Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)
Cluster Sizing Calculation 
1. Estimate the number of tasks your job requires. 
2. Pick an instance and note down the number of tasks it 
can run in parallel 
3. We need to pick some sample data files to run a test 
workload. The number of sample files should be the 
same number from step #2. 
4. Run an Amazon EMR cluster with a single Core node 
and process your sample files from #3. 
Note down the amount of time taken to process your 
sample files.
Cluster Sizing Calculation 
Total Tasks * Time To Process Sample Files 
Instance Task Capacity * Desired Processing Time 
Estimated Number Of Nodes:
Example: Cluster Sizing Calculation 
1. Estimate the number of tasks your job requires 
150 
2. Pick an instance and note down the number of 
tasks it can run in parallel 
m1.xlarge with 8 task capacity per instance
Example: Cluster Sizing Calculation 
3. We need to pick some sample data files to run a 
test workload. The number of sample files should 
be the same number from step #2. 
8 files selected for our sample test
Example: Cluster Sizing Calculation 
4. Run an Amazon EMR cluster with a single core 
node and process your sample files from #3. 
Note down the amount of time taken to process 
your sample files. 
3 min to process 8 files
Cluster Sizing Calculation 
Total Tasks For Your Job * Time To Process Sample Files 
Per Instance Task Capacity * Desired Processing Time 
Estimated number of nodes: 
150 * 3 min 
8 * 5 min 
= 11 m1.xlarge
File Best Practices 
Avoid small files at all costs (smaller than 
100MB) 
Use Compression
Holy Grail Question 
What if I have small file issues?
Dealing with Small Files 
Use S3DistCP to 
combine smaller files 
together 
S3DistCP takes a 
pattern and target file 
to combine smaller 
input files to larger 
ones 
./elastic-mapreduce –jar 
/home/hadoop/lib/emr-s3distcp-1.0.jar  
--args '--src,s3://myawsbucket/cf, 
--dest,hdfs:///local, 
--groupBy,.*XABCD12345678.([0-9]+-[0-9]+- 
[0-9]+-[0-9]+).*, 
--targetSize,128,
Compression 
Always Compress Data Files On Amazon S3 
Reduces Bandwidth Between Amazon S3 and 
Amazon EMR 
Speeds Up Your Job 
Compress Task Output
Compression 
Compression Types: 
Some are fast BUT offer less space reduction 
Some are space efficient BUT Slower 
Some are splitable and some are not 
Algorithm % Space 
Remaining 
Encoding 
Speed 
Decoding 
Speed 
GZIP 13% 21MB/s 118MB/s 
LZO 20% 135MB/s 410MB/s 
Snappy 22% 172MB/s 409MB/s
Changing Compression Type 
You May Decide To Change Compression Type 
Use S3DistCP to change the compression types of 
your files 
Example: 
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar  
/home/hadoop/lib/emr-s3distcp-1.0.jar  
--args '--src,s3://myawsbucket/cf, 
--dest,hdfs:///local, 
--outputCodec,lzo’
Outline 
Introduction to Amazon EMR 
Amazon EMR Design Patterns 
Amazon EMR Best Practices 
Observations from AWS
M1/C1 Instance Families 
Heavily used by EMR Customers 
However, HDFS Utilisation is typically very 
Low 
M3/C3 Offers better Performance/$
M1 vs M3 
Instance Cost / Map Task Cost / Reduce Task 
m1.large $0.08 $0.15 
m1.xlarge $0.06 $0.15 
m3.xlarge $0.04 $0.07 
m3.2xlarge $0.04 $0.07
C1 vs C3 
Instance Cost / Map Task Cost / Reduce Task 
c1.medium $0.13 $0.13 
c1.xlarge $0.35 $0.70 
c3.xlarge $0.05 $0.11 
c3.2xlarge $0.05 $0.11
Orc vs Parquet 
File Formats designed for SQL/Data Warehousing 
on Hadoop 
Columnar File Formats 
Compress Well 
High Row Count, Low Cardinality
OrcFile Format 
Optimised Row Columnar Format 
Zlibor Snappy External Compression 
250MB Stripe of 1 Column and Index 
RunLengthor Dictionary Encoding 
1 Output File per Container Task
Parquet File Format 
Gzipor Snappy External Compression 
Array Data Structures 
Limited Data Type Support for Hive 
Batch Creation 
1GB Files
Orc vs Parquet 
Depends on the Tool you are using 
Consider Future Architecture & Requirements 
Test Test Test
In Summary 
• Practice Cloud Architecture with Transient Clusters 
• Utilize S3 as the system of record for durability 
• Utilize Task Nodes on Spot for Increased performance and 
Lower Cost 
• Move to new Instance Families for Better Performance/$ 
• Exciting Developments around Columnar File Formats

More Related Content

What's hot

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Getting Started with Amazon EMR
Getting Started with Amazon EMRGetting Started with Amazon EMR
Getting Started with Amazon EMRArman Iman
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 

What's hot (20)

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Beyond EC2 and S3
Beyond EC2 and S3Beyond EC2 and S3
Beyond EC2 and S3
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Getting Started with Amazon EMR
Getting Started with Amazon EMRGetting Started with Amazon EMR
Getting Started with Amazon EMR
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 

Viewers also liked

Amazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft BerlinAmazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft BerlinAWS Germany
 
AWSome Day 2014 Kuala Lumpur - Keynote
AWSome Day 2014 Kuala Lumpur - KeynoteAWSome Day 2014 Kuala Lumpur - Keynote
AWSome Day 2014 Kuala Lumpur - KeynoteAmazon Web Services
 
AWS Summit 2014 - Perth - Keynote
AWS Summit 2014 - Perth - KeynoteAWS Summit 2014 - Perth - Keynote
AWS Summit 2014 - Perth - KeynoteAmazon Web Services
 
Effective Security Response in the Cloud - Session Sponsored by Trend Micro
Effective Security Response in the Cloud - Session Sponsored by Trend Micro Effective Security Response in the Cloud - Session Sponsored by Trend Micro
Effective Security Response in the Cloud - Session Sponsored by Trend Micro Amazon Web Services
 
AWS Customer Presentation - mediabrands - marc dispensa
AWS Customer Presentation - mediabrands - marc dispensa AWS Customer Presentation - mediabrands - marc dispensa
AWS Customer Presentation - mediabrands - marc dispensa Amazon Web Services
 
Deploy, Scale and Manage your Microsoft Investments with AWS
Deploy, Scale and Manage your Microsoft Investments with AWS Deploy, Scale and Manage your Microsoft Investments with AWS
Deploy, Scale and Manage your Microsoft Investments with AWS Amazon Web Services
 
Effective Security Response in the Cloud - Session Sponsored by Trend Micro
 Effective Security Response in the Cloud - Session Sponsored by Trend Micro Effective Security Response in the Cloud - Session Sponsored by Trend Micro
Effective Security Response in the Cloud - Session Sponsored by Trend MicroAmazon Web Services
 
AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...
AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...
AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...Amazon Web Services
 
(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014
(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014
(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014Amazon Web Services
 
AWS Paris Summit 2014 - T1 - Startup Showcase
AWS Paris Summit 2014 - T1 - Startup ShowcaseAWS Paris Summit 2014 - T1 - Startup Showcase
AWS Paris Summit 2014 - T1 - Startup ShowcaseAmazon Web Services
 
When Clouds Collide - Session Sponsored by Datacom
When Clouds Collide - Session Sponsored by DatacomWhen Clouds Collide - Session Sponsored by Datacom
When Clouds Collide - Session Sponsored by DatacomAmazon Web Services
 
(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...
(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...
(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...Amazon Web Services
 
(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014
(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014
(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014Amazon Web Services
 
APN Partner Webinar - AWS Marketplace & Test Drive
APN Partner Webinar - AWS Marketplace & Test DriveAPN Partner Webinar - AWS Marketplace & Test Drive
APN Partner Webinar - AWS Marketplace & Test DriveAmazon Web Services
 
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformSAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformAmazon Web Services
 
AWS - Migrating Internal IT Applications
AWS - Migrating Internal IT Applications AWS - Migrating Internal IT Applications
AWS - Migrating Internal IT Applications Amazon Web Services
 
DynamoDB In-depth & Developer Drill Down
DynamoDB In-depth & Developer Drill Down DynamoDB In-depth & Developer Drill Down
DynamoDB In-depth & Developer Drill Down Amazon Web Services
 
Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...
Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...
Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...Amazon Web Services
 

Viewers also liked (20)

Amazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft BerlinAmazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft Berlin
 
AWSome Day 2014 Kuala Lumpur - Keynote
AWSome Day 2014 Kuala Lumpur - KeynoteAWSome Day 2014 Kuala Lumpur - Keynote
AWSome Day 2014 Kuala Lumpur - Keynote
 
Security Overview
Security Overview Security Overview
Security Overview
 
AWS Summit 2014 - Perth - Keynote
AWS Summit 2014 - Perth - KeynoteAWS Summit 2014 - Perth - Keynote
AWS Summit 2014 - Perth - Keynote
 
Effective Security Response in the Cloud - Session Sponsored by Trend Micro
Effective Security Response in the Cloud - Session Sponsored by Trend Micro Effective Security Response in the Cloud - Session Sponsored by Trend Micro
Effective Security Response in the Cloud - Session Sponsored by Trend Micro
 
AWS Customer Presentation - mediabrands - marc dispensa
AWS Customer Presentation - mediabrands - marc dispensa AWS Customer Presentation - mediabrands - marc dispensa
AWS Customer Presentation - mediabrands - marc dispensa
 
Deploy, Scale and Manage your Microsoft Investments with AWS
Deploy, Scale and Manage your Microsoft Investments with AWS Deploy, Scale and Manage your Microsoft Investments with AWS
Deploy, Scale and Manage your Microsoft Investments with AWS
 
Effective Security Response in the Cloud - Session Sponsored by Trend Micro
 Effective Security Response in the Cloud - Session Sponsored by Trend Micro Effective Security Response in the Cloud - Session Sponsored by Trend Micro
Effective Security Response in the Cloud - Session Sponsored by Trend Micro
 
AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...
AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...
AWS Public Sector Symposium 2014 Canberra | Storage and Archiving options on ...
 
DynamoDB at HasOffers
DynamoDB at HasOffers DynamoDB at HasOffers
DynamoDB at HasOffers
 
(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014
(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014
(BDT307) Running NoSQL on Amazon EC2 | AWS re:Invent 2014
 
AWS Paris Summit 2014 - T1 - Startup Showcase
AWS Paris Summit 2014 - T1 - Startup ShowcaseAWS Paris Summit 2014 - T1 - Startup Showcase
AWS Paris Summit 2014 - T1 - Startup Showcase
 
When Clouds Collide - Session Sponsored by Datacom
When Clouds Collide - Session Sponsored by DatacomWhen Clouds Collide - Session Sponsored by Datacom
When Clouds Collide - Session Sponsored by Datacom
 
(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...
(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...
(APP202) Deploy, Manage, and Scale Your Apps with AWS OpsWorks and AWS Elasti...
 
(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014
(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014
(SDD414) Amazon Redshift Deep Dive and What's Next | AWS re:Invent 2014
 
APN Partner Webinar - AWS Marketplace & Test Drive
APN Partner Webinar - AWS Marketplace & Test DriveAPN Partner Webinar - AWS Marketplace & Test Drive
APN Partner Webinar - AWS Marketplace & Test Drive
 
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing PlatformSAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
SAP HANA - The Foundation of Real Time, Now on the AWS Cloud Computing Platform
 
AWS - Migrating Internal IT Applications
AWS - Migrating Internal IT Applications AWS - Migrating Internal IT Applications
AWS - Migrating Internal IT Applications
 
DynamoDB In-depth & Developer Drill Down
DynamoDB In-depth & Developer Drill Down DynamoDB In-depth & Developer Drill Down
DynamoDB In-depth & Developer Drill Down
 
Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...
Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...
Bringing Governance to an Existing Cloud at NASA’s Jet Propulsion Laboratory ...
 

Similar to AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big DataAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석Amazon Web Services Korea
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...Amazon Web Services
 
Building Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudBuilding Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 

Similar to AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices (20)

AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
(BDT322) How Redfin & Twitter Leverage Amazon S3 For Big Data
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data Workloads
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
AWS re:Invent 2016: Workshop: Stretching Scalability: Doing more with Amazon ...
 
Building Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS CloudBuilding Analytics Applications in the AWS Cloud
Building Analytics Applications in the AWS Cloud
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
EMR Training
EMR TrainingEMR Training
EMR Training
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

  • 1. © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Elastic MapReduce: Deep Dive and Best Practices Ian Meyers, AWS (meyersi@) October 29th, 2014
  • 2. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Observations from AWS
  • 3. Map-Reduce Engine Vibrant Ecosystem Hadoop-as-a-Service Massively Parallel Cost Effective AWS Wrapper Integrated to AWS services What is EMR?
  • 5. EMRfs HDFS Amazon EMR Amazon S3 Amazon DynamoDB
  • 6. EMRfs HDFS Data management Analytics languages Amazon EMR Amazon S3 Amazon DynamoDB
  • 7. EMRfs HDFS Data management Analytics languages Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB
  • 8. EMRfs HDFS Data management Analytics languages Amazon Redshift Amazon EMR Amazon RDS Amazon S3 Amazon DynamoDB AWS Data Pipeline
  • 9. Amazon EMR Introduction Launch clusters of any size in a matter of minutes Use variety of different instance sizes that match your workload Don’t get stuck with hardware Don’t deal with capacity planning Run multiple clusters with different sizes, specs and node types
  • 10.
  • 11. Elastic MapReduce & Amazon S3 EMR has an optimised driver for Amazon S3 64MB Range Offset Reads to increase performance Elastic MapReduce Consistent View further Increases Performance Addresses Consistency S3 Cost - $.03/GB - Volume Based Price Tiering
  • 12. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Observations from AWS
  • 13. Amazon EMR Design Patterns Pattern #1: Transient vs. Alive Clusters Pattern #2: Core Nodes and Task Nodes Pattern #3: Amazon S3 & HDFS
  • 14. Pattern #1: Transient vs. Alive Clusters
  • 15. Pattern #1: Transient Clusters Cluster lives for the duration of the job Shut down the cluster when the job is done Data persist on Amazon S3 Input & Output Data on Amazon S3
  • 16. Benefits of Transient Clusters 1. Control your cost 2. Minimum maintenance • Cluster goes away when job is done 3. Practice cloud architecture • Pay for what you use • Data processing as a workflow
  • 17. Alive Clusters Very similar to traditional Hadoop deployments Cluster stays around after the job is done Data persistence model: Amazon S3 Amazon S3 Copy To HDFS HDFS and Amazon S3 as backup
  • 18. Alive Clusters Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage Get in the habit of shutting down your cluster and start a new one, once a week or month Design your data processing workflow to account for failure You can use workflow managements such as AWS Data Pipeline
  • 19. Pattern #2: Core & Task nodes
  • 20. Core Nodes Master instance group Amazon EMR cluster Core instance group HDFS HDFS Run TaskTrackers (Compute) Run DataNode (HDFS)
  • 21. Core Nodes Can add core nodes More HDFS space More CPU/memory Master instance group Amazon EMR cluster Core instance group HDFS HDFS HDFS
  • 22. Core Nodes Can’t remove core nodes because of HDFS Master instance group Core instance group HDFS HDFS HDFS Amazon EMR cluster
  • 23. Amazon EMR Task Nodes Run TaskTrackers No HDFS Reads from core node HDFS Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 24. Amazon EMR Task Nodes Can add task nodes Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 25. Amazon EMR Task Nodes More CPU power More memory Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 26. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 27. Amazon EMR Task Nodes You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 28. Task Node Use-Cases Speed up job processing using Spot market Run task nodes on Spot market Get discount on hourly price Nodes can come and go without interruption to your cluster When you need extra horsepower for a short amount of time Example: Need to pull large amount of data from Amazon S3
  • 29. Pattern #3: Amazon S3 & HDFS
  • 30. Option 1: Amazon S3 as HDFS Use Amazon S3 as your permanent data store HDFS for temporary storage data between jobs No additional step to copy data to HDFS Amazon EMR Cluster Task Instance Group Core Instance Group HDFS HDFS Amazon S3
  • 31. Benefits: Amazon S3 as HDFS Ability to shut down your cluster HUGE Benefit!! Use Amazon S3 as your durable storage 11 9s of durability
  • 32. Benefits: Amazon S3 as HDFS No need to scale HDFS Capacity Replication for durability Amazon S3 scales with your data Both in IOPs and data storage
  • 33. Benefits: Amazon S3 as HDFS Ability to share data between multiple clusters Hard to do with HDFS Amazon S3 EMR EMR
  • 34. Benefits: Amazon S3 as HDFS Take advantage of Amazon S3 features Amazon S3 Server Side Encryption Amazon S3 Lifecycle Policies Amazon S3 versioning to protect against corruption Build elastic clusters Add nodes to read from Amazon S3 Remove nodes with data safe on Amazon S3
  • 35. EMR Consistent View Provides a ‘consistent view’ of data on S3 within a Cluster Ensures that all files created by a Step are available to Subsequent Steps Index of data from S3, managed by Dynamo DB Configurable Retry & Metastore New Hadoop Config File emrfs-site.xml fs.s3.consistent* System Properties
  • 36. EMR Consistent View EMRfs HDFS Amazon EMR Amazon S3 Amazon DynamoDB File Data Processed Files Registry
  • 37. EMR Consistent View Manage data in EMRFS using the emrfs client: emrfs describe-metadata, set-metadata-capacity, delete-metadata, create-metadata, list-metadata-stores - work with Metadata Stores diff - Show what in a bucket is missing from the index delete - Remove Index Entries sync - Ensure that the Index is Synced with a bucket import - Import Bucket Items into Index
  • 38. What About Data Locality? Run your job in the same region as your Amazon S3 bucket Amazon EMR nodes have high speed connectivity to Amazon S3 If your job Is CPU/memory-bound, locality doesn’t make a huge difference
  • 39. Amazon S3 provides near linear scalability S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr 350 VMs; 28.7GB/s; $90/hr 34 secs per terabyte GB/Second Reader Connections Performance & Scalability
  • 40. When HDFS is a Better Choice… Iterative workloads If you’re processing the same dataset more than once Disk I/O intensive workloads
  • 41. Option 2: Optimise for Latency with HDFS 1. Data persisted on Amazon S3
  • 42. Option 2: Optimise for Latency with HDFS 2. Launch Amazon EMR and copy data to HDFS with S3distcp S3DistCp
  • 43. Option 2: Optimise for Latency with HDFS 3. Start processing data on HDFS S3DistCp
  • 44. Benefits: HDFS instead of S3 Better pattern for I/O-intensive workloads Amazon S3 as system of record Durability Scalability Cost Features: lifecycle policy, security
  • 45. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Observations from AWS
  • 46. Amazon EMR Nodes and Size Use M1.Small Instances for functional testing Use XLarge + nodes for production workloads Use CC2/C3 for memory and CPU intensive jobs HS1, HI1, I2 instances for HDFS workloads Prefer a smaller cluster of larger nodes
  • 47. Holy Grail Question How many nodes do I need?
  • 48. Instance Resource Allocation • Hadoop 1 - Static Number of Mappers/Reducers configured for the Cluster Nodes • Hadoop 2 - Variable Number of Hadoop Applications based on File Splits and Available Memory • Useful to understand Old vs New Sizing
  • 49. Instance Resources 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 0 50 100 150 200 250 300 Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)
  • 50. Cluster Sizing Calculation 1. Estimate the number of tasks your job requires. 2. Pick an instance and note down the number of tasks it can run in parallel 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 4. Run an Amazon EMR cluster with a single Core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  • 51. Cluster Sizing Calculation Total Tasks * Time To Process Sample Files Instance Task Capacity * Desired Processing Time Estimated Number Of Nodes:
  • 52. Example: Cluster Sizing Calculation 1. Estimate the number of tasks your job requires 150 2. Pick an instance and note down the number of tasks it can run in parallel m1.xlarge with 8 task capacity per instance
  • 53. Example: Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  • 54. Example: Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  • 55. Cluster Sizing Calculation Total Tasks For Your Job * Time To Process Sample Files Per Instance Task Capacity * Desired Processing Time Estimated number of nodes: 150 * 3 min 8 * 5 min = 11 m1.xlarge
  • 56. File Best Practices Avoid small files at all costs (smaller than 100MB) Use Compression
  • 57. Holy Grail Question What if I have small file issues?
  • 58. Dealing with Small Files Use S3DistCP to combine smaller files together S3DistCP takes a pattern and target file to combine smaller input files to larger ones ./elastic-mapreduce –jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --groupBy,.*XABCD12345678.([0-9]+-[0-9]+- [0-9]+-[0-9]+).*, --targetSize,128,
  • 59. Compression Always Compress Data Files On Amazon S3 Reduces Bandwidth Between Amazon S3 and Amazon EMR Speeds Up Your Job Compress Task Output
  • 60. Compression Compression Types: Some are fast BUT offer less space reduction Some are space efficient BUT Slower Some are splitable and some are not Algorithm % Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  • 61. Changing Compression Type You May Decide To Change Compression Type Use S3DistCP to change the compression types of your files Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,s3://myawsbucket/cf, --dest,hdfs:///local, --outputCodec,lzo’
  • 62. Outline Introduction to Amazon EMR Amazon EMR Design Patterns Amazon EMR Best Practices Observations from AWS
  • 63. M1/C1 Instance Families Heavily used by EMR Customers However, HDFS Utilisation is typically very Low M3/C3 Offers better Performance/$
  • 64. M1 vs M3 Instance Cost / Map Task Cost / Reduce Task m1.large $0.08 $0.15 m1.xlarge $0.06 $0.15 m3.xlarge $0.04 $0.07 m3.2xlarge $0.04 $0.07
  • 65. C1 vs C3 Instance Cost / Map Task Cost / Reduce Task c1.medium $0.13 $0.13 c1.xlarge $0.35 $0.70 c3.xlarge $0.05 $0.11 c3.2xlarge $0.05 $0.11
  • 66. Orc vs Parquet File Formats designed for SQL/Data Warehousing on Hadoop Columnar File Formats Compress Well High Row Count, Low Cardinality
  • 67. OrcFile Format Optimised Row Columnar Format Zlibor Snappy External Compression 250MB Stripe of 1 Column and Index RunLengthor Dictionary Encoding 1 Output File per Container Task
  • 68. Parquet File Format Gzipor Snappy External Compression Array Data Structures Limited Data Type Support for Hive Batch Creation 1GB Files
  • 69. Orc vs Parquet Depends on the Tool you are using Consider Future Architecture & Requirements Test Test Test
  • 70. In Summary • Practice Cloud Architecture with Transient Clusters • Utilize S3 as the system of record for durability • Utilize Task Nodes on Spot for Increased performance and Lower Cost • Move to new Instance Families for Better Performance/$ • Exciting Developments around Columnar File Formats

Editor's Notes

  1. EMR Is managed Hadoop Offering that takes burden of deploying and maintaining hadoop clusters away from developers. EMR uses Apache Hadoop mapreduce engine and integrates with variety of different tools.
  2. Transient clusters are the type of clusters that are only around for the duration of the job. Once the job is done the cluster shuts down. This model of running Hadoop clusters is very different from traditional Hadoop deployments. With traditional Hadoop deployment, the Hadoop clusters stays up and running regardless if there are any jobs for the cluster to process mostly due to fact that the Hadoop cluster also hosts HDFS storage. With HDFS storage, you don’t have the luxury of shutting down the cluster. With Transient EMR clusters, your data persist on S3, meaning that you don’t use HDFS to store your data. Once data is safe and secure on S3, you have the ability to shutdown the cluster after your job is done knowing that you wont lose data.
  3. There are many reasons you want to use EMR transient clusters. Cost: Shutting down resources you don’t need is the best path towards optimizing your workload for cost efficiency. Don’t pay for what you’re not using. At AWS that’s all we talk about. 2) No Maintenance. Obviously if you’re shutting down the cluster, then you’re only maintaining the cluster for the duration of your job. This help tremendously reducing the headache maintaining long running clusters 3) Practice cloud architecture best practices. It’ll also makes you a better cloud practitioner. Again you’re getting in the habit of paying for what you’re using which is great. You’ll also get into the habit of thinking of your data processing as a workflow where resources come and go as needed.
  4. I hear this a lot: EMR is only good for short-lived/transient clusters, do I need to run my own Hadoop on EC2 if I need long running clusters? That’s not true at all. EMR can be designed for longer running clusters. You would still want to keep your data on S3 for durability or you can copy data from S3 to HDFS first or you can use HDFS as your primary storage and use S3 as the data store backup.
  5. Notice that we don’t remove S3 from our design. Its important to keep your data safe on S3 just in case you experience cluster failures. In fact I want you to think or always plan for failure. Just because you’re running a long running cluster doesn’t mean you wont see failures. So architecting your data processing workflow to be able to deal with cluster failures is super important. One way to do that is to use a workflow management tool such as Amazon Data pipeline.
  6. EMR has three node types. One master which runs namenode and jobtracker, core nodes and task nodes which are two different type of slave nodes. Lets review two different slave nodes.
  7. We’ll start with Core nodes. Core nodes run TaskTracker and Datanode. Core nodes are very similar to traditional Hadoop salve nodes. They can process data with mappers and reducers and can also store data with HDFS or Datanode.
  8. However, once you add core nodes to your cluster, you can’t remove them later. That’s the only caveat with core nodes. And there’s a good reason for that. Because Core node hold HDFS data, removing nodes from the cluster can potentially cause data-loss.
  9. Task Nodes are a bit different than what you usually see in your traditional hadoop deployments. TaskNodes run jobtracker only. They don’t run datanode which means no HDFS data is stored on Tasknodes.
  10. Similar to core nodes, you can increase/expand your cluster’s TaskNode capacity by adding more nodes.
  11. example
  12. But unlike core nodes, TaskNodes can be removed from the cluster. And you can probably guess why. Because TaskNodes don’t hold any HDFS data. So you’re free to add/remove them at any given time.
  13. example
  14. Use Tasknodes to speed up your data processing using Spot market. Tasknodes are a great use-case for spot instances. Remember that Tasknodes can be added/removed easily. That ability gives you the peace of mind to use Tasknodes for spot market. And if at some point your spot instance gets taken away because the price went up too much, your cluster can withstand losing nodes.
  15. In the next few slides, we’ll talk about data persistence models with EMR. The first pattern is S3 as HDFS. With this data persistence model, data gets stored on S3. HDFS does not play any role in storing data. As matter of fact HDFS is only there for temporary storage. Another common thing I hear is that storing data on S3 instead of HDFS slows my job down a lot because data has to get copied to HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on S3, Hadoop reads directly from S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers but as I mentined, HDFS acts as the temp space and nothing more.
  16. One of the biggest and the most important benefits of using S3 instead of HDFS is the fact that you can shutdown your cluster when your job is done knowingthat data is safe on S3. That’s a huge plus! Can you imagine doing that with traditional hadoop deployments? I haven’t come across anyone who could easily do that.
  17. The other important benefit of S3 instead of HDFS is avoiding to play the HDFS capacity game. You’re dealing with big data problems which means you have ton of data coming in. The last thing you want to do is to play the guessing game of how much HDFS space you need. With S3 you don’t have to play that game. S3 scales with your data both in terms of IO and also storage space. With HDFS you have to get the space you want X3 to account for durability. With S3 the space you pay for includes replication and everything else behind the scene to make your data durable.
  18. Imagine your hosting data on HDFS and another team in your company asks to get access to your data? What would you do? You can either copy data from HDFS to some other storage. Now if you’re dealing with large amount of data, say 400TB or 1PB, this is going to be a nightmare. Or you can give them access to your cluster to run job. But man it would suck if they run a large job and take over the entire cluster. With Data on S3, you can share data between multiple jobs in parallel without scarifying storage or cluster resources. S3 can scale with as many jobs as you want it to.
  19. And everything else comes free with S3. Features such as SSE, LifeCycle and etc. And again keep in mind that S3 as the storage is the main reason why we can’t build elastic clusters where nodes gets added and removed dynamically without any data-loss
  20. With this pattern, you still store your data on S3 and use S3 as your primary storage but for processing your data, data gets copied to HDFS first. Copying data to HDFS can be done with a S3DistCP tool provided by the EMR team. S3distCp is very similar to distcp tool that comes with Hadoop for distributed copy jobs, ie copying data between clusters. However, S3distcp was written with S3 in mind meaning that it can perform much better than DistCP.
  21. Use-case for this slide
  22. Use-case for this slide
  23. The benefits of this pattern is getting better IO if we’re dealing with IO intensive workloads. Or as mentioned previously, if data needs to be processed multiple times, copying data to HDFS first is a more optimized approach. And because we’re not using HDFS as the primary storage, we can still take advantage of S3 features.
  24. Do not use smaller nodes for production workload unless you’re 100% sure you know what you’re doing. The majority of jobs I’ve seen requires more CPU and Memory the smaller instances have to offer and most of the times causes job failures if the cluster is not fine tuned. Instead of spending time fine tunning small nodes, get a larger node and run your workload with peace of mind. Anything larger and including m1.xlarge is a good candidate. m1.xlarge, c1.xlarge, m2.4xlarge and all cluster compute instances are good choices.
  25. This is my fav question: given this much data, how many nodes do I need?
  26. Avoid small files at all costs. Small files can cause a lot of issues. I call them the termites of big data. The reason small files are just trouble is that each file, as discussed previously, eventually becomes a mapper. An each mapper is a java JVM. Smaller files cause Java JVM to get spawn up but as soon as the mapper is up and running, the content of file is processed in a short amount of time and the mapper goes away. That’s waste of CPU and memory. Ideally you like your mapper to do as much processing as possible before shutting down.
  27. Decompression is less intensive
  28. Give guidance