SlideShare a Scribd company logo
1 of 80
Download to read offline
November 13th, 2014 | Las Vegas, NV 
Ian Meyers, Amazon Web Services
Compute 
Storage 
AWS Global Infrastructure 
Database 
App Services 
Deployment & Administration 
Networking 
Analytics 
Amazon Elastic MapReduce 
Managed, elastic Hadoop (1.x & 2.x) cluster 
Integrates with Amazon S3, Amazon DynamoDB, Amazon Kinesis and Amazon Redshift 
Install Storm, Spark, Presto, Hive, Pig, Impala, & end-user tools automatically 
Native support for Spot Instances 
Integrated HBaseNoSQL database 
Amazon EMR
--bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop 
--keyword-config-file–merge values in new configto existing 
--keyword-key-value–override values provided 
Configuration File Name 
Configuration File Keyword 
File Name Shortcut 
Key-Value Pair Shortcut 
core-site.xml 
core 
C 
c 
hdfs-site.xml 
hdfs 
H 
h 
mapred-site.xml 
mapred 
M 
m 
yarn-site.xml 
yarn 
Y 
y
Set number of mappers per task tracker 
Useful for small memory footprint map tasks 
More work done with a given instance
Set HDFS block size to 1MB 
Useful for smaller files when HDFS is used
Reuse mappers 
Mapper startup time ~ 2-20 seconds 
Useful for tasks with large number of mappers 
Mappers must be “clean” after run (relevant for Java)
Configure process heap size, Java opts, and allow for replacing the hadoop- user-env.sh 
Hadoop 1 
Hadoop 2 
--bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-daemons 
--args–{namenode}-heap-size=2048, 
--{namenode}-opts=-XX:GCTimeRatio=19
EMRfs 
HDFS 
Amazon EMR 
Amazon S3 Amazon 
DynamoDB 
Processed Files 
Registry 
File Data
55
5
≈ 
60sec * 15MB 1GB
aws emradd-steps --cluster-id <cluster> 
--steps Name=GroupSmallFiles, 
Type=CUSTOM_JAR, 
Args=files,home/hadoop/lib/emr- s3distcp-1.0.jar, 
src,s3://myawsbucket/cf, 
dest,hdfs:///local, 
groupBy,.*(i-w.log).*, 
targetSize,128…
Algorithm 
%Space Remaining 
Encoding Speed 
Decoding Speed 
GZIP 
13% 
21MB/s 
118MB/s 
LZO 
20% 
135MB/s 
410MB/s 
Snappy 
22% 
172MB/s 
409MB/s
-outputCodec,lzo
Amazon EMR Cluster 
Task Instance Group 
Core Instance Group 
HDFS 
HDFS 
Amazon S3
HUGE Benefit!!
EMR 
EMR 
Amazon 
S3
Amazon EMR Cluster 
Task Instance Group 
Core Instance Group 
HDFS 
HDFS 
Amazon S3
S3DistCP
S3DistCP
EMR 
HDFS 
Pig
Hive 0.13.1 
•Support for ORC 
•Window functions 
•Decimal types 
•TRUNCATE command 
•Better optimiser (less need for hinting) 
Pig 0.12.0 
•Streaming UDF’s not written in Java 
•Native supportfor Avro 
•Native support for Parquet 
•Improved data types 
Impala 1.1 
•In-memory SQL engine 
•Support for HBasetables 
•Support for Parquet – column-oriented file format 
•Query and interactive shells 
HBase 0.94.18 
•Database Snapshotting 
•Improved read caching and seek optimisation 
•Improved transactions
Read Data Directly into Hive, Pig, Streaming and Cascading from Kinesis Streams 
No Intermediate Data Persistence Required 
Simple way to introduce real time sources into Batch Oriented Systems 
Multi-Application Support & Automatic Checkpointing 
Amazon EMR Integration with Amazon Kinesis
drop table call_data_records; 
CREATE TABLE call_data_records( 
start_timebigint, 
end_timebigint, 
phone_numberSTRING, 
carrier STRING, 
recorded_durationbigint, 
calculated_durationbigint, 
latdouble, 
long double 
) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY "," 
STORED BY 
'com.amazon.emr.kinesis.hive.KinesisStorageHandler' 
TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream"); 
Amazon EMR Integration with Amazon Kinesis
EC2 Instance 
MapTasks 
ReduceTasks 
m1.small 
2 
1 
m1.large 
3 
1 
m1.xlarge 
8 
3 
m2.xlarge 
3 
1 
m2.2xlarge 
6 
2 
m2.4xlarge 
14 
4 
m3.xlarge 
6 
1 
m3.2xlarge 
12 
3 
cg1.4xlarge 
12 
3 
cc2.8xlarge 
24 
6 
c3.4xlarge 
24 
6 
hi1.4xlarge 
24 
6 
hs1.8xlarge 
24 
6 
cr1.8xlarge& 
c3.8xlarge 
48 
12 
1 
2 
4 
8 
16 
32 
64 
128 
256 
512 
1024 
2048 
4096 
8192 
16384 
32768 
65536 
0 
50 
100 
150 
200 
250 
300 
Memory (GB) 
Mappers* 
Reducers* 
CPU (ECU Units) 
Local Storage (GB)
Instance 
Cost / MapTask 
Cost / ReduceTask 
m1.large 
$0.08 
$0.15 
m1.xlarge 
$0.06 
$0.15 
m3.xlarge 
$0.04 
$0.07 
m3.2xlarge 
$0.04 
$0.07
Instance 
Cost / MapTask 
Cost / ReduceTask 
c1.medium 
$0.13 
$0.13 
c1.xlarge 
$0.35 
$0.70 
c3.xlarge 
$0.05 
$0.11 
c3.2xlarge 
$0.05 
$0.11
Total tasks * Time to process sample files 
Instance task capacity * Desired processing time 
Estimated number of nodes:
1.Estimate the number of tasks your job requires 
150 
2.Pick an instance and note down the number of Tasks it can run in parallel 
m1.xlarge with 8 task capacity per instance
3.We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 
8 files selected for our sample test
4.Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process this dataset. 
3 min to process 8 files
Total tasks for your job * Time to process sample files 
Per instance task capacity * Desired processing time 
Estimated number of nodes: 
150 * 3 min 
8 * 5 min 
= 
11 m1.xlarge
Master instance group 
Amazon EMR cluster 
HDFS 
HDFS 
Run TaskTrackers(Compute) 
Run DataNode(HDFS) 
Core instance group
Can add core nodes 
More HDFS space 
More CPU/memory 
Master instance group 
Amazon EMR cluster 
HDFS 
HDFS 
HDFS 
Core instance group
Can’t remove core nodes because of HDFS 
Master instance group 
HDFS 
HDFS 
HDFS 
Amazon EMR cluster 
Core instance group
Run TaskTrackers 
No HDFS 
Reads from core node HDFS 
Master instance group 
HDFS 
HDFS 
Amazon EMR cluster 
Task instance group 
Core instance group
Can add task nodes 
Master instance group 
HDFS 
HDFS 
Amazon EMR cluster 
Task instance group 
Core instance group
More CPU power 
More memory 
Master instance group 
HDFS 
HDFS 
Amazon EMR cluster 
Task instance group 
Core instance group
You can remove task nodes when processing is completed 
Task instance group 
Master instance group 
Core instance group 
HDFS 
HDFS 
Amazon EMR cluster
You can remove task nodes when processing is completed 
Master instance group 
HDFS 
HDFS 
Amazon EMR cluster 
Task instance group 
Core instance group
Amazon 
CloudWatch
http://bit.ly/awsevals

More Related Content

What's hot

Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRrICh morrow
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedHarsha KM
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 

What's hot (20)

Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel AvivData science with spark on amazon EMR - Pop-up Loft Tel Aviv
Data science with spark on amazon EMR - Pop-up Loft Tel Aviv
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Beyond EC2 and S3
Beyond EC2 and S3Beyond EC2 and S3
Beyond EC2 and S3
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 

Similar to (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosEdureka!
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integrationtrihug
 
Get Value from Your Data
Get Value from Your DataGet Value from Your Data
Get Value from Your DataDanilo Poccia
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewNitesh Ghosh
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 

Similar to (SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014 (20)

Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
MapR, Implications for Integration
MapR, Implications for IntegrationMapR, Implications for Integration
MapR, Implications for Integration
 
Get Value from Your Data
Get Value from Your DataGet Value from Your Data
Get Value from Your Data
 
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsNYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
NYC Hadoop Meetup - MapR, Architecture, Philosophy and Applications
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Invent 2014

  • 1. November 13th, 2014 | Las Vegas, NV Ian Meyers, Amazon Web Services
  • 2. Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics Amazon Elastic MapReduce Managed, elastic Hadoop (1.x & 2.x) cluster Integrates with Amazon S3, Amazon DynamoDB, Amazon Kinesis and Amazon Redshift Install Storm, Spark, Presto, Hive, Pig, Impala, & end-user tools automatically Native support for Spot Instances Integrated HBaseNoSQL database Amazon EMR
  • 3.
  • 4.
  • 5.
  • 6. --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --keyword-config-file–merge values in new configto existing --keyword-key-value–override values provided Configuration File Name Configuration File Keyword File Name Shortcut Key-Value Pair Shortcut core-site.xml core C c hdfs-site.xml hdfs H h mapred-site.xml mapred M m yarn-site.xml yarn Y y
  • 7. Set number of mappers per task tracker Useful for small memory footprint map tasks More work done with a given instance
  • 8. Set HDFS block size to 1MB Useful for smaller files when HDFS is used
  • 9. Reuse mappers Mapper startup time ~ 2-20 seconds Useful for tasks with large number of mappers Mappers must be “clean” after run (relevant for Java)
  • 10. Configure process heap size, Java opts, and allow for replacing the hadoop- user-env.sh Hadoop 1 Hadoop 2 --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-daemons --args–{namenode}-heap-size=2048, --{namenode}-opts=-XX:GCTimeRatio=19
  • 11.
  • 12.
  • 13. EMRfs HDFS Amazon EMR Amazon S3 Amazon DynamoDB Processed Files Registry File Data
  • 14.
  • 15.
  • 16.
  • 17. 55
  • 18. 5
  • 19.
  • 20. ≈ 60sec * 15MB 1GB
  • 21.
  • 22. aws emradd-steps --cluster-id <cluster> --steps Name=GroupSmallFiles, Type=CUSTOM_JAR, Args=files,home/hadoop/lib/emr- s3distcp-1.0.jar, src,s3://myawsbucket/cf, dest,hdfs:///local, groupBy,.*(i-w.log).*, targetSize,128…
  • 23.
  • 24. Algorithm %Space Remaining Encoding Speed Decoding Speed GZIP 13% 21MB/s 118MB/s LZO 20% 135MB/s 410MB/s Snappy 22% 172MB/s 409MB/s
  • 26.
  • 27. Amazon EMR Cluster Task Instance Group Core Instance Group HDFS HDFS Amazon S3
  • 29.
  • 31.
  • 32.
  • 33. Amazon EMR Cluster Task Instance Group Core Instance Group HDFS HDFS Amazon S3
  • 34.
  • 36.
  • 38.
  • 39.
  • 40.
  • 42. Hive 0.13.1 •Support for ORC •Window functions •Decimal types •TRUNCATE command •Better optimiser (less need for hinting) Pig 0.12.0 •Streaming UDF’s not written in Java •Native supportfor Avro •Native support for Parquet •Improved data types Impala 1.1 •In-memory SQL engine •Support for HBasetables •Support for Parquet – column-oriented file format •Query and interactive shells HBase 0.94.18 •Database Snapshotting •Improved read caching and seek optimisation •Improved transactions
  • 43. Read Data Directly into Hive, Pig, Streaming and Cascading from Kinesis Streams No Intermediate Data Persistence Required Simple way to introduce real time sources into Batch Oriented Systems Multi-Application Support & Automatic Checkpointing Amazon EMR Integration with Amazon Kinesis
  • 44. drop table call_data_records; CREATE TABLE call_data_records( start_timebigint, end_timebigint, phone_numberSTRING, carrier STRING, recorded_durationbigint, calculated_durationbigint, latdouble, long double ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler' TBLPROPERTIES("kinesis.stream.name"="TestAggregatorStream"); Amazon EMR Integration with Amazon Kinesis
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51. EC2 Instance MapTasks ReduceTasks m1.small 2 1 m1.large 3 1 m1.xlarge 8 3 m2.xlarge 3 1 m2.2xlarge 6 2 m2.4xlarge 14 4 m3.xlarge 6 1 m3.2xlarge 12 3 cg1.4xlarge 12 3 cc2.8xlarge 24 6 c3.4xlarge 24 6 hi1.4xlarge 24 6 hs1.8xlarge 24 6 cr1.8xlarge& c3.8xlarge 48 12 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 0 50 100 150 200 250 300 Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)
  • 52.
  • 53.
  • 54. Instance Cost / MapTask Cost / ReduceTask m1.large $0.08 $0.15 m1.xlarge $0.06 $0.15 m3.xlarge $0.04 $0.07 m3.2xlarge $0.04 $0.07
  • 55. Instance Cost / MapTask Cost / ReduceTask c1.medium $0.13 $0.13 c1.xlarge $0.35 $0.70 c3.xlarge $0.05 $0.11 c3.2xlarge $0.05 $0.11
  • 56.
  • 57.
  • 58. Total tasks * Time to process sample files Instance task capacity * Desired processing time Estimated number of nodes:
  • 59. 1.Estimate the number of tasks your job requires 150 2.Pick an instance and note down the number of Tasks it can run in parallel m1.xlarge with 8 task capacity per instance
  • 60. 3.We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  • 61. 4.Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process this dataset. 3 min to process 8 files
  • 62. Total tasks for your job * Time to process sample files Per instance task capacity * Desired processing time Estimated number of nodes: 150 * 3 min 8 * 5 min = 11 m1.xlarge
  • 63.
  • 64.
  • 65. Master instance group Amazon EMR cluster HDFS HDFS Run TaskTrackers(Compute) Run DataNode(HDFS) Core instance group
  • 66. Can add core nodes More HDFS space More CPU/memory Master instance group Amazon EMR cluster HDFS HDFS HDFS Core instance group
  • 67. Can’t remove core nodes because of HDFS Master instance group HDFS HDFS HDFS Amazon EMR cluster Core instance group
  • 68. Run TaskTrackers No HDFS Reads from core node HDFS Master instance group HDFS HDFS Amazon EMR cluster Task instance group Core instance group
  • 69. Can add task nodes Master instance group HDFS HDFS Amazon EMR cluster Task instance group Core instance group
  • 70. More CPU power More memory Master instance group HDFS HDFS Amazon EMR cluster Task instance group Core instance group
  • 71. You can remove task nodes when processing is completed Task instance group Master instance group Core instance group HDFS HDFS Amazon EMR cluster
  • 72. You can remove task nodes when processing is completed Master instance group HDFS HDFS Amazon EMR cluster Task instance group Core instance group
  • 73.
  • 74.
  • 76.
  • 77.
  • 78.
  • 79.