SlideShare a Scribd company logo
1 of 107
Download to read offline
Masterclass
Elastic MapReduce
Ryan Shuttleworth – Technical Evangelist
@ryanAWS
A technical deep dive beyond the basics
Help educate you on how to get the best from AWS technologies
Show you how things work and how to get things done
Broaden your knowledge in ~45 mins
Masterclass
A key tool in the toolbox to help with ‘Big Data’ challenges
Makes possible analytics processes previously not feasible
Cost effective when leveraged with EC2 spot market
Broad ecosystem of tools to handle specific use cases
Amazon Elastic MapReduce
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
A framework
Splits data into pieces
Lets processing occur
Gathers the results
Very large
click log
(e.g TBs)
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Split the
log into
many small
pieces
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Split the
log into
many small
pieces
Process in an
EMR cluster
Very large
click log
(e.g TBs)
Lots of actions
by John Smith
Split the
log into
many small
pieces
Process in an
EMR cluster
Aggregate
the results
from all
the nodes
Very large
click log
(e.g TBs)
What
John
Smith
did
Lots of actions
by John Smith
Split the
log into
many small
pieces
Process in an
EMR cluster
Aggregate
the results
from all
the nodes
What
John
Smith
did
Very large
click log
(e.g TBs) Insight in a fraction of the time
HDFS
Reliable storage
MapReduce
Data analysis
Big Data != Hadoop
Big Data != Hadoop
Velocity
Variety
Volume
Big Data != Hadoop
When you need to innovate to collect,
store, analyze, and manage your data
ComputeStorage Big Data
How do you get your slice of it?
AWS Direct Connect
Dedicated low latency
bandwidth
Queuing
Highly scalable event
buffering
Amazon Storage Gateway
Sync local storage to the cloud
AWS Import/Export
Physical media shipping
Relational Database Service
Fully managed database
(MySQL, Oracle, MSSQL)
DynamoDB
NoSQL, Schemaless,
Provisioned throughput
database
S3
Object datastore up to 5TB
per object
99.999999999% durability
ComputeStorage Big Data
Where do you put your slice of it?
SimpleDB
NoSQL, Schemaless
Smaller datasets
Elastic MapReduce
Getting started, tools & operating modes
EMR cluster
Start an EMR
cluster using
console or cli tools
Master instance group
EMR cluster
Master instance
group created that
controls the
cluster (runs
MySQL)
Master instance group
EMR cluster
Core instance group
Core instance
group created for
life of cluster
Master instance group
EMR cluster
Core instance group
HDFS HDFS
Core instances run
DataNode and
TaskTracker
daemons
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Optional task
instances can be
added or
subtracted to
perform work
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
S3 can be used as
underlying ‘file
system’ for
input/output data
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Master node
coordinates
distribution of
work and manages
cluster state
Master instance group
EMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Core and Task
instances read-
write to S3
map
Input
file reduce
Output
file
EC2 instance
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
map
Input
file reduce
Output
file
EC2 instance
EC2 instance
EC2 instance
HDFS
EMR
Pig
HDFS
EMR
S3 DynamoDB
HDFS
EMR
S3 DynamoDB
Data management
HDFS
EMR
Pig
S3 DynamoDB
Analytics languagesData management
HDFS
EMR
Pig
S3 DynamoDB
RDS
Analytics languagesData management
HDFS
EMR
Pig
S3 DynamoDB
RDS
Analytics languagesData management
RedShift
AWS DataPipeline
EMR
Create
Job Flow
Run Job
Step 1
Run Job
Step 2
Run Job
Step n
End Job
Flow
Cluster created
Master, Core &
Task instances
groups
Processing step
Execute a script
such as java, hive
script, python etc
Chained step
Execute
subsequent step
using data from
previous step
Chained step
Add steps
dynamically or
predefine them as
a pipeline
Cluster terminated
Cluster instances
terminated on end of job
flow
EMR
Create
Job Flow
Run Job
Step 1
Run Job
Step 2
Run Job
Step n
End Job
Flow
Cluster created
Master, Core &
Task instances
groups
Processing step
Execute a script
such as java, hive
script, python etc
Chained step
Execute
subsequent step
using data from
previous step
Chained step
Add steps
dynamically or
predefine them as
a pipeline
Cluster terminated
Cluster instances
terminated on end of job
flow
Hadoop breaks a data set
into multiple sets if the
data set is too large to
process quickly on a
single cluster node.
Step can be compiled
java .jar or script
executed by the
Hadoop Streamer
Data typically passed
between steps using
HDFS file system on
Core Instances Group
nodes or S3
Output data generally
placed in S3 for
subsequent use
Utility that comes with the Hadoop distribution
Allows you to create and run Map/Reduce jobs with any
executable or script as the mapper and/or the reducer
Reads the input from standard input and the reducer
outputs data through standard output
By default, each line of input/output represents a record
with tab separated key/value
Hadoop Streaming
Input source
Output location
Mapper script location (S3)
#!/usr/bin/python
import sys
import re
def main(argv):
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "t" + "1"
if __name__ == "__main__":
main(sys.argv)
Streaming: word count python script Stdin > Stdout
#!/usr/bin/python
import sys
import re
def main(argv):
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "t" + "1"
if __name__ == "__main__":
main(sys.argv)
Streaming: word count python script Stdin > Stdout
Read words from StdIn line by line
#!/usr/bin/python
import sys
import re
def main(argv):
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "t" + "1"
if __name__ == "__main__":
main(sys.argv)
Streaming: word count python script Stdin > Stdout
Output to StdOut tab delimited record
e.g. “LongValueSum:abacus 1”
Reducer script or
function
Aggregate: Sort inputs
and add up totals
e.g
“Abacus 1”
“Abacus 1”
becomes
“Abacus 2”
Manages the job flow: coordinating the
distribution of the MapReduce
executable and subsets of the raw
data, to the core and task instance
groups
It also tracks the status of each task
performed, and monitors the health of
the instance groups.
To monitor the progress of the job
flow, you can SSH into the master node
as the Hadoop user and either look at
the Hadoop log files directly or access
the user interface that Hadoop
Master instance group
Instance groups
Manages the job flow: coordinating the
distribution of the MapReduce
executable and subsets of the raw
data, to the core and task instance
groups
It also tracks the status of each task
performed, and monitors the health of
the instance groups.
To monitor the progress of the job
flow, you can SSH into the master node
as the Hadoop user and either look at
the Hadoop log files directly or access
the user interface that Hadoop
Master instance group
Contains all of the core nodes of a job
flow.
A core node is an EC2 instance that
runs Hadoop map and reduce tasks
and stores data using the Hadoop
Distributed File System (HDFS).
The EC2 instances you assign as core
nodes are capacity that must be
allotted for the entire job flow run.
Core nodes run both the DataNodes
and TaskTracker Hadoop daemons.
Core instance group
Instance groups
Manages the job flow: coordinating the
distribution of the MapReduce
executable and subsets of the raw
data, to the core and task instance
groups
It also tracks the status of each task
performed, and monitors the health of
the instance groups.
To monitor the progress of the job
flow, you can SSH into the master node
as the Hadoop user and either look at
the Hadoop log files directly or access
the user interface that Hadoop
Master instance group
Contains all of the core nodes of a job
flow.
A core node is an EC2 instance that
runs Hadoop map and reduce tasks
and stores data using the Hadoop
Distributed File System (HDFS).
The EC2 instances you assign as core
nodes are capacity that must be
allotted for the entire job flow run.
Core nodes run both the DataNodes
and TaskTracker Hadoop daemons.
Core instance group
Contains all of the task nodes in a job
flow.
The task instance group is optional.
You can add it when you start the job
flow or add a task instance group to a
job flow in progress.
You can increase and decrease the
number of task nodes. Because they
don't store data and can be added and
removed from a job flow, you can use
task nodes to manage the EC2 instance
capacity your job flow uses
Task instance group
Instance groups
SSH onto master node
with this keypair
Keep job flow running
so you can add more
to it
…
aakar 3
abdal 3
abdelaziz 18
abolish 3
aboriginal 12
abraham 6
abseron 9
abstentions15
accession 73
accord 90
achabar 3
achievements 3
acic 4
acknowledged 6
acquis 9
acquisitions 3
…
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket]
--reducer
aggregate
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket]
--reducer
aggregate
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket]
--reducer
aggregate
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket]
--reducer
aggregate
./elastic-mapreduce --create --stream
--mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
--input
s3://elasticmapreduce/samples/wordcount/input
--output
s3n://myawsbucket]
--reducer
aggregate
launches a job flow to run on a single-node cluster using
an Amazon EC2 m1.small instance
when your steps are running correctly on a small set of
sample data, you can launch job flows to run on multiple
nodes
You can specify the number of nodes and the type of
instance to run with the --num-instances and --
instance-type parameters
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–alive
--log-uri s3n://mybucket/EMR/log
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–alive
--log-uri s3n://mybucket/EMR/log
Instance type/count
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–alive
--log-uri s3n://mybucket/EMR/log
Don’t terminate cluster –
keep it running so steps can
be added
elastic-mapreduce
--create
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–alive
--pig-interactive --pig-versions latest
--hive-interactive –-hive-versions latest
--hbase
--log-uri s3n://mybucket/EMR/log
Adding Hive, Pig and Hbase to
the job flow
elastic-mapreduce
--create
--bootstrap-action
s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args "-
s,dfs.block.size=1048576”
--key-pair micro
--region eu-west-1
--name MyJobFlow
--num-instances 5
--instance-type m2.4xlarge
–alive
--pig-interactive --pig-versions latest
--hive-interactive –-hive-versions latest
Using a bootstrapping action
to configure the cluster
70
You can specify up to 16 bootstrap actions per job flow by
providing multiple bootstrap-action parameters
s3://elasticmapreduce/bootstrap-
actions/configure-daemons
Configure JVM properties
--bootstrap-action
s3://elasticmapreduce/bootstrap-
actions/configure-daemons --args --namenode-
heap-size=2048,--namenode-opts=-
XX:GCTimeRatio=19
s3://elasticmapreduce/bootstrap-
actions/configure-hadoop
Configure cluster wide hadoop settings
./elastic-mapreduce –create --bootstrap-action
s3://elasticmapreduce/bootstrap-actions/configure-
hadoop --args "--site-config-
file,s3://myawsbucket/config.xml,-
s,mapred.tasktracker.map.tasks.maximum=2"
s3://elasticmapreduce/bootstrap-
actions/configurations/latest/memory-
intensive
Configure cluster wide memory settings
./elastic-mapreduce –create --bootstrap-action
s3://elasticmapreduce/bootstrap-
actions/configurations/latest/memory-intensive
71
You can specify up to 16 bootstrap actions per job flow by
providing multiple bootstrap-action parameters
s3://elasticmapreduce/bootstrap-
actions/run-if
Conditionally run a command
./elastic-mapreduce --create --alive 
--bootstrap-action
s3://elasticmapreduce/bootstrap-actions/run-if

--args "instance.isMaster=true,echo running on
master node"
Shutdown actions Sun scripts on shutdown
A bootstrap action script can create one or more
shutdown actions by writing scripts to the
/mnt/var/lib/instance-controller/public/shutdown-
actions/ directory. Each script must run and complete
within 60 seconds.
Custom actions Run a custom script
--bootstrap-action "s3://elasticmapreduce/bootstrap-
actions/download.sh"
Integrated tools
Selecting the right one
Hadoop streaming
Unstructured data with wide variation of
formats, high flexibility on language
Broad language support (Python, Perl,
Bash…)
Mapper, reducer, input and output all
supplied by the jobflow
Scale from 1 to unlimited number of
processes
Hadoop streaming
Unstructured data with wide variation of
formats, high flexibility on language
Broad language support (Python, Perl,
Bash…)
Mapper, reducer, input and output all
supplied by the jobflow
Scale from 1 to unlimited number of
processes
Doing analytics in Eclipse is wrong
2-stage dataflow is rigid
Joins are lengthy and error-prone
Prototyping/exploration requires compile & deploy
Hadoop streaming
Unstructured data with wide variation of
formats, high flexibility on language
Broad language support (Python, Perl,
Bash…)
Mapper, reducer, input and output all
supplied by the jobflow
Scale from 1 to unlimited number of
processes
Hive
Structured data which is lower value
or where higher latency is tolerable
Pig
Semi-structured data which can be
mined using set operations
Hbase
Near real time K/V store for structured
data
Hive integration
Schema on read
SQL Interface for working with data
Simple way to use Hadoop
Create Table statement references data location on
S3
Language called HiveQL, similar to SQL
An example of a query could be:
SELECT COUNT(1) FROM sometable;
Requires to setup a mapping to the input data
Uses SerDe:s to make different input formats
queryable
Powerful data types (Array & Map..)
SQL HiveQL
Updates UPDATE, INSERT,
DELETE
INSERT, OVERWRITE
TABLE
Transactions Supported Not supported
Indexes Supported Not supported
Latency Sub-second Minutes
Functions Hundreds Dozens
Multi-table inserts Not supported Supported
Create table as select Not valid SQL-92 Supported
JDBC Listener started by default on EMR
Port 10003 for Hive 0.8.1, Ports 9100-9101 for Hadoop
Download hive-jdbc-0.8.1.jar
Access by:
Open port on security group ElasticMapReduce-
master
SSH Tunnel
./elastic-mapreduce –create
--name "Hive job flow”
--hive-script
--args s3://myawsbucket/myquery.q
--args -d,INPUT=s3://myawsbucket/input,-
d,OUTPUT=s3://myawsbucket/output
HiveQL to execute
./elastic-mapreduce
--create
--alive
--name "Hive job flow”
--num-instances 5 --instance-type m1.large 
--hive-interactive
Interactive hive session
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Table structure
to create
(happens fast as
just mapping to
source)
CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime,
adId, impressionId, referrer, userAgent,
userCookie, ip' )
LOCATION ‘s3://mybucketsource/tables/impressions' ;
Source data in S3
hive> select * from impressions limit 5;
Selecting from source
data directly via Hadoop
Streaming Hive Pig DynamoDB Redshift
Unstructured
Data
✓ ✓
Structured Data ✓ ✓ ✓ ✓
Language
Support
Any* HQL Pig Latin Client SQL
SQL ✓ ✓
Volume Unlimited Unlimited Unlimited Relatively
Low
1.6 PB
Latency Medium Medium Medium Ultra Low Low
Cost considerations
Making the most of EC2 resources
1 instance for 100 hours
=
100 instances for 1 hour
Small instance = $8
1 instance for 1,000 hours
=
1,000 instances for 1 hour
Small instance = $80
100%
Achieving economies of scale
Time
Reserved capacity
100%
Achieving economies of scale
Time
On
Reserved capacity
100%
On-demand
Time
Achieving economies of scale
On
Reserved capacity
100%
On-demand
Time
Achieving economies of scale
Spot
Job Flow
14 Hours
Duration:
Scenario #1
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Job Flow
14 Hours
Duration:
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50%
Cost Savings: ~22%
Some other things…
Hints and tips
Managing big data
Dynamic Creation of Clusters for Processing Jobs
Bulk of Data Stored in S3
Use Reduced Redundancy Storage for Lower Value / Re-
creatable Data, and Lower Cost by 30%
DynamoDB for High Performance Data Access
Migrate data to Glacier for Archive and Periodic Access
VPC & IAM for Network Security
Shark can return results up to 30 times faster
than the same queries run on Hive.
elastic-mapreduce --create --alive --name
"Spark/Shark Cluster”
--bootstrap-action
s3://elasticmapreduce/samples/spark/install-spark-
shark.sh
--bootstrap-name "install Mesos/Spark/Shark”
--hive-interactive --hive-versions 0.7.1.3
--ami-version 2.0
--instance-type m1.xlarge --instance-count 3
--key-pair ec2-keypair
Summary
What is EMR?
Map-Reduce engine Integrated with tools
Hadoop-as-a-service
Massively parallel
Cost effective AWS wrapper
Integrated to AWS services
Try the tutorial:
aws.amazon.com/articles/2855
Find out more:
aws.amazon.com/big-data

More Related Content

What's hot

Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...
Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...
Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...Amazon Web Services
 
Serverless computing with AWS Lambda
Serverless computing with AWS Lambda Serverless computing with AWS Lambda
Serverless computing with AWS Lambda Apigee | Google Cloud
 
Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...Amazon Web Services
 
Introduction to Block and File storage on AWS
Introduction to Block and File storage on AWSIntroduction to Block and File storage on AWS
Introduction to Block and File storage on AWSAmazon Web Services
 
(CMP201) All You Need To Know About Auto Scaling
(CMP201) All You Need To Know About Auto Scaling(CMP201) All You Need To Know About Auto Scaling
(CMP201) All You Need To Know About Auto ScalingAmazon Web Services
 
Intro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute ServicesIntro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute ServicesAmazon Web Services
 
An Introduction to AWS
An Introduction to AWSAn Introduction to AWS
An Introduction to AWSIan Massingham
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Introduction to Amazon Web Services by i2k2 Networks
Introduction to Amazon Web Services by i2k2 NetworksIntroduction to Amazon Web Services by i2k2 Networks
Introduction to Amazon Web Services by i2k2 Networksi2k2 Networks (P) Ltd.
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3
Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3
Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 

What's hot (20)

Deep Dive on AWS Lambda
Deep Dive on AWS LambdaDeep Dive on AWS Lambda
Deep Dive on AWS Lambda
 
Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...
Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...
Access Control for the Cloud: AWS Identity and Access Management (IAM) (SEC20...
 
Serverless computing with AWS Lambda
Serverless computing with AWS Lambda Serverless computing with AWS Lambda
Serverless computing with AWS Lambda
 
Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...Webinar aws 101   a walk through the aws cloud- introduction to cloud computi...
Webinar aws 101 a walk through the aws cloud- introduction to cloud computi...
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
Amazon EC2 Masterclass
Amazon EC2 MasterclassAmazon EC2 Masterclass
Amazon EC2 Masterclass
 
Introduction to Block and File storage on AWS
Introduction to Block and File storage on AWSIntroduction to Block and File storage on AWS
Introduction to Block and File storage on AWS
 
(CMP201) All You Need To Know About Auto Scaling
(CMP201) All You Need To Know About Auto Scaling(CMP201) All You Need To Know About Auto Scaling
(CMP201) All You Need To Know About Auto Scaling
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
Intro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute ServicesIntro to AWS: EC2 & Compute Services
Intro to AWS: EC2 & Compute Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Deep dive into AWS IAM
Deep dive into AWS IAMDeep dive into AWS IAM
Deep dive into AWS IAM
 
An Introduction to AWS
An Introduction to AWSAn Introduction to AWS
An Introduction to AWS
 
Getting Started with Amazon EC2
Getting Started with Amazon EC2Getting Started with Amazon EC2
Getting Started with Amazon EC2
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Introduction to Amazon Web Services by i2k2 Networks
Introduction to Amazon Web Services by i2k2 NetworksIntroduction to Amazon Web Services by i2k2 Networks
Introduction to Amazon Web Services by i2k2 Networks
 
Intro to AWS: Database Services
Intro to AWS: Database ServicesIntro to AWS: Database Services
Intro to AWS: Database Services
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3
Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3
Hands-on Setup and Overview of AWS Console, AWS CLI, AWS SDK, Boto 3
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 

Viewers also liked

Amazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft BerlinAmazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft BerlinAWS Germany
 
Advanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorksAdvanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorksAmazon Web Services
 
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012End Note - AWS India Summit 2012
End Note - AWS India Summit 2012Amazon Web Services
 
AWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - SageAWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - SageAmazon Web Services
 
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPCAWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPCAmazon Web Services
 
Monetise your content with Amazon CloudFront
Monetise your content with Amazon CloudFrontMonetise your content with Amazon CloudFront
Monetise your content with Amazon CloudFrontAmazon Web Services
 
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPCAWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPCAmazon Web Services
 
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa CarlsonAWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa CarlsonAmazon Web Services
 
Empowering Publishers Event - Intro - May-15-2013
Empowering Publishers Event - Intro - May-15-2013Empowering Publishers Event - Intro - May-15-2013
Empowering Publishers Event - Intro - May-15-2013Amazon Web Services
 
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...Amazon Web Services
 
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage OptionsAWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage OptionsAmazon Web Services
 
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPCAWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPCAmazon Web Services
 
AWS 101 Lunch & Learn March 2013
AWS 101 Lunch & Learn March 2013AWS 101 Lunch & Learn March 2013
AWS 101 Lunch & Learn March 2013Amazon Web Services
 
Viaggio attraverso il cloud come costruire architetture web scalabili e rob...
Viaggio attraverso il cloud   come costruire architetture web scalabili e rob...Viaggio attraverso il cloud   come costruire architetture web scalabili e rob...
Viaggio attraverso il cloud come costruire architetture web scalabili e rob...Amazon Web Services
 
MED303 Addressing Security in Media Workflows - AWS re: Invent 2012
MED303 Addressing Security in Media Workflows - AWS re: Invent 2012MED303 Addressing Security in Media Workflows - AWS re: Invent 2012
MED303 Addressing Security in Media Workflows - AWS re: Invent 2012Amazon Web Services
 
Focus on your app with Amazon RDS
Focus on your app with Amazon RDSFocus on your app with Amazon RDS
Focus on your app with Amazon RDSAmazon Web Services
 
Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...
Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...
Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...Amazon Web Services
 
Best Practices for Getting Started with AWS
Best Practices for Getting Started with AWSBest Practices for Getting Started with AWS
Best Practices for Getting Started with AWSAmazon Web Services
 
SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...
SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...
SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...Amazon Web Services
 

Viewers also liked (20)

Amazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft BerlinAmazon Machine Learning #AWSLoft Berlin
Amazon Machine Learning #AWSLoft Berlin
 
Advanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorksAdvanced Topics - Session 2 - Introducing AWS OpsWorks
Advanced Topics - Session 2 - Introducing AWS OpsWorks
 
End Note - AWS India Summit 2012
End Note - AWS India Summit 2012End Note - AWS India Summit 2012
End Note - AWS India Summit 2012
 
AWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - SageAWS Enterprise Summit London 2013 - Stuart Lynn - Sage
AWS Enterprise Summit London 2013 - Stuart Lynn - Sage
 
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPCAWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
AWS Canberra WWPS Summit 2013 - Extending your Datacentre with Amazon VPC
 
Monetise your content with Amazon CloudFront
Monetise your content with Amazon CloudFrontMonetise your content with Amazon CloudFront
Monetise your content with Amazon CloudFront
 
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPCAWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
AWS Summit 2013 | Auckland - Extending your Datacentre with Amazon VPC
 
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa CarlsonAWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
AWS Summit 2013 | Singapore - Public Sector Keynote, Teresa Carlson
 
Empowering Publishers Event - Intro - May-15-2013
Empowering Publishers Event - Intro - May-15-2013Empowering Publishers Event - Intro - May-15-2013
Empowering Publishers Event - Intro - May-15-2013
 
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
AWS Sydney Summit 2013 - Continuous Deployment Practices, with Production, Te...
 
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage OptionsAWS Summit 2013 | Singapore - Understanding AWS Storage Options
AWS Summit 2013 | Singapore - Understanding AWS Storage Options
 
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPCAWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
AWS Summit 2013 | Singapore - Extending your Datacenter with Amazon VPC
 
AWS 101 Lunch & Learn March 2013
AWS 101 Lunch & Learn March 2013AWS 101 Lunch & Learn March 2013
AWS 101 Lunch & Learn March 2013
 
Viaggio attraverso il cloud come costruire architetture web scalabili e rob...
Viaggio attraverso il cloud   come costruire architetture web scalabili e rob...Viaggio attraverso il cloud   come costruire architetture web scalabili e rob...
Viaggio attraverso il cloud come costruire architetture web scalabili e rob...
 
Your First Week with Amazon EC2
Your First Week with Amazon EC2Your First Week with Amazon EC2
Your First Week with Amazon EC2
 
MED303 Addressing Security in Media Workflows - AWS re: Invent 2012
MED303 Addressing Security in Media Workflows - AWS re: Invent 2012MED303 Addressing Security in Media Workflows - AWS re: Invent 2012
MED303 Addressing Security in Media Workflows - AWS re: Invent 2012
 
Focus on your app with Amazon RDS
Focus on your app with Amazon RDSFocus on your app with Amazon RDS
Focus on your app with Amazon RDS
 
Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...
Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...
Cloud Storage Transformation – Keynote - AWS Cloud Storage for the Enterprise...
 
Best Practices for Getting Started with AWS
Best Practices for Getting Started with AWSBest Practices for Getting Started with AWS
Best Practices for Getting Started with AWS
 
SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...
SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...
SVC103 The Whys and Hows of Integrating Amazon Simple Email Service into your...
 

Similar to Masterclass Webinar: Amazon Elastic MapReduce (EMR)

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 

Similar to Masterclass Webinar: Amazon Elastic MapReduce (EMR) (20)

Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Unit 1
Unit 1Unit 1
Unit 1
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
BD-zero lecture.pptx
BD-zero lecture.pptxBD-zero lecture.pptx
BD-zero lecture.pptx
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Masterclass Webinar: Amazon Elastic MapReduce (EMR)

  • 1. Masterclass Elastic MapReduce Ryan Shuttleworth – Technical Evangelist @ryanAWS
  • 2. A technical deep dive beyond the basics Help educate you on how to get the best from AWS technologies Show you how things work and how to get things done Broaden your knowledge in ~45 mins Masterclass
  • 3. A key tool in the toolbox to help with ‘Big Data’ challenges Makes possible analytics processes previously not feasible Cost effective when leveraged with EC2 spot market Broad ecosystem of tools to handle specific use cases Amazon Elastic MapReduce
  • 4. What is EMR? Map-Reduce engine Integrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services
  • 5. A framework Splits data into pieces Lets processing occur Gathers the results
  • 7. Very large click log (e.g TBs) Lots of actions by John Smith
  • 8. Very large click log (e.g TBs) Lots of actions by John Smith Split the log into many small pieces
  • 9. Very large click log (e.g TBs) Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster
  • 10. Very large click log (e.g TBs) Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster Aggregate the results from all the nodes
  • 11. Very large click log (e.g TBs) What John Smith did Lots of actions by John Smith Split the log into many small pieces Process in an EMR cluster Aggregate the results from all the nodes
  • 12. What John Smith did Very large click log (e.g TBs) Insight in a fraction of the time
  • 13.
  • 15. Big Data != Hadoop
  • 16. Big Data != Hadoop Velocity Variety Volume
  • 17. Big Data != Hadoop When you need to innovate to collect, store, analyze, and manage your data
  • 18. ComputeStorage Big Data How do you get your slice of it? AWS Direct Connect Dedicated low latency bandwidth Queuing Highly scalable event buffering Amazon Storage Gateway Sync local storage to the cloud AWS Import/Export Physical media shipping
  • 19. Relational Database Service Fully managed database (MySQL, Oracle, MSSQL) DynamoDB NoSQL, Schemaless, Provisioned throughput database S3 Object datastore up to 5TB per object 99.999999999% durability ComputeStorage Big Data Where do you put your slice of it? SimpleDB NoSQL, Schemaless Smaller datasets
  • 20. Elastic MapReduce Getting started, tools & operating modes
  • 21. EMR cluster Start an EMR cluster using console or cli tools
  • 22. Master instance group EMR cluster Master instance group created that controls the cluster (runs MySQL)
  • 23. Master instance group EMR cluster Core instance group Core instance group created for life of cluster
  • 24. Master instance group EMR cluster Core instance group HDFS HDFS Core instances run DataNode and TaskTracker daemons
  • 25. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Optional task instances can be added or subtracted to perform work
  • 26. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Amazon S3 S3 can be used as underlying ‘file system’ for input/output data
  • 27. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Amazon S3 Master node coordinates distribution of work and manages cluster state
  • 28. Master instance group EMR cluster Task instance groupCore instance group HDFS HDFS Amazon S3 Core and Task instances read- write to S3
  • 30. map Input file reduce Output file map Input file reduce Output file map Input file reduce Output file EC2 instance EC2 instance EC2 instance
  • 36. HDFS EMR Pig S3 DynamoDB RDS Analytics languagesData management RedShift AWS DataPipeline
  • 37. EMR Create Job Flow Run Job Step 1 Run Job Step 2 Run Job Step n End Job Flow Cluster created Master, Core & Task instances groups Processing step Execute a script such as java, hive script, python etc Chained step Execute subsequent step using data from previous step Chained step Add steps dynamically or predefine them as a pipeline Cluster terminated Cluster instances terminated on end of job flow
  • 38. EMR Create Job Flow Run Job Step 1 Run Job Step 2 Run Job Step n End Job Flow Cluster created Master, Core & Task instances groups Processing step Execute a script such as java, hive script, python etc Chained step Execute subsequent step using data from previous step Chained step Add steps dynamically or predefine them as a pipeline Cluster terminated Cluster instances terminated on end of job flow Hadoop breaks a data set into multiple sets if the data set is too large to process quickly on a single cluster node. Step can be compiled java .jar or script executed by the Hadoop Streamer Data typically passed between steps using HDFS file system on Core Instances Group nodes or S3 Output data generally placed in S3 for subsequent use
  • 39.
  • 40.
  • 41. Utility that comes with the Hadoop distribution Allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer Reads the input from standard input and the reducer outputs data through standard output By default, each line of input/output represents a record with tab separated key/value Hadoop Streaming
  • 42.
  • 46. #!/usr/bin/python import sys import re def main(argv): pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "t" + "1" if __name__ == "__main__": main(sys.argv) Streaming: word count python script Stdin > Stdout
  • 47. #!/usr/bin/python import sys import re def main(argv): pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "t" + "1" if __name__ == "__main__": main(sys.argv) Streaming: word count python script Stdin > Stdout Read words from StdIn line by line
  • 48. #!/usr/bin/python import sys import re def main(argv): pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "t" + "1" if __name__ == "__main__": main(sys.argv) Streaming: word count python script Stdin > Stdout Output to StdOut tab delimited record e.g. “LongValueSum:abacus 1”
  • 50. Aggregate: Sort inputs and add up totals e.g “Abacus 1” “Abacus 1” becomes “Abacus 2”
  • 51.
  • 52. Manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop Master instance group Instance groups
  • 53. Manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop Master instance group Contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Core nodes run both the DataNodes and TaskTracker Hadoop daemons. Core instance group Instance groups
  • 54. Manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop Master instance group Contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Core nodes run both the DataNodes and TaskTracker Hadoop daemons. Core instance group Contains all of the task nodes in a job flow. The task instance group is optional. You can add it when you start the job flow or add a task instance group to a job flow in progress. You can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a job flow, you can use task nodes to manage the EC2 instance capacity your job flow uses Task instance group Instance groups
  • 55.
  • 56. SSH onto master node with this keypair
  • 57. Keep job flow running so you can add more to it
  • 58. … aakar 3 abdal 3 abdelaziz 18 abolish 3 aboriginal 12 abraham 6 abseron 9 abstentions15 accession 73 accord 90 achabar 3 achievements 3 acic 4 acknowledged 6 acquis 9 acquisitions 3 …
  • 64. launches a job flow to run on a single-node cluster using an Amazon EC2 m1.small instance when your steps are running correctly on a small set of sample data, you can launch job flows to run on multiple nodes You can specify the number of nodes and the type of instance to run with the --num-instances and -- instance-type parameters
  • 65. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –alive --log-uri s3n://mybucket/EMR/log
  • 66. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –alive --log-uri s3n://mybucket/EMR/log Instance type/count
  • 67. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –alive --log-uri s3n://mybucket/EMR/log Don’t terminate cluster – keep it running so steps can be added
  • 68. elastic-mapreduce --create --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest --hbase --log-uri s3n://mybucket/EMR/log Adding Hive, Pig and Hbase to the job flow
  • 69. elastic-mapreduce --create --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --args "- s,dfs.block.size=1048576” --key-pair micro --region eu-west-1 --name MyJobFlow --num-instances 5 --instance-type m2.4xlarge –alive --pig-interactive --pig-versions latest --hive-interactive –-hive-versions latest Using a bootstrapping action to configure the cluster
  • 70. 70 You can specify up to 16 bootstrap actions per job flow by providing multiple bootstrap-action parameters s3://elasticmapreduce/bootstrap- actions/configure-daemons Configure JVM properties --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-daemons --args --namenode- heap-size=2048,--namenode-opts=- XX:GCTimeRatio=19 s3://elasticmapreduce/bootstrap- actions/configure-hadoop Configure cluster wide hadoop settings ./elastic-mapreduce –create --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure- hadoop --args "--site-config- file,s3://myawsbucket/config.xml,- s,mapred.tasktracker.map.tasks.maximum=2" s3://elasticmapreduce/bootstrap- actions/configurations/latest/memory- intensive Configure cluster wide memory settings ./elastic-mapreduce –create --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configurations/latest/memory-intensive
  • 71. 71 You can specify up to 16 bootstrap actions per job flow by providing multiple bootstrap-action parameters s3://elasticmapreduce/bootstrap- actions/run-if Conditionally run a command ./elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true,echo running on master node" Shutdown actions Sun scripts on shutdown A bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown- actions/ directory. Each script must run and complete within 60 seconds. Custom actions Run a custom script --bootstrap-action "s3://elasticmapreduce/bootstrap- actions/download.sh"
  • 72.
  • 74. Hadoop streaming Unstructured data with wide variation of formats, high flexibility on language Broad language support (Python, Perl, Bash…) Mapper, reducer, input and output all supplied by the jobflow Scale from 1 to unlimited number of processes
  • 75. Hadoop streaming Unstructured data with wide variation of formats, high flexibility on language Broad language support (Python, Perl, Bash…) Mapper, reducer, input and output all supplied by the jobflow Scale from 1 to unlimited number of processes Doing analytics in Eclipse is wrong 2-stage dataflow is rigid Joins are lengthy and error-prone Prototyping/exploration requires compile & deploy
  • 76. Hadoop streaming Unstructured data with wide variation of formats, high flexibility on language Broad language support (Python, Perl, Bash…) Mapper, reducer, input and output all supplied by the jobflow Scale from 1 to unlimited number of processes Hive Structured data which is lower value or where higher latency is tolerable Pig Semi-structured data which can be mined using set operations Hbase Near real time K/V store for structured data
  • 78. SQL Interface for working with data Simple way to use Hadoop Create Table statement references data location on S3 Language called HiveQL, similar to SQL An example of a query could be: SELECT COUNT(1) FROM sometable; Requires to setup a mapping to the input data Uses SerDe:s to make different input formats queryable Powerful data types (Array & Map..)
  • 79. SQL HiveQL Updates UPDATE, INSERT, DELETE INSERT, OVERWRITE TABLE Transactions Supported Not supported Indexes Supported Not supported Latency Sub-second Minutes Functions Hundreds Dozens Multi-table inserts Not supported Supported Create table as select Not valid SQL-92 Supported
  • 80. JDBC Listener started by default on EMR Port 10003 for Hive 0.8.1, Ports 9100-9101 for Hadoop Download hive-jdbc-0.8.1.jar Access by: Open port on security group ElasticMapReduce- master SSH Tunnel
  • 81. ./elastic-mapreduce –create --name "Hive job flow” --hive-script --args s3://myawsbucket/myquery.q --args -d,INPUT=s3://myawsbucket/input,- d,OUTPUT=s3://myawsbucket/output HiveQL to execute
  • 82. ./elastic-mapreduce --create --alive --name "Hive job flow” --num-instances 5 --instance-type m1.large --hive-interactive Interactive hive session
  • 83. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ;
  • 84. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Table structure to create (happens fast as just mapping to source)
  • 85. CREATE EXTERNAL TABLE impressions ( requestBeginTime string, adId string, impressionId string, referrer string, userAgent string, userCookie string, ip string ) PARTITIONED BY (dt string) ROW FORMAT serde 'com.amazon.elasticmapreduce.JsonSerde' with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' ) LOCATION ‘s3://mybucketsource/tables/impressions' ; Source data in S3
  • 86. hive> select * from impressions limit 5; Selecting from source data directly via Hadoop
  • 87. Streaming Hive Pig DynamoDB Redshift Unstructured Data ✓ ✓ Structured Data ✓ ✓ ✓ ✓ Language Support Any* HQL Pig Latin Client SQL SQL ✓ ✓ Volume Unlimited Unlimited Unlimited Relatively Low 1.6 PB Latency Medium Medium Medium Ultra Low Low
  • 88. Cost considerations Making the most of EC2 resources
  • 89. 1 instance for 100 hours = 100 instances for 1 hour
  • 91. 1 instance for 1,000 hours = 1,000 instances for 1 hour
  • 97.
  • 98. Job Flow 14 Hours Duration: Scenario #1 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
  • 99. Job Flow 14 Hours Duration: Scenario #1 Duration: Job Flow 7 Hours Scenario #2 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28
  • 100. Job Flow 14 Hours Duration: Scenario #1 Duration: Job Flow 7 Hours Scenario #2 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75
  • 101. Job Flow 14 Hours Duration: Scenario #1 Duration: Job Flow 7 Hours Scenario #2 EMR with spot instances #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Time Savings: 50% Cost Savings: ~22%
  • 103. Managing big data Dynamic Creation of Clusters for Processing Jobs Bulk of Data Stored in S3 Use Reduced Redundancy Storage for Lower Value / Re- creatable Data, and Lower Cost by 30% DynamoDB for High Performance Data Access Migrate data to Glacier for Archive and Periodic Access VPC & IAM for Network Security
  • 104. Shark can return results up to 30 times faster than the same queries run on Hive. elastic-mapreduce --create --alive --name "Spark/Shark Cluster” --bootstrap-action s3://elasticmapreduce/samples/spark/install-spark- shark.sh --bootstrap-name "install Mesos/Spark/Shark” --hive-interactive --hive-versions 0.7.1.3 --ami-version 2.0 --instance-type m1.xlarge --instance-count 3 --key-pair ec2-keypair
  • 106. What is EMR? Map-Reduce engine Integrated with tools Hadoop-as-a-service Massively parallel Cost effective AWS wrapper Integrated to AWS services
  • 107. Try the tutorial: aws.amazon.com/articles/2855 Find out more: aws.amazon.com/big-data