2. Amazon
Elas+c
MapReduce
! Enables
customers
to
easily
and
cost-‐
effec+vely
process
vast
amounts
of
data.
! U+lizes
a
hosted
Hadoop
framework
running
on
the
web-‐scale
infrastructure
of
Amazon.
! Launched
in
the
US
in
April
and
EU
in
July
of
2009
3. Amazon
Elas+c
MapReduce
! Large
scale
data
processing
has
a
lot
of
MUCK
and
we
want
to
remove
it
for
our
customers
! Hard
to
manage
compute
clusters
! Hard
to
tune
Hadoop
! Hadoop
issues
preven+ng
smooth
opera+on
in
the
cloud
Amazon.com
Confiden+al
3
5. Amazon Elastic MapReduce
Amazon EC2 Instances
End
Deploy Application
Hadoop Hadoop Hadoop
Elastic Elastic
MapReduce MapReduce
Hadoop Hadoop Hadoop Notify
Web Console,
Command line tools Input output
dataset results
Input
S3
Output
S3
Get Results
Input Data
bucket
bucket
Amazon S3
6. Amazon Elastic MapReduce
Benefits
Uses as many or as few EC2 instances as needed.
Elastic
Spin up large or small job flows in minutes.
Get up and running quickly with easy-to-use web
Easy to use console, robust command line clients and sample
jobs. No configuration necessary.
Fault tolerant service built on top of battle-tested
Reliable
AWS infrastructure. Automatically retries failed tasks.
We monitor progress of your jobs and turn off
Cost Effective
resources when job flow is done.
7. Problems
customers
solve
with
Elas+c
MapReduce
! Data
mining
(Log
processing,
click
stream
analysis,
similari+es,
etc.)
! Bio-‐informa+cs
(Genome
analysis)
! Financial
simula+on
(Monte
Carlo
simula+on)
! File
processing
(resize
jpegs)
! Web
indexing
Amazon.com
Confiden+al
7
8. Customer
Feedback
! Pros:
! Amazon
Elas+c
MapReduce
makes
it
easy
to
run
Hadoop
applica+ons.
! Reliable
plaZorm
for
produc+on
data-‐processing
! Challenges:
! Simple
tasks
such
as
log
processing
require
fluency
in
MapReduce
! Hadoop
applica+ons
are
difficult
to
develop
9. New
Features
! Support
for
Apache
Pig
–
August
2009
! Batch
and
interac+ve
mode
! Concurrent
access
to
mul+ple
file
systems
! Loading
resources
from
Amazon
S3
! Addi+onal
Piggybank
func+ons
! Integra+on
with
Elas+c
MapReduce
Client
and
Web
Console
10. New
Features
! Support
for
Apache
Hive
0.4
–
Today
! Batch
and
interac+ve
mode
! Integra+on
with
Elas+c
MapReduce
Client
and
Web
Console
! Addi+ons
to
Hive
• Load
table
par++ons
automa+cally
from
Amazon
S3
• Specify
an
off-‐instance
metadata
store
• Op+mized
data
writes
to
Amazon
S3
• Reference
resources
on
Amazon
S3
11. Amazon
Elas+c
MapReduce
Ecosystem
! Karmasphere
Studio
for
Hadoop
–
NetBeans
IDE
for
development,
debugging,
deployment
and
management
of
Hadoop
jobs
! Deploy
Hadoop
jobs
to
Elas+c
MapReduce
! Monitor
progress
of
Elas+c
MapReduce
job
flows
! Amazon
S3
file
browser
! Elas+c
MapReduce
HDFS
browser
12. Amazon
Elas+c
MapReduce
Ecosystem
! Support
for
Cloudera’s
Hadoop
distribu+on
(private
beta)
! Op+onally
use
Cloudera’s
Hadoop
while
execu+ng
Elas+c
MapReduce
job
flows
! Get
support
from
Cloudera
for
the
Elas+c
MapReduce
job
flows