Introduction To Elastic MapReduce at WHUG

Possible real-world situation
● We have big data and/or very long,
embarrassingly parallel computation
● Our data may grow fast
● We want to start and try Hadoop asap

● We do not have our own infrastructure
● We do not have Hadoop administrators
● We have limited funds

Possible solution
Amazon Elastic MapReduce (EMR)
● Hadoop framework running on the web scale
infrastructure of Amazon

EMR Benefits
Elastic (scalable)
● Use one, hundred, or even thousands of
instances to process even petabytes of data
● Modify the number of instances while the job
flow is running
● Start computation within minutes

EMR Benefits
Easy to use
● No configuration necessary
○ Do not worry about setting up hardware and
networking, running, managing and tuning the
performance of Hadoop cluster
● Easy-to-use tools and plugins available
○ AWS Web Management Console
○ Command Line Tools by Amazon
○ Amazon EMR API, SDK, Libraries
○ Plugins for IDEs (e.g. Eclipse & Karmasphere Studio
for EMR)

EMR Benefits
Reliable
● Build on Amazon's highly available and
battle-tested infrastructure
● Provision new nodes to replace those that
fail
● Used by e.g.:

EMR Benefits
Cost effective
● Pay for what you use (for each started hour)
● Choose various instance types that meets
your requirements
● Possibility to reserve instances for 1 or 3
years to pay less for hour

EMR Overview
Amazon Elastic MapReduce (Amazon EMR)
works in conjunction with
● Amazon EC2 to rent computing instances
(with Hadoop installed)
● Amazon S3 to store input and output data,
scripts/applications and logs

EMR Architectural Overview

* image from the Internet

EC2 Instance Types

* image from Big Data University, Course: "Hadoop and the Amazon Cloud"

EMR Pricing - "On-demand"
instances
Standard Family Instances (US East Region)

http://aws.amazon.com/elasticmapreduce/pricing/

EC2 & S3 Pricing - Real-world example
New York Times wanted to host all public
domain articles from 1851 to 1922.
● 11 million articles
● 4 TB of raw image TIFF input data converted
to 1.5 TB of PDF documents
● 100 EC2 Instances rented
● < 24 hours of computation
● $240 paid (not including storage & bandwidth)
● 1 employee assigned to this task

EC2 & S3 Pricing - Real-world example

How much
did they pay for storage
and bandwidth?

S3 Pricing

http://aws.amazon.com/s3/pricing/

EC2 & S3 Pricing Calculator
Simple Monthly Calculator:
http://calculator.s3.amazonaws.com/calc5.html

AWS Free Usage Tier (Per Month)
Available for free to new AWS customers for 12
months following AWS sign-up date e.g.:
● 750 hours of Amazon EC2 Micro Instance
usage
○ 613 MB of memory and 32-bit or 64-bit platform
● 5 GB of Amazon S3 standard storage,
20,000 Get and 2,000 Put Requests
● 15 GB of bandwidth out aggregated across
all AWS services

EMR - Support for Hadoop
Ecosystem
Develop and run MapReduce application using:
● Java
● Streaming (e.g. Ruby, Perl, Python, PHP, R,
or C++)
● Pig
● Hive

HBase can be easily installed using set of EC2
scripts
●

EMR - Featured Users

* logos form http://aws.amazon.com/elasticmapreduce/

EMR - Case Study - Yelp

● help people connect
with great local business
● share reviews and insights

● as of November 2010:
○ 39 million monthly unique visitors
○ in total, 14 million reviews posted
●

EMR - Case Study - Yelp
● uses S3 to store daily logs (~100GB/day)
and photos
● uses EMR to power features like
○ People who viewed this also viewed
○ Review highlights
○ Autocomplete in search box
○ Top searches
● implements jobs in Python and uses their
own open-source library, mrjob, to run them
on EMR

mrjob - WordCount example
from mrjob.job import MRJob

class MRWordCounter(MRJob):
def mapper(self, key, line):
for word in line.split():
yield word, 1

def reducer(self, word, occurrences):
yield word, sum(occurrences)

if __name__ == '__main__':
MRWordCounter.run()

mrjob - run on EMR
$ python wordcount.py
--ec2_instance_type c1.medium
--num-ec2-instances 10
-r emr < 's3://input-bucket/*.txt' > output

Million Song Dataset
● Contains detailed acoustic and contextual
data for one million popular songs
● ~300 GB of data
● Publicly available
○ for download: http://www.infochimps.
com/collections/million-songs
○ for processing using EMR: http://tbmmsd.s3.
amazonaws.com/

Million Song Dataset
Contains data such as:
● Song's title, year and hotness
● Song's tempo, duration, danceability,
energy, loudness, segments count, preview
(URL to mp3 file) and so on
● Artist's name and hotness

Million Song Dataset - Song's
density
Song's density* can be defined as the average
number of notes or atomic sounds (called
segments) per second in a song.

density = segmentCnt / duration

* based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Million Song Dataset - Task*
Simple music recommendation system
● Calculate density for each song
● Find hot songs with similar density

* based on Paul Lamere's blog - http://bit.ly/qUbLdQ

Million Song Dataset - MapReduce
Input data
● 339 files
● Each file contains ~3 000 songs
● Each song is represented by one line in
input file
● Fields are separated by a tab character

Mapper
● Reads song's data from each line of input
text
● Calculate song's density
● Emits song's density as key with some other
details as value

<line_offset, song_data> ->
<density, (artist_name, song_title, song_url)>

public void map(LongWritable key, Text value,
OutputCollector<FloatWritable, TripleTextWritable> output, Reporter
reporter) throws IOException {

song.parseLine(value.toString());
if (song.tempo > 0 && song.duration > 0 ) {
// calculate density
float density = ((float) song.segmentCnt) / song.duration;

denstyWritable.set(density);
songWritable.set(song.artistName, song.title, song.preview);

output.collect(denstyWritable, songWritable);
}
}

Reducer
● Identity Reducer
● Each Reducer gets density values from
different range: <i,i+1)*,**

<density, [(artist_name, song_title, song_url)]> ->
<density, (artist_name, song_title, song_url)>

* thanks to a custom Partitioner
** not optimal partitioning (partitions are not balanced)

Demo - used software
● Karmasphere Studio for EMR (Eclipse
plugin)
○ graphical environment that supports the complete
lifecycle for developing for Amazon Elastic
MapReduce, including prototyping, developing,
testing, debugging, deploying and optimizing
Hadoop Jobs (http://www.karmasphere.
com/ksc/karmasphere-studio-for-amazon.html)

Demo - used software
● Karmasphere Studio for EMR (Eclipse
plugin)

images from:
http://www.karmasphere.com/ksc/karmasphere-studio-for-amazon.html

Please watch video on WHUG channel on
YouTube

http://www.youtube.com/watch?
v=Azwilbn8GCs

Introduction To Elastic MapReduce at WHUG

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Introduction To Elastic MapReduce at WHUG

Similar to Introduction To Elastic MapReduce at WHUG (20)

Recently uploaded

Recently uploaded (20)

Introduction To Elastic MapReduce at WHUG