Amazon Elastic MapReduce (EMR): Hadoop as a Service

Amazon Elastic
MapReduce
Ville Seppänen | Jari Voutilainen |@Vilsepi @Zharktas @GoforeOy

Agenda
1. Introduction to Hadoop Streaming and Elastic
MapReduce
2. Simple EMR web interface demo
3. Introduction to our dataset
4. Using EMR from command line with boto
All presentation material is available at https://github.com/gofore/aws-
emr

Hadoop Streaming
Utility that allows you to create and run
Map/Reduce jobs with any executable or script as
the mapper and/or the reducer.
$HADOOP_HOME/bin/hadoopjar$HADOOP_HOME/hadoop-streaming.jar
-input my/Input/Directories
-output my/Output/Directory
-mapper myMapperProgram.py
-reducermyReducerProgram.py
catinput_data.txt|mapper.py|reducer.py>output_data.txt

Amazon Elastic MapReduce
Hadoop-based MapReduce cluster as a service
Can run either Amazon-optimized Hadoop or
MapR
Managed from a web UI or through API

The endlessly fascinating example of counting words in Hadoop

Cluster creation steps
Cluster: name, logging
Tags: keywords for the cluster
Software: Hadoop distribution and version, pre-
installed applications (Hive, Pig,...)
File System: encryption, consistency
Hardware: number and type of instances
Security and Access: ssh keys, node access roles
Bootstrap Actions: scripts to customize the cluster
Steps: a queue of mapreduce jobs for the cluster

(mapper)WordSplitter.py
#!/usr/bin/python
importsys
importre
pattern=re.compile("[a-zA-Z][a-zA-Z0-9]*")
forlineinsys.stdin:
forwordinpattern.findall(line):
print"LongValueSum:"+word.lower()+"t"+"1"
LongValueSum:i 1
LongValueSum:count 1
LongValueSum:words 1
LongValueSum:with 1
LongValueSum:hadoop 1

Filesystems
EMRFS is an implementation of HDFS, with reading
and writing of files directly to S3.
HDFS should be used to cache results of
intermediate steps.
S3 is block-based just like HDFS. S3n is file based,
which can be accessed with other tools, but filesize
is limited to 5GB

S3 is not a file system, it is a RESTish object
storage.
S3 has eventual consistency: files written to S3
might not be immediately available for reading.
EMR FS can be configured to encrypt files in S3
and monitor consistancy of files, which can detect
event that try to use inconsistant files.
http://wiki.apache.org/hadoop/AmazonS3

is a service offering real time
information and data about the traffic, weather
and condition information on the Finnish main
roads.
The service is provided by the
(Liikennevirasto), and produced by
and .
Digitraffic
Finnish Transport
Agency Gofore
Infotripla

Traffic fluency
Our data consists of traffic fluency information, i.e.
how quickly vehicles have been identified to pass
through a road segment (a link).
Data is gathered with camera-based
, and more
recently with mobile-device-based
.
Automatic
License Plate Recognition (ALPR)
Floating Car
Data (FCD)

<ivjtdataduration="60"periodstart="2014-06-24T02:55:00Z">
<recognitions>
<linkid="110302"data_source="1">
<recognitionoffset_seconds="8" travel_time="152"></recognition>
<recognitionoffset_seconds="36"travel_time="155"></recognition>
</link>
<recognitionoffset_seconds="6" travel_time="126"></recognition>
</link>
</link>
</recognitions>
</ivjtdata>
Each file contains finished passthroughs for each road segment during
one minute.

Some numbers
6.5 years worth of data from January 2008 to June
2014
3.9 million XML files (525600 files per year)
6.3 GB of compressed archives (with 7.5GB of
additional median data as CSV)
42 GB of data as XML (and 13 GB as CSV)

Potential research questions
1. Do people drive faster during the night?
2. Does winter time have less recognitions (either
due to less cars or snowy plates)?
3. How well number of recognitions correlate with
speed (rush hour probably slows travel, but are
speeds higher during days with less traffic)?
4. Is it possible to identify speed limits from the
travel times? How much dispersion in speeds?
5. When do speed limits change (winter and summer
limits)?

The small files problem
Unpacked the tar.gz archives and uploaded the
XML files as such to S3 (using AWS ).
Turns out (4 million 11kB) small files with Hadoop
is not fun. Hadoop does not handle well with files
significantly smaller than the HDFS block size
(default 64MB)
And well, XML is not fun either, so...
CLI tools
[1] [2] [3]

JSONify all the things!
Wrote scripts to parse, munge and upload data
Concatenated data into bigger files, calculated
some extra data, and converted it into JSON. Size
reduced to 60% of the original XML.
First munged 1-day files (10-20MB each) and later
1-month files (180-540MB each)
Munging XML worth of 6.5 years takes 8.5 hours
on a single t2.medium instance

Static link information (120kb json)
{
"sites":[
{
"id":"1205",
"name":"Viinikka",
"lat":61.488282,
"lon":23.779057,
"rno":"3495",
"tro":"3495/1-2930"
}
],
"links":[
{
"id":"99001041",
"name":"Hallila->Viinikka",
"dist":5003.0,
"site_start":"1108",
"site_end":"1205"
}]
}

{
"date":"2014-06-01T02:52:00Z",
"recognitions":[
{
"id":"4510201",
"tt":117,
"cars":4,
"itts":[
100,
139,
121,
110
]
}
]
}

Alternatives for the web interface
AWS
SDKs like for Python
Command line tools
boto

Connect to EMR
#!/usr/bin/envpython
importboto.emr
fromboto.emr.instance_groupimportInstanceGroup
#RequiresthatAWSAPIcredentialshavebeenexportedasenvvariables
connection=boto.emr.connect_to_region('eu-west-1')

Specify EC2 instances
instance_groups=[]
instance_groups.append(InstanceGroup(
role="MASTER",name="Mainnode",
type="m1.medium",num_instances=1,
market="ON_DEMAND"))
role="CORE",name="Workernodes",
market="ON_DEMAND"))
role="TASK",name="Optionalspot-pricenodes",
market="SPOT",bidprice=0.012))

Start EMR cluster
cluster_id=connection.run_jobflow(
"Ourawesomecluster",
instance_groups=instance_groups,
action_on_failure='CANCEL_AND_WAIT',
keep_alive=True,
enable_debugging=True,
log_uri="s3://our-s3-bucket/logs/",
ami_version="3.3.1",
bootstrap_actions=[],
ec2_keyname="name-of-our-ssh-key",
visible_to_all_users=True,
job_flow_role="EMR_EC2_DefaultRole",
service_role="EMR_DefaultRole")

Add job step to cluster
steps=[]
steps.append(boto.emr.step.StreamingStep(
"Ourawesomestreamingapp",
input="s3://our-s3-bucket/our-input-data",
output="s3://our-s3-bucket/our-output-path/",
mapper="our-mapper.py",
reducer="aggregate",
cache_files=[
"s3://our-s3-bucket/programs/our-mapper.py#our-mapper.py",
"s3://our-s3-bucket/data/our-dictionary.json#our-dictionary.json",)
],
action_on_failure='CANCEL_AND_WAIT',
jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar'))
connection.add_jobflow_steps(cluster_id,steps)

Recap
#!/usr/bin/envpython
importboto.emr
fromboto.emr.instance_groupimportInstanceGroup
connection=boto.emr.connect_to_region('eu-west-1')
cluster_id=connection.run_jobflow(**cluster_parameters)
connection.add_jobflow_steps(cluster_id,**steps_parameters)

Step 1 of 2: Run mapreduce
#Createnewcluster
aws-tools/run-jobs.py
create-cluster
"Carspeedcountingcluster"
Startingcluster
j-F0K0A4Q9F5O0Carspeedcountingcluster
#Addjobsteptothecluster
aws-tools/run-jobs.py
run-step
j-F0K0A4Q9F5O0
05-car-speed-for-time-of-day_map.py
digitraffic/munged/links-by-month/2014
Stepwilloutputdatato
s3://hadoop-seminar-emr/digitraffic/outputs/
2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/

Step 2 of 2: Analyze results
#Downloadandconcatenateoutput
awss3cp
s3://hadoop-seminar-emr/digitraffic/outputs/2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/
/tmp/emr
--recursive
--profilehadoop-seminar-emr
cat/tmp/emr/part-*>/tmp/emr/output
#Analyzeresults
result-analysis/05_speed_during_day/05-car-speed-for-time-of-day_output.py
/tmp/emr/output
example-data/locationdata.json

Some statistics
We experimented with different input files an
cluster sizes
Execution time was about half hour with small
cluster and 30 small 15-20 MB files
Same input parsed with simple python script took
about 5 minutes
Larger cluster and 6 larger 500 MB files took 17
minutes.
"Too small problem for EMR/Hadoop"

Takeaways
Make sure your problem is big enough for Hadoop
Munging wisely makes streaming programs easier
and faster
Always use Spot instances with EMR

Further reading
Ubuntu MaaS blog:
Big Data Borat:
Amazon EMR Developer Guide
Amazon EMR Best practices
Scaling a 2000-node Hadoop
cluster on EC2
"Quiz: Is it a Pokemon or a bigdata
technology?"

Amazon Elastic MapReduce (EMR): Hadoop as a Service

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (17)

Similar to Amazon Elastic MapReduce (EMR): Hadoop as a Service

Similar to Amazon Elastic MapReduce (EMR): Hadoop as a Service (20)

Recently uploaded

Recently uploaded (20)

Amazon Elastic MapReduce (EMR): Hadoop as a Service