2. Agenda
1. Introduction to Hadoop Streaming and Elastic
MapReduce
2. Simple EMR web interface demo
3. Introduction to our dataset
4. Using EMR from command line with boto
All presentation material is available at https://github.com/gofore/aws-
emr
3. Hadoop Streaming
Utility that allows you to create and run
Map/Reduce jobs with any executable or script as
the mapper and/or the reducer.
$HADOOP_HOME/bin/hadoopjar$HADOOP_HOME/hadoop-streaming.jar
-input my/Input/Directories
-output my/Output/Directory
-mapper myMapperProgram.py
-reducermyReducerProgram.py
catinput_data.txt|mapper.py|reducer.py>output_data.txt
5. Amazon Elastic MapReduce
Hadoop-based MapReduce cluster as a service
Can run either Amazon-optimized Hadoop or
MapR
Managed from a web UI or through API
9. Cluster creation steps
Cluster: name, logging
Tags: keywords for the cluster
Software: Hadoop distribution and version, pre-
installed applications (Hive, Pig,...)
File System: encryption, consistency
Hardware: number and type of instances
Security and Access: ssh keys, node access roles
Bootstrap Actions: scripts to customize the cluster
Steps: a queue of mapreduce jobs for the cluster
11. Filesystems
EMRFS is an implementation of HDFS, with reading
and writing of files directly to S3.
HDFS should be used to cache results of
intermediate steps.
S3 is block-based just like HDFS. S3n is file based,
which can be accessed with other tools, but filesize
is limited to 5GB
12. S3 is not a file system, it is a RESTish object
storage.
S3 has eventual consistency: files written to S3
might not be immediately available for reading.
EMR FS can be configured to encrypt files in S3
and monitor consistancy of files, which can detect
event that try to use inconsistant files.
http://wiki.apache.org/hadoop/AmazonS3
14. is a service offering real time
information and data about the traffic, weather
and condition information on the Finnish main
roads.
The service is provided by the
(Liikennevirasto), and produced by
and .
Digitraffic
Finnish Transport
Agency Gofore
Infotripla
15. Traffic fluency
Our data consists of traffic fluency information, i.e.
how quickly vehicles have been identified to pass
through a road segment (a link).
Data is gathered with camera-based
, and more
recently with mobile-device-based
.
Automatic
License Plate Recognition (ALPR)
Floating Car
Data (FCD)
19. Some numbers
6.5 years worth of data from January 2008 to June
2014
3.9 million XML files (525600 files per year)
6.3 GB of compressed archives (with 7.5GB of
additional median data as CSV)
42 GB of data as XML (and 13 GB as CSV)
20. Potential research questions
1. Do people drive faster during the night?
2. Does winter time have less recognitions (either
due to less cars or snowy plates)?
3. How well number of recognitions correlate with
speed (rush hour probably slows travel, but are
speeds higher during days with less traffic)?
4. Is it possible to identify speed limits from the
travel times? How much dispersion in speeds?
5. When do speed limits change (winter and summer
limits)?
22. The small files problem
Unpacked the tar.gz archives and uploaded the
XML files as such to S3 (using AWS ).
Turns out (4 million 11kB) small files with Hadoop
is not fun. Hadoop does not handle well with files
significantly smaller than the HDFS block size
(default 64MB)
And well, XML is not fun either, so...
CLI tools
[1] [2] [3]
23. JSONify all the things!
Wrote scripts to parse, munge and upload data
Concatenated data into bigger files, calculated
some extra data, and converted it into JSON. Size
reduced to 60% of the original XML.
First munged 1-day files (10-20MB each) and later
1-month files (180-540MB each)
Munging XML worth of 6.5 years takes 8.5 hours
on a single t2.medium instance
37. Some statistics
We experimented with different input files an
cluster sizes
Execution time was about half hour with small
cluster and 30 small 15-20 MB files
Same input parsed with simple python script took
about 5 minutes
Larger cluster and 6 larger 500 MB files took 17
minutes.
"Too small problem for EMR/Hadoop"
39. Takeaways
Make sure your problem is big enough for Hadoop
Munging wisely makes streaming programs easier
and faster
Always use Spot instances with EMR
40. Further reading
Ubuntu MaaS blog:
Big Data Borat:
Amazon EMR Developer Guide
Amazon EMR Best practices
Scaling a 2000-node Hadoop
cluster on EC2
"Quiz: Is it a Pokemon or a bigdata
technology?"