Nordic Perl Workshop 2013

Playing with Hadoop
Søren Lund (slu)
slu@369.dk
DISCLAIMER








I have no experience with Hadoop in a realworld project
The installation notes I present are not
nescessarily suitable for production
The example scripts have not been used on
real (big) data
Hence the title Playing with Hadoop
About Hadoop (and Big Data)
The Problem (it's not new)









We have (access to)
more and more data
Processing this data
takes longer and
longer
Not enough memory
Running out of disk
space
Our trusty old server
can't keep up

!!!!!
Scaling up








Upgrade hardware:
bigger and faster
Redundancy: power
supply, RAID, hotswap
Expensive to keep
scaling up
Our software will run
without modifications
Scaling out








Add more
(commodity) servers
Redundancy is
replaced by
replication
You can keep on
scaling out, it's cheap
How do we enable
our software to run
across multiple
servers?
Google solved this


Google published two papers


Google File System (GFS), 2003
http://research.google.com/archive/gfs.html






MapReduce, 2004
http://research.google.com/archive/mapreduce.html

GFS and MapReduced provided a platform for
processing huge amounts of data in an efficient
way
Hadoop was born





Doug Cutting read the Google papers
Based on those, he created Hadoop
(named after his sons toy elephant)
It is an implementation of GFS/MapReduce
(Open Source / Apache License)



Written in Java and deployed on Linux



First part of Lucene, now an Apache project



https://hadoop.apache.org/
Hadoop Components


Hadoop Common – utilities to control the rest



HDFS – Hadoop Distributed File System



YARN – Yet Another Resource Negotiator



MapReduce – YARN-based parallel processing



This enables us to write software that can
handle Big Data by scaling out
Big Data isn't just big


Huge amounts of data (volume)



Unstructured data (form)



Highly dynamic data (burst/change rate)



Big Data is actually hard-to-handle (with
traditional tools/methods) data
Examples of Big Data


Log files, i.e.





web server access logs
application logs

Internet feeds





Twitter, Facebook, etc.
RSS

Images (face recognition, tagging)
Installing Hadoop
Needed to run Hadoop


You need the following to run Hadoop



Java JDK





Linux server
Hadoop tarball

I'm using the following



JDK 1.6.24 64 bit





Ubuntu 12.04 LTS 64 bit
Hadoop 1.0.4

Could not get JDK7 + Hadoop 2.2 to work
Install Java
Setup Java home and path
Add hadoop user
Create SSH key for hadoop user
Accept SSH key
Install Hadoop and add to path
Disable IPv6
Reboot and check installation
Running an example job
Calculate Pi
Estimated value of Pi
Three modes of operation


Pi was calculated in Local standalone mode





it is the default mode (i.e. no configuration needed)
all components of Hadoop run in a single JVM

Pseudo-distributed mode



components communicate using sockets





a separate JVM is spawned for each component
it is a mini-cluster on a single host

Fully distributed mode


components are spread across multiple machines
Create base directory for HDFS
Set JAVA_HOME
Edit core-site.xml
Edit hdfs-site.xml
Edit mapred-site.xml
Log out and log on as hadoop
Format HDFS
Start HDFS
Start Map Reduce
Create home directory & test data
Running Word Count
First let's try the example jar
Inspect the result
Compile and run our own jar
https://gist.github.com/soren/7213273
Inspect result
Run improved version
https://gist.github.com/soren/7213453
Inspect (improved) result
Hadoop MapReduce





A reducer will get all values associated with a
given key
Precursor job can be used to normalize data
Combiners can be used to perform early sorting
of map output before it is send to the reducer
Perl MapReduce
Playing with MapReduce



We don't need Hadoop to play with MapReduce
Instead we can emulate Hadoop using two
scripts



wc_mapper.pl – a Word Count Mapper



wc_reducer.pl – a Word Count Reducer



We connect them using a pipe (|)



Very Unix-like!
Run MapReduce without Hadoop
https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285
Hadoop's Streaming interface


Enables you to write jobs in any programming
language, e.g. Perl



Input from STDIN



Output to STDOUT



Key/Value pairs separated by TAB



Reducers will get values one-by-one



Not to be confused with Hadoop Pipes that
provides a native C++ interface to Hadoop
Run Perl Word Count
https://gist.github.com/soren/7596270

https://gist.github.com/soren/7596285
Inspect result
Hadoop::Streaming


Perl interface to Hadoop's Streaming interface



Implemented in Moose



You'll can now implement you MapReduce as


a class with a map() and reduce() method



a mapper script



a reducer script
Installing Hadoop::Streaming


Btw, Perl was already installed on the server ;-)



But we want to install Hadoop::Streaming



I also had to install local::lib to make it work



All you have to do is
sudo cpan local::lib Hadoop::Streaming



Nice and easy
Run Hadoop::Streaming job
https://gist.github.com/soren/7596451
https://gist.github.com/soren/7600134 https://gist.github.com/soren/7600144
Inspect result
Some final notes and loose ends
The Web User Interface


HDFS




MapReduce




http://localhost:8070/

File Browser




http://localhost:8030/

http://localhost:8075/browseDirectory.jsp?namenodeInfo

Note: this is with port forwarding in VirtualBox


50030 → 8030, 50070 → 8070, 50075 → 8075
Joins in Hadoop


It's possible to implement joins in MapReduce





Reduce-joins – simple
Map-joins – less data to transfer

Do you need joins?


Maybe you're data has structure → SQL?



Try Hive (HiveQL)



Or Pig (Pig Latin)
Hadoop in the Cloud


Elastic MapReduce (EMR)
http://aws.amazon.com/elasticmapreduce/



Essentially Hadoop in the Cloud



Build on EC2 and S3



You can upload JARs or scripts
There's more


Distributions








Cloudera Distribution for Hadoop (CDH)
http://www.cloudera.com/
Hortonworks Data Platform (HDP)
http://hortonworks.com/

HBase, Hive, Pig and other related projects
https://hadoop.apache.org/
But, a basic Hadoop setup is a good start


and a nice place to just play with Hadoop
I like big data and I can not lie
Oh, my God, Becky, look at the data, it's so big
It looks like one of those Hadoop guys setups
Who understands those Hadoop guys
They only map/reduce it because it is on a
distributed file system
I mean the data, it's just so big
I can't believe it's so huge
It's just out there, I mean, it's gross
Look, it's just so blah
The End

Questions?

Slides will be available at http://www.slideshare.net/slu/
Find me on Twitter https://twitter.com/slu

Playing with Hadoop (NPW2013)