Playing with Hadoop (NPW2013)

Nordic Perl Workshop 2013

Playing with Hadoop
Søren Lund (slu)
slu@369.dk

DISCLAIMER








I have no experience with Hadoop in a realworld project
The installation notes I present are not
nescessarily suitable for production
The example scripts have not been used on
real (big) data
Hence the title Playing with Hadoop

The Problem (it's not new)









We have (access to)
more and more data
Processing this data
takes longer and
longer
Not enough memory
Running out of disk
space
Our trusty old server
can't keep up

!!!!!

Scaling up








Upgrade hardware:
bigger and faster
Redundancy: power
supply, RAID, hotswap
Expensive to keep
scaling up
Our software will run
without modifications

Scaling out








Add more
(commodity) servers
Redundancy is
replaced by
replication
You can keep on
scaling out, it's cheap
How do we enable
our software to run
across multiple
servers?

Google solved this


Google published two papers


Google File System (GFS), 2003
http://research.google.com/archive/gfs.html






MapReduce, 2004
http://research.google.com/archive/mapreduce.html

GFS and MapReduced provided a platform for
processing huge amounts of data in an efficient
way

Hadoop was born





Doug Cutting read the Google papers
Based on those, he created Hadoop
(named after his sons toy elephant)
It is an implementation of GFS/MapReduce
(Open Source / Apache License)



Written in Java and deployed on Linux



First part of Lucene, now an Apache project



https://hadoop.apache.org/

Hadoop Components


Hadoop Common – utilities to control the rest



HDFS – Hadoop Distributed File System



YARN – Yet Another Resource Negotiator



MapReduce – YARN-based parallel processing



This enables us to write software that can
handle Big Data by scaling out

Big Data isn't just big


Huge amounts of data (volume)



Unstructured data (form)



Highly dynamic data (burst/change rate)



Big Data is actually hard-to-handle (with
traditional tools/methods) data

Examples of Big Data


Log files, i.e.





web server access logs
application logs

Internet feeds





Twitter, Facebook, etc.
RSS

Images (face recognition, tagging)

Needed to run Hadoop


You need the following to run Hadoop



Java JDK





Linux server
Hadoop tarball

I'm using the following



JDK 1.6.24 64 bit





Ubuntu 12.04 LTS 64 bit
Hadoop 1.0.4

Could not get JDK7 + Hadoop 2.2 to work

Create SSH key for hadoop user

Install Hadoop and add to path

Three modes of operation


Pi was calculated in Local standalone mode





it is the default mode (i.e. no configuration needed)
all components of Hadoop run in a single JVM

Pseudo-distributed mode



components communicate using sockets





a separate JVM is spawned for each component
it is a mini-cluster on a single host

Fully distributed mode


components are spread across multiple machines

Create base directory for HDFS

Create home directory & test data

First let's try the example jar

Compile and run our own jar
https://gist.github.com/soren/7213273

Run improved version

Hadoop MapReduce





A reducer will get all values associated with a
given key
Precursor job can be used to normalize data
Combiners can be used to perform early sorting
of map output before it is send to the reducer

Playing with MapReduce



We don't need Hadoop to play with MapReduce
Instead we can emulate Hadoop using two
scripts



wc_mapper.pl – a Word Count Mapper



wc_reducer.pl – a Word Count Reducer



We connect them using a pipe (|)



Very Unix-like!

Run MapReduce without Hadoop
https://gist.github.com/soren/7596270 https://gist.github.com/soren/7596285

Hadoop's Streaming interface


Enables you to write jobs in any programming
language, e.g. Perl



Input from STDIN



Output to STDOUT



Key/Value pairs separated by TAB



Reducers will get values one-by-one



Not to be confused with Hadoop Pipes that
provides a native C++ interface to Hadoop

Run Perl Word Count


Hadoop::Streaming


Perl interface to Hadoop's Streaming interface



Implemented in Moose



You'll can now implement you MapReduce as


a class with a map() and reduce() method



a mapper script



a reducer script

Installing Hadoop::Streaming


Btw, Perl was already installed on the server ;-)



But we want to install Hadoop::Streaming



I also had to install local::lib to make it work



All you have to do is
sudo cpan local::lib Hadoop::Streaming



Nice and easy

Run Hadoop::Streaming job
https://gist.github.com/soren/7600134 https://gist.github.com/soren/7600144

Some final notes and loose ends

The Web User Interface


HDFS




MapReduce




http://localhost:8070/

File Browser




http://localhost:8030/

http://localhost:8075/browseDirectory.jsp?namenodeInfo

Note: this is with port forwarding in VirtualBox


50030 → 8030, 50070 → 8070, 50075 → 8075

Joins in Hadoop


It's possible to implement joins in MapReduce





Reduce-joins – simple
Map-joins – less data to transfer

Do you need joins?


Maybe you're data has structure → SQL?



Try Hive (HiveQL)



Or Pig (Pig Latin)

Hadoop in the Cloud


Elastic MapReduce (EMR)
http://aws.amazon.com/elasticmapreduce/



Essentially Hadoop in the Cloud



Build on EC2 and S3



You can upload JARs or scripts

There's more


Distributions








Cloudera Distribution for Hadoop (CDH)
http://www.cloudera.com/
Hortonworks Data Platform (HDP)
http://hortonworks.com/

HBase, Hive, Pig and other related projects
https://hadoop.apache.org/
But, a basic Hadoop setup is a good start


and a nice place to just play with Hadoop

I like big data and I can not lie
Oh, my God, Becky, look at the data, it's so big
It looks like one of those Hadoop guys setups
Who understands those Hadoop guys
They only map/reduce it because it is on a
distributed file system
I mean the data, it's just so big
I can't believe it's so huge
It's just out there, I mean, it's gross
Look, it's just so blah

The End

Questions?

Slides will be available at http://www.slideshare.net/slu/
Find me on Twitter https://twitter.com/slu

Playing with Hadoop (NPW2013)

More Related Content

What's hot

Viewers also liked

Similar to Playing with Hadoop (NPW2013)

Recently uploaded

Playing with Hadoop (NPW2013)