Apache Hadoop
The elephant in the room
C. Aaron Cois, Ph.D.
Me
@aaroncois
www.codehenge.net
Love to chat!
The Problem
Large-Scale Computation
• Traditionally, large computation was
focused on
– Complex, CPU-intensive calculations
– On relatively small data sets
• Examples:
– Calculate complex differential equations
– Calculate digits of Pi
Parallel Processing
• Distributed systems allow scalable
computation (more
processors, working simultaneously)
INPUT OUTPUT
Data Storage
• Data is often stored on a SAN
• Data is copied to each compute node
at compute time
• This works well for small amounts of
data, but requires significant copy
time for large data sets
SAN
Compute Nodes
Data
SAN
Calculating…
You must first distribute data
each time you run a
computation…
How much data?
How much data?
over 25 PB of data
How much data?
over 25 PB of data
over 100 PB of data
The internet
IDC estimates[2] the internet contains at
least:
1 Zetabyte
or
1,000 Exabytes
or
1,000,000 Petabytes
2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)
How much time?
Disk Transfer Rates:
• Standard 7200 RPM drive
128.75 MB/s
=> 7.7 secs/GB
=> 13 mins/100 GB
=> > 2 hours/TB
=> 90 days/PB
1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate
How much time?
Fastest Network Xfer rate:
• iSCSI over 1000GB ethernet (theor.)
– 12.5 Gb/S => 80 sec/TB, 1333 min/PB
Ok, ignore network bottleneck:
• Hypertransport Bus
– 51.2 Gb/S => 19 sec/TB, 325 min/PB
1 http://en.wikipedia.org/wiki/List_of_device_bit_rates
We need a better plan
• Sending data to distributed processors is
the bottleneck
• So what if we sent the processors to the
data?
Core concept:
Pre-distribute and store the data.
Assign compute nodes to operate on local
data.
The Solution
Distributed Data Servers
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
Distribute the Data
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
Send computation code to servers
containing relevant data
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
010110
011010
Hadoop Origin
• Hadoop was modeled after innovative
systems created by Google
• Designed to handle massive (web-
scale) amounts of data
Fun Fact: Hadoop’s creator
named it after his son’s stuffed
elephant
Hadoop Goals
• Store massive data sets
• Enable distributed computation
• Heavy focus on
– Fault tolerance
– Data integrity
– Commodity hardware
Hadoop System
GFS
MapReduce
BigTable
HDFS
Hadoop
MapReduce
HBase
Hadoop System
GFS
MapReduce
BigTable
HDFS
Hadoop
MapReduce
HBase
Hadoop
Components
HDFS
• “Hadoop Distributed File System”
• Sits on top of native filesystem
– ext3, etc
• Stores data in files, replicated and
distributed across data nodes
• Files are “write once”
• Performs best with millions of ~100MB+
files
HDFS
Files are split into blocks for storage
Datanodes
– Data blocks are distributed/replicated
across datanodes
Namenode
– The master node
– Keeps track of location of data blocks
HDFS
Multi-Node Cluster
Master Slave
Name Node
Data NodeData Node
MapReduce
A programming model
– Designed to make programming parallel
computation over large distributed data
sets easy
– Each node processes data already
residing on it (when possible)
– Inspired by functional programming map
and reduce functions
MapReduce
JobTracker
– Runs on a master node
– Clients submit jobs to the JobTracker
– Assigns Map and Reduce tasks to slave
nodes
TaskTracker
– Runs on every slave node
– Daemon that instantiates Map or Reduce
tasks and reports results to JobTracker
MapReduce
Multi-Node Cluster
Master Slave
JobTracker
TaskTrackerTaskTracker
MapReduce
Layer
HDFS Layer
Multi-Node Cluster
Master Slave
NameNod
e
DataNodeDataNode
JobTracker
TaskTracker TaskTracker
HBase
• Hadoop’s Database
• Sits on top of HDFS
• Provides random read/write access to
Very LargeTM tables
– Billions of rows, billions of columns
• Access via
Java, Jython, Groovy, Scala, or REST
web service
A Typical Hadoop Cluster
• Consists entirely of commodity ~$5k
servers
• 1 master, 1 -> 1000+ slaves
• Scales linearly as more processing
nodes are added
How it works
http://en.wikipedia.org/wiki/MapReduce
Traditional MapReduce
Hadoop MapReduce
Image Credit: http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854
MapReduce Example
function map(Str name, Str document):
for each word w in document:
increment_count(w, 1)
function reduce(Str word, Iter partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
return (word, sum)
What didn’t I worry about?
• Data distribution
• Node management
• Concurrency
• Error handling
• Node failure
• Load balancing
• Data replication/integrity
Demo
Try the demo yourself!
Go to:
https://github.com/cacois/vagrant-
hadoop-cluster
Follow the instructions in the README

Hadoop: The elephant in the room

Editor's Notes

  • #14 Note: This study was from 2007. I don’t know if there’s a Moore’s Law of growth of data on the internet, but I expect this is a much larger number now.
  • #23 This is not a supercomputer, and its not intended to be. Google’s approach was always to use a lot of cheap, expendable commodity servers, rather than be beholden to expensive, custom hardware and vendors. What they knew was software, so they learned on that expertise to produce a solution.