4. Large-Scale Computation
• Traditionally, large computation was
focused on
– Complex, CPU-intensive calculations
– On relatively small data sets
• Examples:
– Calculate complex differential equations
– Calculate digits of Pi
6. Data Storage
• Data is often stored on a SAN
• Data is copied to each compute node
at compute time
• This works well for small amounts of
data, but requires significant copy
time for large data sets
13. The internet
IDC estimates[2] the internet contains at
least:
1 Zetabyte
or
1,000 Exabytes
or
1,000,000 Petabytes
2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)
14. How much time?
Disk Transfer Rates:
• Standard 7200 RPM drive
128.75 MB/s
=> 7.7 secs/GB
=> 13 mins/100 GB
=> > 2 hours/TB
=> 90 days/PB
1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate
15. How much time?
Fastest Network Xfer rate:
• iSCSI over 1000GB ethernet (theor.)
– 12.5 Gb/S => 80 sec/TB, 1333 min/PB
Ok, ignore network bottleneck:
• Hypertransport Bus
– 51.2 Gb/S => 19 sec/TB, 325 min/PB
1 http://en.wikipedia.org/wiki/List_of_device_bit_rates
16. We need a better plan
• Sending data to distributed processors is
the bottleneck
• So what if we sent the processors to the
data?
Core concept:
Pre-distribute and store the data.
Assign compute nodes to operate on local
data.
21. Hadoop Origin
• Hadoop was modeled after innovative
systems created by Google
• Designed to handle massive (web-
scale) amounts of data
Fun Fact: Hadoop’s creator
named it after his son’s stuffed
elephant
22. Hadoop Goals
• Store massive data sets
• Enable distributed computation
• Heavy focus on
– Fault tolerance
– Data integrity
– Commodity hardware
26. HDFS
• “Hadoop Distributed File System”
• Sits on top of native filesystem
– ext3, etc
• Stores data in files, replicated and
distributed across data nodes
• Files are “write once”
• Performs best with millions of ~100MB+
files
27. HDFS
Files are split into blocks for storage
Datanodes
– Data blocks are distributed/replicated
across datanodes
Namenode
– The master node
– Keeps track of location of data blocks
29. MapReduce
A programming model
– Designed to make programming parallel
computation over large distributed data
sets easy
– Each node processes data already
residing on it (when possible)
– Inspired by functional programming map
and reduce functions
30. MapReduce
JobTracker
– Runs on a master node
– Clients submit jobs to the JobTracker
– Assigns Map and Reduce tasks to slave
nodes
TaskTracker
– Runs on every slave node
– Daemon that instantiates Map or Reduce
tasks and reports results to JobTracker
33. HBase
• Hadoop’s Database
• Sits on top of HDFS
• Provides random read/write access to
Very LargeTM tables
– Billions of rows, billions of columns
• Access via
Java, Jython, Groovy, Scala, or REST
web service
34. A Typical Hadoop Cluster
• Consists entirely of commodity ~$5k
servers
• 1 master, 1 -> 1000+ slaves
• Scales linearly as more processing
nodes are added
38. MapReduce Example
function map(Str name, Str document):
for each word w in document:
increment_count(w, 1)
function reduce(Str word, Iter partialCounts):
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
return (word, sum)
39. What didn’t I worry about?
• Data distribution
• Node management
• Concurrency
• Error handling
• Node failure
• Load balancing
• Data replication/integrity
41. Try the demo yourself!
Go to:
https://github.com/cacois/vagrant-
hadoop-cluster
Follow the instructions in the README
Editor's Notes
Note: This study was from 2007. I don’t know if there’s a Moore’s Law of growth of data on the internet, but I expect this is a much larger number now.
This is not a supercomputer, and its not intended to be. Google’s approach was always to use a lot of cheap, expendable commodity servers, rather than be beholden to expensive, custom hardware and vendors. What they knew was software, so they learned on that expertise to produce a solution.