Note: This study was from 2007. I don’t know if there’s a Moore’s Law of growth of data on the internet, but I expect this is a much larger number now.
This is not a supercomputer, and its not intended to be. Google’s approach was always to use a lot of cheap, expendable commodity servers, rather than be beholden to expensive, custom hardware and vendors. What they knew was software, so they learned on that expertise to produce a solution.
Hadoop: The elephant in the room
Apache HadoopThe elephant in the roomC. Aaron Cois, Ph.D.
Large-Scale Computation• Traditionally, large computation wasfocused on– Complex, CPU-intensive calculations– On relatively small data sets• Examples:– Calculate complex differential equations– Calculate digits of Pi
Parallel Processing• Distributed systems allow scalablecomputation (moreprocessors, working simultaneously)INPUT OUTPUT
Data Storage• Data is often stored on a SAN• Data is copied to each compute nodeat compute time• This works well for small amounts ofdata, but requires significant copytime for large data sets
How much data?over 25 PB of dataover 100 PB of data
The internetIDC estimates the internet contains atleast:1 Zetabyteor1,000 Exabytesor1,000,000 Petabytes2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)
How much time?Disk Transfer Rates:• Standard 7200 RPM drive128.75 MB/s=> 7.7 secs/GB=> 13 mins/100 GB=> > 2 hours/TB=> 90 days/PB1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate
We need a better plan• Sending data to distributed processors isthe bottleneck• So what if we sent the processors to thedata?Core concept:Pre-distribute and store the data.Assign compute nodes to operate on localdata.
010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010Distribute the Data010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010
010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010Send computation code to serverscontaining relevant data010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010
Hadoop Origin• Hadoop was modeled after innovativesystems created by Google• Designed to handle massive (web-scale) amounts of dataFun Fact: Hadoop’s creatornamed it after his son’s stuffedelephant
Hadoop Goals• Store massive data sets• Enable distributed computation• Heavy focus on– Fault tolerance– Data integrity– Commodity hardware
HDFS• “Hadoop Distributed File System”• Sits on top of native filesystem– ext3, etc• Stores data in files, replicated anddistributed across data nodes• Files are “write once”• Performs best with millions of ~100MB+files
HDFSFiles are split into blocks for storageDatanodes– Data blocks are distributed/replicatedacross datanodesNamenode– The master node– Keeps track of location of data blocks
MapReduceA programming model– Designed to make programming parallelcomputation over large distributed datasets easy– Each node processes data alreadyresiding on it (when possible)– Inspired by functional programming mapand reduce functions
MapReduceJobTracker– Runs on a master node– Clients submit jobs to the JobTracker– Assigns Map and Reduce tasks to slavenodesTaskTracker– Runs on every slave node– Daemon that instantiates Map or Reducetasks and reports results to JobTracker
HBase• Hadoop’s Database• Sits on top of HDFS• Provides random read/write access toVery LargeTM tables– Billions of rows, billions of columns• Access viaJava, Jython, Groovy, Scala, or RESTweb service
A Typical Hadoop Cluster• Consists entirely of commodity ~$5kservers• 1 master, 1 -> 1000+ slaves• Scales linearly as more processingnodes are added
MapReduce Examplefunction map(Str name, Str document):for each word w in document:increment_count(w, 1)function reduce(Str word, Iter partialCounts):sum = 0for each pc in partialCounts:sum += ParseInt(pc)return (word, sum)
What didn’t I worry about?• Data distribution• Node management• Concurrency• Error handling• Node failure• Load balancing• Data replication/integrity