Hadoop: The elephant in the room


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Note: This study was from 2007. I don’t know if there’s a Moore’s Law of growth of data on the internet, but I expect this is a much larger number now.
  • This is not a supercomputer, and its not intended to be. Google’s approach was always to use a lot of cheap, expendable commodity servers, rather than be beholden to expensive, custom hardware and vendors. What they knew was software, so they learned on that expertise to produce a solution.
  • Hadoop: The elephant in the room

    1. 1. Apache HadoopThe elephant in the roomC. Aaron Cois, Ph.D.
    2. 2. Me@aaroncoiswww.codehenge.netLove to chat!
    3. 3. The Problem
    4. 4. Large-Scale Computation• Traditionally, large computation wasfocused on– Complex, CPU-intensive calculations– On relatively small data sets• Examples:– Calculate complex differential equations– Calculate digits of Pi
    5. 5. Parallel Processing• Distributed systems allow scalablecomputation (moreprocessors, working simultaneously)INPUT OUTPUT
    6. 6. Data Storage• Data is often stored on a SAN• Data is copied to each compute nodeat compute time• This works well for small amounts ofdata, but requires significant copytime for large data sets
    7. 7. SANCompute NodesData
    8. 8. SANCalculating…
    9. 9. You must first distribute dataeach time you run acomputation…
    10. 10. How much data?
    11. 11. How much data?over 25 PB of data
    12. 12. How much data?over 25 PB of dataover 100 PB of data
    13. 13. The internetIDC estimates[2] the internet contains atleast:1 Zetabyteor1,000 Exabytesor1,000,000 Petabytes2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)
    14. 14. How much time?Disk Transfer Rates:• Standard 7200 RPM drive128.75 MB/s=> 7.7 secs/GB=> 13 mins/100 GB=> > 2 hours/TB=> 90 days/PB1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate
    15. 15. How much time?Fastest Network Xfer rate:• iSCSI over 1000GB ethernet (theor.)– 12.5 Gb/S => 80 sec/TB, 1333 min/PBOk, ignore network bottleneck:• Hypertransport Bus– 51.2 Gb/S => 19 sec/TB, 325 min/PB1 http://en.wikipedia.org/wiki/List_of_device_bit_rates
    16. 16. We need a better plan• Sending data to distributed processors isthe bottleneck• So what if we sent the processors to thedata?Core concept:Pre-distribute and store the data.Assign compute nodes to operate on localdata.
    17. 17. The Solution
    18. 18. Distributed Data Servers
    19. 19. 010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010Distribute the Data010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010
    20. 20. 010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010Send computation code to serverscontaining relevant data010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010
    21. 21. Hadoop Origin• Hadoop was modeled after innovativesystems created by Google• Designed to handle massive (web-scale) amounts of dataFun Fact: Hadoop’s creatornamed it after his son’s stuffedelephant
    22. 22. Hadoop Goals• Store massive data sets• Enable distributed computation• Heavy focus on– Fault tolerance– Data integrity– Commodity hardware
    23. 23. Hadoop SystemGFSMapReduceBigTableHDFSHadoopMapReduceHBase
    24. 24. Hadoop SystemGFSMapReduceBigTableHDFSHadoopMapReduceHBaseHadoop
    25. 25. Components
    26. 26. HDFS• “Hadoop Distributed File System”• Sits on top of native filesystem– ext3, etc• Stores data in files, replicated anddistributed across data nodes• Files are “write once”• Performs best with millions of ~100MB+files
    27. 27. HDFSFiles are split into blocks for storageDatanodes– Data blocks are distributed/replicatedacross datanodesNamenode– The master node– Keeps track of location of data blocks
    28. 28. HDFSMulti-Node ClusterMaster SlaveName NodeData NodeData Node
    29. 29. MapReduceA programming model– Designed to make programming parallelcomputation over large distributed datasets easy– Each node processes data alreadyresiding on it (when possible)– Inspired by functional programming mapand reduce functions
    30. 30. MapReduceJobTracker– Runs on a master node– Clients submit jobs to the JobTracker– Assigns Map and Reduce tasks to slavenodesTaskTracker– Runs on every slave node– Daemon that instantiates Map or Reducetasks and reports results to JobTracker
    31. 31. MapReduceMulti-Node ClusterMaster SlaveJobTrackerTaskTrackerTaskTracker
    32. 32. MapReduceLayerHDFS LayerMulti-Node ClusterMaster SlaveNameNodeDataNodeDataNodeJobTrackerTaskTracker TaskTracker
    33. 33. HBase• Hadoop’s Database• Sits on top of HDFS• Provides random read/write access toVery LargeTM tables– Billions of rows, billions of columns• Access viaJava, Jython, Groovy, Scala, or RESTweb service
    34. 34. A Typical Hadoop Cluster• Consists entirely of commodity ~$5kservers• 1 master, 1 -> 1000+ slaves• Scales linearly as more processingnodes are added
    35. 35. How it works
    36. 36. http://en.wikipedia.org/wiki/MapReduceTraditional MapReduce
    37. 37. Hadoop MapReduceImage Credit: http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854
    38. 38. MapReduce Examplefunction map(Str name, Str document):for each word w in document:increment_count(w, 1)function reduce(Str word, Iter partialCounts):sum = 0for each pc in partialCounts:sum += ParseInt(pc)return (word, sum)
    39. 39. What didn’t I worry about?• Data distribution• Node management• Concurrency• Error handling• Node failure• Load balancing• Data replication/integrity
    40. 40. Demo
    41. 41. Try the demo yourself!Go to:https://github.com/cacois/vagrant-hadoop-clusterFollow the instructions in the README