Hadoop: The elephant in the room
Upcoming SlideShare
Loading in...5

Hadoop: The elephant in the room






Total Views
Views on SlideShare
Embed Views



2 Embeds 4

http://www.linkedin.com 3
https://www.linkedin.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Note: This study was from 2007. I don’t know if there’s a Moore’s Law of growth of data on the internet, but I expect this is a much larger number now.
  • This is not a supercomputer, and its not intended to be. Google’s approach was always to use a lot of cheap, expendable commodity servers, rather than be beholden to expensive, custom hardware and vendors. What they knew was software, so they learned on that expertise to produce a solution.

Hadoop: The elephant in the room Hadoop: The elephant in the room Presentation Transcript

  • Apache HadoopThe elephant in the roomC. Aaron Cois, Ph.D.
  • Me@aaroncoiswww.codehenge.netLove to chat!
  • The Problem
  • Large-Scale Computation• Traditionally, large computation wasfocused on– Complex, CPU-intensive calculations– On relatively small data sets• Examples:– Calculate complex differential equations– Calculate digits of Pi
  • Parallel Processing• Distributed systems allow scalablecomputation (moreprocessors, working simultaneously)INPUT OUTPUT
  • Data Storage• Data is often stored on a SAN• Data is copied to each compute nodeat compute time• This works well for small amounts ofdata, but requires significant copytime for large data sets
  • SANCompute NodesData
  • SANCalculating…
  • You must first distribute dataeach time you run acomputation…
  • How much data?
  • How much data?over 25 PB of data
  • How much data?over 25 PB of dataover 100 PB of data
  • The internetIDC estimates[2] the internet contains atleast:1 Zetabyteor1,000 Exabytesor1,000,000 Petabytes2 http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf (2007)
  • How much time?Disk Transfer Rates:• Standard 7200 RPM drive128.75 MB/s=> 7.7 secs/GB=> 13 mins/100 GB=> > 2 hours/TB=> 90 days/PB1 http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate
  • How much time?Fastest Network Xfer rate:• iSCSI over 1000GB ethernet (theor.)– 12.5 Gb/S => 80 sec/TB, 1333 min/PBOk, ignore network bottleneck:• Hypertransport Bus– 51.2 Gb/S => 19 sec/TB, 325 min/PB1 http://en.wikipedia.org/wiki/List_of_device_bit_rates
  • We need a better plan• Sending data to distributed processors isthe bottleneck• So what if we sent the processors to thedata?Core concept:Pre-distribute and store the data.Assign compute nodes to operate on localdata.
  • The Solution
  • Distributed Data Servers
  • 010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010Distribute the Data010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010
  • 010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010Send computation code to serverscontaining relevant data010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010010110011010
  • Hadoop Origin• Hadoop was modeled after innovativesystems created by Google• Designed to handle massive (web-scale) amounts of dataFun Fact: Hadoop’s creatornamed it after his son’s stuffedelephant
  • Hadoop Goals• Store massive data sets• Enable distributed computation• Heavy focus on– Fault tolerance– Data integrity– Commodity hardware
  • Hadoop SystemGFSMapReduceBigTableHDFSHadoopMapReduceHBase
  • Hadoop SystemGFSMapReduceBigTableHDFSHadoopMapReduceHBaseHadoop
  • Components
  • HDFS• “Hadoop Distributed File System”• Sits on top of native filesystem– ext3, etc• Stores data in files, replicated anddistributed across data nodes• Files are “write once”• Performs best with millions of ~100MB+files
  • HDFSFiles are split into blocks for storageDatanodes– Data blocks are distributed/replicatedacross datanodesNamenode– The master node– Keeps track of location of data blocks
  • HDFSMulti-Node ClusterMaster SlaveName NodeData NodeData Node
  • MapReduceA programming model– Designed to make programming parallelcomputation over large distributed datasets easy– Each node processes data alreadyresiding on it (when possible)– Inspired by functional programming mapand reduce functions
  • MapReduceJobTracker– Runs on a master node– Clients submit jobs to the JobTracker– Assigns Map and Reduce tasks to slavenodesTaskTracker– Runs on every slave node– Daemon that instantiates Map or Reducetasks and reports results to JobTracker
  • MapReduceMulti-Node ClusterMaster SlaveJobTrackerTaskTrackerTaskTracker
  • MapReduceLayerHDFS LayerMulti-Node ClusterMaster SlaveNameNodeDataNodeDataNodeJobTrackerTaskTracker TaskTracker
  • HBase• Hadoop’s Database• Sits on top of HDFS• Provides random read/write access toVery LargeTM tables– Billions of rows, billions of columns• Access viaJava, Jython, Groovy, Scala, or RESTweb service
  • A Typical Hadoop Cluster• Consists entirely of commodity ~$5kservers• 1 master, 1 -> 1000+ slaves• Scales linearly as more processingnodes are added
  • How it works
  • http://en.wikipedia.org/wiki/MapReduceTraditional MapReduce
  • Hadoop MapReduceImage Credit: http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854
  • MapReduce Examplefunction map(Str name, Str document):for each word w in document:increment_count(w, 1)function reduce(Str word, Iter partialCounts):sum = 0for each pc in partialCounts:sum += ParseInt(pc)return (word, sum)
  • What didn’t I worry about?• Data distribution• Node management• Concurrency• Error handling• Node failure• Load balancing• Data replication/integrity
  • Demo
  • Try the demo yourself!Go to:https://github.com/cacois/vagrant-hadoop-clusterFollow the instructions in the README