TheEdge10 : Big Data is Here - Hadoop to the Rescue


Published on

A presentation from TheEdge10 about Hadoop and Big data

Published in: Software
  • Really Really Good...It clearly states everything in simple terms
    Are you sure you want to  Yes  No
    Your message goes here

TheEdge10 : Big Data is Here - Hadoop to the Rescue

  1. 1. Big Data is Here – <br />Hadoop to the Rescue!<br />Shay Sofer,<br />AlphaCSP<br />
  2. 2. Today we will:<br />Understand what is BigData<br />Get to know Hadoop<br />Experience some MapReduce magic<br />Persist very large files<br />Learn some nifty tricks<br />On Today's Menu...<br />
  3. 3. Data is Everywhere<br />
  4. 4. IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010)<br />1ZB = 1 Trillion Gigabytes <br /> (or: 1,000,000,000,000,000,000,000 bytes = 1021)<br />60% Growth from 2009<br />By 2020 – we will reach 35 ZB<br />Facts and Numbers<br />Data is Everywhere<br />
  5. 5. Facts and Numbers<br />Data is Everywhere<br />Source:<br />
  6. 6. 234M Web sites<br />7M New sites in 2009<br />New York Stock Exchange – 1 TB of data per day<br />Web 2.0<br />147M Blogs (and counting…)<br />Twitter – ~12 TB of data per day<br />Facts and Numbers<br />Data is Everywhere<br />
  7. 7. 500M users<br />40M photos per day <br /> More than 30billion pieces of content (web links, news stories, blog posts, notes, photo albums etc.) shared each month<br />Facts and Numbers - Facebook<br />Data is Everywhere<br />
  8. 8. Big dataare datasets that grow so large that they become awkward to work with using on-hand database management tools<br />Where and how do we store this information?<br />How do we perform analyses on such large datasets?<br />Why are you here?<br />Data is Everywhere<br />
  9. 9. Scale-up Vs. Scale-out<br />Data is Everywhere<br />
  10. 10. Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer<br />Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application<br />Scale-up Vs. Scale-out<br />Data is Everywhere<br />
  11. 11. Introducing…Hadoop!<br />
  12. 12. A framework for writing and running distributed applications that process large amount of data.<br />Runs on large clusters of commodity hardware<br />A cluster with hundreds of machine is standard<br />Inspired by Google’s architecture : MapReduce and GFS<br />What is Hadoop?<br />Hadoop<br />
  13. 13. Robust - Handles failures of individual nodes<br />Scales linearly<br />Open source <br />A top-level Apache project<br />Why Hadoop?<br />Hadoop<br />
  14. 14. Hadoop<br />
  15. 15. Facebook holds the largest known Hadoop storage cluster in the world<br />2000 machines<br />12 TB per machine (some has 24 TB)<br />32 GB of RAM per machine<br />Total of more than 21 Petabytes <br />(1 Petabyte = 1024 Terabytes) <br />Facebook (Again…)<br />Hadoop<br />
  16. 16. History<br />Hadoop<br />Apache Nutch – Open Source web search engine founded by Doug Cutting<br />Cutting joins Yahoo!, forms Hadoop<br />Sorting 1 TB in 62 seconds<br />2004<br />2006<br />2008<br />2008<br />2002<br />2010<br />Google’s GFS & MapReduce papers published<br />Creating the longest Pi yet<br />Hadoop hits web scale, being used by Yahoo! for web indexing<br />
  17. 17. Hadoop<br />
  18. 18. IDE Plugin<br />Hadoop<br />
  19. 19. Hadoop and MapReduce<br />
  20. 20. A programming model for processing and generating large data sets<br />Introduced by Google <br />Parallel processing of the map/reduce operations<br />Definition<br />MapReduce<br />
  21. 21. Sam believed “An apple a day keeps a doctor away”<br />MapReduce – The Story of Sam<br />Mother<br />Sam<br />An Apple<br />Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs<br />
  22. 22. Sam thought of “drinking” the apple<br />MapReduce – The Story of Sam<br /><ul><li>He used a to cut the and a to make juice. </li></ul>Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs<br />
  23. 23. Sam applied his invention to all the fruits he could find in the fruit basket<br /><ul><li>(map ‘( )) </li></ul>MapReduce – The Story of Sam<br />A list of values mapped into another list of values, which gets reduced into a single value<br />( ) <br /><ul><li>(reduce ‘( )) </li></ul>Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs<br />
  24. 24. MapReduce – The Story of Sam<br />Sam got his first job for his talent in making juice<br />Fruits<br /><ul><li>Now, it’s not just one basket but a whole container of fruits</li></ul>Largedata and list of values for output<br /><ul><li>Also, they produce alist of juice types separately
  25. 25. But, Sam had just ONE and ONE </li></ul>Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs<br />
  26. 26. MapReduce – The Story of Sam<br />Sam Implemented a parallelversion of his innovation <br />Each map input: list of <key, value> pairs<br />Fruits<br />(<a, > , <o, > , <p ,> , …)<br />Map<br />Each map output: list of <key, value> pairs<br />(<a’ , > , <o’, v > , <p’ , > , …)<br />Grouped by key (shuffle)<br />Each reduce input: <key, value-list><br />e.g. <a’, ( …)><br />Reduce<br />Reduced into a list of values<br />Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs<br />
  27. 27. Mapper- Takes a series of key/value pairs, processes each and generates output key/value pairs<br /> (k1, v1) list(k2, v2)<br />Reducer- Iterates through the values that are associated with a specific key and generate output<br /> (k2, list (v2)) list(k3, v3)<br />The Mapper takes the input data, filters and transforms into something The Reducercan aggregate over<br />First Map, Then Reduce<br />MapReduce<br />
  28. 28. MapReduce<br />Shuffle<br />Input<br />
  29. 29. Hadoop comes with a number of predefined classes<br />BooleanWritable<br />ByteWritable<br />LongWritable<br />Text, etc…<br />Supports pluggable serialization frameworks<br />Apache Avro <br />Hadoop Data Types<br />MapReduce<br />
  30. 30. TextInputFormat / TextOutputFormat<br />KeyValueTextInputFormat<br />SequenceFile - A Hadoopspecific compressed binary file format. Optimized for passing data between 2 MapReduce jobs<br />Input / Output Formats<br />MapReduce<br />
  31. 31. publicstaticclass MapClass extends MapReduceBase<br />privateText word = new Text();<br />publicvoid map(LongWritable key, Text value,<br /> OutputCollector<Text,IntWritable> output, …){<br />String line = value.toString();<br />StringTokenizer itr = new StringTokenizer(line);<br />while(itr.hasMoreTokens()){<br />word.set(itr.nextToken());<br />output.collect(word,newIntWritable(1));<br /> }<br /> } <br />} <br />Word Count – The Mapper<br />implements Mapper<LongWritable,Text,Text,IntWritable><br />< Hello, 1> < World, 1> < Bye, 1> < World, 1> <br /><K1,Hello World Bye World><br />
  32. 32. publicstaticclassReduceClassextends MapReduceBase<br />publicvoidreduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,…){<br />intsum = 0;<br />while(values.hasNext()){<br />sum +=; <br />}<br />output.collect(key, new IntWritable(sum));<br />{<br />{<br />Word Count– The Reducer<br />implementsReducer<Text,IntWritable,Text,IntWritable>{<br />< Hello, 1> < World, 2> < Bye, 1> <br />< Hello, 1> < World, 1> < Bye, 1> < World, 1> <br />
  33. 33. publicstaticvoid main(String[] args){<br />JobConf job = newJobConf(WordCount.class);<br />job.setOutputKeyClass(Text.class);<br />job.setOutputValueClass(IntWritable.class);<br />job.setMapperClass(MapClass.class);<br />job.setReducerClass(ReduceClass.class);<br />FileInputFormat.addInputFormat(job ,new Path(args[0]));<br />FileOutputFormat.addOutputFormat(job ,newPath(args[1]));<br />//job.setInputFormat(KeyValueTextInputFormat.class);<br />JobClient.runJob(job);<br />{<br />Word Count – The Driver<br />
  34. 34. Music discovery website<br />Scrobbling / Streaming VIA radio<br />40M unique visitors per month<br />Over 40M scrobbles per day<br />Each scrobble creates a log line<br />Hadoop @ Last.FM<br />MapReduce<br />
  35. 35.
  36. 36. Goal : Create a “Unique listeners per track” chart<br />Sample listening data<br />MapReduce<br />
  37. 37. publicvoid map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable> output, <br /> Reporter reporter) throwsIOException { <br />intscrobbles, radioListens; // assume they are initialized -<br />IntWritabletrackId,userId; // for verbosity<br /> // if track somehow is marked with zero plays - ignore<br />if (scrobbles <= 0 && radioListens <= 0) {<br />return; <br /> }<br />// output user id against track id<br />output.collect(trackId, userId);<br /> }<br />Unique Listens - Mapper<br />
  38. 38. publicvoid reduce(IntWritabletrackId, <br />Iterator<IntWritable> values, <br />OutputCollector<IntWritable, IntWritable> output, <br /> Reporter reporter) throwsIOException {<br /> Set<Integer> usersSet = newHashSet<Integer>();<br />// add all userIds to the set, duplicates removed<br />while (values.hasNext()) {<br />IntWritableuserId =;<br />usersSet.add(userId.get());<br /> }<br />// output: trackId -> number of unique listeners per track<br />output.collect(trackId, newIntWritable(usersSet.size()));<br />}<br />Unique Listens - Reducer<br />
  39. 39. Complex tasks will sometimes be needed to be broken down to subtasks<br />Output of the previous job goes as input to the next job<br />job-a | job-b | job-c<br />Simply launch the driver of the 2nd job after the 1st<br />Chaining<br />MapReduce<br />
  40. 40. Hadoop supports other languages via API called Streaming<br />Use UNIX commands as mappers and reducers<br />Or use any script that processes line-oriented data stream from STDIN and outputs to STDOUT<br />Python, Perl etc.<br />Hadoop Streaming<br />MapReduce<br />
  41. 41. $ hadoop jar hadoop-streaming.jar <br /> -input input/myFile.txt<br /> -output output.txt <br /> -mapper<br /> -reducer<br />Hadoop Streaming<br />MapReduce<br />
  42. 42. HDFS<br />Hadoop Distributed File System<br />
  43. 43. A large dataset can and will outgrow the storage capacity of a single physical machine<br />Partition it across separate machines – Distributed FileSystems<br />Network based - complex<br />What happens when a node fails?<br />Distributed FileSystem<br />HDFS<br />
  44. 44. Designed for storing very large files running on clusters on commodity hardware<br />Highly fault-tolerant (via replication)<br />A typical file is gigabytes to terabytes in size<br />High throughput<br />HDFS - Hadoop Distributed FileSystem<br />HDFS<br />
  45. 45. Running Hadoop = Running a set of daemons on<br />different servers in your network<br />NameNode<br />DataNode<br />Secondary NameNode<br />JobTracker<br />TaskTracker<br />Hadoop’s Building Blocks<br />HDFS<br />
  46. 46. Topology of a Hadoop Cluster<br />Secondary NameNode<br />NameNode<br />JobTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />
  47. 47. HDFS has a master/slave architecture ; The NameNode acts as the master<br />Single NameNode per HDFS<br />Keeps track of :<br />How the files are broken into blocks<br />Which nodes store those blocks<br />The overall health of the filesystem<br />Memory and I/O intensive<br />The NameNode<br />HDFS<br />
  48. 48. Each slave machine will host a DataNode daemon<br />Serves read/write/delete requests from the NameNode<br />Manages the storage attached to the nodes <br />Sends a periodic Heartbeat to the NameNode<br />The DataNode<br />HDFS<br />
  49. 49. Failure is the norm rather than exception<br />Detection of faults and quick, automatic recovery<br />Each file is stored as a sequence of blocks (default: 64MB each)<br />The blocks of a file are replicated for fault tolerance<br />Block size and replicas are configurable per file<br />Fault Tolerance - Replication<br />HDFS<br />
  50. 50. HDFS<br />
  51. 51. Topology of a Hadoop Cluster<br />Secondary NameNode<br />NameNode<br />JobTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />
  52. 52. Assistant daemon that should be on a dedicated node<br />Takes snapshots of the HDFS metadata<br />Doesn’t receive real time changes<br />Helps minimizing downtime incase the NameNode crashes<br />Secondary NameNode<br />HDFS<br />
  53. 53. Topology of a Hadoop Cluster<br />Secondary NameNode<br />NameNode<br />JobTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />DataNode<br />TaskTracker<br />
  54. 54. One per cluster - on the master node<br />Receives job request submitted by the client<br />Schedules and monitors MapReduce jobs on TaskTrackers<br />JobTracker<br />HDFS<br />
  55. 55. Run map and reduce tasks<br />Send progress reports to the JobTracker<br />TaskTracker<br />HDFS<br />
  56. 56. VIA file commands<br />$ hadoopfs -mkdir /user/chuck<br />$ hadoopfs -put hugeFile.txt<br />$ hadoopfs -get anotherHugeFile.txt<br />Programmatically (HDFS API)<br />FileSystem hdfs = FileSystem.get(new Configuration());<br />FSDataOutStream out = hdfs.create(filePath);<br />while(...){<br /> out.write(buffer,0,bytesRead);<br />}<br />Working with HDFS<br />HDFS<br />
  57. 57. Tips & Tricks<br />
  58. 58. Tip #1: Hadoop Configuration Types<br />Tips & Tricks<br />
  59. 59. Monitoring events in the cluster can prove to be a bit more difficult<br />Web interface for our cluster<br />Shows a summary of the cluster<br />Details about list of jobs there are currently running, completed and failed<br />Tip #2: JobTracker UI <br />Tips & Tricks<br />
  60. 60. WebTracker UI SS<br />Tips & Tricks<br />
  61. 61. Digging through logs or….<br /> Running again the exact same scenario with the same input on the same node?<br />IsolationRunner can rerun the failed task to reproduce the problem<br />Attach a debugger <br />Keep.failed.tasks.file= true<br />Tip #3: IsolationRunner – Hadoop’s Time Machine<br />Tips & Tricks<br />
  62. 62. Output of the map phase (which will be shuffled across the network) can be quite large<br />Built in support for compression<br />Different codecs : gzip, bzip2 etc<br />Transparent to the developer<br />conf.setCompressMapOutput(true);<br />conf.setMapOutputCompressorClass(GzipCodec.class);<br />Tip #4: Compression<br />Tips & Tricks<br />
  63. 63. A node can experience a slowdown, thus slowing down the entire job<br />If a task is identified as “slow”, it will be scheduled to run in another node in parallel<br />As soon as one finishes successfully, the others will be killed<br />An optimization – not a feature<br />Tip #5: Speculative Execution<br />Tips & Tricks<br />
  64. 64. Input can come from 2 (or more) different sources<br />Hadoop has a contrib package called datajoin<br />Generic framework for performing reduce-side join<br />Tip #6: DataJoin Package<br />MapReduce<br />
  65. 65. Hadoop in the Cloud<br />Amazon Web Services<br />
  66. 66. Cloud computing - Shared resources and information are provided on demand<br />Rent a cluster rather than buy it<br />The best known infrastructure for cloud computing is Amazon Web Services (AWS)<br />Launched at July 2002<br />Cloud Computing and AWS<br />Hadoop in the Cloud<br />
  67. 67. Elastic Compute Cloud (EC2)<br />A large farm of VMs where a user can rent and use them to run a computer application<br />Wide range on instance types to choose from (price varies)<br />Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use<br />Hadoop comes with built in support for EC2 and S3<br />$ hadoop-ec2 launch-cluster <cluster-name> <num-of-slaves> <br />Hadoop in the Cloud – Core Services<br />
  68. 68. EC2 Data Flow<br />HDFS<br />EC2<br />MapReduce Tasks<br />Our<br />Data<br />
  69. 69. EC2 & S3 Data Flow<br />S3<br />Our<br />Data<br />HDFS<br />EC2<br />MapReduce Tasks<br />
  70. 70. Hadoop-Related Projects<br />
  71. 71. Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial<br />Pig simplifies Hadoop programming<br />Provides high-level data processing language : Pig Latin<br />Being used by Yahoo! (70% of production jobs), Twitter, LinkedIn, EBay etc..<br />Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25<br />Pig<br />Hadoop-Related Projects<br />
  72. 72. Users = LOAD ‘users.csv’ AS (name, age);<br />Fltrd = FILTER Users BYage >= 18 AND age <= 25;<br />Pages = LOAD ‘pages.csv’ AS (user, url);<br />Jnd = JOIN Fltrd BY name, Pages BY user;<br />Grpd = GROUP Jnd BY url;<br />Smmd = FOREACH Grpd GENERATEgroup, COUNT(Jnd) AS clicks;<br />Srtd = ORDER Smmd BY clicks DESC;<br />Top5 = LIMIT Srtd 5;<br />STORE Top5 INTO ‘top5sites.csv’;<br />Pig Latin – Data Flow Language <br />
  73. 73. A data warehousing package built on top of Hadoop<br />SQL-like queries on large datasets <br />Hive<br />Hadoop-Related Projects<br />
  74. 74. Hadoop database for random read/write access<br />Uses HDFS as the underlying file system<br />Supports billions of rows and millions of columns<br />Facebook chose HBase as a framework for their new version of “Messages”<br />HBase<br />Hadoop-Related Projects<br />
  75. 75. A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports<br />Cloudera<br />Hadoop-Related Projects<br />
  76. 76. Machine learning algorithms for Hadoop<br />Coming up next.. (-:<br />Mahout<br />Hadoop-Related Projects<br />
  77. 77. Big Data can and will cause serious scalability problems to your application<br />MapReduce for analysis, Distributed filesystem for storage<br />Hadoop = MapReduce + HDFS and much more<br />AWS integration is easy<br />Lots of documentation<br />Last words<br />Summary<br />
  78. 78. Hadoop in Action / Chuck Lam<br />Hadoop: The Definitive Guide, 2nd Edition / Tom White (O’reilly)<br />Apache Hadoop Documentation<br />Hadoop @ Last.FM Presentation <br />MapReduce in Simple Terms / SaliyaEkanayake<br />Amazon Web Services<br />References<br />
  79. 79. Thank you!<br />