Hadoop What is it? What is Map/Reduce? Why does it exist? Can I see an Example ? What's the “Architecture” ? Who's Using It? Is there More Info ?
Hadoop – What is It ? “ Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named  Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.” [1] Apache Project to clone Google's “MapReduce” Paradigm for programming and processing large data [1] http://wiki.apache.org/lucene-hadoop/ [2] http://wiki.apache.org/lucene-hadoop/ProjectDescription
Hadoop – What is MapReduce? Simplified Data Processing on Large Clusters [1] MapReduce is a set of code and infrastructure for parsing and building large data sets. A map function generates a key/value pair from the input data and this data is then reduced by a function to merges all values associated with equivalent keys. Programs are automatically parallelized and executed on a run-time system which manages partitioning the input data, scheduling execution and managing communication including recovery from machine failures This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system! Google Stats [2]  -----------------------------> Learn More [3] [1] http://labs.google.com/papers/mapreduce.html [2] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html [3] http://code.google.com/edu/content/submissions/mapreduce/listing.html
Hadoop / MapReduce – Why? Specifically Hadoop itself is a reaction to Google's competitive advantage provided by it's internal “MapReduce” system. Google has not released “MapReduce” but has partitipated in and influenced Hadoop development IBM and Yahoo among Hadoop's backers Applies practices of computational clusters to processing data Increasing amount of information for companies to process Reasons include; Growth of customers  and Increasing necessity for competitive advantage Promotes an “Infrastructure approach” to “Application Programming”
Hadoop – Give me an Example... Map >>> seq = range(8) >>> def add(x, y): return x+y >>> map(add, seq, seq) [0, 2, 4, 6, 8, 10, 12, 14] Hadoop / MapReduce Evens & Odds [2, 3, 4, 5, 6, 7, 8] Map evens to '1' and odds to '0': Evens –> (1,2), (1,4), (1,6), (1,8) Odds –> (0, 3), (0, 5), (0, 7) Reduce: Evens: ('Even',[2468]) and Odds ('Odd',[357]) Hadoop Examples [2] – e.g. counting words in a file [1] http://docs.python.org/tut/node7.html [2] http://lucene.apache.org/hadoop/docs/current/api/org/apache/hadoop/examples/package-summary.html Reduce >>> def add(x,y): return x+y >>> reduce(add, range(1, 11)) 55
Hadoop – What's the Architecture ? Cluster Code Written in Java and consists of Compute Nodes aka “tasktrackers” managed by a “jobtracker” and a Distributed File System (HDFS) “namenode” with “datanodes” Infrastructure - Typically built on “commodity” hardware [2] Application JNI and C/C++ bindings Has a “streaming” component for command line interaction –  http://lucene.apache.org/hadoop/docs/r0.15.2/streaming.html Scale - “Hadoop has been demonstrated on clusters of up to 2000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 2.25 hours) and improving. Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours.” [1] [1]  http://wiki.apache.org/lucene-hadoop/FAQ#3 [2] http://wiki.apache.org/lucene-hadoop/MachineScaling
Hadoop – Who's Using It [1]? Hadoop Korean User Group - 50 node cluster in the university network  Used for development projects Retrieving and Analyzing Biomedical Knowledge Latent Semantic Analysis, Collaborative Filtering Hbase, Hbase Shell Test Last.fm - Used for charts calculation and web log analysis 25 node cluster (dual xeon LV 2GHz, 4GB RAM, 1TB/node storage) 10 node cluster (dual xeon L5320 1.86GHz, 8GB RAM, 3TB/node storage) Nutch - flexible web search engine software Powerset - Natural Language Search Up to 400 instances on Amazon EC2 Data storage in Amazon S3 Yahoo! - Used to support research for Ad Systems and Web Search >5000 nodes running Hadoop as of July 2007 Biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each) Also used to do scaling tests to support development of Hadoop on larger clusters Google and IBM [2] [1] http://wiki.apache.org/lucene-hadoop/PoweredBy [2] http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm?campaign_id=rss_tech
Hadoop – More Information http://wiki.apache.org/lucene-hadoop/ImportantConcepts http://wiki.apache.org/lucene-hadoop/ Overview - http://wiki.apache.org/lucene-hadoop/HadoopMapReduce http://lucene.apache.org/hadoop/docs/r0.15.2/ Quickstart – http://lucene.apache.org/hadoop/docs/r0.15.2/quickstart.html Tutorials – http://lucene.apache.org/hadoop/docs/r0.15.2/mapred_tutorial.html Python Tutorial – http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python Google Code for Educators MapReduce in a Week – http://code.google.com/edu/content/submissions/mapreduce/listing.html HBase: BigTable like Storage –  Built on Hadoop's DFS just like BigTable uses Google's GFS http://wiki.apache.org/lucene-hadoop/Hbase Bigtable – http://labs.google.com/papers/bigtabxle.html FAQ – http://wiki.apache.org/lucene-hadoop/FAQ IBM MapReduce tools for Eclipse - http://www.alphaworks.ibm.com/tech/mapreducetools

Hadoop - Overview

  • 1.
    Hadoop What isit? What is Map/Reduce? Why does it exist? Can I see an Example ? What's the “Architecture” ? Who's Using It? Is there More Info ?
  • 2.
    Hadoop – Whatis It ? “ Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.” [1] Apache Project to clone Google's “MapReduce” Paradigm for programming and processing large data [1] http://wiki.apache.org/lucene-hadoop/ [2] http://wiki.apache.org/lucene-hadoop/ProjectDescription
  • 3.
    Hadoop – Whatis MapReduce? Simplified Data Processing on Large Clusters [1] MapReduce is a set of code and infrastructure for parsing and building large data sets. A map function generates a key/value pair from the input data and this data is then reduced by a function to merges all values associated with equivalent keys. Programs are automatically parallelized and executed on a run-time system which manages partitioning the input data, scheduling execution and managing communication including recovery from machine failures This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system! Google Stats [2] -----------------------------> Learn More [3] [1] http://labs.google.com/papers/mapreduce.html [2] http://googlesystem.blogspot.com/2008/01/google-reveals-more-mapreduce-stats.html [3] http://code.google.com/edu/content/submissions/mapreduce/listing.html
  • 4.
    Hadoop / MapReduce– Why? Specifically Hadoop itself is a reaction to Google's competitive advantage provided by it's internal “MapReduce” system. Google has not released “MapReduce” but has partitipated in and influenced Hadoop development IBM and Yahoo among Hadoop's backers Applies practices of computational clusters to processing data Increasing amount of information for companies to process Reasons include; Growth of customers and Increasing necessity for competitive advantage Promotes an “Infrastructure approach” to “Application Programming”
  • 5.
    Hadoop – Giveme an Example... Map >>> seq = range(8) >>> def add(x, y): return x+y >>> map(add, seq, seq) [0, 2, 4, 6, 8, 10, 12, 14] Hadoop / MapReduce Evens & Odds [2, 3, 4, 5, 6, 7, 8] Map evens to '1' and odds to '0': Evens –> (1,2), (1,4), (1,6), (1,8) Odds –> (0, 3), (0, 5), (0, 7) Reduce: Evens: ('Even',[2468]) and Odds ('Odd',[357]) Hadoop Examples [2] – e.g. counting words in a file [1] http://docs.python.org/tut/node7.html [2] http://lucene.apache.org/hadoop/docs/current/api/org/apache/hadoop/examples/package-summary.html Reduce >>> def add(x,y): return x+y >>> reduce(add, range(1, 11)) 55
  • 6.
    Hadoop – What'sthe Architecture ? Cluster Code Written in Java and consists of Compute Nodes aka “tasktrackers” managed by a “jobtracker” and a Distributed File System (HDFS) “namenode” with “datanodes” Infrastructure - Typically built on “commodity” hardware [2] Application JNI and C/C++ bindings Has a “streaming” component for command line interaction – http://lucene.apache.org/hadoop/docs/r0.15.2/streaming.html Scale - “Hadoop has been demonstrated on clusters of up to 2000 nodes. Sort performance on 900 nodes is good (sorting 9TB of data on 900 nodes takes around 2.25 hours) and improving. Sort performances on 1400 nodes and 2000 nodes are pretty good too - sorting 14TB of data on a 1400-node cluster takes 2.2 hours; sorting 20TB on a 2000-node cluster takes 2.5 hours.” [1] [1] http://wiki.apache.org/lucene-hadoop/FAQ#3 [2] http://wiki.apache.org/lucene-hadoop/MachineScaling
  • 7.
    Hadoop – Who'sUsing It [1]? Hadoop Korean User Group - 50 node cluster in the university network Used for development projects Retrieving and Analyzing Biomedical Knowledge Latent Semantic Analysis, Collaborative Filtering Hbase, Hbase Shell Test Last.fm - Used for charts calculation and web log analysis 25 node cluster (dual xeon LV 2GHz, 4GB RAM, 1TB/node storage) 10 node cluster (dual xeon L5320 1.86GHz, 8GB RAM, 3TB/node storage) Nutch - flexible web search engine software Powerset - Natural Language Search Up to 400 instances on Amazon EC2 Data storage in Amazon S3 Yahoo! - Used to support research for Ad Systems and Web Search >5000 nodes running Hadoop as of July 2007 Biggest cluster: 2000 nodes (2*4cpu boxes w 3TB disk each) Also used to do scaling tests to support development of Hadoop on larger clusters Google and IBM [2] [1] http://wiki.apache.org/lucene-hadoop/PoweredBy [2] http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm?campaign_id=rss_tech
  • 8.
    Hadoop – MoreInformation http://wiki.apache.org/lucene-hadoop/ImportantConcepts http://wiki.apache.org/lucene-hadoop/ Overview - http://wiki.apache.org/lucene-hadoop/HadoopMapReduce http://lucene.apache.org/hadoop/docs/r0.15.2/ Quickstart – http://lucene.apache.org/hadoop/docs/r0.15.2/quickstart.html Tutorials – http://lucene.apache.org/hadoop/docs/r0.15.2/mapred_tutorial.html Python Tutorial – http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python Google Code for Educators MapReduce in a Week – http://code.google.com/edu/content/submissions/mapreduce/listing.html HBase: BigTable like Storage – Built on Hadoop's DFS just like BigTable uses Google's GFS http://wiki.apache.org/lucene-hadoop/Hbase Bigtable – http://labs.google.com/papers/bigtabxle.html FAQ – http://wiki.apache.org/lucene-hadoop/FAQ IBM MapReduce tools for Eclipse - http://www.alphaworks.ibm.com/tech/mapreducetools