Apache Hadoop - Kumaresan Manickavelu
Problems With Scale Failure is the defining difference between distributed and local programming If components fail, their workload must be picked up by still-functioning units Nodes that fail and restart must be able to rejoin the group activity without a full group restart Increased load should cause graceful decline Increasing resources should support a proportional increase in load capacity Storing and Sharing data with processing units.
Hadoop Echo System Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. HBase: A scalable, distributed database that supports structured data storage for large tables.
HDFS Based on Google’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Optimized for huge files that are mostly appended and read Architecture HDFS has a master/slave architecture  An HDFS cluster consists of a single NameNode and a number of DataNodes  HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software  The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.  The DataNodes are responsible for serving read and write requests from the file system’s clients.
Map Reduce Provides a clean abstraction for programmers to write distributed application.  Factors out many reliability concerns from application logic A batch data processing system Automatic parallelization & distribution Fault-tolerance Status and monitoring tools
Programming Model Programmer has to implement interface of two functions: –  map (in_key, in_value) -> (out_key, intermediate_value) list –  reduce (out_key, intermediate_value list) ->   out_value list
Map Reduce Flow
Mapper (indexing example) Input is the line no and the actual line. Input  1 :  (“100”,“I Love India ”)  Output  1 :  (“I”,“100”), (“Love”,“100”), (“India”,“100”)  Input  2 :  (“101”,“I Love eBay”)  Output  2 :  (“I”,“101”), (“Love”,“101”), (“eBay”,“101”)
Reducer (indexing example) Input is word and the line nos.  Input  1 : (“I”,“100”,”101”)  Input  2 :  (“Love”,“100”,”101”) Input  3 :  (“India”, “100”) Input  4 :  (“eBay”, “101”) Output, the words are stored along with the line nos.
Google Page Rank example Mapper Input is a link and the html content Output is a list of outgoing link and pagerank of this page Reducer Input is a link and a list of pagranks of pages linking to this page Output is the pagerank of this page, which is the weighted average of all input pageranks
Hadoop at Yahoo World's largest Hadoop production application.  The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster Biggest contributor to Hadoop. Converting All its batches to Hadoop.
Hadoop at Amazon Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)  The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240  Amazon Elastic MapReduce  is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.
Thanks Questions? kumaresan . manickavelu @ gmail.com

Apache Hadoop

  • 1.
    Apache Hadoop -Kumaresan Manickavelu
  • 2.
    Problems With ScaleFailure is the defining difference between distributed and local programming If components fail, their workload must be picked up by still-functioning units Nodes that fail and restart must be able to rejoin the group activity without a full group restart Increased load should cause graceful decline Increasing resources should support a proportional increase in load capacity Storing and Sharing data with processing units.
  • 3.
    Hadoop Echo SystemApache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. HBase: A scalable, distributed database that supports structured data storage for large tables.
  • 4.
    HDFS Based onGoogle’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Optimized for huge files that are mostly appended and read Architecture HDFS has a master/slave architecture An HDFS cluster consists of a single NameNode and a number of DataNodes HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients.
  • 5.
    Map Reduce Providesa clean abstraction for programmers to write distributed application. Factors out many reliability concerns from application logic A batch data processing system Automatic parallelization & distribution Fault-tolerance Status and monitoring tools
  • 6.
    Programming Model Programmerhas to implement interface of two functions: – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list
  • 7.
  • 8.
    Mapper (indexing example)Input is the line no and the actual line. Input 1 : (“100”,“I Love India ”) Output 1 : (“I”,“100”), (“Love”,“100”), (“India”,“100”) Input 2 : (“101”,“I Love eBay”) Output 2 : (“I”,“101”), (“Love”,“101”), (“eBay”,“101”)
  • 9.
    Reducer (indexing example)Input is word and the line nos. Input 1 : (“I”,“100”,”101”) Input 2 : (“Love”,“100”,”101”) Input 3 : (“India”, “100”) Input 4 : (“eBay”, “101”) Output, the words are stored along with the line nos.
  • 10.
    Google Page Rankexample Mapper Input is a link and the html content Output is a list of outgoing link and pagerank of this page Reducer Input is a link and a list of pagranks of pages linking to this page Output is the pagerank of this page, which is the weighted average of all input pageranks
  • 11.
    Hadoop at YahooWorld's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster Biggest contributor to Hadoop. Converting All its batches to Hadoop.
  • 12.
    Hadoop at AmazonHadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.
  • 13.
    Thanks Questions? kumaresan. manickavelu @ gmail.com

Editor's Notes

  • #3 One node failing every day. Then in a cluster of 365 nodes one node will fail every day. Ebay Pools example. Example of thread and spring. Example of thumbs pool cache.