Apache Hadoop - Kumaresan Manickavelu
Problems With Scale <ul><li>Failure is the defining difference between distributed and local programming </li></ul><ul><ul...
Hadoop Echo System <ul><li>Apache Hadoop is a collection of open-source software for reliable, scalable, distributed compu...
HDFS <ul><li>Based on Google’s GFS </li></ul><ul><li>Redundant storage of massive amounts of data on cheap and unreliable ...
Map Reduce <ul><li>Provides a clean abstraction for programmers to write distributed application.  </li></ul><ul><li>Facto...
Programming Model <ul><li>Programmer has to implement interface of two functions: </li></ul><ul><li>–  map (in_key, in_val...
Map Reduce Flow
Mapper (indexing example) <ul><li>Input is the line no and the actual line. </li></ul><ul><li>Input  1 :  (“100”,“I Love I...
Reducer (indexing example) <ul><li>Input is word and the line nos.  </li></ul><ul><li>Input  1 : (“I”,“100”,”101”)  </li><...
Google Page Rank example <ul><li>Mapper </li></ul><ul><ul><li>Input is a link and the html content </li></ul></ul><ul><ul>...
Hadoop at Yahoo <ul><li>World's largest Hadoop production application.  </li></ul><ul><li>The Yahoo! Search Webmap is a Ha...
Hadoop at Amazon <ul><li>Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)  <...
Thanks <ul><ul><li>Questions? </li></ul></ul><ul><ul><li>kumaresan . manickavelu @ gmail.com </li></ul></ul>
Upcoming SlideShare
Loading in …5
×

Apache Hadoop

3,574 views

Published on

Beginners guide to Apache Hadoop and MapReduce for developers.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,574
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
142
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • One node failing every day. Then in a cluster of 365 nodes one node will fail every day. Ebay Pools example. Example of thread and spring. Example of thumbs pool cache.
  • Apache Hadoop

    1. 1. Apache Hadoop - Kumaresan Manickavelu
    2. 2. Problems With Scale <ul><li>Failure is the defining difference between distributed and local programming </li></ul><ul><ul><li>If components fail, their workload must be picked up by still-functioning units </li></ul></ul><ul><ul><li>Nodes that fail and restart must be able to rejoin the group activity without a full group restart </li></ul></ul><ul><li>Increased load should cause graceful decline </li></ul><ul><li>Increasing resources should support a proportional increase in load capacity </li></ul><ul><li>Storing and Sharing data with processing units. </li></ul>
    3. 3. Hadoop Echo System <ul><li>Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. </li></ul><ul><li>Hadoop Common: The common utilities that support the other Hadoop subprojects. </li></ul><ul><li>HDFS: A distributed file system that provides high throughput access to application data. </li></ul><ul><li>MapReduce: A software framework for distributed processing of large data sets on compute clusters. </li></ul><ul><li>Pig: A high-level data-flow language and execution framework for parallel computation. </li></ul><ul><li>HBase: A scalable, distributed database that supports structured data storage for large tables. </li></ul>
    4. 4. HDFS <ul><li>Based on Google’s GFS </li></ul><ul><li>Redundant storage of massive amounts of data on cheap and unreliable computers </li></ul><ul><li>Optimized for huge files that are mostly appended and read </li></ul><ul><li>Architecture </li></ul><ul><ul><li>HDFS has a master/slave architecture </li></ul></ul><ul><ul><li>An HDFS cluster consists of a single NameNode and a number of DataNodes </li></ul></ul><ul><ul><li>HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software </li></ul></ul><ul><li>The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. </li></ul><ul><li>The DataNodes are responsible for serving read and write requests from the file system’s clients. </li></ul>
    5. 5. Map Reduce <ul><li>Provides a clean abstraction for programmers to write distributed application. </li></ul><ul><li>Factors out many reliability concerns from application logic </li></ul><ul><li>A batch data processing system </li></ul><ul><li>Automatic parallelization & distribution </li></ul><ul><li>Fault-tolerance </li></ul><ul><li>Status and monitoring tools </li></ul>
    6. 6. Programming Model <ul><li>Programmer has to implement interface of two functions: </li></ul><ul><li>– map (in_key, in_value) -> </li></ul><ul><li>(out_key, intermediate_value) list </li></ul><ul><li>– reduce (out_key, intermediate_value list) -> </li></ul><ul><li> out_value list </li></ul>
    7. 7. Map Reduce Flow
    8. 8. Mapper (indexing example) <ul><li>Input is the line no and the actual line. </li></ul><ul><li>Input 1 : (“100”,“I Love India ”) </li></ul><ul><li>Output 1 : (“I”,“100”), (“Love”,“100”), (“India”,“100”) </li></ul><ul><li>Input 2 : (“101”,“I Love eBay”) </li></ul><ul><li>Output 2 : (“I”,“101”), (“Love”,“101”), (“eBay”,“101”) </li></ul>
    9. 9. Reducer (indexing example) <ul><li>Input is word and the line nos. </li></ul><ul><li>Input 1 : (“I”,“100”,”101”) </li></ul><ul><li>Input 2 : (“Love”,“100”,”101”) </li></ul><ul><li>Input 3 : (“India”, “100”) </li></ul><ul><li>Input 4 : (“eBay”, “101”) </li></ul><ul><li>Output, the words are stored along with the line nos. </li></ul>
    10. 10. Google Page Rank example <ul><li>Mapper </li></ul><ul><ul><li>Input is a link and the html content </li></ul></ul><ul><ul><li>Output is a list of outgoing link and pagerank of this page </li></ul></ul><ul><li>Reducer </li></ul><ul><ul><li>Input is a link and a list of pagranks of pages linking to this page </li></ul></ul><ul><ul><li>Output is the pagerank of this page, which is the weighted average of all input pageranks </li></ul></ul>
    11. 11. Hadoop at Yahoo <ul><li>World's largest Hadoop production application. </li></ul><ul><li>The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster </li></ul><ul><li>Biggest contributor to Hadoop. </li></ul><ul><li>Converting All its batches to Hadoop. </li></ul>
    12. 12. Hadoop at Amazon <ul><li>Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) </li></ul><ul><li>The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 </li></ul><ul><li>Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework. </li></ul>
    13. 13. Thanks <ul><ul><li>Questions? </li></ul></ul><ul><ul><li>kumaresan . manickavelu @ gmail.com </li></ul></ul>

    ×