0
Hadoop – Large scale data analysis<br />Abhijit Sharma<br />Page 1    |    9/8/2011<br />
Unprecedented growth in <br />Data set size - Facebook 21+ PB data warehouse, 12+ TB/day<br />Un(semi)-structured data – l...
Page 3    |    9/8/2011<br />Putting Big Data to work<br />Data driven Org – decision support, new offerings<br />Analytic...
Embarrassingly data parallel problems<br />Data chunked & distributed across cluster<br />Parallel processing with data lo...
Open source system for large scale batch distributed computing on big data<br />Map Reduce Programming Paradigm & Framewor...
MapReduce is a programming model and an implementation for parallel processing of large data sets<br />Map processes each ...
Map : Apply a function to each list member - Parallelizable<br />[1, 2, 3].collect { it * it } <br />Output : [1, 2, 3] ->...
Page 8    |    9/8/2011<br />Word Count - Shell<br />cat * | grep  | sort                | uniq –c<br />input| map  | shuf...
Page 9    |    9/8/2011<br />Word Count - Map Reduce<br />
mapper (filename, file-contents):<br />for each word in file-contents:<br />    emit (word, 1) // single count for a word ...
Word Count / Distributed logs search for # accesses to various URLs<br />Map – emits word/URL, 1 for each doc/log split<br...
Hides complexity of distributed computing<br />Automatic parallelization of job<br />Automatic data chunking & distributio...
Page 13    |    9/8/2011<br />Hadoop Map Reduce Architecture<br />
Very large files – block size 64 MB/128 MB<br />Data access pattern - Write once read many<br />Writes are large, create &...
Page 15    |    9/8/2011<br />HDFS Architecture<br />
Thanks<br />Page 16    |    9/8/2011<br />
Page 17    |    9/8/2011<br />Backup Slides<br />
Page 18    |    9/8/2011<br />Map & Reduce Functions<br />
Page 19    |    9/8/2011<br />Job Configuration<br />
Job Tracker tracks MR jobs – runs on master node<br />Task Tracker<br />Runs on data nodes and tracks Mapper, Reducer task...
Name Node <br />Manages the file system namespace and regulates access to files by clients – stores meta data<br />Mapping...
Upcoming SlideShare
Loading in...5
×

An introduction to Hadoop for large scale data analysis

1,705

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,705
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "An introduction to Hadoop for large scale data analysis"

  1. 1. Hadoop – Large scale data analysis<br />Abhijit Sharma<br />Page 1 | 9/8/2011<br />
  2. 2. Unprecedented growth in <br />Data set size - Facebook 21+ PB data warehouse, 12+ TB/day<br />Un(semi)-structured data – logs, documents, graphs<br />Connected data web, tags, graphs<br />Relevant to enterprises – logs, social media, machine generated data, breaking of silos<br />Page 2 | 9/8/2011<br />Big Data Trends<br />
  3. 3. Page 3 | 9/8/2011<br />Putting Big Data to work<br />Data driven Org – decision support, new offerings<br />Analytics on large data sets (FB Insights – Page, App etc stats), <br />Data Mining – Clustering - Google News articles<br />Search - Google<br />
  4. 4. Embarrassingly data parallel problems<br />Data chunked & distributed across cluster<br />Parallel processing with data locality – task dispatched where data is<br />Horizontal/Linear scaling approach using commodity hardware<br />Write Once, Read Many<br />Examples <br />Distributed logs – grep, # of accesses per URL<br />Search - Term Vector generation, Reverse Links<br />Page 4 | 9/8/2011<br />Problem characteristics and examples<br />
  5. 5. Open source system for large scale batch distributed computing on big data<br />Map Reduce Programming Paradigm & Framework <br />Map Reduce Infrastructure<br />Distributed File System (HDFS)<br />Endorsed/used extensively by web giants – Google, FB, Yahoo!<br />Page 5 | 9/8/2011<br />What is Hadoop?<br />
  6. 6. MapReduce is a programming model and an implementation for parallel processing of large data sets<br />Map processes each logical record per input split to generate a set of intermediate key/value pairs<br />Reduce merges all intermediate values associated with the same intermediate key<br />Page 6 | 9/8/2011<br />Map Reduce - Definition<br />
  7. 7. Map : Apply a function to each list member - Parallelizable<br />[1, 2, 3].collect { it * it } <br />Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]<br />Reduce : Apply a function and an accumulator to each list member<br />[1, 2, 3].inject(0) { sum, item -> sum + item } <br />Output : [1, 2, 3] -> Reduce (Sum) : 6<br />Map & Reduce <br />[1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } <br />Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14<br />Page 7 | 9/8/2011<br />Map Reduce - Functional Programming Origins<br />
  8. 8. Page 8 | 9/8/2011<br />Word Count - Shell<br />cat * | grep | sort | uniq –c<br />input| map | shuffle & sort | reduce<br />
  9. 9. Page 9 | 9/8/2011<br />Word Count - Map Reduce<br />
  10. 10. mapper (filename, file-contents):<br />for each word in file-contents:<br /> emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the”<br />reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])<br />sum = 0<br /> for each value in intermediate_values:<br /> sum = sum + value<br /> emit (word, sum)<br />Page 10 | 9/8/2011<br />Word Count - Pseudo code<br />
  11. 11. Word Count / Distributed logs search for # accesses to various URLs<br />Map – emits word/URL, 1 for each doc/log split<br />Reduce – sums up the counts for a specific word/URL<br />Term Vector generation – term -> [doc-id]<br />Map – emits term, doc-id for each doc split<br />Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..])<br />Reverse Links – source -> target to target-> source<br />Map – emits (target, source) for each doc split<br />Reducer – Identity Reducer – accumulates the (target, [source, source ..]) <br />Page 11 | 9/8/2011<br />Examples – Map Reduce Defn<br />
  12. 12. Hides complexity of distributed computing<br />Automatic parallelization of job<br />Automatic data chunking & distribution (via HDFS)<br />Data locality – MR task dispatched where data is<br />Fault tolerant to server, storage, N/W failures<br />Network and disk transfer optimization<br />Load balancing<br />Page 12 | 9/8/2011<br />Map Reduce – Hadoop Implementation<br />
  13. 13. Page 13 | 9/8/2011<br />Hadoop Map Reduce Architecture<br />
  14. 14. Very large files – block size 64 MB/128 MB<br />Data access pattern - Write once read many<br />Writes are large, create & append only<br />Reads are large & streaming<br />Commodity hardware<br />Tolerant to failure – server, storage, network<br />Highly available through transparent replication<br /><ul><li>Throughput is more important than latency</li></ul>Page 14 | 9/8/2011<br />HDFS Characteristics<br />
  15. 15. Page 15 | 9/8/2011<br />HDFS Architecture<br />
  16. 16. Thanks<br />Page 16 | 9/8/2011<br />
  17. 17. Page 17 | 9/8/2011<br />Backup Slides<br />
  18. 18. Page 18 | 9/8/2011<br />Map & Reduce Functions<br />
  19. 19. Page 19 | 9/8/2011<br />Job Configuration<br />
  20. 20. Job Tracker tracks MR jobs – runs on master node<br />Task Tracker<br />Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node<br />Heartbeats to Job Tracker<br />Maintains and picks up tasks from a queue<br />Page 20 | 9/8/2011<br />Hadoop Map Reduce Components<br />
  21. 21. Name Node <br />Manages the file system namespace and regulates access to files by clients – stores meta data<br />Mapping of blocks to Data Nodes and replicas<br />Manage replication<br />Executes file system namespace operations like opening, closing, and renaming files and directories.<br />Data Node<br />One per node, which manages local storage attached to the node <br />Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes<br />Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.<br />Page 21 | 9/8/2011<br />HDFS<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×