Your SlideShare is downloading. ×
0
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
An introduction to Hadoop for large scale data analysis
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

An introduction to Hadoop for large scale data analysis

1,686

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,686
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
30
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop – Large scale data analysis<br />Abhijit Sharma<br />Page 1 | 9/8/2011<br />
  • 2. Unprecedented growth in <br />Data set size - Facebook 21+ PB data warehouse, 12+ TB/day<br />Un(semi)-structured data – logs, documents, graphs<br />Connected data web, tags, graphs<br />Relevant to enterprises – logs, social media, machine generated data, breaking of silos<br />Page 2 | 9/8/2011<br />Big Data Trends<br />
  • 3. Page 3 | 9/8/2011<br />Putting Big Data to work<br />Data driven Org – decision support, new offerings<br />Analytics on large data sets (FB Insights – Page, App etc stats), <br />Data Mining – Clustering - Google News articles<br />Search - Google<br />
  • 4. Embarrassingly data parallel problems<br />Data chunked & distributed across cluster<br />Parallel processing with data locality – task dispatched where data is<br />Horizontal/Linear scaling approach using commodity hardware<br />Write Once, Read Many<br />Examples <br />Distributed logs – grep, # of accesses per URL<br />Search - Term Vector generation, Reverse Links<br />Page 4 | 9/8/2011<br />Problem characteristics and examples<br />
  • 5. Open source system for large scale batch distributed computing on big data<br />Map Reduce Programming Paradigm & Framework <br />Map Reduce Infrastructure<br />Distributed File System (HDFS)<br />Endorsed/used extensively by web giants – Google, FB, Yahoo!<br />Page 5 | 9/8/2011<br />What is Hadoop?<br />
  • 6. MapReduce is a programming model and an implementation for parallel processing of large data sets<br />Map processes each logical record per input split to generate a set of intermediate key/value pairs<br />Reduce merges all intermediate values associated with the same intermediate key<br />Page 6 | 9/8/2011<br />Map Reduce - Definition<br />
  • 7. Map : Apply a function to each list member - Parallelizable<br />[1, 2, 3].collect { it * it } <br />Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]<br />Reduce : Apply a function and an accumulator to each list member<br />[1, 2, 3].inject(0) { sum, item -> sum + item } <br />Output : [1, 2, 3] -> Reduce (Sum) : 6<br />Map & Reduce <br />[1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } <br />Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14<br />Page 7 | 9/8/2011<br />Map Reduce - Functional Programming Origins<br />
  • 8. Page 8 | 9/8/2011<br />Word Count - Shell<br />cat * | grep | sort | uniq –c<br />input| map | shuffle & sort | reduce<br />
  • 9. Page 9 | 9/8/2011<br />Word Count - Map Reduce<br />
  • 10. mapper (filename, file-contents):<br />for each word in file-contents:<br /> emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the”<br />reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])<br />sum = 0<br /> for each value in intermediate_values:<br /> sum = sum + value<br /> emit (word, sum)<br />Page 10 | 9/8/2011<br />Word Count - Pseudo code<br />
  • 11. Word Count / Distributed logs search for # accesses to various URLs<br />Map – emits word/URL, 1 for each doc/log split<br />Reduce – sums up the counts for a specific word/URL<br />Term Vector generation – term -> [doc-id]<br />Map – emits term, doc-id for each doc split<br />Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..])<br />Reverse Links – source -> target to target-> source<br />Map – emits (target, source) for each doc split<br />Reducer – Identity Reducer – accumulates the (target, [source, source ..]) <br />Page 11 | 9/8/2011<br />Examples – Map Reduce Defn<br />
  • 12. Hides complexity of distributed computing<br />Automatic parallelization of job<br />Automatic data chunking & distribution (via HDFS)<br />Data locality – MR task dispatched where data is<br />Fault tolerant to server, storage, N/W failures<br />Network and disk transfer optimization<br />Load balancing<br />Page 12 | 9/8/2011<br />Map Reduce – Hadoop Implementation<br />
  • 13. Page 13 | 9/8/2011<br />Hadoop Map Reduce Architecture<br />
  • 14. Very large files – block size 64 MB/128 MB<br />Data access pattern - Write once read many<br />Writes are large, create & append only<br />Reads are large & streaming<br />Commodity hardware<br />Tolerant to failure – server, storage, network<br />Highly available through transparent replication<br /><ul><li>Throughput is more important than latency</li></ul>Page 14 | 9/8/2011<br />HDFS Characteristics<br />
  • 15. Page 15 | 9/8/2011<br />HDFS Architecture<br />
  • 16. Thanks<br />Page 16 | 9/8/2011<br />
  • 17. Page 17 | 9/8/2011<br />Backup Slides<br />
  • 18. Page 18 | 9/8/2011<br />Map & Reduce Functions<br />
  • 19. Page 19 | 9/8/2011<br />Job Configuration<br />
  • 20. Job Tracker tracks MR jobs – runs on master node<br />Task Tracker<br />Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node<br />Heartbeats to Job Tracker<br />Maintains and picks up tasks from a queue<br />Page 20 | 9/8/2011<br />Hadoop Map Reduce Components<br />
  • 21. Name Node <br />Manages the file system namespace and regulates access to files by clients – stores meta data<br />Mapping of blocks to Data Nodes and replicas<br />Manage replication<br />Executes file system namespace operations like opening, closing, and renaming files and directories.<br />Data Node<br />One per node, which manages local storage attached to the node <br />Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes<br />Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.<br />Page 21 | 9/8/2011<br />HDFS<br />

×