0
Overview of Hadoop and MapReduce
                         Ganesh Neelakanta Iyer
      Research Scholar, National Universi...
About Me


I have 3 years of Industry work experience
   - Sasken Communication Technologies Ltd, Bangalore
   - NXP Semic...
Agenda
• Introduction to Hadoop

• Introduction to HDFS

• MapReduce Paradigm

• Some practical MapReduce examples

• MapR...
Introduction to Hadoop
Data!
• Facebook hosts approximately 10 billion photos, taking up one
  petabyte of storage

• The New York Stock Exchange...
Hadoop
• Open source Cloud supported by Apache

• Reliable shared storage and analysis system

• Uses distributed file sys...
Typical Hadoop Cluster




                         Pro-Hadoop by Jason Venner
Typical Hadoop Cluster
                                       Aggregation switch


              Rack switch




  40 node...
mage from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
Introduction to HDFS
HDFS – Hadoop Distributed File System
Very Large Distributed File System
   – 10K nodes, 100 million files, 10 PB
Assumes ...
Distributed File System
   Data Coherency
      – Write-once-read-many access model
      – Client can only append to exis...
MapReduce Paradigm
MapReduce
Simple data-parallel programming model designed for scalability and
   fault-tolerance

Framework for distribute...
What is MapReduce used for?
At Google:
    Index construction for Google Search
    Article clustering for Google News
   ...
What is MapReduce used for?
In research:
    Astronomical image analysis (Washington)
    Bioinformatics (Maryland)
    An...
MapReduce Programming Model
Data type: key-value records

Map function:
                     (Kin, Vin)   list(Kinter, Vin...
Example: Word Count
  def mapper(line):
      foreach word in line.split():
         output(word, 1)


  def reducer(key, ...
Input      Map              Shuffle & Sort               Reduce   Output
                            the, 1
              ...
MapReduce Execution Details
Single master controls job execution on multiple slaves

Mappers preferentially placed on same...
Fault Tolerance in MapReduce
1. If a task crashes:
     Retry on another node
          OK for a map because it has no dep...
Fault Tolerance in MapReduce

2. If a node crashes:
     Re-launch its current tasks on other nodes
     Re-run any maps t...
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler):
     Launch second copy of task on another node (“s...
Takeaways
By providing a data-parallel programming model, MapReduce can
   control job execution in useful ways:
    Autom...
Some practical MapReduce
examples
1. Search
Input: (lineNumber, line) records
Output: lines matching a given pattern

Map:
          if(line matches pattern...
2. Sort
Input: (key, value) records
Output: same records, sorted by key   Map
                                            ...
3. Inverted Index
Input: (filename, text) records
Output: list of files containing each word

Map:
          foreach word ...
Inverted Index Example
    hamlet.txt
                  to, hamlet.txt
   to be or not   be, hamlet.txt
       to be      ...
4. Most Popular Words
Input: (filename, text) records
Output: top 100 words occurring in the most files

Two-stage solutio...
MapReduce in Hadoop
MapReduce in Hadoop

Three ways to write jobs in Hadoop:
   Java API
   Hadoop Streaming (for Python, Perl, etc)
   Pipes ...
Word Count in Python with Hadoop Streaming
              import sys
Mapper.py:    for line in sys.stdin:
               fo...
Concluding remarks
Conclusions
MapReduce programming model hides the complexity of work
  distribution and fault tolerance

Principal design ...
What next?
MapReduce has limitations – Applications are limited

Some developments:
  • Pig started at Yahoo research
  • ...
Resources
Hadoop: http://hadoop.apache.org/core/
Pig: http://hadoop.apache.org/pig
Hive: http://hadoop.apache.org/hive
Vid...
Thank you!
ganesh.iyer@nus.edu.sg
http://ganeshniyer.com
Upcoming SlideShare
Loading in...5
×

Introduction to Hadoop and MapReduce

6,032

Published on

Slides of the workshop conducted in Model Engineering College, Ernakulam, and Sree Narayana Gurukulam College, Kadayiruppu
Kerala, India in December 2010

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
  • very impressive presentation
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
6,032
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
389
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to Hadoop and MapReduce"

  1. 1. Overview of Hadoop and MapReduce Ganesh Neelakanta Iyer Research Scholar, National University of Singapore
  2. 2. About Me I have 3 years of Industry work experience - Sasken Communication Technologies Ltd, Bangalore - NXP Semiconductors Pvt Ltd (Formerly Philips Semiconductors), Bangalore I have finished my Masters in Electrical and Computer Engineering from NUS (National University of Singapore) in 2008. Currently Research Scholar in NUS under the guidance of A/P. Bharadwaj Veeravalli. Research Interests: Cloud computing, Game theory, Resource Allocation and Pricing Personal Interests: Kathakali, Teaching, Travelling, Photography
  3. 3. Agenda • Introduction to Hadoop • Introduction to HDFS • MapReduce Paradigm • Some practical MapReduce examples • MapReduce in Hadoop • Concluding remarks
  4. 4. Introduction to Hadoop
  5. 5. Data! • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage • The New York Stock Exchange generates about one terabyte of new trade data per day • In last one week, I personally took 15 GB photos while I was travelling. So imagine the memory requirements for all photos taken in a day all over the world!
  6. 6. Hadoop • Open source Cloud supported by Apache • Reliable shared storage and analysis system • Uses distributed file system (Called as HDFS) like GFS • Can be used for a variety of applications
  7. 7. Typical Hadoop Cluster Pro-Hadoop by Jason Venner
  8. 8. Typical Hadoop Cluster Aggregation switch Rack switch 40 nodes/rack, 1000-4000 nodes in cluster 1 Gbps bandwidth within rack, 8 Gbps out of rack Node specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?) Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf
  9. 9. mage from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
  10. 10. Introduction to HDFS
  11. 11. HDFS – Hadoop Distributed File System Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recover from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS http://www.gartner.com/it/page.jsp?id=1447613
  12. 12. Distributed File System Data Coherency – Write-once-read-many access model – Client can only append to existing files Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  13. 13. MapReduce Paradigm
  14. 14. MapReduce Simple data-parallel programming model designed for scalability and fault-tolerance Framework for distributed processing of large data sets Originally designed by Google Pluggable user code runs in generic framework Pioneered by Google - Processes 20 petabytes of data per day
  15. 15. What is MapReduce used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection
  16. 16. What is MapReduce used for? In research: Astronomical image analysis (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington) <Your application here>
  17. 17. MapReduce Programming Model Data type: key-value records Map function: (Kin, Vin) list(Kinter, Vinter) Reduce function: (Kinter, list(Vinter)) list(Kout, Vout)
  18. 18. Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  19. 19. Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick Map fox, 1 brown, 2 brown fox fox, 2 Reduce how, 1 the, 1 fox, 1 now, 1 the, 1 the, 3 the fox ate Map the mouse quick, 1 how, 1 ate, 1 ate, 1 now, 1 mouse, 1 brown, 1 Reduce cow, 1 how now Map cow, 1 mouse, 1 brown cow quick, 1
  20. 20. MapReduce Execution Details Single master controls job execution on multiple slaves Mappers preferentially placed on same node or same rack as their input block Minimizes network usage Mappers save outputs to local disk before serving them to reducers Allows recovery if a reducer crashes Allows having more reducers than nodes
  21. 21. Fault Tolerance in MapReduce 1. If a task crashes: Retry on another node OK for a map because it has no dependencies OK for reduce because map outputs are on disk If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)
  22. 22. Fault Tolerance in MapReduce 2. If a node crashes: Re-launch its current tasks on other nodes Re-run any maps the node previously ran Necessary because their output files were lost along with the crashed node
  23. 23. Fault Tolerance in MapReduce 3. If a task is going slowly (straggler): Launch second copy of task on another node (“speculative execution”) Take the output of whichever copy finishes first, and kill the other Surprisingly important in large clusters Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc Single straggler may noticeably slow down a job
  24. 24. Takeaways By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing
  25. 25. Some practical MapReduce examples
  26. 26. 1. Search Input: (lineNumber, line) records Output: lines matching a given pattern Map: if(line matches pattern): output(line) Reduce: identify function Alternative: no reducer (map-only job)
  27. 27. 2. Sort Input: (key, value) records Output: same records, sorted by key Map ant, bee Reduce [A-M] zebra aardvark Map: identity function ant cow bee Reduce: identify function cow Map elephant pig Trick: Pick partitioning Reduce [N-Z] aardvark, pig function h such that elephant sheep k1<k2 => h(k1)<h(k2) Map sheep, yak yak zebra
  28. 28. 3. Inverted Index Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Combine: uniquify filenames for each word Reduce: def reduce(word, filenames): output(word, sort(filenames))
  29. 29. Inverted Index Example hamlet.txt to, hamlet.txt to be or not be, hamlet.txt to be or, hamlet.txt afraid, (12th.txt) not, hamlet.txt be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) be, 12th.txt or, (hamlet.txt) 12th.txt not, 12th.txt to, (hamlet.txt) be not afraid afraid, 12th.txt of greatness of, 12th.txt greatness, 12th.txt
  30. 30. 4. Most Popular Words Input: (filename, text) records Output: top 100 words occurring in the most files Two-stage solution: Job 1: Create inverted index, giving (word, list(file)) records Job 2: Map each (word, list(file)) to (count, word) Sort these records by count as in sort job
  31. 31. MapReduce in Hadoop
  32. 32. MapReduce in Hadoop Three ways to write jobs in Hadoop: Java API Hadoop Streaming (for Python, Perl, etc) Pipes API (C++)
  33. 33. Word Count in Python with Hadoop Streaming import sys Mapper.py: for line in sys.stdin: for word in line.split(): print(word.lower() + "t" + 1) Reducer.py: import sys counts = {} for line in sys.stdin: word, count = line.split("t”) dict[word] = dict.get(word, 0) + int(count) for word, count in counts: print(word.lower() + "t" + 1)
  34. 34. Concluding remarks
  35. 35. Conclusions MapReduce programming model hides the complexity of work distribution and fault tolerance Principal design philosophies: Make it scalable, so you can throw hardware at problems Make it cheap, lowering hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale
  36. 36. What next? MapReduce has limitations – Applications are limited Some developments: • Pig started at Yahoo research • Hive developed at Facebook • Amazon Elastic MapReduce
  37. 37. Resources Hadoop: http://hadoop.apache.org/core/ Pig: http://hadoop.apache.org/pig Hive: http://hadoop.apache.org/hive Video tutorials: http://www.cloudera.com/hadoop-training Amazon Web Services: http://aws.amazon.com/ Amazon Elastic MapReduce guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/Getti ngStartedGuide/ Slides of the talk delivered by Matei Zaharia, EECS, University of California, Berkeley
  38. 38. Thank you! ganesh.iyer@nus.edu.sg http://ganeshniyer.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×