SlideShare a Scribd company logo
MapReduce basics
        Harisankar H,
     PhD student, DOS lab,
     Dept. CSE, IIT Madras

          6-Feb-2013

http://harisankarh.wordpress.com
Distributed processing ?
 • Processing distributed across multiple
   machines/servers




Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg
Why distributed processing?
– Reduce execution time of large jobs
   • E.g., extracting urls from terabytes of data
   • 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
   • Other nodes will take over the jobs if some of the
     nodes fail
      – Typically if you have 10,000 servers, on the average one will
        fail per day
Issues in distributed processing
• Realized traditionally using special-purpose
  implementations
   – E.g., indexer, log processor
• Implementation really hard at socket programming level
   – Fault-tolerance
      • Keep track of failure, reassignment of tasks
   – Hand-coded parallelization
   – Scheduling across heterogeneous nodes
   – Locality
      • Minimise movement of data for computation
   – How to distribute data?
• Results in:
   – Complex, brittle, non-generic code
   – Reimplementation of common features like fault-tolerance,
     distribution
Need for a generic abstraction for
         distributed processing

App programmer  abstraction  systems developer

                   Separation of concerns


                  Express app   Performance, fault
                  logic         handling etc.

 • Tradeoff between genericity and performance
   – More generic => usually less performance
 • MapReduce probably a sweet spot where you
   have both to some extent
MapReduce abstraction(app
                programmer’s view)
  • Model input and output as <key,value> pairs
  • Provide map() and reduce() functions which
    act on <k,v> pairs
  • Input: set of <k,v> pairs: {k,v}
      – For each input <k,v>:
                                      map(k1,v1)  list(k2,v2)
      – For each unique output key from map:
             reduce(k2,combined list(v2))  list(v3)

System will take care of distributing the tasks across thousands of machines,
handling locality, fault-tolerance etc.
Example: word count
• Problem:
   – Count the number of occurrences of each unique
     word in a big collection of documents
• Input <k,v> set:
   – <document name, document contents>
      • Organize the files in this format
• Output:
   – <word, count>
      • Get it in output files
• Next step:
   – Define the map() and reduce() functions
Word count
map(String key, String value):
 // key: document name
 // value: document contents
 for each word w in value:
 EmitIntermediate(w, “1”);

reduce(String key, List values):
 // key: a word
 // values: a list of counts
 int result = 0;
 for each v in values:
 result += ParseInt(v);
 Emit(AsString(result));
Program in java

                                              public void reduce(Text key,
     public void map(LongWritable key, Text   Iterable<IntWritable> values, Context
value, Context context) throws …              context) throws …
   {                                                {
       String line = value.toString();                int sum = 0;
       StringTokenizer tokenizer = new                for (IntWritable val : values) {
StringTokenizer(line);                                   sum += val.get();
       while (tokenizer.hasMoreTokens()) {            }
         word.set(tokenizer.nextToken());             context.write(key, new
         context.write(word, one);            IntWritable(sum));
       }                                          }
     }
Implementing MapReduce abstraction

App programmer  abstraction  systems developer


 • Looked at the application programmer’s view
 • Need a platform which implements the
   MapReduce abstraction
 • Hadoop is the popular open-source
   implementation of MapReduce abstraction
 • Questions for the platform developer
   – How to
      •   parallelize ?
      •   handle faults ?
      •   provide locality ?
      •   distribute the data ?
Basics of platform implementation
• parallelize ?
   – Each map can be executed independently in parallel
   – After all maps have finished execution, all reduce can be
     executed in parallel
• handle faults ?
   – map() and reduce() has no internal state
      • Simply re-execute in case of a failure
• distribute the data ?
   – Have a distributed file system(HDFS)
• provide locality ?
   – Prefer to execute map() on the nodes having input <k,v>
     pair
MapReduce implementation
• Distributed File System(DFS) +
  MapReduce(MR) Engine
  – Specifically, MR engine uses a DFS
• Distributed files system
  – Files split into large chunks and stored in the
    distributed file system(e.g., HDFS)
  – Large chunks: typically 64MB per block
  – can have a master-slave architecture
     • Master assigns and manages replicated blocks in the
       slaves
MapReduce engine
• Has a master slave architecture
  – Master co-ordinates the task execution across
    workers
  – Workers perform the map() and reduce()
    functions
     • Reads and writes blocks to/from the DFS
  – Master keeps tracks of failure of workers and
    reassigns tasks if necessary
     • Failure detection usually done through timeouts
network
Some tips for designing MR jobs
• Reduce network traffic between map and reduce
  – Model map() and reduce() jobs appropriately
  – Use combine() functions
     • combine(<k,[v]>)  <k,[v]>
     • combine() executes after all map()s finish in each block
         – map() [same node] combine() [network]  reduce()

• Make map jobs of roughly equal expected
  execution times
• Try to make reduce() jobs less skewed
Pros and cons of MapReduce
• Advantages
  –   Simple, easy to use distributed processing system
  –   Reasonably generic
  –   Exploits locality for performance
  –   Simple and less buggy implementation
• Issues
  – Not a magic bullet which fit all problems
       • Difficult to model iterative and recursive computations
           – E.g.: k-means clustering
           – Generate-Map-Reduce
       • Difficult to model streaming computations
       • Centralized entities like master becomes bottlenecks
       • Most real-world problems require large chains of MR jobs
Summary
  • Today
       –   Distributed processing issues, MR programming model
       –   Sample MR job
       –   How MR can be implemented
       –   Pros and cons of MR, tips for better performance
  • Tomorrow
       – Details specific to Hadoop
       – Downloading and setting up of Hadoop on a cluster

Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
Hadoop components
• HDFS
  – Master: Namenode
  – Slave : DataNode
• MapReduce engine
  – Master: JobTracker
  – Slave: TaskTracker

More Related Content

What's hot

Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSort
Ryo Jin
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replicationAbDul ThaYyal
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
Ppt project process migration
Ppt project process migrationPpt project process migration
Ppt project process migration
jaya380
 
31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loadingmyrajendra
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
Kathirvel Ayyaswamy
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
Ibrahim Amer
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
Tomer Perry
 
Give or take a block
Give or take a blockGive or take a block
Give or take a block
Tomer Perry
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
Zbigniew Jerzak
 
Transaction Process System and Recovery
Transaction Process System and RecoveryTransaction Process System and Recovery
Transaction Process System and Recovery
Jitendra Thakur
 
Distributed process and scheduling
Distributed process and scheduling Distributed process and scheduling
Distributed process and scheduling
SHATHAN
 
File replication
File replicationFile replication
File replication
Klawal13
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
Konstantin V. Shvachko
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
Wu Liang
 
process management
 process management process management
process managementAshish Kumar
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Altinity Ltd
 

What's hot (20)

Samsung DeepSort
Samsung DeepSortSamsung DeepSort
Samsung DeepSort
 
Chapter 14 replication
Chapter 14 replicationChapter 14 replication
Chapter 14 replication
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Ppt project process migration
Ppt project process migrationPpt project process migration
Ppt project process migration
 
31 address binding, dynamic loading
31 address binding, dynamic loading31 address binding, dynamic loading
31 address binding, dynamic loading
 
CS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMSCS6601 DISTRIBUTED SYSTEMS
CS6601 DISTRIBUTED SYSTEMS
 
Distributed System Management
Distributed System ManagementDistributed System Management
Distributed System Management
 
Spectrum Scale Memory Usage
Spectrum Scale Memory UsageSpectrum Scale Memory Usage
Spectrum Scale Memory Usage
 
Give or take a block
Give or take a blockGive or take a block
Give or take a block
 
FSMO
FSMO FSMO
FSMO
 
03 Hadoop
03 Hadoop03 Hadoop
03 Hadoop
 
Clock Synchronization in Distributed Systems
Clock Synchronization in Distributed SystemsClock Synchronization in Distributed Systems
Clock Synchronization in Distributed Systems
 
Ch4 memory management
Ch4 memory managementCh4 memory management
Ch4 memory management
 
Transaction Process System and Recovery
Transaction Process System and RecoveryTransaction Process System and Recovery
Transaction Process System and Recovery
 
Distributed process and scheduling
Distributed process and scheduling Distributed process and scheduling
Distributed process and scheduling
 
File replication
File replicationFile replication
File replication
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 
Dynamo cassandra
Dynamo cassandraDynamo cassandra
Dynamo cassandra
 
process management
 process management process management
process management
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 

Viewers also liked

Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Maciek Jozwiak
 
Persecuted and Forgotten?
Persecuted and Forgotten? Persecuted and Forgotten?
Persecuted and Forgotten?
Anochi.com.
 
Highway safety in pakistan
Highway safety in pakistanHighway safety in pakistan
Highway safety in pakistanAdnan Masood
 
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 19998621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
Anochi.com.
 
Hệ thống ống nhựa Blue Ocean (VIE)
Hệ thống ống nhựa Blue Ocean (VIE)Hệ thống ống nhựa Blue Ocean (VIE)
Hệ thống ống nhựa Blue Ocean (VIE)
Lac Viet Consultant Trading Joint Stock Company
 
Zionism versus bolshevism
Zionism versus bolshevismZionism versus bolshevism
Zionism versus bolshevismAnochi.com.
 
Z STREET: IRS continues Constitutional violations concerning Israel - B
Z STREET: IRS continues Constitutional violations concerning Israel - BZ STREET: IRS continues Constitutional violations concerning Israel - B
Z STREET: IRS continues Constitutional violations concerning Israel - B
Anochi.com.
 
דו"ח הבדיקה של פרופ' חיים פרשטמן
דו"ח הבדיקה של פרופ' חיים פרשטמן דו"ח הבדיקה של פרופ' חיים פרשטמן
דו"ח הבדיקה של פרופ' חיים פרשטמן
Anochi.com.
 
סימולציה תשלום שכר לחיילי צהל
סימולציה תשלום שכר לחיילי צהלסימולציה תשלום שכר לחיילי צהל
סימולציה תשלום שכר לחיילי צהלAnochi.com.
 
חשבונות המאזן הלאומי
חשבונות המאזן הלאומיחשבונות המאזן הלאומי
חשבונות המאזן הלאומי
Anochi.com.
 
Teaching by Design - Session 1 Slides
Teaching by Design - Session 1 SlidesTeaching by Design - Session 1 Slides
Teaching by Design - Session 1 SlidesIlene Dawn Alexander
 
Strategie per la mente sicilia
Strategie per la mente siciliaStrategie per la mente sicilia
Strategie per la mente sicilia
maurizio vellano
 
התיישבות בודדים בנגב JIMS
התיישבות בודדים בנגב JIMSהתיישבות בודדים בנגב JIMS
התיישבות בודדים בנגב JIMSAnochi.com.
 
לנגוצקי נגד רציו
לנגוצקי נגד רציולנגוצקי נגד רציו
לנגוצקי נגד רציוAnochi.com.
 
ההוצאה הלאומית לחינוך 2014-2012
ההוצאה הלאומית לחינוך   2014-2012ההוצאה הלאומית לחינוך   2014-2012
ההוצאה הלאומית לחינוך 2014-2012
Anochi.com.
 
כלכלה מסביב לעולם רוסיה
כלכלה מסביב לעולם   רוסיהכלכלה מסביב לעולם   רוסיה
כלכלה מסביב לעולם רוסיה
Anochi.com.
 
Dr. Yaron Brook on TheMarker 12/2013
Dr. Yaron Brook on TheMarker 12/2013Dr. Yaron Brook on TheMarker 12/2013
Dr. Yaron Brook on TheMarker 12/2013Anochi.com.
 
Margo Rose Social Media Executive
Margo Rose Social Media ExecutiveMargo Rose Social Media Executive
Margo Rose Social Media Executive
Margo Rose
 
Presentacion 25 de noviembre
Presentacion 25 de noviembrePresentacion 25 de noviembre
Presentacion 25 de noviembre
Javieralgeciras
 

Viewers also liked (20)

Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Versio...
 
Persecuted and Forgotten?
Persecuted and Forgotten? Persecuted and Forgotten?
Persecuted and Forgotten?
 
Highway safety in pakistan
Highway safety in pakistanHighway safety in pakistan
Highway safety in pakistan
 
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 19998621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
8621 המכון ללימודים אסטרטגיים חוק הגנת הדייר 1999
 
Hệ thống ống nhựa Blue Ocean (VIE)
Hệ thống ống nhựa Blue Ocean (VIE)Hệ thống ống nhựa Blue Ocean (VIE)
Hệ thống ống nhựa Blue Ocean (VIE)
 
Zionism versus bolshevism
Zionism versus bolshevismZionism versus bolshevism
Zionism versus bolshevism
 
Z STREET: IRS continues Constitutional violations concerning Israel - B
Z STREET: IRS continues Constitutional violations concerning Israel - BZ STREET: IRS continues Constitutional violations concerning Israel - B
Z STREET: IRS continues Constitutional violations concerning Israel - B
 
דו"ח הבדיקה של פרופ' חיים פרשטמן
דו"ח הבדיקה של פרופ' חיים פרשטמן דו"ח הבדיקה של פרופ' חיים פרשטמן
דו"ח הבדיקה של פרופ' חיים פרשטמן
 
סימולציה תשלום שכר לחיילי צהל
סימולציה תשלום שכר לחיילי צהלסימולציה תשלום שכר לחיילי צהל
סימולציה תשלום שכר לחיילי צהל
 
חשבונות המאזן הלאומי
חשבונות המאזן הלאומיחשבונות המאזן הלאומי
חשבונות המאזן הלאומי
 
Teaching by Design - Session 1 Slides
Teaching by Design - Session 1 SlidesTeaching by Design - Session 1 Slides
Teaching by Design - Session 1 Slides
 
Timothy CV
Timothy CVTimothy CV
Timothy CV
 
Strategie per la mente sicilia
Strategie per la mente siciliaStrategie per la mente sicilia
Strategie per la mente sicilia
 
התיישבות בודדים בנגב JIMS
התיישבות בודדים בנגב JIMSהתיישבות בודדים בנגב JIMS
התיישבות בודדים בנגב JIMS
 
לנגוצקי נגד רציו
לנגוצקי נגד רציולנגוצקי נגד רציו
לנגוצקי נגד רציו
 
ההוצאה הלאומית לחינוך 2014-2012
ההוצאה הלאומית לחינוך   2014-2012ההוצאה הלאומית לחינוך   2014-2012
ההוצאה הלאומית לחינוך 2014-2012
 
כלכלה מסביב לעולם רוסיה
כלכלה מסביב לעולם   רוסיהכלכלה מסביב לעולם   רוסיה
כלכלה מסביב לעולם רוסיה
 
Dr. Yaron Brook on TheMarker 12/2013
Dr. Yaron Brook on TheMarker 12/2013Dr. Yaron Brook on TheMarker 12/2013
Dr. Yaron Brook on TheMarker 12/2013
 
Margo Rose Social Media Executive
Margo Rose Social Media ExecutiveMargo Rose Social Media Executive
Margo Rose Social Media Executive
 
Presentacion 25 de noviembre
Presentacion 25 de noviembrePresentacion 25 de noviembre
Presentacion 25 de noviembre
 

Similar to MapReduce basics

Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop
HadoopHadoop
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
TAGADPALLEWARPARTHVA
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
Tushar557668
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
Haripritha
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
KhanKhaja1
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
softwarequery
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
BikalAdhikari4
 

Similar to MapReduce basics (20)

Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop
HadoopHadoop
Hadoop
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

MapReduce basics

  • 1. MapReduce basics Harisankar H, PhD student, DOS lab, Dept. CSE, IIT Madras 6-Feb-2013 http://harisankarh.wordpress.com
  • 2. Distributed processing ? • Processing distributed across multiple machines/servers Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg
  • 3. Why distributed processing? – Reduce execution time of large jobs • E.g., extracting urls from terabytes of data • 1000 machines could finish the jobs 1000 times faster – Fault-tolerance • Other nodes will take over the jobs if some of the nodes fail – Typically if you have 10,000 servers, on the average one will fail per day
  • 4. Issues in distributed processing • Realized traditionally using special-purpose implementations – E.g., indexer, log processor • Implementation really hard at socket programming level – Fault-tolerance • Keep track of failure, reassignment of tasks – Hand-coded parallelization – Scheduling across heterogeneous nodes – Locality • Minimise movement of data for computation – How to distribute data? • Results in: – Complex, brittle, non-generic code – Reimplementation of common features like fault-tolerance, distribution
  • 5. Need for a generic abstraction for distributed processing App programmer  abstraction  systems developer Separation of concerns Express app Performance, fault logic handling etc. • Tradeoff between genericity and performance – More generic => usually less performance • MapReduce probably a sweet spot where you have both to some extent
  • 6. MapReduce abstraction(app programmer’s view) • Model input and output as <key,value> pairs • Provide map() and reduce() functions which act on <k,v> pairs • Input: set of <k,v> pairs: {k,v} – For each input <k,v>: map(k1,v1)  list(k2,v2) – For each unique output key from map: reduce(k2,combined list(v2))  list(v3) System will take care of distributing the tasks across thousands of machines, handling locality, fault-tolerance etc.
  • 7. Example: word count • Problem: – Count the number of occurrences of each unique word in a big collection of documents • Input <k,v> set: – <document name, document contents> • Organize the files in this format • Output: – <word, count> • Get it in output files • Next step: – Define the map() and reduce() functions
  • 8. Word count map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); reduce(String key, List values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
  • 9. Program in java public void reduce(Text key, public void map(LongWritable key, Text Iterable<IntWritable> values, Context value, Context context) throws … context) throws … { { String line = value.toString(); int sum = 0; StringTokenizer tokenizer = new for (IntWritable val : values) { StringTokenizer(line); sum += val.get(); while (tokenizer.hasMoreTokens()) { } word.set(tokenizer.nextToken()); context.write(key, new context.write(word, one); IntWritable(sum)); } } }
  • 10. Implementing MapReduce abstraction App programmer  abstraction  systems developer • Looked at the application programmer’s view • Need a platform which implements the MapReduce abstraction • Hadoop is the popular open-source implementation of MapReduce abstraction • Questions for the platform developer – How to • parallelize ? • handle faults ? • provide locality ? • distribute the data ?
  • 11. Basics of platform implementation • parallelize ? – Each map can be executed independently in parallel – After all maps have finished execution, all reduce can be executed in parallel • handle faults ? – map() and reduce() has no internal state • Simply re-execute in case of a failure • distribute the data ? – Have a distributed file system(HDFS) • provide locality ? – Prefer to execute map() on the nodes having input <k,v> pair
  • 12. MapReduce implementation • Distributed File System(DFS) + MapReduce(MR) Engine – Specifically, MR engine uses a DFS • Distributed files system – Files split into large chunks and stored in the distributed file system(e.g., HDFS) – Large chunks: typically 64MB per block – can have a master-slave architecture • Master assigns and manages replicated blocks in the slaves
  • 13. MapReduce engine • Has a master slave architecture – Master co-ordinates the task execution across workers – Workers perform the map() and reduce() functions • Reads and writes blocks to/from the DFS – Master keeps tracks of failure of workers and reassigns tasks if necessary • Failure detection usually done through timeouts
  • 15. Some tips for designing MR jobs • Reduce network traffic between map and reduce – Model map() and reduce() jobs appropriately – Use combine() functions • combine(<k,[v]>)  <k,[v]> • combine() executes after all map()s finish in each block – map() [same node] combine() [network]  reduce() • Make map jobs of roughly equal expected execution times • Try to make reduce() jobs less skewed
  • 16. Pros and cons of MapReduce • Advantages – Simple, easy to use distributed processing system – Reasonably generic – Exploits locality for performance – Simple and less buggy implementation • Issues – Not a magic bullet which fit all problems • Difficult to model iterative and recursive computations – E.g.: k-means clustering – Generate-Map-Reduce • Difficult to model streaming computations • Centralized entities like master becomes bottlenecks • Most real-world problems require large chains of MR jobs
  • 17. Summary • Today – Distributed processing issues, MR programming model – Sample MR job – How MR can be implemented – Pros and cons of MR, tips for better performance • Tomorrow – Details specific to Hadoop – Downloading and setting up of Hadoop on a cluster Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
  • 18. Hadoop components • HDFS – Master: Namenode – Slave : DataNode • MapReduce engine – Master: JobTracker – Slave: TaskTracker