SlideShare a Scribd company logo
1 of 18
Download to read offline
Hadoop Map/Reduce

    Owen O’Malley
      July 2006
Map/Reduce Goals
– Distribution
   • The data is available where needed.
   • Application does not care how many computers
     are being used.
– Reliability
   • Application does not care that computers or
     networks may have temporary or permanent
     failures.


                                                   2
Application Perspective
• Define Mapper and Reducer classes and a
  “launching” program.
• Mapper
  – Is given a stream of key1,value1 pairs
  – Generates a stream of key2, value2 pairs
• Reducer
  – Is given a key2 and a stream of value2’s
  – Generates a stream of key3, value3 pairs
• Launching Program
  – Creates a JobConf to define a job.
  – Submits JobConf to JobTracker and waits for
    completion.                                   3
Application Dataflow




                       4
Input & Output Formats
• The application also chooses input and output
  formats, which define how the persistent data
  is read and written. These are interfaces and
  can be defined by the application.
• InputFormat
  – Splits the input to determine the input to each map
    task.
  – Defines a RecordReader that reads key, value
    pairs that are passed to the map task
• OutputFormat
  – Given the key, value pairs and a filename, writes
    the reduce task output to persistent store.
                                                        5
Output Ordering
• The application can control the sort order and
  partitions of the output via
  OutputKeyComparator and Partitioner.
• OutputKeyComparator
   – Defines how to compare serialized keys.
   – Defaults to WritableComparable, but should be
     defined for any application defined key types.
      • key1.compareTo(key2)
• Partitioner
   – Given a map output key and the number of
     reduces, chooses a reduce.
   – Defaults to HashPartitioner
                                                      6
      • key.hashCode % numReduces
Combiners
• Combiners are an optimization for jobs with
  reducers that can merge multiple values into
  a single value.
• Typically, the combiner is the same as the
  reducer and runs on the map outputs before it
  is transferred to the reducer’s machine.
• For example, WordCount’s mapper generates
  (word, count) and the combiner and reducer
  generate the sum for each word.
  – Input: “hi Owen bye Owen”
  – Map output: (“hi”, 1), (“Owen”, 1), (“bye”,1), (“Owen”,1)
  – Combiner output: (“Owen”, 2), (“bye”, 1), (“hi”, 1)         7
Process Communication
• Use a custom RPC implementation
  –   Easy to change/extend
  –   Defined as Java interfaces
  –   Server objects implement the interface
  –   Client proxy objects automatically created
• All messages originate at the client
  – Prevents cycles and therefore deadlocks
• Errors
  – Include timeouts and communication problems.
  – Are signaled to client via IOException.
  – Are NEVER signaled to the server.
                                                   8
Map/Reduce Processes
• Launching Application
  – User application code
  – Submits a specific kind of Map/Reduce job
• JobTracker
  – Handles all jobs
  – Makes all scheduling decisions
• TaskTracker
  – Manager for all tasks on a given node
• Task
  – Runs an individual map or reduce fragment for a
    given job
  – Forks from the TaskTracker
                                                      9
Process Diagram




                  10
Job Control Flow
• Application launcher creates and submits job.
• JobTracker initializes job, creates FileSplits,
  and adds tasks to queue.
• TaskTrackers ask for a new map or reduce
  task every 10 seconds or when the previous
  task finishes.
• As tasks run, the TaskTracker reports status
  to the JobTracker every 10 seconds.
• When job completes, the JobTracker tells the
  TaskTrackers to delete temporary files.
• Application launcher notices job completion
  and stops waiting.                              11
Application Launcher
• Application code to create JobConf and set
  the parameters.
  – Mapper, Reducer classes
  – InputFormat and OutputFormat classes
  – Combiner class, if desired
• Writes JobConf and the application jar to DFS
  and submits job to JobTracker.
• Can exit immediately or wait for the job to
  complete or fail.

                                               12
JobTracker
• Takes JobConf and creates an instance of
  the InputFormat. Calls the getSplits method to
  generate map inputs.
• Creates a JobInProgress object and a bunch
  of TaskInProgress “TIP” and Task objects.
  – JobInProgress is the status of the job.
  – TaskInProgress is the status of a fragment of
    work.
  – Task is an attempt to do a TIP.
• As TaskTrackers request work, they are given
  Tasks to execute.                          13
TaskTracker
• All Tasks
  –   Create the TaskRunner
  –   Copy the job.jar and job.xml from DFS.
  –   Localize the JobConf for this Task.
  –   Call task.prepare() (details later)
  –   Launch the Task in a new JVM under
      TaskTracker.Child.
  –   Catch output from Task and log it at the info level.
  –   Take Task status updates and send to JobTracker
      every 10 seconds.
  –   If job is killed, kill the task.
  –   If task dies or completes, tell the JobTracker.    14
TaskTracker for Reduces
• For Reduces, the task.prepare() fetches all of
  the relevant map outputs for this reduce.
• Files are fetched using http from the map’s
  TaskTracker’s Jetty.
• Files are fetched in parallel threads, but only
  1 to each host.
• When fetches fail, a backoff scheme is used
  to keep from overloading TaskTrackers.
• Fetching accounts for the first 33% of the
  reduce’s progress.
                                                15
Map Tasks
• Use the InputFormat object to create a
  RecordReader from the FileSplit.
• Loop through the keys and values in the
  FileSplit and feed each to the mapper.
• For no combiner, a SequenceFile is written
  for the keys to each reduce.
• With a combiner, the frameworks buffers
  100,000 keys and values, sorts, combines,
  and writes them to SequenceFile’s for each
  reduce.
                                               16
Reduce Tasks: Sort
• Sort
  – 33% to 66% of reduce’s progress
  – Base
     • Read 100 (io.sort.mb) meg of keys and values into
       memory.
     • Sort the memory
     • Write to disk
  – Merge
     • Read 10 (io.sort.factor) files and do a merge into 1 file.
     • Repeat as many times as required (2 levels for 100 files,
       3 levels for 1000 files, etc.)

                                                                17
Reduce Tasks: Reduce
• Reduce
  – 66% to 100% of reduce’s progress
  – Use a SequenceFile.Reader to read sorted input
    and pass to reducer one key at a time along with
    the associated values.
  – Output keys and values are written to the
    OutputFormat object, which usually writes a file to
    DFS.
  – The output from the reduce is NOT resorted, so it
    is in the order and fragmentation of the map output
    keys.
                                                     18

More Related Content

Similar to Hadoop MapReduce: Distribution, Reliability, and Application Perspective

Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepSubhas Kumar Ghosh
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Rajesh Ananda Kumar
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model examIndhujeni
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operationSubhas Kumar Ghosh
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersCleverence Kombe
 

Similar to Hadoop MapReduce: Distribution, Reliability, and Application Perspective (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Map reduce
Map reduceMap reduce
Map reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 

More from nextlib

D Rb Silicon Valley Ruby Conference
D Rb   Silicon Valley Ruby ConferenceD Rb   Silicon Valley Ruby Conference
D Rb Silicon Valley Ruby Conferencenextlib
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architecturesnextlib
 
Aldous Huxley Brave New World
Aldous Huxley Brave New WorldAldous Huxley Brave New World
Aldous Huxley Brave New Worldnextlib
 
Social Graph
Social GraphSocial Graph
Social Graphnextlib
 
Ajax Prediction
Ajax PredictionAjax Prediction
Ajax Predictionnextlib
 
Closures for Java
Closures for JavaClosures for Java
Closures for Javanextlib
 
A Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the WikipediaA Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the Wikipedianextlib
 
SVD review
SVD reviewSVD review
SVD reviewnextlib
 
Mongrel Handlers
Mongrel HandlersMongrel Handlers
Mongrel Handlersnextlib
 
Blue Ocean Strategy
Blue Ocean StrategyBlue Ocean Strategy
Blue Ocean Strategynextlib
 
日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學nextlib
 
Comparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering SystemsComparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering Systemsnextlib
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithmsnextlib
 
Agile Adoption2007
Agile Adoption2007Agile Adoption2007
Agile Adoption2007nextlib
 
Modern Compiler Design
Modern Compiler DesignModern Compiler Design
Modern Compiler Designnextlib
 
透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲nextlib
 
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...nextlib
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
Bigtable
BigtableBigtable
Bigtablenextlib
 

More from nextlib (20)

Nio
NioNio
Nio
 
D Rb Silicon Valley Ruby Conference
D Rb   Silicon Valley Ruby ConferenceD Rb   Silicon Valley Ruby Conference
D Rb Silicon Valley Ruby Conference
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Aldous Huxley Brave New World
Aldous Huxley Brave New WorldAldous Huxley Brave New World
Aldous Huxley Brave New World
 
Social Graph
Social GraphSocial Graph
Social Graph
 
Ajax Prediction
Ajax PredictionAjax Prediction
Ajax Prediction
 
Closures for Java
Closures for JavaClosures for Java
Closures for Java
 
A Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the WikipediaA Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the Wikipedia
 
SVD review
SVD reviewSVD review
SVD review
 
Mongrel Handlers
Mongrel HandlersMongrel Handlers
Mongrel Handlers
 
Blue Ocean Strategy
Blue Ocean StrategyBlue Ocean Strategy
Blue Ocean Strategy
 
日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學
 
Comparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering SystemsComparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering Systems
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
Agile Adoption2007
Agile Adoption2007Agile Adoption2007
Agile Adoption2007
 
Modern Compiler Design
Modern Compiler DesignModern Compiler Design
Modern Compiler Design
 
透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲
 
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Bigtable
BigtableBigtable
Bigtable
 

Recently uploaded

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Hadoop MapReduce: Distribution, Reliability, and Application Perspective

  • 1. Hadoop Map/Reduce Owen O’Malley July 2006
  • 2. Map/Reduce Goals – Distribution • The data is available where needed. • Application does not care how many computers are being used. – Reliability • Application does not care that computers or networks may have temporary or permanent failures. 2
  • 3. Application Perspective • Define Mapper and Reducer classes and a “launching” program. • Mapper – Is given a stream of key1,value1 pairs – Generates a stream of key2, value2 pairs • Reducer – Is given a key2 and a stream of value2’s – Generates a stream of key3, value3 pairs • Launching Program – Creates a JobConf to define a job. – Submits JobConf to JobTracker and waits for completion. 3
  • 5. Input & Output Formats • The application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application. • InputFormat – Splits the input to determine the input to each map task. – Defines a RecordReader that reads key, value pairs that are passed to the map task • OutputFormat – Given the key, value pairs and a filename, writes the reduce task output to persistent store. 5
  • 6. Output Ordering • The application can control the sort order and partitions of the output via OutputKeyComparator and Partitioner. • OutputKeyComparator – Defines how to compare serialized keys. – Defaults to WritableComparable, but should be defined for any application defined key types. • key1.compareTo(key2) • Partitioner – Given a map output key and the number of reduces, chooses a reduce. – Defaults to HashPartitioner 6 • key.hashCode % numReduces
  • 7. Combiners • Combiners are an optimization for jobs with reducers that can merge multiple values into a single value. • Typically, the combiner is the same as the reducer and runs on the map outputs before it is transferred to the reducer’s machine. • For example, WordCount’s mapper generates (word, count) and the combiner and reducer generate the sum for each word. – Input: “hi Owen bye Owen” – Map output: (“hi”, 1), (“Owen”, 1), (“bye”,1), (“Owen”,1) – Combiner output: (“Owen”, 2), (“bye”, 1), (“hi”, 1) 7
  • 8. Process Communication • Use a custom RPC implementation – Easy to change/extend – Defined as Java interfaces – Server objects implement the interface – Client proxy objects automatically created • All messages originate at the client – Prevents cycles and therefore deadlocks • Errors – Include timeouts and communication problems. – Are signaled to client via IOException. – Are NEVER signaled to the server. 8
  • 9. Map/Reduce Processes • Launching Application – User application code – Submits a specific kind of Map/Reduce job • JobTracker – Handles all jobs – Makes all scheduling decisions • TaskTracker – Manager for all tasks on a given node • Task – Runs an individual map or reduce fragment for a given job – Forks from the TaskTracker 9
  • 11. Job Control Flow • Application launcher creates and submits job. • JobTracker initializes job, creates FileSplits, and adds tasks to queue. • TaskTrackers ask for a new map or reduce task every 10 seconds or when the previous task finishes. • As tasks run, the TaskTracker reports status to the JobTracker every 10 seconds. • When job completes, the JobTracker tells the TaskTrackers to delete temporary files. • Application launcher notices job completion and stops waiting. 11
  • 12. Application Launcher • Application code to create JobConf and set the parameters. – Mapper, Reducer classes – InputFormat and OutputFormat classes – Combiner class, if desired • Writes JobConf and the application jar to DFS and submits job to JobTracker. • Can exit immediately or wait for the job to complete or fail. 12
  • 13. JobTracker • Takes JobConf and creates an instance of the InputFormat. Calls the getSplits method to generate map inputs. • Creates a JobInProgress object and a bunch of TaskInProgress “TIP” and Task objects. – JobInProgress is the status of the job. – TaskInProgress is the status of a fragment of work. – Task is an attempt to do a TIP. • As TaskTrackers request work, they are given Tasks to execute. 13
  • 14. TaskTracker • All Tasks – Create the TaskRunner – Copy the job.jar and job.xml from DFS. – Localize the JobConf for this Task. – Call task.prepare() (details later) – Launch the Task in a new JVM under TaskTracker.Child. – Catch output from Task and log it at the info level. – Take Task status updates and send to JobTracker every 10 seconds. – If job is killed, kill the task. – If task dies or completes, tell the JobTracker. 14
  • 15. TaskTracker for Reduces • For Reduces, the task.prepare() fetches all of the relevant map outputs for this reduce. • Files are fetched using http from the map’s TaskTracker’s Jetty. • Files are fetched in parallel threads, but only 1 to each host. • When fetches fail, a backoff scheme is used to keep from overloading TaskTrackers. • Fetching accounts for the first 33% of the reduce’s progress. 15
  • 16. Map Tasks • Use the InputFormat object to create a RecordReader from the FileSplit. • Loop through the keys and values in the FileSplit and feed each to the mapper. • For no combiner, a SequenceFile is written for the keys to each reduce. • With a combiner, the frameworks buffers 100,000 keys and values, sorts, combines, and writes them to SequenceFile’s for each reduce. 16
  • 17. Reduce Tasks: Sort • Sort – 33% to 66% of reduce’s progress – Base • Read 100 (io.sort.mb) meg of keys and values into memory. • Sort the memory • Write to disk – Merge • Read 10 (io.sort.factor) files and do a merge into 1 file. • Repeat as many times as required (2 levels for 100 files, 3 levels for 1000 files, etc.) 17
  • 18. Reduce Tasks: Reduce • Reduce – 66% to 100% of reduce’s progress – Use a SequenceFile.Reader to read sorted input and pass to reducer one key at a time along with the associated values. – Output keys and values are written to the OutputFormat object, which usually writes a file to DFS. – The output from the reduce is NOT resorted, so it is in the order and fragmentation of the map output keys. 18