SlideShare a Scribd company logo
Hadoop Map/Reduce

    Owen O’Malley
      July 2006
Map/Reduce Goals
– Distribution
   • The data is available where needed.
   • Application does not care how many computers
     are being used.
– Reliability
   • Application does not care that computers or
     networks may have temporary or permanent
     failures.


                                                   2
Application Perspective
• Define Mapper and Reducer classes and a
  “launching” program.
• Mapper
  – Is given a stream of key1,value1 pairs
  – Generates a stream of key2, value2 pairs
• Reducer
  – Is given a key2 and a stream of value2’s
  – Generates a stream of key3, value3 pairs
• Launching Program
  – Creates a JobConf to define a job.
  – Submits JobConf to JobTracker and waits for
    completion.                                   3
Application Dataflow




                       4
Input & Output Formats
• The application also chooses input and output
  formats, which define how the persistent data
  is read and written. These are interfaces and
  can be defined by the application.
• InputFormat
  – Splits the input to determine the input to each map
    task.
  – Defines a RecordReader that reads key, value
    pairs that are passed to the map task
• OutputFormat
  – Given the key, value pairs and a filename, writes
    the reduce task output to persistent store.
                                                        5
Output Ordering
• The application can control the sort order and
  partitions of the output via
  OutputKeyComparator and Partitioner.
• OutputKeyComparator
   – Defines how to compare serialized keys.
   – Defaults to WritableComparable, but should be
     defined for any application defined key types.
      • key1.compareTo(key2)
• Partitioner
   – Given a map output key and the number of
     reduces, chooses a reduce.
   – Defaults to HashPartitioner
                                                      6
      • key.hashCode % numReduces
Combiners
• Combiners are an optimization for jobs with
  reducers that can merge multiple values into
  a single value.
• Typically, the combiner is the same as the
  reducer and runs on the map outputs before it
  is transferred to the reducer’s machine.
• For example, WordCount’s mapper generates
  (word, count) and the combiner and reducer
  generate the sum for each word.
  – Input: “hi Owen bye Owen”
  – Map output: (“hi”, 1), (“Owen”, 1), (“bye”,1), (“Owen”,1)
  – Combiner output: (“Owen”, 2), (“bye”, 1), (“hi”, 1)         7
Process Communication
• Use a custom RPC implementation
  –   Easy to change/extend
  –   Defined as Java interfaces
  –   Server objects implement the interface
  –   Client proxy objects automatically created
• All messages originate at the client
  – Prevents cycles and therefore deadlocks
• Errors
  – Include timeouts and communication problems.
  – Are signaled to client via IOException.
  – Are NEVER signaled to the server.
                                                   8
Map/Reduce Processes
• Launching Application
  – User application code
  – Submits a specific kind of Map/Reduce job
• JobTracker
  – Handles all jobs
  – Makes all scheduling decisions
• TaskTracker
  – Manager for all tasks on a given node
• Task
  – Runs an individual map or reduce fragment for a
    given job
  – Forks from the TaskTracker
                                                      9
Process Diagram




                  10
Job Control Flow
• Application launcher creates and submits job.
• JobTracker initializes job, creates FileSplits,
  and adds tasks to queue.
• TaskTrackers ask for a new map or reduce
  task every 10 seconds or when the previous
  task finishes.
• As tasks run, the TaskTracker reports status
  to the JobTracker every 10 seconds.
• When job completes, the JobTracker tells the
  TaskTrackers to delete temporary files.
• Application launcher notices job completion
  and stops waiting.                              11
Application Launcher
• Application code to create JobConf and set
  the parameters.
  – Mapper, Reducer classes
  – InputFormat and OutputFormat classes
  – Combiner class, if desired
• Writes JobConf and the application jar to DFS
  and submits job to JobTracker.
• Can exit immediately or wait for the job to
  complete or fail.

                                               12
JobTracker
• Takes JobConf and creates an instance of
  the InputFormat. Calls the getSplits method to
  generate map inputs.
• Creates a JobInProgress object and a bunch
  of TaskInProgress “TIP” and Task objects.
  – JobInProgress is the status of the job.
  – TaskInProgress is the status of a fragment of
    work.
  – Task is an attempt to do a TIP.
• As TaskTrackers request work, they are given
  Tasks to execute.                          13
TaskTracker
• All Tasks
  –   Create the TaskRunner
  –   Copy the job.jar and job.xml from DFS.
  –   Localize the JobConf for this Task.
  –   Call task.prepare() (details later)
  –   Launch the Task in a new JVM under
      TaskTracker.Child.
  –   Catch output from Task and log it at the info level.
  –   Take Task status updates and send to JobTracker
      every 10 seconds.
  –   If job is killed, kill the task.
  –   If task dies or completes, tell the JobTracker.    14
TaskTracker for Reduces
• For Reduces, the task.prepare() fetches all of
  the relevant map outputs for this reduce.
• Files are fetched using http from the map’s
  TaskTracker’s Jetty.
• Files are fetched in parallel threads, but only
  1 to each host.
• When fetches fail, a backoff scheme is used
  to keep from overloading TaskTrackers.
• Fetching accounts for the first 33% of the
  reduce’s progress.
                                                15
Map Tasks
• Use the InputFormat object to create a
  RecordReader from the FileSplit.
• Loop through the keys and values in the
  FileSplit and feed each to the mapper.
• For no combiner, a SequenceFile is written
  for the keys to each reduce.
• With a combiner, the frameworks buffers
  100,000 keys and values, sorts, combines,
  and writes them to SequenceFile’s for each
  reduce.
                                               16
Reduce Tasks: Sort
• Sort
  – 33% to 66% of reduce’s progress
  – Base
     • Read 100 (io.sort.mb) meg of keys and values into
       memory.
     • Sort the memory
     • Write to disk
  – Merge
     • Read 10 (io.sort.factor) files and do a merge into 1 file.
     • Repeat as many times as required (2 levels for 100 files,
       3 levels for 1000 files, etc.)

                                                                17
Reduce Tasks: Reduce
• Reduce
  – 66% to 100% of reduce’s progress
  – Use a SequenceFile.Reader to read sorted input
    and pass to reducer one key at a time along with
    the associated values.
  – Output keys and values are written to the
    OutputFormat object, which usually writes a file to
    DFS.
  – The output from the reduce is NOT resorted, so it
    is in the order and fragmentation of the map output
    keys.
                                                     18

More Related Content

Similar to Hadoop Map Reduce Arch

MapReduce
MapReduceMapReduce
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Map reduce
Map reduceMap reduce
Map reduce
Somesh Maliye
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Romain Jacotin
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
ShimoFcis
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
Ahmad El Tawil
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
Rajesh Ananda Kumar
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
Indhujeni
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
MapReduce
MapReduceMapReduce
MapReduce
ahmedelmorsy89
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
Flink Forward
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 

Similar to Hadoop Map Reduce Arch (20)

MapReduce
MapReduceMapReduce
MapReduce
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Map reduce
Map reduceMap reduce
Map reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
MapReduce
MapReduceMapReduce
MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 

More from nextlib

Nio
NioNio
Nio
nextlib
 
D Rb Silicon Valley Ruby Conference
D Rb   Silicon Valley Ruby ConferenceD Rb   Silicon Valley Ruby Conference
D Rb Silicon Valley Ruby Conference
nextlib
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
nextlib
 
Aldous Huxley Brave New World
Aldous Huxley Brave New WorldAldous Huxley Brave New World
Aldous Huxley Brave New World
nextlib
 
Social Graph
Social GraphSocial Graph
Social Graph
nextlib
 
Ajax Prediction
Ajax PredictionAjax Prediction
Ajax Prediction
nextlib
 
Closures for Java
Closures for JavaClosures for Java
Closures for Java
nextlib
 
A Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the WikipediaA Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the Wikipedia
nextlib
 
SVD review
SVD reviewSVD review
SVD review
nextlib
 
Mongrel Handlers
Mongrel HandlersMongrel Handlers
Mongrel Handlers
nextlib
 
Blue Ocean Strategy
Blue Ocean StrategyBlue Ocean Strategy
Blue Ocean Strategynextlib
 
日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學nextlib
 
Comparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering SystemsComparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering Systems
nextlib
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
nextlib
 
Agile Adoption2007
Agile Adoption2007Agile Adoption2007
Agile Adoption2007
nextlib
 
Modern Compiler Design
Modern Compiler DesignModern Compiler Design
Modern Compiler Design
nextlib
 
透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲nextlib
 
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
nextlib
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib
 
Bigtable
BigtableBigtable
Bigtable
nextlib
 

More from nextlib (20)

Nio
NioNio
Nio
 
D Rb Silicon Valley Ruby Conference
D Rb   Silicon Valley Ruby ConferenceD Rb   Silicon Valley Ruby Conference
D Rb Silicon Valley Ruby Conference
 
Multi-core architectures
Multi-core architecturesMulti-core architectures
Multi-core architectures
 
Aldous Huxley Brave New World
Aldous Huxley Brave New WorldAldous Huxley Brave New World
Aldous Huxley Brave New World
 
Social Graph
Social GraphSocial Graph
Social Graph
 
Ajax Prediction
Ajax PredictionAjax Prediction
Ajax Prediction
 
Closures for Java
Closures for JavaClosures for Java
Closures for Java
 
A Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the WikipediaA Content-Driven Reputation System for the Wikipedia
A Content-Driven Reputation System for the Wikipedia
 
SVD review
SVD reviewSVD review
SVD review
 
Mongrel Handlers
Mongrel HandlersMongrel Handlers
Mongrel Handlers
 
Blue Ocean Strategy
Blue Ocean StrategyBlue Ocean Strategy
Blue Ocean Strategy
 
日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學日本7-ELEVEN消費心理學
日本7-ELEVEN消費心理學
 
Comparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering SystemsComparing State-of-the-Art Collaborative Filtering Systems
Comparing State-of-the-Art Collaborative Filtering Systems
 
Item Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation AlgorithmsItem Based Collaborative Filtering Recommendation Algorithms
Item Based Collaborative Filtering Recommendation Algorithms
 
Agile Adoption2007
Agile Adoption2007Agile Adoption2007
Agile Adoption2007
 
Modern Compiler Design
Modern Compiler DesignModern Compiler Design
Modern Compiler Design
 
透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲透过众神的眼睛--鸟瞰非洲
透过众神的眼睛--鸟瞰非洲
 
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...Improving Quality of Search Results Clustering with Approximate Matrix Factor...
Improving Quality of Search Results Clustering with Approximate Matrix Factor...
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Bigtable
BigtableBigtable
Bigtable
 

Recently uploaded

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 

Recently uploaded (20)

Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 

Hadoop Map Reduce Arch

  • 1. Hadoop Map/Reduce Owen O’Malley July 2006
  • 2. Map/Reduce Goals – Distribution • The data is available where needed. • Application does not care how many computers are being used. – Reliability • Application does not care that computers or networks may have temporary or permanent failures. 2
  • 3. Application Perspective • Define Mapper and Reducer classes and a “launching” program. • Mapper – Is given a stream of key1,value1 pairs – Generates a stream of key2, value2 pairs • Reducer – Is given a key2 and a stream of value2’s – Generates a stream of key3, value3 pairs • Launching Program – Creates a JobConf to define a job. – Submits JobConf to JobTracker and waits for completion. 3
  • 5. Input & Output Formats • The application also chooses input and output formats, which define how the persistent data is read and written. These are interfaces and can be defined by the application. • InputFormat – Splits the input to determine the input to each map task. – Defines a RecordReader that reads key, value pairs that are passed to the map task • OutputFormat – Given the key, value pairs and a filename, writes the reduce task output to persistent store. 5
  • 6. Output Ordering • The application can control the sort order and partitions of the output via OutputKeyComparator and Partitioner. • OutputKeyComparator – Defines how to compare serialized keys. – Defaults to WritableComparable, but should be defined for any application defined key types. • key1.compareTo(key2) • Partitioner – Given a map output key and the number of reduces, chooses a reduce. – Defaults to HashPartitioner 6 • key.hashCode % numReduces
  • 7. Combiners • Combiners are an optimization for jobs with reducers that can merge multiple values into a single value. • Typically, the combiner is the same as the reducer and runs on the map outputs before it is transferred to the reducer’s machine. • For example, WordCount’s mapper generates (word, count) and the combiner and reducer generate the sum for each word. – Input: “hi Owen bye Owen” – Map output: (“hi”, 1), (“Owen”, 1), (“bye”,1), (“Owen”,1) – Combiner output: (“Owen”, 2), (“bye”, 1), (“hi”, 1) 7
  • 8. Process Communication • Use a custom RPC implementation – Easy to change/extend – Defined as Java interfaces – Server objects implement the interface – Client proxy objects automatically created • All messages originate at the client – Prevents cycles and therefore deadlocks • Errors – Include timeouts and communication problems. – Are signaled to client via IOException. – Are NEVER signaled to the server. 8
  • 9. Map/Reduce Processes • Launching Application – User application code – Submits a specific kind of Map/Reduce job • JobTracker – Handles all jobs – Makes all scheduling decisions • TaskTracker – Manager for all tasks on a given node • Task – Runs an individual map or reduce fragment for a given job – Forks from the TaskTracker 9
  • 11. Job Control Flow • Application launcher creates and submits job. • JobTracker initializes job, creates FileSplits, and adds tasks to queue. • TaskTrackers ask for a new map or reduce task every 10 seconds or when the previous task finishes. • As tasks run, the TaskTracker reports status to the JobTracker every 10 seconds. • When job completes, the JobTracker tells the TaskTrackers to delete temporary files. • Application launcher notices job completion and stops waiting. 11
  • 12. Application Launcher • Application code to create JobConf and set the parameters. – Mapper, Reducer classes – InputFormat and OutputFormat classes – Combiner class, if desired • Writes JobConf and the application jar to DFS and submits job to JobTracker. • Can exit immediately or wait for the job to complete or fail. 12
  • 13. JobTracker • Takes JobConf and creates an instance of the InputFormat. Calls the getSplits method to generate map inputs. • Creates a JobInProgress object and a bunch of TaskInProgress “TIP” and Task objects. – JobInProgress is the status of the job. – TaskInProgress is the status of a fragment of work. – Task is an attempt to do a TIP. • As TaskTrackers request work, they are given Tasks to execute. 13
  • 14. TaskTracker • All Tasks – Create the TaskRunner – Copy the job.jar and job.xml from DFS. – Localize the JobConf for this Task. – Call task.prepare() (details later) – Launch the Task in a new JVM under TaskTracker.Child. – Catch output from Task and log it at the info level. – Take Task status updates and send to JobTracker every 10 seconds. – If job is killed, kill the task. – If task dies or completes, tell the JobTracker. 14
  • 15. TaskTracker for Reduces • For Reduces, the task.prepare() fetches all of the relevant map outputs for this reduce. • Files are fetched using http from the map’s TaskTracker’s Jetty. • Files are fetched in parallel threads, but only 1 to each host. • When fetches fail, a backoff scheme is used to keep from overloading TaskTrackers. • Fetching accounts for the first 33% of the reduce’s progress. 15
  • 16. Map Tasks • Use the InputFormat object to create a RecordReader from the FileSplit. • Loop through the keys and values in the FileSplit and feed each to the mapper. • For no combiner, a SequenceFile is written for the keys to each reduce. • With a combiner, the frameworks buffers 100,000 keys and values, sorts, combines, and writes them to SequenceFile’s for each reduce. 16
  • 17. Reduce Tasks: Sort • Sort – 33% to 66% of reduce’s progress – Base • Read 100 (io.sort.mb) meg of keys and values into memory. • Sort the memory • Write to disk – Merge • Read 10 (io.sort.factor) files and do a merge into 1 file. • Repeat as many times as required (2 levels for 100 files, 3 levels for 1000 files, etc.) 17
  • 18. Reduce Tasks: Reduce • Reduce – 66% to 100% of reduce’s progress – Use a SequenceFile.Reader to read sorted input and pass to reducer one key at a time along with the associated values. – Output keys and values are written to the OutputFormat object, which usually writes a file to DFS. – The output from the reduce is NOT resorted, so it is in the order and fragmentation of the map output keys. 18