SlideShare a Scribd company logo
1 of 33
Download to read offline
Introduction to MapReduce
Bhupesh Chawda
bhupesh@apache.org
DataTorrent
Why Hadoop?
● Data Growth is mind boggling. Forecast for 2020: 40 Trillion GB
● Cost effective
● Scalable
● Fast
● Open source
Source: https://rapidminer.com/rapidminer-acquires-radoop/
Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg
What is Mapreduce
● It is a powerful paradigm for parallel computation
● Hadoop uses MapReduce to execute jobs on files in HDFS
● Hadoop will intelligently distribute computation over cluster
● Take computation to data
Analogy: Counting Fans
● Given a cricket stadium, count the number of fans for each player / team
● Traditional way
● Smart way
● Smarter way?
Origin: Functional Programming
● Map - Returns a list constructed by applying a function (the first argument) to all
items in a list passed as the second argument
○ map f [a, b, c] = [f(a), f(b), f(c)]
○ map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]
● Reduce - Returns a list constructed by applying a function (the first argument) on
the list passed as the second argument. Can be identity (do nothing).
○ reduce f [a, b, c] = f(a, b, c)
○ reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14
Sum of squares example
Sum of squares of even and odd numbers
Programming model - Key Value Pairs
● Format of input- output
(key, value)
● Map: (k1 , v1 ) → list (k2 , v2 )
● Reduce: (k2 , list v2 ) → list (k3 , v3 )
Sum of squares of odd, even and prime
Map reduce overview
Map reduce with combiner
The Big Picture
Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
The Bigger Picture
Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
MapReduce Code Example - Word Count
Image Source: http://arnon.me/2014/06/mapreduce/
MapReduce - The Mapper
Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MapReduce - The Reducer
Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MapReduce - The Driver
Image Source: https://memegenerator.net/instance/56997204
Hadoop Distributions
Who is using Hadoop?
References
● https://hadoop.apache.org/
● www.slideshare.net/SandeepDeshmukh5/hadoopintroduction-46841859
● Hadoop - The Definitive Guide - 4th Edition
● Images shamelessly stolen from the internet - Have credited though!
Acknowledgements
● Sandeep Deshmukh, DataTorrent - For some of the slides
Thank You!!
Please send your questions at:
bhupesh@apache.org
Extra Slides
Anatomy of a Map reduce run
● In Map reduce context
○ The client which submits the job
○ Job tracker which coordinates the run
○ Task trackers which run the map and
reduce tasks
○ HDFS
● In YARN context - Will see later
○ The client which submits the job
○ YARN resource manager
○ YARN node managers
○ Map Reduce App Master
○ HDFS
Map reduce in YARN - Will see later
The Map Side - Details
● Map task writes to a circular buffer which it writes the output to
● Once it reaches a threshold, it starts to spill the contents to local disk
● Before writing to disk, the data is partitioned corresponding to the reducers that
the data will be sent to
● Each partition is sorted by key and combiner is run on the sorted output
● Multiple spill files may be created by the time map finishes. These spill files are
merged into a single partitioned, sorted output file
● The output file partitions are made available to reducers over HTTP
The Reduce Side - Details
● The map outputs are sitting on local disks. Reduce tasks will need this data in
order to proceed with the reduce task
● Reduce task needs the map output for its particular partition from several maps
across the cluster
● The reduce task starts copying the map outputs as soon as each map completes.
This is the copy phase. The map outputs are fetched in parallel by multiple
threads.
● Map outputs are copied to jvm’s memory if small enough, else copied to disk. As
copies accumulate, they are merged into larger sorted files. When all are copied,
they are merged maintaining their sort order
● Reduce function is invoked for each key in sorted output and output is written
directly to HDFS
Map reduce as unix commands
Problem:
● Input
○ 1 TB file containing color
names - Red, Blue, Green,
Yellow, Purple, Maroon
● Output
○ Number of occurrences of
colors Blue and Green

More Related Content

What's hot

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 

What's hot (19)

Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large ClustersMapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
 

Viewers also liked

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINAL
Christoph Sinn
 
Justin J. Dunne Resume
Justin J. Dunne ResumeJustin J. Dunne Resume
Justin J. Dunne Resume
Justin Dunne
 
apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)
David Ritchie
 
Oracle Advanced Analytics
Oracle Advanced AnalyticsOracle Advanced Analytics
Oracle Advanced Analytics
aghosh_us
 

Viewers also liked (20)

Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Random number generators
Random number generatorsRandom number generators
Random number generators
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
shared-ownership-21_FINAL
shared-ownership-21_FINALshared-ownership-21_FINAL
shared-ownership-21_FINAL
 
Justin J. Dunne Resume
Justin J. Dunne ResumeJustin J. Dunne Resume
Justin J. Dunne Resume
 
apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)apprenticeship-levy-summary-5may2016 (1)
apprenticeship-levy-summary-5may2016 (1)
 
1z0 591
1z0 5911z0 591
1z0 591
 
Final Paper
Final PaperFinal Paper
Final Paper
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
Practical Text Mining with SQL using Relational Databases
Practical Text Mining with SQL using Relational DatabasesPractical Text Mining with SQL using Relational Databases
Practical Text Mining with SQL using Relational Databases
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
 
Innovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle RInnovate Analytics with Oracle Data Mining & Oracle R
Innovate Analytics with Oracle Data Mining & Oracle R
 
Oracle Performance Tools of the Trade
Oracle Performance Tools of the TradeOracle Performance Tools of the Trade
Oracle Performance Tools of the Trade
 
Oracle Advanced Analytics
Oracle Advanced AnalyticsOracle Advanced Analytics
Oracle Advanced Analytics
 
OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016OUG Ireland Meet-up - Updates from Oracle Open World 2016
OUG Ireland Meet-up - Updates from Oracle Open World 2016
 
OUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th JanuaryOUG Ireland Meet-up 12th January
OUG Ireland Meet-up 12th January
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 

Similar to Introduction to map reduce

Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
AMIT BORUDE
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
jencyjayastina
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
Noha Elprince
 

Similar to Introduction to map reduce (20)

Hadoop
HadoopHadoop
Hadoop
 
E031201032036
E031201032036E031201032036
E031201032036
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous ClustersHybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Juniper Innovation Contest
Juniper Innovation ContestJuniper Innovation Contest
Juniper Innovation Contest
 
MapReduce
MapReduceMapReduce
MapReduce
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
My mapreduce1 presentation
My mapreduce1 presentationMy mapreduce1 presentation
My mapreduce1 presentation
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
hadoop
hadoophadoop
hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Introduction to map reduce

  • 1. Introduction to MapReduce Bhupesh Chawda bhupesh@apache.org DataTorrent
  • 2. Why Hadoop? ● Data Growth is mind boggling. Forecast for 2020: 40 Trillion GB ● Cost effective ● Scalable ● Fast ● Open source Source: https://rapidminer.com/rapidminer-acquires-radoop/ Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg
  • 3. What is Mapreduce ● It is a powerful paradigm for parallel computation ● Hadoop uses MapReduce to execute jobs on files in HDFS ● Hadoop will intelligently distribute computation over cluster ● Take computation to data
  • 4. Analogy: Counting Fans ● Given a cricket stadium, count the number of fans for each player / team ● Traditional way ● Smart way ● Smarter way?
  • 5.
  • 6.
  • 7.
  • 8. Origin: Functional Programming ● Map - Returns a list constructed by applying a function (the first argument) to all items in a list passed as the second argument ○ map f [a, b, c] = [f(a), f(b), f(c)] ○ map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9] ● Reduce - Returns a list constructed by applying a function (the first argument) on the list passed as the second argument. Can be identity (do nothing). ○ reduce f [a, b, c] = f(a, b, c) ○ reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14
  • 9. Sum of squares example
  • 10. Sum of squares of even and odd numbers
  • 11. Programming model - Key Value Pairs ● Format of input- output (key, value) ● Map: (k1 , v1 ) → list (k2 , v2 ) ● Reduce: (k2 , list v2 ) → list (k3 , v3 )
  • 12. Sum of squares of odd, even and prime
  • 14. Map reduce with combiner
  • 15. The Big Picture Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
  • 16. The Bigger Picture Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
  • 17. MapReduce Code Example - Word Count
  • 19. MapReduce - The Mapper Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
  • 20. MapReduce - The Reducer Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
  • 21. MapReduce - The Driver
  • 24. Who is using Hadoop?
  • 25. References ● https://hadoop.apache.org/ ● www.slideshare.net/SandeepDeshmukh5/hadoopintroduction-46841859 ● Hadoop - The Definitive Guide - 4th Edition ● Images shamelessly stolen from the internet - Have credited though!
  • 26. Acknowledgements ● Sandeep Deshmukh, DataTorrent - For some of the slides
  • 27. Thank You!! Please send your questions at: bhupesh@apache.org
  • 29. Anatomy of a Map reduce run ● In Map reduce context ○ The client which submits the job ○ Job tracker which coordinates the run ○ Task trackers which run the map and reduce tasks ○ HDFS ● In YARN context - Will see later ○ The client which submits the job ○ YARN resource manager ○ YARN node managers ○ Map Reduce App Master ○ HDFS
  • 30. Map reduce in YARN - Will see later
  • 31. The Map Side - Details ● Map task writes to a circular buffer which it writes the output to ● Once it reaches a threshold, it starts to spill the contents to local disk ● Before writing to disk, the data is partitioned corresponding to the reducers that the data will be sent to ● Each partition is sorted by key and combiner is run on the sorted output ● Multiple spill files may be created by the time map finishes. These spill files are merged into a single partitioned, sorted output file ● The output file partitions are made available to reducers over HTTP
  • 32. The Reduce Side - Details ● The map outputs are sitting on local disks. Reduce tasks will need this data in order to proceed with the reduce task ● Reduce task needs the map output for its particular partition from several maps across the cluster ● The reduce task starts copying the map outputs as soon as each map completes. This is the copy phase. The map outputs are fetched in parallel by multiple threads. ● Map outputs are copied to jvm’s memory if small enough, else copied to disk. As copies accumulate, they are merged into larger sorted files. When all are copied, they are merged maintaining their sort order ● Reduce function is invoked for each key in sorted output and output is written directly to HDFS
  • 33. Map reduce as unix commands Problem: ● Input ○ 1 TB file containing color names - Red, Blue, Green, Yellow, Purple, Maroon ● Output ○ Number of occurrences of colors Blue and Green