Introduction to MapReduce
Bhupesh Chawda
bhupesh@apache.org
DataTorrent
Why Hadoop?
Data Growth is mind boggling. Forecast for 2020: 40 Trillion GB
Cost effective
Scalable
Fast
Open source
Source: https://rapidminer.com/rapidminer-acquires-radoop/
Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg
What is Mapreduce
It is a powerful paradigm for parallel computation
Hadoop uses MapReduce to execute jobs on files in HDFS
Hadoop will intelligently distribute computation over cluster
Take computation to data
Analogy: Counting Fans
Given a cricket stadium, count the number of fans for each player / team
Traditional way
Smart way
Smarter way?
Origin: Functional Programming
Map - Returns a list constructed by applying a function (the first argument) to all
items in a list passed as the second argument
map f [a, b, c] = [f(a), f(b), f(c)]
map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]
Reduce - Returns a list constructed by applying a function (the first argument) on
the list passed as the second argument. Can be identity (do nothing).
reduce f [a, b, c] = f(a, b, c)
reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14
Sum of squares example
Sum of squares of even and odd numbers
Programming model - Key Value Pairs
Format of input- output
(key, value)
Map: (k1 , v1 ) → list (k2 , v2 )
Reduce: (k2 , list v2 ) → list (k3 , v3 )
Sum of squares of odd, even and prime
Map reduce overview
Map reduce with combiner
The Big Picture
Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
The Bigger Picture
Image Source: http://blog.csdn.net/bingduanlbd/article/details/51933914
MapReduce Code Example - Word Count
Image Source: http://arnon.me/2014/06/mapreduce/
MapReduce - The Mapper
Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MapReduce - The Reducer
Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
MapReduce - The Driver
Image Source: https://memegenerator.net/instance/56997204
Hadoop Distributions
Who is using Hadoop?
References
https://hadoop.apache.org/
www.slideshare.net/SandeepDeshmukh5/hadoopintroduction-46841859
Hadoop - The Definitive Guide - 4th Edition
Images shamelessly stolen from the internet - Have credited though!
Acknowledgements
Sandeep Deshmukh, DataTorrent - For some of the slides
Thank You!!
Please send your questions at:
bhupesh@apache.org / bhupesh@datatorrent.com
Extra Slides
Anatomy of a Map reduce run
In Map reduce context
The client which submits the job
Job tracker which coordinates the run
Task trackers which run the map and reduce
tasks
HDFS
In YARN context - Will see later
The client which submits the job
YARN resource manager
Map reduce in YARN - Will see later
The Map Side - Details
Map task writes to a circular buffer which it writes the output to
Once it reaches a threshold, it starts to spill the contents to local disk
Before writing to disk, the data is partitioned corresponding to the reducers that the
data will be sent to
Each partition is sorted by key and combiner is run on the sorted output
Multiple spill files may be created by the time map finishes. These spill files are
merged into a single partitioned, sorted output file
The output file partitions are made available to reducers over HTTP
The Reduce Side - Details
The map outputs are sitting on local disks. Reduce tasks will need this data in order
to proceed with the reduce task
Reduce task needs the map output for its particular partition from several maps
across the cluster
The reduce task starts copying the map outputs as soon as each map completes. This
is the copy phase. The map outputs are fetched in parallel by multiple threads.
Map outputs are copied to jvm’s memory if small enough, else copied to disk. As
copies accumulate, they are merged into larger sorted files. When all are copied,
they are merged maintaining their sort order
Reduce function is invoked for each key in sorted output and output is written
Map reduce as unix commands
Problem:
Input
1 TB file containing color
names - Red, Blue, Green,
Yellow, Purple, Maroon
Output
Number of occurrences of
colors Blue and Green

Introduction to Map Reduce

  • 1.
    Introduction to MapReduce BhupeshChawda bhupesh@apache.org DataTorrent
  • 2.
    Why Hadoop? Data Growthis mind boggling. Forecast for 2020: 40 Trillion GB Cost effective Scalable Fast Open source Source: https://rapidminer.com/rapidminer-acquires-radoop/ Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg
  • 3.
    What is Mapreduce Itis a powerful paradigm for parallel computation Hadoop uses MapReduce to execute jobs on files in HDFS Hadoop will intelligently distribute computation over cluster Take computation to data
  • 4.
    Analogy: Counting Fans Givena cricket stadium, count the number of fans for each player / team Traditional way Smart way Smarter way?
  • 8.
    Origin: Functional Programming Map- Returns a list constructed by applying a function (the first argument) to all items in a list passed as the second argument map f [a, b, c] = [f(a), f(b), f(c)] map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9] Reduce - Returns a list constructed by applying a function (the first argument) on the list passed as the second argument. Can be identity (do nothing). reduce f [a, b, c] = f(a, b, c) reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14
  • 9.
  • 10.
    Sum of squaresof even and odd numbers
  • 11.
    Programming model -Key Value Pairs Format of input- output (key, value) Map: (k1 , v1 ) → list (k2 , v2 ) Reduce: (k2 , list v2 ) → list (k3 , v3 )
  • 12.
    Sum of squaresof odd, even and prime
  • 13.
  • 14.
  • 15.
    The Big Picture ImageSource: http://blog.csdn.net/bingduanlbd/article/details/51933914
  • 16.
    The Bigger Picture ImageSource: http://blog.csdn.net/bingduanlbd/article/details/51933914
  • 17.
  • 18.
  • 19.
    MapReduce - TheMapper Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
  • 20.
    MapReduce - TheReducer Source: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
  • 21.
  • 22.
  • 23.
  • 24.
    Who is usingHadoop?
  • 25.
    References https://hadoop.apache.org/ www.slideshare.net/SandeepDeshmukh5/hadoopintroduction-46841859 Hadoop - TheDefinitive Guide - 4th Edition Images shamelessly stolen from the internet - Have credited though!
  • 26.
  • 27.
    Thank You!! Please sendyour questions at: bhupesh@apache.org / bhupesh@datatorrent.com
  • 28.
  • 29.
    Anatomy of aMap reduce run In Map reduce context The client which submits the job Job tracker which coordinates the run Task trackers which run the map and reduce tasks HDFS In YARN context - Will see later The client which submits the job YARN resource manager
  • 30.
    Map reduce inYARN - Will see later
  • 31.
    The Map Side- Details Map task writes to a circular buffer which it writes the output to Once it reaches a threshold, it starts to spill the contents to local disk Before writing to disk, the data is partitioned corresponding to the reducers that the data will be sent to Each partition is sorted by key and combiner is run on the sorted output Multiple spill files may be created by the time map finishes. These spill files are merged into a single partitioned, sorted output file The output file partitions are made available to reducers over HTTP
  • 32.
    The Reduce Side- Details The map outputs are sitting on local disks. Reduce tasks will need this data in order to proceed with the reduce task Reduce task needs the map output for its particular partition from several maps across the cluster The reduce task starts copying the map outputs as soon as each map completes. This is the copy phase. The map outputs are fetched in parallel by multiple threads. Map outputs are copied to jvm’s memory if small enough, else copied to disk. As copies accumulate, they are merged into larger sorted files. When all are copied, they are merged maintaining their sort order Reduce function is invoked for each key in sorted output and output is written
  • 33.
    Map reduce asunix commands Problem: Input 1 TB file containing color names - Red, Blue, Green, Yellow, Purple, Maroon Output Number of occurrences of colors Blue and Green