2. Why Hadoop?
● Data Growth is mind boggling. Forecast for 2020: 40 Trillion GB
● Cost effective
● Scalable
● Fast
● Open source
Source: https://rapidminer.com/rapidminer-acquires-radoop/
Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg
3. What is Mapreduce
● It is a powerful paradigm for parallel computation
● Hadoop uses MapReduce to execute jobs on files in HDFS
● Hadoop will intelligently distribute computation over cluster
● Take computation to data
4. Analogy: Counting Fans
● Given a cricket stadium, count the number of fans for each player / team
● Traditional way
● Smart way
● Smarter way?
5.
6.
7.
8. Origin: Functional Programming
● Map - Returns a list constructed by applying a function (the first argument) to all
items in a list passed as the second argument
○ map f [a, b, c] = [f(a), f(b), f(c)]
○ map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]
● Reduce - Returns a list constructed by applying a function (the first argument) on
the list passed as the second argument. Can be identity (do nothing).
○ reduce f [a, b, c] = f(a, b, c)
○ reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14
29. Anatomy of a Map reduce run
● In Map reduce context
○ The client which submits the job
○ Job tracker which coordinates the run
○ Task trackers which run the map and
reduce tasks
○ HDFS
● In YARN context - Will see later
○ The client which submits the job
○ YARN resource manager
○ YARN node managers
○ Map Reduce App Master
○ HDFS
31. The Map Side - Details
● Map task writes to a circular buffer which it writes the output to
● Once it reaches a threshold, it starts to spill the contents to local disk
● Before writing to disk, the data is partitioned corresponding to the reducers that
the data will be sent to
● Each partition is sorted by key and combiner is run on the sorted output
● Multiple spill files may be created by the time map finishes. These spill files are
merged into a single partitioned, sorted output file
● The output file partitions are made available to reducers over HTTP
32. The Reduce Side - Details
● The map outputs are sitting on local disks. Reduce tasks will need this data in
order to proceed with the reduce task
● Reduce task needs the map output for its particular partition from several maps
across the cluster
● The reduce task starts copying the map outputs as soon as each map completes.
This is the copy phase. The map outputs are fetched in parallel by multiple
threads.
● Map outputs are copied to jvm’s memory if small enough, else copied to disk. As
copies accumulate, they are merged into larger sorted files. When all are copied,
they are merged maintaining their sort order
● Reduce function is invoked for each key in sorted output and output is written
directly to HDFS
33. Map reduce as unix commands
Problem:
● Input
○ 1 TB file containing color
names - Red, Blue, Green,
Yellow, Purple, Maroon
● Output
○ Number of occurrences of
colors Blue and Green