Here is how you can solve this problem using MapReduce and Unix commands:Map step:grep -c Blue input.txt > output/part-0000grep -c Green input.txt > output/part-0001 This will count the number of occurrences of Blue and Green in the input file and write them to separate output files.Reduce step: cat output/part-0000 output/part-0001 | sort -n > outputThis will concatenate the output files from the Map step, sort them numerically and write to a final output file.The output file will contain:Blue <count>Green <count>So this solves the problem of counting occurrences
Here is how you can solve this problem using MapReduce and Unix commands:
Map step:
grep -o 'Blue\|Green' input.txt | wc -l > output
This uses grep to search the input file for the strings "Blue" or "Green" and print only the matches. The matches are piped to wc which counts the lines (matches).
Reduce step:
cat output
This isn't really needed as there is only one mapper. Cat prints the contents of the output file which has the count of Blue and Green.
So MapReduce has been simulated using grep for the map and cat for the reduce functionality. The key aspects are - grep extracts the relevant data (map
Similar to Here is how you can solve this problem using MapReduce and Unix commands:Map step:grep -c Blue input.txt > output/part-0000grep -c Green input.txt > output/part-0001 This will count the number of occurrences of Blue and Green in the input file and write them to separate output files.Reduce step: cat output/part-0000 output/part-0001 | sort -n > outputThis will concatenate the output files from the Map step, sort them numerically and write to a final output file.The output file will contain:Blue <count>Green <count>So this solves the problem of counting occurrences
Similar to Here is how you can solve this problem using MapReduce and Unix commands:Map step:grep -c Blue input.txt > output/part-0000grep -c Green input.txt > output/part-0001 This will count the number of occurrences of Blue and Green in the input file and write them to separate output files.Reduce step: cat output/part-0000 output/part-0001 | sort -n > outputThis will concatenate the output files from the Map step, sort them numerically and write to a final output file.The output file will contain:Blue <count>Green <count>So this solves the problem of counting occurrences (20)
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Here is how you can solve this problem using MapReduce and Unix commands:Map step:grep -c Blue input.txt > output/part-0000grep -c Green input.txt > output/part-0001 This will count the number of occurrences of Blue and Green in the input file and write them to separate output files.Reduce step: cat output/part-0000 output/part-0001 | sort -n > outputThis will concatenate the output files from the Map step, sort them numerically and write to a final output file.The output file will contain:Blue <count>Green <count>So this solves the problem of counting occurrences
2. Why Hadoop?
Data Growth is mind boggling. Forecast for 2020: 40 Trillion GB
Cost effective
Scalable
Fast
Open source
Source: https://rapidminer.com/rapidminer-acquires-radoop/
Image: http://seikun.kambashi.com/images/blog/interning_at_placeiq/2.jpg
3. What is Mapreduce
It is a powerful paradigm for parallel computation
Hadoop uses MapReduce to execute jobs on files in HDFS
Hadoop will intelligently distribute computation over cluster
Take computation to data
4. Analogy: Counting Fans
Given a cricket stadium, count the number of fans for each player / team
Traditional way
Smart way
Smarter way?
5.
6.
7.
8. Origin: Functional Programming
Map - Returns a list constructed by applying a function (the first argument) to all
items in a list passed as the second argument
map f [a, b, c] = [f(a), f(b), f(c)]
map sq [1, 2, 3] = [sq(1), sq(2), sq(3)] = [1,4,9]
Reduce - Returns a list constructed by applying a function (the first argument) on
the list passed as the second argument. Can be identity (do nothing).
reduce f [a, b, c] = f(a, b, c)
reduce sum [1, 4, 9] = sum(1, sum(4,sum(9,sum(NULL)))) = 14
29. Anatomy of a Map reduce run
In Map reduce context
The client which submits the job
Job tracker which coordinates the run
Task trackers which run the map and reduce
tasks
HDFS
In YARN context - Will see later
The client which submits the job
YARN resource manager
31. The Map Side - Details
Map task writes to a circular buffer which it writes the output to
Once it reaches a threshold, it starts to spill the contents to local disk
Before writing to disk, the data is partitioned corresponding to the reducers that the
data will be sent to
Each partition is sorted by key and combiner is run on the sorted output
Multiple spill files may be created by the time map finishes. These spill files are
merged into a single partitioned, sorted output file
The output file partitions are made available to reducers over HTTP
32. The Reduce Side - Details
The map outputs are sitting on local disks. Reduce tasks will need this data in order
to proceed with the reduce task
Reduce task needs the map output for its particular partition from several maps
across the cluster
The reduce task starts copying the map outputs as soon as each map completes. This
is the copy phase. The map outputs are fetched in parallel by multiple threads.
Map outputs are copied to jvm’s memory if small enough, else copied to disk. As
copies accumulate, they are merged into larger sorted files. When all are copied,
they are merged maintaining their sort order
Reduce function is invoked for each key in sorted output and output is written
33. Map reduce as unix commands
Problem:
Input
1 TB file containing color
names - Red, Blue, Green,
Yellow, Purple, Maroon
Output
Number of occurrences of
colors Blue and Green