MapReduce is a programming model used to process large datasets in a distributed computing environment. It works in three stages - map, shuffle, and reduce. In the map stage, data is processed by the mapper and converted into intermediate key-value pairs. These pairs are shuffled and sorted in the shuffle stage. Finally, in the reduce stage, the intermediate data is processed by the reducer to generate the final output. MapReduce provides an easy way to scale applications by distributing processing across large clusters of commodity servers. It allows parallel processing of large datasets in a reliable, fault-tolerant manner.
2. What is Map Reduce?
Map Reduce is a massive parallel technique for processing data which is
maps are the individual tasks that transform input records into
intermediate records.
MapReduce program executes in three stages,
1.map stage.
2.shuffle stage.
3.reduce stage.
2
4. Cont..
Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
Shuffle and Sort − The process of exchanging the intermediate outputs from the
map tasks to where they are required by the reducers is known as shuffling.
Reducer −Reduces a set of intermediate values which share a key to a smaller set
of values. All of the values with the same key are presented to a single reducer
together
4
5. Why Map Reduce?
Large scale data processing was difficult!
Managing hundreds of 1000s of process
Managing parallelization and distribution
Reliable execution with easy data access
Map reduce provides all of these easily..!
5
7. Why Map Reduce?
Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database
servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called Map Reduce. Map
Reduce divides a task into small parts and assigns them to many computers. Later,
the results are collected at one place and integrated to form the result dataset.
7
10. DISADVANTAGES
Its not always very easy to implement each and everything as a MR
program
When your processing requires lot of data to be shuffled over the network
When you need to handle streaming data.MR is best suited to batch
Process huge amounts of data which you already have with you.
10
11. CONCLUSION
Map Reduce provides a simple way to scale your application.
Effortlessly scale from a single machine to thousands
The Map Reduce Programming model has been with success
used at Google for several completely diffent functions.
11