SlideShare a Scribd company logo
ACADGILD
INTRODUCTION
Have you ever wondered how to process huge data residing on multiple systems? Well here is a simple
solution for the same – Hadoop’s MapReduce feature.
MapReduce is a software framework for easily writing applications which process vast amounts of data
residing on multiple systems. Although it is a very powerful framework, it doesn't provide a solution
for all the big data problems.
Before discussing about MapReduce let first understand framework in general. Framework is a set of
rules which we follow or should follow to obtain the desired result. So whenever we write a
MapReduce program we should fit our solution into the MapReduce framework.
Although MapReduce is very powerful it has its limitations. Some problems like processing graph
algorithms, algorithms which require iterative processing, etc. are tricky and challenging. So
implementing such problems in MapReduce is very difficult. To overcome such problems we can
use MapReduce design pattern.
[Note: A Design pattern is a general repeatable solution to a commonly occurring problem in software
design. A design pattern isn't a finished design that can be transformed directly into code. It is a
description or template for how to solve a problem that can be used in many different situations.]
We generally use MapReduce for data analysis The most important part of data analysis is to find
outlier. An outlier is any value that is numerically distant from most of the other data points in a set of
data. These records are most interesting and unique pieces of data in the set.
The point of this blog is to develop MapReduce design pattern which aims at finding the Top K records
for a specific criteria so that we could take a look at them and perhaps figure out the reason which
made them special.
This can be achived by defining a ranking function or comparison function between two records that
determines whether one is higher than the other. We can apply this pattern to use MapReduce to find
the records with the highest value across the entire data set.
Before discussing MapReduce approach let’s understans the traditional approach of finding Top K
records in a file located on a single machine.
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
Traditional Approach: If we are dealing with the file located in the single system or RDBMS we can
follow below steps find to K records:
1.Sort the data
2.Pick Top K records
MapReduce approach: Solving the same using MapReduce is a bit complicated because:
1.Data is not sorted
2.Data is processed across multiple nodes
For finding the Top K records in distributed file system like Hadoop using MapReduce we should
follow the below steps:
1.In MapReduce find Top K for each mapper and send to reducer
2.Reducer will in turn find the global top 10 of all the mappers
To achieve this we can follow Top-K MapReduce design patterns which is explained below with the
help of an algorithm:
class mapper:
map(key, record):
insert record into top ten sorted list
if length of array is greater-than 10 then
truncate list to a length of 10
cleanup():
for record in top sorted ten list:
emit null,record
class reducer:
reduce(key, records):
sort records
truncate records to top 10
for record in records:
emit record;
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
The above algorithm is shown pictorially below:
As shown in Figure 1 above we find the local Top K for each mapper which is in turn sent to the
reducer.
Let’s consider the same with the help of sample data:
yearID,teamID,lgID,playerID,salary
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
1985,ATL,NL,barkele01,870000
1985,ATL,NL,bedrost01,550000
1985,ATL,NL,benedbr01,545000
1985,ATL,NL,campri01,633333
1985,ATL,NL,ceronri01,625000
1985,ATL,NL,chambch01,800000
Above data set contains 5 columns – yearID, teamID, lgID, playerID, salary. In this example we are
finding Top K records based on salary.
For sorting the data easily we can use java.lang.TreeMap. It will sort the keys automatically.
But in the default behavior Tree sort will ignore the duplicate values which will not give the correct
results.
To overcome this we should create a TreeMap with our own comparator to include the duplicate values
and sort them.
Below is the implementation of Comparator to sort and include the duplicate values :
Comparator code:
import java.util.Comparator;
public class Salary {
private int sum;
public int getSum() {
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
return sum;
}
public void setSum(int sum) {
this.sum = sum;
}
public Salary(int sum) {
super();
this.sum = sum;
}
}
class MySalaryComp1 implements Comparator<Salary>{
@Override
public int compare(Salary e1, Salary e2) {
if(e1.getSum()>e2.getSum()){
return 1;
} else {
return -1;
}
}
}
Mapper Code:
public class Top20Mapper extends Mapper<LongWritable, Text, NullWritable, Text> {
// create the Tree Map with MySalaryComparator
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
public static TreeMap<sala, Text> ToRecordMap = new TreeMap<Salary , Text>(new
MySalaryComp1());
public void map(LongWritable key, Text value, Context context)throws IOException,
InterruptedException {
String line=value.toString();
String[] tokens=line.split("t");
//split the data and fetch salary
int salary=Integer.parseInt(tokens[3]);
//insert salary object as key and entire row as value
//tree map sort the records based on salary
ToRecordMap.put(new Salary (salary), new Text(value));
// If we have more than ten records, remove the one with the lowest salary
// As this tree map is sorted in descending order, the employee with
// the lowest salary is the last key.
Iterator<Entry<Salary , Text>> iter = ToRecordMap.entrySet().iterator();
Entry<Salary , Text> entry = null;
while(ToRecordMap.size()>10){
entry = iter.next();
iter.remove();
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
}
}
protected void cleanup(Context context) throws IOException, InterruptedException {
// Output our ten records to the reducers with a null key
for (Text t:ToRecordMap.values()) {
context.write(NullWritable.get(), t);
}
}
}
According to Text input format we receive 1 line for each iteration of Mapper. In order to get the top 10
records from each input split we need all the records of split so that we can compare them and find the
Top K records.
First step in Mapper is to extract the column based on which we would like to find Top K records and
insert that value as key into TreeMap and entire row as value.
If we have more than ten records, we remove the one with the least salary as this tree map is sorted in
descending order. The employee with the least salary is the last key.
Cleanup is the overridden method of Mapper class. This method will execute at the end of mapclass i.e.
once per split. In clean up method of map we will get the Top K records for each split. After this we
send the local top 10 of each map to reducer.
Reducer Code:
import java.io.IOException;
import java.util.Iterator;
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
import java.util.TreeMap;
import java.util.Map.Entry;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class Top20Reducer extends Reducer<NullWritable, Text, NullWritable, Text> {
public static TreeMap<Salary , Text> ToRecordMap = new TreeMap<Salary , Text>(new
MySalaryComp1());
public void reduce(NullWritable key, Iterable<Text> values,Context context) throws
IOException, InterruptedException {
for (Text value : values) {
String line=value.toString();
if(line.length()>0){
String[] tokens=line.split("t");
//split the data and fetch salary
int salary=Integer.parseInt(tokens[3]);
//insert salary as key and entire row as value
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
//tree map sort the records based on salary
ToRecordMap.put(new Salary (salary), new Text(value));
}
}
// If we have more than ten records, remove the one with the lowest sal
// As this tree map is sorted in descending order, the user with
// the lowest sal is the last key.
Iterator<Entry<Salary , Text>> iter = ToRecordMap.entrySet().iterator();
Entry<Salary , Text> entry = null;
while(ToRecordMap.size()>10){
entry = iter.next();
iter.remove();
}
for (Text t : ToRecordMap.descendingMap().values()) {
// Output our ten records to the file system with a null key
context.write(NullWritable.get(), t);
}
}
}
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
ACADGILD
In Reducer we will get the Top 10 records from each Mapper and the steps followed in Mappers are
repeated except the clean up phase because we have all the records in Reducer since key is the same for
all the mappers i.e. NullWritable.
The Output of the Job is Top K records.
This way we can obtain the Top K records using MadReduce functionality.
I hope this blog was helpful in giving you a better understanding of Implementing MapReduce design
pattern.
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/

More Related Content

What's hot

Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
Shashwat Shriparv
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
Julian Hyde
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
Spark Summit
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
vishal choudhary
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
AnandMHadoop
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
Hortonworks
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
Julian Hyde
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Amund Tveit
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Import web resources using R Studio
Import web resources using R StudioImport web resources using R Studio
Import web resources using R Studio
Rupak Roy
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Scalding
ScaldingScalding
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
Andrea Iacono
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
Julian Hyde
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
Stamatis Zampetakis
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
Venkat Datla
 

What's hot (20)

Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Hive query optimization infinity
Hive query optimization infinityHive query optimization infinity
Hive query optimization infinity
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Hive Functions Cheat Sheet
Hive Functions Cheat SheetHive Functions Cheat Sheet
Hive Functions Cheat Sheet
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Import web resources using R Studio
Import web resources using R StudioImport web resources using R Studio
Import web resources using R Studio
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Scalding
ScaldingScalding
Scalding
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 

Similar to ACADILD:: HADOOP LESSON

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Big data hadoop distributed file system for data
Big data hadoop distributed file system for dataBig data hadoop distributed file system for data
Big data hadoop distributed file system for data
preetik9044
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Vincent Poncet
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
AnilVijayagiri
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
Avinash Pandu
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 

Similar to ACADILD:: HADOOP LESSON (20)

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Big data hadoop distributed file system for data
Big data hadoop distributed file system for dataBig data hadoop distributed file system for data
Big data hadoop distributed file system for data
 
Unit 2
Unit 2Unit 2
Unit 2
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
SparkNotes
SparkNotesSparkNotes
SparkNotes
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 

More from Padma shree. T

ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on railsACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
Padma shree. T
 
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
Padma shree. T
 
ACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hiveACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hive
Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
Padma shree. T
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON
Padma shree. T
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON
Padma shree. T
 

More from Padma shree. T (11)

ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on railsACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on rails
 
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
ACADGILD:: ANDROID LESSON-How to analyze &amp; manage memory on android like ...
 
ACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hiveACADGILD:: HADOOP LESSON - File formats in apache hive
ACADGILD:: HADOOP LESSON - File formats in apache hive
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON ACADGILD:: HADOOP LESSON
ACADGILD:: HADOOP LESSON
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON
 
ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON ACADGILD:: ANDROID LESSON
ACADGILD:: ANDROID LESSON
 

Recently uploaded

Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
EduSkills OECD
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
AzmatAli747758
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
bennyroshan06
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
rosedainty
 

Recently uploaded (20)

Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...Cambridge International AS  A Level Biology Coursebook - EBook (MaryFosbery J...
Cambridge International AS A Level Biology Coursebook - EBook (MaryFosbery J...
 
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptxMARUTI SUZUKI- A Successful Joint Venture in India.pptx
MARUTI SUZUKI- A Successful Joint Venture in India.pptx
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)Template Jadual Bertugas Kelas (Boleh Edit)
Template Jadual Bertugas Kelas (Boleh Edit)
 

ACADILD:: HADOOP LESSON

  • 1. ACADGILD INTRODUCTION Have you ever wondered how to process huge data residing on multiple systems? Well here is a simple solution for the same – Hadoop’s MapReduce feature. MapReduce is a software framework for easily writing applications which process vast amounts of data residing on multiple systems. Although it is a very powerful framework, it doesn't provide a solution for all the big data problems. Before discussing about MapReduce let first understand framework in general. Framework is a set of rules which we follow or should follow to obtain the desired result. So whenever we write a MapReduce program we should fit our solution into the MapReduce framework. Although MapReduce is very powerful it has its limitations. Some problems like processing graph algorithms, algorithms which require iterative processing, etc. are tricky and challenging. So implementing such problems in MapReduce is very difficult. To overcome such problems we can use MapReduce design pattern. [Note: A Design pattern is a general repeatable solution to a commonly occurring problem in software design. A design pattern isn't a finished design that can be transformed directly into code. It is a description or template for how to solve a problem that can be used in many different situations.] We generally use MapReduce for data analysis The most important part of data analysis is to find outlier. An outlier is any value that is numerically distant from most of the other data points in a set of data. These records are most interesting and unique pieces of data in the set. The point of this blog is to develop MapReduce design pattern which aims at finding the Top K records for a specific criteria so that we could take a look at them and perhaps figure out the reason which made them special. This can be achived by defining a ranking function or comparison function between two records that determines whether one is higher than the other. We can apply this pattern to use MapReduce to find the records with the highest value across the entire data set. Before discussing MapReduce approach let’s understans the traditional approach of finding Top K records in a file located on a single machine. https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 2. ACADGILD Traditional Approach: If we are dealing with the file located in the single system or RDBMS we can follow below steps find to K records: 1.Sort the data 2.Pick Top K records MapReduce approach: Solving the same using MapReduce is a bit complicated because: 1.Data is not sorted 2.Data is processed across multiple nodes For finding the Top K records in distributed file system like Hadoop using MapReduce we should follow the below steps: 1.In MapReduce find Top K for each mapper and send to reducer 2.Reducer will in turn find the global top 10 of all the mappers To achieve this we can follow Top-K MapReduce design patterns which is explained below with the help of an algorithm: class mapper: map(key, record): insert record into top ten sorted list if length of array is greater-than 10 then truncate list to a length of 10 cleanup(): for record in top sorted ten list: emit null,record class reducer: reduce(key, records): sort records truncate records to top 10 for record in records: emit record; https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 3. ACADGILD The above algorithm is shown pictorially below: As shown in Figure 1 above we find the local Top K for each mapper which is in turn sent to the reducer. Let’s consider the same with the help of sample data: yearID,teamID,lgID,playerID,salary https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 4. ACADGILD 1985,ATL,NL,barkele01,870000 1985,ATL,NL,bedrost01,550000 1985,ATL,NL,benedbr01,545000 1985,ATL,NL,campri01,633333 1985,ATL,NL,ceronri01,625000 1985,ATL,NL,chambch01,800000 Above data set contains 5 columns – yearID, teamID, lgID, playerID, salary. In this example we are finding Top K records based on salary. For sorting the data easily we can use java.lang.TreeMap. It will sort the keys automatically. But in the default behavior Tree sort will ignore the duplicate values which will not give the correct results. To overcome this we should create a TreeMap with our own comparator to include the duplicate values and sort them. Below is the implementation of Comparator to sort and include the duplicate values : Comparator code: import java.util.Comparator; public class Salary { private int sum; public int getSum() { https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 5. ACADGILD return sum; } public void setSum(int sum) { this.sum = sum; } public Salary(int sum) { super(); this.sum = sum; } } class MySalaryComp1 implements Comparator<Salary>{ @Override public int compare(Salary e1, Salary e2) { if(e1.getSum()>e2.getSum()){ return 1; } else { return -1; } } } Mapper Code: public class Top20Mapper extends Mapper<LongWritable, Text, NullWritable, Text> { // create the Tree Map with MySalaryComparator https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 6. ACADGILD public static TreeMap<sala, Text> ToRecordMap = new TreeMap<Salary , Text>(new MySalaryComp1()); public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException { String line=value.toString(); String[] tokens=line.split("t"); //split the data and fetch salary int salary=Integer.parseInt(tokens[3]); //insert salary object as key and entire row as value //tree map sort the records based on salary ToRecordMap.put(new Salary (salary), new Text(value)); // If we have more than ten records, remove the one with the lowest salary // As this tree map is sorted in descending order, the employee with // the lowest salary is the last key. Iterator<Entry<Salary , Text>> iter = ToRecordMap.entrySet().iterator(); Entry<Salary , Text> entry = null; while(ToRecordMap.size()>10){ entry = iter.next(); iter.remove(); https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 7. ACADGILD } } protected void cleanup(Context context) throws IOException, InterruptedException { // Output our ten records to the reducers with a null key for (Text t:ToRecordMap.values()) { context.write(NullWritable.get(), t); } } } According to Text input format we receive 1 line for each iteration of Mapper. In order to get the top 10 records from each input split we need all the records of split so that we can compare them and find the Top K records. First step in Mapper is to extract the column based on which we would like to find Top K records and insert that value as key into TreeMap and entire row as value. If we have more than ten records, we remove the one with the least salary as this tree map is sorted in descending order. The employee with the least salary is the last key. Cleanup is the overridden method of Mapper class. This method will execute at the end of mapclass i.e. once per split. In clean up method of map we will get the Top K records for each split. After this we send the local top 10 of each map to reducer. Reducer Code: import java.io.IOException; import java.util.Iterator; https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 8. ACADGILD import java.util.TreeMap; import java.util.Map.Entry; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.Reducer.Context; public class Top20Reducer extends Reducer<NullWritable, Text, NullWritable, Text> { public static TreeMap<Salary , Text> ToRecordMap = new TreeMap<Salary , Text>(new MySalaryComp1()); public void reduce(NullWritable key, Iterable<Text> values,Context context) throws IOException, InterruptedException { for (Text value : values) { String line=value.toString(); if(line.length()>0){ String[] tokens=line.split("t"); //split the data and fetch salary int salary=Integer.parseInt(tokens[3]); //insert salary as key and entire row as value https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 9. ACADGILD //tree map sort the records based on salary ToRecordMap.put(new Salary (salary), new Text(value)); } } // If we have more than ten records, remove the one with the lowest sal // As this tree map is sorted in descending order, the user with // the lowest sal is the last key. Iterator<Entry<Salary , Text>> iter = ToRecordMap.entrySet().iterator(); Entry<Salary , Text> entry = null; while(ToRecordMap.size()>10){ entry = iter.next(); iter.remove(); } for (Text t : ToRecordMap.descendingMap().values()) { // Output our ten records to the file system with a null key context.write(NullWritable.get(), t); } } } https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
  • 10. ACADGILD In Reducer we will get the Top 10 records from each Mapper and the steps followed in Mappers are repeated except the clean up phase because we have all the records in Reducer since key is the same for all the mappers i.e. NullWritable. The Output of the Job is Top K records. This way we can obtain the Top K records using MadReduce functionality. I hope this blog was helpful in giving you a better understanding of Implementing MapReduce design pattern. https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/