Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sewz2m
This CloudxLab Key-Value RDD tutorial helps you to understand Key-Value RDD in detail. Below are the topics covered in this tutorial:
1) Spark Key-Value RDD
2) Creating Key-Value Pair RDDs
3) Transformations on Pair RDDs - reduceByKey(func)
4) Count Word Frequency in a File using Spark
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
A talk given by Julian Hyde at DataCouncil SF on April 18, 2019
How do you organize your data so that your users get the right answers at the right time? That question is a pretty good definition of data engineering — but it is also describes the purpose of every DBMS (database management system). And it’s not a coincidence that these are so similar.
This talk looks at the patterns that reoccur throughout data management — such as caching, partitioning, sorting, and derived data sets. As the speaker is the author of Apache Calcite, we first look at these patterns through the lens of Relational Algebra and DBMS architecture. But then we apply these patterns to the modern data pipeline, ETL and analytics. As a case study, we look at how Looker’s “derived tables” blur the line between ETL and caching, and leverage the power of cloud databases.
A talk given by Julian Hyde at ApacheCon NA 2018 in Montreal on September 26th, 2018.
Spatial and GIS applications have traditionally required specialized databases, or at least specialized data structures like r-trees. Unfortunately this means that hybrid applications such as spatial analytics are not well served, and many people are unaware of the power of spatial queries because their favorite database does not support them.
In this talk, we describe how Apache Calcite enables efficient spatial queries using generic data structures such as HBase’s key-sorted tables, using techniques like Hilbert space-filling curves and materialized views. Calcite implements much of the OpenGIS function set and recognizes query patterns that can be rewritten to use particular spatial indexes. Calcite is bringing spatial query to the masses!
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at DataWorks Summit, San Jose, on June 14th 2017.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sewz2m
This CloudxLab Key-Value RDD tutorial helps you to understand Key-Value RDD in detail. Below are the topics covered in this tutorial:
1) Spark Key-Value RDD
2) Creating Key-Value Pair RDDs
3) Transformations on Pair RDDs - reduceByKey(func)
4) Count Word Frequency in a File using Spark
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
In this session you will learn:
Hadoop Data Types
Hadoop MapReduce Paradigm
Map and Reduce Tasks
Map Phase
MapReduce: The Reducer
IOException & JobConf
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Don't optimize my queries, organize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we make it optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt. We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
In this session you will learn:
Hadoop Data Types
Hadoop MapReduce Paradigm
Map and Reduce Tasks
Map Phase
MapReduce: The Reducer
IOException & JobConf
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
A talk given by Julian Hyde at Apache: Big Data, Miami, on May 16th 2017.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
Apache Calcite is a dynamic data management framework. Think of it as a toolkit for building databases: it has an industry-standard SQL parser, validator, highly customizable optimizer (with pluggable transformation rules and cost functions, relational algebra, and an extensive library of rules), but it has no preferred storage primitives. In this tutorial, the attendees will use Apache Calcite to build a fully fledged query processor from scratch with very few lines of code. This processor is a full implementation of SQL over an Apache Lucene storage engine. (Lucene does not support SQL queries and lacks a declarative language for performing complex operations such as joins or aggregations.) Attendees will also learn how to use Calcite as an effective tool for research.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.
http://CloudComputing.IndicThreads.com
Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
ACADGILD:: FRONTEND LESSON -Ruby on rails vs groovy on railsPadma shree. T
Have you ever wondered as to what these terms are – Ruby, Groovy, Rail and got confused as to what all this is about? Which one to choose? Which one is better and which is not? Well then here is a blog which I write with the intention of making things clear between Ruby on Rails and Groovy on Rails.
ACADGILD:: ANDROID LESSON-How to analyze & manage memory on android like ...Padma shree. T
This Blog is all about memory management in Android. It provides information about how you can analyze & reduce memory usage while developing an Android app.
Memory management is a complex field of computer science and there are many techniques being developed to make it more efficient. This guide is designed to introduce you to some of the basic memory management issues that programmers face.
ACADGILD:: HADOOP LESSON - File formats in apache hivePadma shree. T
This Blog aims at discussing the different file formats available in Apache Hive. After reading this Blog you will get a clear understanding of the different file formats that are available in Hive and how and where to use them appropriately. Before we move forward lets discuss about Apache Hive.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
1. ACADGILD
INTRODUCTION
Have you ever wondered how to process huge data residing on multiple systems? Well here is a simple
solution for the same – Hadoop’s MapReduce feature.
MapReduce is a software framework for easily writing applications which process vast amounts of data
residing on multiple systems. Although it is a very powerful framework, it doesn't provide a solution
for all the big data problems.
Before discussing about MapReduce let first understand framework in general. Framework is a set of
rules which we follow or should follow to obtain the desired result. So whenever we write a
MapReduce program we should fit our solution into the MapReduce framework.
Although MapReduce is very powerful it has its limitations. Some problems like processing graph
algorithms, algorithms which require iterative processing, etc. are tricky and challenging. So
implementing such problems in MapReduce is very difficult. To overcome such problems we can
use MapReduce design pattern.
[Note: A Design pattern is a general repeatable solution to a commonly occurring problem in software
design. A design pattern isn't a finished design that can be transformed directly into code. It is a
description or template for how to solve a problem that can be used in many different situations.]
We generally use MapReduce for data analysis The most important part of data analysis is to find
outlier. An outlier is any value that is numerically distant from most of the other data points in a set of
data. These records are most interesting and unique pieces of data in the set.
The point of this blog is to develop MapReduce design pattern which aims at finding the Top K records
for a specific criteria so that we could take a look at them and perhaps figure out the reason which
made them special.
This can be achived by defining a ranking function or comparison function between two records that
determines whether one is higher than the other. We can apply this pattern to use MapReduce to find
the records with the highest value across the entire data set.
Before discussing MapReduce approach let’s understans the traditional approach of finding Top K
records in a file located on a single machine.
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
2. ACADGILD
Traditional Approach: If we are dealing with the file located in the single system or RDBMS we can
follow below steps find to K records:
1.Sort the data
2.Pick Top K records
MapReduce approach: Solving the same using MapReduce is a bit complicated because:
1.Data is not sorted
2.Data is processed across multiple nodes
For finding the Top K records in distributed file system like Hadoop using MapReduce we should
follow the below steps:
1.In MapReduce find Top K for each mapper and send to reducer
2.Reducer will in turn find the global top 10 of all the mappers
To achieve this we can follow Top-K MapReduce design patterns which is explained below with the
help of an algorithm:
class mapper:
map(key, record):
insert record into top ten sorted list
if length of array is greater-than 10 then
truncate list to a length of 10
cleanup():
for record in top sorted ten list:
emit null,record
class reducer:
reduce(key, records):
sort records
truncate records to top 10
for record in records:
emit record;
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
3. ACADGILD
The above algorithm is shown pictorially below:
As shown in Figure 1 above we find the local Top K for each mapper which is in turn sent to the
reducer.
Let’s consider the same with the help of sample data:
yearID,teamID,lgID,playerID,salary
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
4. ACADGILD
1985,ATL,NL,barkele01,870000
1985,ATL,NL,bedrost01,550000
1985,ATL,NL,benedbr01,545000
1985,ATL,NL,campri01,633333
1985,ATL,NL,ceronri01,625000
1985,ATL,NL,chambch01,800000
Above data set contains 5 columns – yearID, teamID, lgID, playerID, salary. In this example we are
finding Top K records based on salary.
For sorting the data easily we can use java.lang.TreeMap. It will sort the keys automatically.
But in the default behavior Tree sort will ignore the duplicate values which will not give the correct
results.
To overcome this we should create a TreeMap with our own comparator to include the duplicate values
and sort them.
Below is the implementation of Comparator to sort and include the duplicate values :
Comparator code:
import java.util.Comparator;
public class Salary {
private int sum;
public int getSum() {
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
5. ACADGILD
return sum;
}
public void setSum(int sum) {
this.sum = sum;
}
public Salary(int sum) {
super();
this.sum = sum;
}
}
class MySalaryComp1 implements Comparator<Salary>{
@Override
public int compare(Salary e1, Salary e2) {
if(e1.getSum()>e2.getSum()){
return 1;
} else {
return -1;
}
}
}
Mapper Code:
public class Top20Mapper extends Mapper<LongWritable, Text, NullWritable, Text> {
// create the Tree Map with MySalaryComparator
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
6. ACADGILD
public static TreeMap<sala, Text> ToRecordMap = new TreeMap<Salary , Text>(new
MySalaryComp1());
public void map(LongWritable key, Text value, Context context)throws IOException,
InterruptedException {
String line=value.toString();
String[] tokens=line.split("t");
//split the data and fetch salary
int salary=Integer.parseInt(tokens[3]);
//insert salary object as key and entire row as value
//tree map sort the records based on salary
ToRecordMap.put(new Salary (salary), new Text(value));
// If we have more than ten records, remove the one with the lowest salary
// As this tree map is sorted in descending order, the employee with
// the lowest salary is the last key.
Iterator<Entry<Salary , Text>> iter = ToRecordMap.entrySet().iterator();
Entry<Salary , Text> entry = null;
while(ToRecordMap.size()>10){
entry = iter.next();
iter.remove();
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
7. ACADGILD
}
}
protected void cleanup(Context context) throws IOException, InterruptedException {
// Output our ten records to the reducers with a null key
for (Text t:ToRecordMap.values()) {
context.write(NullWritable.get(), t);
}
}
}
According to Text input format we receive 1 line for each iteration of Mapper. In order to get the top 10
records from each input split we need all the records of split so that we can compare them and find the
Top K records.
First step in Mapper is to extract the column based on which we would like to find Top K records and
insert that value as key into TreeMap and entire row as value.
If we have more than ten records, we remove the one with the least salary as this tree map is sorted in
descending order. The employee with the least salary is the last key.
Cleanup is the overridden method of Mapper class. This method will execute at the end of mapclass i.e.
once per split. In clean up method of map we will get the Top K records for each split. After this we
send the local top 10 of each map to reducer.
Reducer Code:
import java.io.IOException;
import java.util.Iterator;
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
8. ACADGILD
import java.util.TreeMap;
import java.util.Map.Entry;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
public class Top20Reducer extends Reducer<NullWritable, Text, NullWritable, Text> {
public static TreeMap<Salary , Text> ToRecordMap = new TreeMap<Salary , Text>(new
MySalaryComp1());
public void reduce(NullWritable key, Iterable<Text> values,Context context) throws
IOException, InterruptedException {
for (Text value : values) {
String line=value.toString();
if(line.length()>0){
String[] tokens=line.split("t");
//split the data and fetch salary
int salary=Integer.parseInt(tokens[3]);
//insert salary as key and entire row as value
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
9. ACADGILD
//tree map sort the records based on salary
ToRecordMap.put(new Salary (salary), new Text(value));
}
}
// If we have more than ten records, remove the one with the lowest sal
// As this tree map is sorted in descending order, the user with
// the lowest sal is the last key.
Iterator<Entry<Salary , Text>> iter = ToRecordMap.entrySet().iterator();
Entry<Salary , Text> entry = null;
while(ToRecordMap.size()>10){
entry = iter.next();
iter.remove();
}
for (Text t : ToRecordMap.descendingMap().values()) {
// Output our ten records to the file system with a null key
context.write(NullWritable.get(), t);
}
}
}
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/
10. ACADGILD
In Reducer we will get the Top 10 records from each Mapper and the steps followed in Mappers are
repeated except the clean up phase because we have all the records in Reducer since key is the same for
all the mappers i.e. NullWritable.
The Output of the Job is Top K records.
This way we can obtain the Top K records using MadReduce functionality.
I hope this blog was helpful in giving you a better understanding of Implementing MapReduce design
pattern.
https://acadgild.com/blog/mapreduce-design-pattern-finding-top-k-records/