MapReduce allows distributed processing of large datasets across clusters of computers. It works by splitting the input data into independent chunks which are processed by the map function in parallel. The map function produces intermediate key-value pairs which are grouped by the reduce function to form the output data. Fault tolerance is achieved through replication of data across nodes and re-executing failed tasks. This makes MapReduce suitable for efficiently processing very large datasets in a distributed environment.
Call Girls Begur Just Call đ 7737669865 đ Top Class Call Girl Service Bangalore
Â
Big data analytics with Apache Hadoop
1. BIG DATA ANALYTICS
WITH APACHE- HADOOP
âBig Data: A Revolution that Will Transform How We Live, Work, and Thinkâ
-Viktor Mayer-SchĂśnberger and Kenneth Cukier
2. Team Members
Abhishek Kumar : Y11UC010
Sachin Mittal : Y11UC189
Subodh Rawani : Y11UC230
Suman Saurabh : Y11UC231
3. Contents
1. What is Big Data ?
ď Definition
ď Turning Data to Value: 5vâs
2. Big Data Analytics
3. Big Data and Hadoop
ď History of Hadoop
ď About Apache Hadoop
ď Key Features of Hadoop
4. Hadoop and MapReduce
ď About MapReduce
ď MapReduce Architecture
ď MapReduce Functionality
ď MapReduce Examples
5. Definition
âData is the oil of the 21st century, and analytics is the combustion engineâ
-Peter Sondergaard, Senior Vice President, Gartner Research
âBig- Data are high volume, high velocity and high variety of information assets that require new
form of processing to enable enhanced decision making insight discovery & process
optimisation.â
âIt is a subjective term, what involves is analysis of data from multiple sources and is joined and
aggregated in arbitrary ways enabling deeper analyses than any one system can provideâ.
-Tom White in Hadoop the Definitive Guide
Big Data is fuelled by two things:
⢠The increasing âdataficationâ of the world, allows to generate new data at frightening rates.
⢠Technological advancement to harness those large and complex data and perform analysis
using improved techniques.
6. Big data describes the exponential growth and availability of data, both structured and unstructured. This data
are from everywhere: Climate Sensors, Social Media post, Digital files, Buy/Sell transaction records, Cell phone
GPS signal and others.
7. Statistics of Data Generated
Big Data in Todayâs Business and Technology
Environment
ď§ 235 Terabytes of data has been collected by the
U.S. Library of Congress in April 2011. (Source)
ď§ Facebook stores, accesses, and analyzes 30+
Petabytes of user generated data. (Source)
ď§ Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than 2.5
petabytes of data. (Source)
ď§ More than 5 billion people are calling, texting,
tweeting and browsing on mobile phones
worldwide. (Source)
ď§ In 2008, Google was processing 20,000 terabytes
of data (20 petabytes) a day. (Source)
The Rapid Growth of Unstructured Data
ď§ YouTube users upload 48 hours of new video
every minute of the day. (Source)
ď§ Brands and organizations on Facebook receive
34,722 Likes every minute of the day. (Source)
ď§ Twitterâs sees roughly 175 million tweets every day,
and has more than 465 million accounts. (Source)
ď§ In late 2011, IDC Digital Universe published a
report indicating that some 1.8 zettabytes of data
will be created that year. (Source)
ď§ In other words, the amount of data in the world
today is equal to:
ď§ Every person in the world having more than 215m high-
resolution MRI scans a day.
ď§ More than 200bn HD movies â which would take a person
47m years to watch.
9. Turning Big Data into Value: 5Vâs
The Digital Era gives unprecedented
amounts of data in terms of Volume,
Velocity, Variety and Veracity and
properly channelling them to Value.
Value
Volume: Refers to the Terabytes, Petabytes as well
as Zettabytes of data generated every second.
Velocity: Speed at which new data is generated
every second. E.g. Google, Twitter, Facebook
Variety: Different formats data such as text, images,
video, video and so on can be stored and processed
rather than only Relational Databases.
Veracity: Trustworthiness of the data. E.g. Twitter
data with hash tags, abbreviations, typos and
colloquial speech as well as the reliability and
accuracy of content. Though not reliable can also be
processed.
Value: Having access to big data is no good unless
we can turn it into value.
10. The âDataficationâ of
our World;
â˘Activities
â˘Conversations
â˘Words
â˘Voice
â˘Social Media
â˘Browser logs
â˘Photos
â˘Videos
â˘Sensors
â˘Etc.
Volume
Veracity
Variety
Velocity
Analysis
Analysing
Big Data:
â˘Text analytics
â˘Sentiment
analysis
â˘Face recognition
â˘Voice analytics
â˘Movement
analytics
â˘Etc.
Value
Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
New technologies in Distributed Systems and Cloud Computing together with the latest
software and analysis approaches allow us to store and process data to Value at massive rate.
11. Some Big Data Use Case By Industry
Telecommunications
Network analytics
Location-based services
Retail
Merchandise optimization
Supply-Chain Management
Banking
Fraud Detection
Trade Surveillance
Media
Click- Fraud Prevention
Social Graph Analysis
Energy
Smart Meter Analytics
Distribution load forecasting
Manufacturing
Customer Care Call Centers
Customer Relationship
Public
Threats Detection
Cyber Security
Healthcare
Clinical Trails data Analysis
Supply Chain Management
Insurance
Catastrophe Modelling
Claims Fraud
12.
13. Challenges of big data
ď§ How to store and protect Big data?
ď§ How to organize and catalog the data that you have backed up?
ď§ How to keep costs low while ensuring that all the critical data is
available you need it.
ď§ Analytical Challenges
ď§ Human Resources and Manpower
ď§ Technical Challenges
ď§ Privacy and Security
15. Why Big-Data Analytics?
⢠Understand existing data resource.
⢠Process them and uncover pattern,
correlations and other useful
information that can be used to make
better decisions.
⢠With big data analytics, data scientists
and others can analyse huge volumes
of data that conventional analytics and
business intelligence solutions can't
touch.
16. Traditional vs. Big Data Approaches
IT
Structures the
data to answer
that question
IT
Delivers a platform to
enable creative
discovery
Business
Explores what questions
could be asked
Business Users
Determine what
question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory Analysis
Traditional Approach
Structured & Repeatable Analysis
18. Practical Examples of Data Analytics
To better understand and target customers, companies expand their traditional
data sets with social media data, browser, text analytics or sensor data to get a
more complete picture of their customers. The big objective, in many cases, is
to create predictive models. Using big data, Telecom companies can now better
predict customer churn; retailers can predict what products will sell, and car
insurance companies understand how well their customers actually drive.
Better understand and target customers
.
The computing power of big data analytics enables us to find new cures and
better understand and predict disease patterns. We can use all the data from
smart watches and wearable devices to better understand links between
lifestyles and diseases. Big data analytics also allow us to monitor and predict
epidemics and disease outbreaks, simply by listening to what people are saying,
i.e. âFeeling rubbish today - in bed with a coldâ or searching for on the Internet.
Improving Health
Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
19. Practical Examples of Data Analytics
Security services use big data analytics to foil terrorist plots and detect cyber
attacks. Police forces use big data tools to catch criminals and even predict
criminal activity and credit card companies use big data analytics it to detect
fraudulent transactions.
Improving Security and Law Enforcement.
Big data is used to improve many aspects of our cities and countries. For
example, it allows cities to optimize traffic flows based on real time traffic
information as well as social media and weather data. A number of cities are
currently using big data analytics with the aim of turning themselves into Smart
Cities, where the transport infrastructure and utility processes are all joined up.
Where a bus would wait for a delayed train and where traffic signals predict
traffic volumes and operate to minimize jams.
Improving and Optimizing Cities and Countries
Copied from: Š 2014 Advanced Performance Institute, BWMC Ltd.
21. Brief history of Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web
search engine, itself a part of the Lucene project.
Nutch was started in 2002, and a working crawler and search system quickly emerged.
However their architecture wouldnât scale to the billions of pages on the Web. In 2003
Google published paper on Googleâs Distributed Filesystem (GFS) which was being
used in production at Google. Hence in 2004 they implemented Nutch Distributed
Filesystem (NDFS) using GFS architecture that would solve their storage needs for
very large files generated as a part of the web crawl and indexing process.
In 2004, Google published the paper that introduced MapReduce to the world. NDFS
and the MapReduce implementation in Nutch were applicable beyond the realm of
search, and in February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop.
22. Apache Hadoop
ď§ Framework for the distributed
processing of large data sets across
clusters of computers using simple
programming models.
ď§ Designed to scale up from a single
server to thousands of machines, with
a very high degree of fault tolerance.
ď§ Rather than relying on high-end
hardware, the resiliency of these
clusters comes from the softwareâs
ability to detect and handle failures at
the application layer.
23. Key Features of Hadoop
1. Flexible
2. Scalable
3. Building more efficient data
economy
4. Cost Effective
5. Fault Tolerant
24. 1) Flexible
1. Hadoop is schema-less, and can absorb any type of data,
structured or not, from any number of sources.
2. Data from multiple sources can be joined and aggregated in arbitrary
ways enabling deeper analyses than any one system can provide.
3. We can develop Map- Reduce programs on Linux, Windows, OS-X in
any language like Python, R, C++, Perl, Ruby, etc.
25. 2) Scalable
Scalability is one of the primary forces driving popularity and adoption
of the Apache Hadoop project. A typical use case for Hadoop is an
emerging Web site starting to run a five-node. New nodes can be
added as needed, and added without needing to change data formats,
how data is loaded, how jobs are written, or the applications on top.
1. Yahoo reportedly ran numerous clusters having 4000+ nodes with
four 1 TB drives per node, 15 PB of total storage capacity.
2. Facebookâs 2000-node warehouse cluster is provisioned for 21 PB of
total storage capacity. Extrapolating the announced growth rate, its
namespace should have close to 200 million objects by now.
3. eBay runs a 700-node cluster. Each node has 24 TB of local disk
storage, 72 GB of RAM, and a 12-core CPU. Total cluster size is 16
PB. It is configured to run 26,000 MapReduce tasks simultaneously.
26. 3) Building more efficient data economy
Data is the new currency of the modern world. Businesses that
successfully maximize its value will have a decisive impact on their own
value and on their customers success.
Apache Hadoop allows businesses to create highly scalable and cost-
efficient data stores. It offers data value at unprecedented scale.
27. 4) Cost Effective
Hadoop brings massively parallel computing to commodity servers. The
result is a sizeable decrease in the cost per terabyte of storage, which
in turn makes it affordable to model all your data.
It's a cost-effective alternative to a conventional extract, transform, and
load (ETL) process that extracts data from different systems, converts it
into a structure suitable for analysis and reporting, and loads into
database.
28. 5) Fault tolerant
When you lose a node, the system redirects work to another location of
the data and continues processing without missing a fright beat.
When any node becomes non-functional, then the node present nearby
ie. Supernode which is near completion or has already completed its
task reassigns itself to the task of that faulty node, The description of
which is present in the shared memory. Therefore a faulty node does
not have to wait for the Master node to notice about its non-
functionality and hence reduce execution time in case any of the node
gets faulty.
30. HDFS Architecture
HDFS is a filesystem designed for storing
very large files with streaming data access
patterns, running on clusters of commodity
hardware. HDFS clusters consist of a
NameNode that manages the file system
metadata and DataNodes that store the
actual data.
Uses:
⢠Storage of large imported files from
applications outside of the Hadoop
ecosystem.
⢠Staging of imported files to be
processed by Hadoop applications.
31. Hive connects the gap between SQL based RDBMS and NoSQL based
Hadoop. Datasets from HDFS and HBase can be mapped onto Hive from
which queries can be written in an SQL like language called HiveQL.
Though Hive may not be the perfect panacea for complex operations, it
reduces the difficulty of having to write MapReduce jobs if a
programmer knows SQL..
â˘Hbase:
⢠Hive:
Inspired by Googleâs BigTable, HBase is a NoSQL distributed column-
oriented database that runs on top of HDFS on which random read/write
can be performed. HBase enables you to store and retrieve random data
in near real-time. It can also be combined with MapReduce to ease bulk
operations such as indexing or analysis.
â˘Pig: Apache Pig uses the data flow language Pig Latin. Pig supports relational
operations such as join, group and aggregate and it can be scaled across
multiple servers simultaneously. Time intensive ETL operations, analytics
on sample data, running complex tasks that collates multiple data
sources are some of the use cases that can be handled using Pig.
32. Flume is a distributed system that aggregates streaming data from
different sources and adds them to a centralized datastore for Hadoop
cluster such as HDFS. Flume facilitates data aggregation which involves
importing and processing data for computation into HDFS or storage in
databases.
⢠Sqoop:
â˘Flume:
Sqoop is the latest Hadoop framework to get enlisted in Bossie award for
open source big data tools. Sqoop enables two-way import/export of
bulk data between HDFS/Hive/HBase and relational or structured
databases. Unlike Flume, Sqoop helps in data transfer of structured
datasets.
⢠Mahout: Mahout is a suite of scalable machine learning libraries implemented on
top of MapReduce. Commercial use cases of machine learning include
predictive analysis via collaborative filtering, clustering and classification.
Product/service recommendations, investigative data mining, statistical
analysis are some of its generic use cases.
34. MapReduce
ď§ MapReduce is a programming paradigm for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
ď§ The framework is divided into two parts:
ď§ Map, allows to parcels out work to different nodes in the distributed cluster.
ď§ Reduce, collates the work and resolves the results into a single value.
ď§ MapReduce framework consists of a single master JobTracker and one
slave TaskTracker per cluster-node. Master is responsible for scheduling the jobs'
component tasks on the slaves, monitoring them and re-executing the failed tasks.
ď§ Although the Hadoop framework is implemented in Java, MapReduce applications
can be written in Python, Ruby, R, C++. Eg. Hadoop Streaming, Hadoop Pipes.
36. Map Reduce core functionality (I)
â˘
Data flow beyond the two key pieces (map and reduce):
⢠Input reader â divides input into appropriate size splits which get
assigned to a Map function.
⢠Map function â maps file data to smaller, intermediate <key, value>
pairs.
⢠Compare function â input for Reduce is pulled from the Map
intermediate output and sorted according to the compare function.
⢠Reduce function â takes intermediate values and reduces to a
smaller solution handed back to the framework.
⢠Output writer â writes file output
37. How MapReduce Works
User to do list:
ď Indicate
⢠input/output files
⢠M: number of map tasks
⢠R: number of reduce tasks
⢠W: number of
machines
ď Write map and reduce
functions
ď Submit the job
ď Input files are split into M pieces
on distributed file system
⢠Typically ~ 64 MB blocks
ď Intermediate files created from
map tasks are written to local disk
ď A sorted and shuffled output is sent
to reduce framework (combiner is
also used in most of the cases).
ď Output files are written to
distributed file system.
39. MAP Reduce Examples
1. WordCount ( Reads the text file and counts how often words occur ).
2. TopN ( To find top-n used words of a text file ).
40. 1. WordCount
Reads text files and counts how often each word occur.
The input and the output are text files,
Need three classes:
⢠WordCount.java: Driver class with main function
⢠WordMapper.java: Mapper class with map method
⢠SumReducer.java: Reducer class with reduce method
41.
42. WordCount Example (Contd.)
WordMapper.java
Mapper class with map function
For the given sample input
assuming two map nodes
The sample input is distributed to the maps
the first map emits:
<Hello, 1> <World, 1> <Bye, 1> <World, 1>
The second map emits:
<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>
43. WordCount Example (Contd.)
SumReducer.java
Reducer class with reduce function
For the input from two Mappers
the reduce method just sums up the values,
which are the occurrence counts for each key
Thus the output of the job is:
<Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>
45. public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
Check Input and Output files
WordCount (Driver)
46. Set output (key, value) types
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
47. Set Mapper/Reducer classes
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
48. Set Input/Output format classes
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
49. Set Input/Output paths
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
50. Set Driver class
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
51. Submit the job to the master node
public class WordCount {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("usage: [input] [output]");
System.exit(-1);
}
Job job = Job.getInstance(new Configuration());
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJarByClass(WordCount.class);
job.submit();
}
}
WordCount (Driver)
52. WordMapper (Mapper class)
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
53. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Extends mapper class with input/
output keys and values
WordMapper (Mapper class)
54. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Output (key, value) typesWordMapper (Mapper class)
55. public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
Input (key, value) types
Output as Context type
WordMapper (Mapper class)
56. Read words from each line
of the input file
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
WordMapper (Mapper class)
57. Count each word
public class WordMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text word = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value,
Context contex) throws IOException, InterruptedException {
// Break line into words for processing
StringTokenizer wordList = new StringTokenizer(value.toString());
while (wordList.hasMoreTokens()) {
word.set(wordList.nextToken());
contex.write(word, one);
}
}
}
WordMapper (Mapper class)
58. Shuffler/Sorter
Maps emit (key, value) pairs
Shuffler/Sorter of Hadoop framework
Sort (key, value) pairs by key
Then, append the value to make (key, list of values) pair
For example,
The first, second maps emit:
<Hello, 1> <World, 1> <Bye, 1> <World, 1>
<Hello, 1> <Hadoop, 1> <Goodbye, 1> <Hadoop, 1>
Shuffler produces and it becomes the input of the reducer
<Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1,
1>>, <World, <1,1>>
59. SumReducer (Reducer class)
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
60. public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
Extends Reducer class with input/
output keys and valuesSumReducer (Reducer class)
61. Set output value type
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
62. Set input (key, list of values) type
and output as Context class
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
63. For each word,
Count/sum the number of values
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
64. For each word,
Total count becomes the value
SumReducer (Reducer class)
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable totalWordCount = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
Iterator<IntWritable> it=values.iterator();
while (it.hasNext()) {
wordCount += it.next().get();
}
totalWordCount.set(wordCount);
context.write(key, totalWordCount);
}
}
65. Reducer
Input: Shuffler produces and it becomes the input of the
reducer
<Bye, 1>, <Goodbye, 1>, <Hadoop, <1,1>>, <Hello, <1, 1>>, <World,
<1,1>>
Output
<Bye, 1>, <Goodbye, 1>, <Hadoop, 2>, <Hello, 2>, <World, 2>
SumReducer
66. ďMap()
The Mapper implementation, via the map method, processes one line at a
time, as provided by the specified TextInputFormat. It then splits the line
into tokens separated by whitespaces, via the StringTokenizer, and emits a
key-value pair of < <word>, 1>.
For asample input the first map emits:
< Deer, 1>
< Beer, 1>
< River, 1>
The second map emits:
< Car, 1>
< River, 1>
< Car, 1>
Map() and Reduce()
The output of the first map:
< Deer, 1>
< Beer, 1>
< River, 1>
The output of the second map:
< Car, 2>
< River, 1>
67. Map() and Reduce() (Continued)
ďReducer()
The Reducer implementation, via the reduce method just sums up the
values, which are the occurence counts for each key (i.e. words in this
example).
68. 2. TopN
ď§ We want to find top-n used words of a text file: âFlatlandâ by E. Abbot.
ď§ The input and the output are text files,
ď§ Need three classes
ď§ TopN.java
ď§ Driver class with main function
ď§ TopNMapper.java
ď§ Mapper class with map method
ď§ TopNReducer.java
ď§ Reducer class with reduce method
71. TopNMapper
/**
* The mapper reads one line at the time, splits it into an array of single words and emits every
* word to the reducers with the value of 1.
*/
public static class TopNMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String tokens = "[_|$#<>^=[]*/,;,.-:()?!"']";
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String cleanLine = value.toString().toLowerCase().replaceAll(tokens, " ");
StringTokenizer itr = new StringTokenizer(cleanLine);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken().trim());
context.write(word, one);
}
}
}
72. TopNReducer
/**
* The reducer retrieves every word and puts it into a Map: if the word already exists in the
* map, increments its value, otherwise sets it to 1.
*/
public static class TopNReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private Map<Text, IntWritable> countMap = new HashMap<>();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
//computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
// puts the number of occurrences of this word into the map.
// We need to create another Text object because the Text instance
// we receive is the same for all the words
countMap.put(new Text(key), new IntWritable(sum));
}
74. TopN- Results
The 2286
Of 1634
And 1098
That 499
You 429
Not 317
But 279
For 267
By 317
In shuffle and sort phase, the partioner will send
every single word (the key) with the value â1â to
the reducers.
All these network transmissions can be
minimized if we reduce the data locally the data
that the mapper will emit.
This is obtained by Combiner.
75. TopNCombiner
/**
* The combiner retrieves every word and puts it into a Map: if the word already exists in the
* map, increments its value, otherwise sets it to 1.
*/
public static class TopNCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
// computes the number of occurrences of a single word
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
76. Hadoop Output: With and Without Combiner
Without Combiner ->
ď§ Map input records = 4239
ď§ Map output records = 37817
ď§ Map output bytes = 359621
ď§ Input split bytes = 118
ď§ Combine input records = 0
ď§ Combine output records = 0
ď§ Reduce input groups = 4987
ď§ Reduce shuffle bytes = 435261
ď§ Reduce input records = 37817
ď§ Reduce output records = 20
With Combiner ->
ď§ Map input records = 4239
ď§ Map output records = 37817
ď§ Map output bytes = 359621
ď§ Input split bytes = 116
ď§ Combine input records = 37817
ď§ Combine output records = 20
ď§ Reduce input groups = 20
ď§ Reduce shuffle bytes = 194
ď§ Reduce input records = 20
ď§ Reduce output records = 20
77. Advantages and Disadvantages of using Combiner
ďąAdvantages ->
ďNetwork transmission are minimized.
ďąDisadvantages ->
ďHadoop doesnât guarantee the execution of a combiner: it can be
executed 0,1 or multiple times on the same input.
ďKey-value pairs emitted from mapper are stored in local file
system, and execution of combiner can cause extensive IO
operations.