SlideShare a Scribd company logo
1 of 47
Expected … what to be said!
● History.
● What is Hadoop.
● Hadoop vs SQl.
● MapReduce.
● Hadoop Building Blocks.
● Installing, Configuring and Running Hadoop.
● Anatomy of MapReduce program.
Hadoop Series Resources
How hadoop was born?
Doug Cutting
Challenges of Distributed Processing of
Large Data
● How to distribute the work?
● How to store and distribute the data itself?
● How to overcome failures?
● How to balance the load?
● How to deal with unstructured data?
● ...
Hadoop tackles these
challenges!
So, what’s Hadoop?
What is Hadoop?
Hadoop is an open source framework for writing and
running distributed applications that process large
amounts of data.
Key distinctions of Hadoop:
● Accessible
● Robust
● Scalable
● Simple
Hadoop vs SQL
● Structured and Unstructured data.
● Datastore and Data Analysis.
● Scale-out and Scale-up.
● Offline batch processing and Online
transactions.
Hadoop Uses
MapReduce
What is MapReduce?...
● Parallel programming model for clusters of
commodity machines.
● MapReduce provides:
o Automatic parallelization & distribution.
o Fault tolerance.
o Locality of data.
What is MapReduce?
MapReduce … Map then Reduce
Keys and Values
● Key/Value pairs.
● Keys divide Reduce Space.
Input Output
Map <k1, v1> list(<k2, v2>)
Reduce <k2, list(v2)> list(<k3, v3>)
WordCount in Action
Input:
foo.txt:
“This is the foo file”
bar.txt:
“And this is the bar one”
1
is
1
the
1
foo
1
file
1
and
1
this
1
is
1
the
1
Reduce#2:
Input:
Output:
is, [1, 1] is,
2
Reduce#1:
Input:
Output:
this, [1, 1]
this, 2
Reduce#3:
Input:
Output:
foo, [1]
foo, 1.
.
Final output:
this 2
is 2
the 2
foo 1
file 1
and 1
bar 1
one 1
WordCount with MapReduce
map(String filename, String document) {
List<String> T = tokenize(document);
for each token in T {
emit ((String)token,
(Integer) 1);
}
}
reduce(String token, List<Integer> values) {
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}
Hadoop Building Blocks
How does hadoop work?...
Hadoop Building Blocks
1. NameNode
2. DataNode
3. Secondary NameNode
4. JobTracker
5. TaskTracker
HDFS: NameNode and DataNodes
JobTracker and TaskTracker
Typical Hadoop Cluster
Running Hadoop
Three modes to run Hadoop:
1. Local (standalone) mode.
2. Pseudo-distributed mode “cluster of one” .
3. Fully distributed mode.
An Action
Running Hadoop on Local Machine
Actions ...
1. Installing Hadoop.
2. Configuring Hadoop (Pseudo-distributed mode).
3. Running WordCount example.
4. Web-based cluster UI.
HDFS
1. HDFS is a filesystem designed for large-scale
distributed data processing.
2. HDFS isn’t a native Unix filesystem.
Basic File Commands:
$ hadoop fs -cmd <args>
$ hadoop fs –ls
$ hadoop fs –mkdir /user/chuck
$ hadoop fs -copyFromLocal
Anatomy of a MapReduce program
MapReduce and beyond
Hadoop
1. Data Types
2. Mapper
3. Reducer
4. Partitioner
5. Combiner
6. Reading and Writing
a. InputFormat
b. OutputFormat
Anatomy of a MapReduce program
Hadoop Data Types
● Certain defined way of serializing key/value pairs.
● Values should implement Writable Interface.
● Keys should implement WritableComparable interface.
● Some predefined classes:
o BooleanWritable.
o ByteWritable.
o IntWritable
o ...
Mapper
Mapper
1. Mapper<K1,V1,K2,V2>
2. Override method:
void map(K1 key, V1 value, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text,
Text, IntWritable> {
private final static IntWritable one = new
IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context
context){
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
Predefined Mappers
Reducer
Reducer
1. Extends Reducer<K1,V1,K2,V2>
2. Overrides method:
void reduce(K2, Iterable<V2>, Context context)
3. Use context.write(K2, V2) to emit key/value pairs.
WordCount Reducer
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
Predefined Reducers
Partitioner
Partitioner
The partitioner decides
which key goes where
class WordSizePartitioner extends
Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text
word, IntWritable count, int
numOfPartions) {
return 0;
}
}
Combiner
Combiner
It’s a local Reduce Task at
Mapper.
WordCout Mapper Output:
1. Without Combiner:<the, 1>, <file,
1>, <the, 1>, …
2. With Combiner:<the, 2>, <file, 2>,
...
Reading and Writing
Reading and Writing
1. Input data usually resides in large files.
2. MapReduce’s processing power is the splitting of the
input data into chunks(InputSplit).
3. Hadoop’s FileSystem provides the class
FSDataInputStream for file reading. It extends
DataInputStream with random read access.
InputFormat Classes
● TextInputFormat
o <offset, line>
● KeyValueTextInputFormat
o keytvaue => <key, value>
● NLineInputFormat
o <offset, nLines>
You can define your own InputFormat class ...
1. The output has no splits.
2. Each reducer generates output file named
part-nnnnn, where nnnnn is the partition ID
of the reducer.
Predefined OutputFormat classes:
> TextOutputFormat <k, v> => ktv
OutputFormat
Recap
END OF SESSION #1
Q

More Related Content

What's hot (20)

Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 

Viewers also liked

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)Brandwatch
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...OpenCity Community
 
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)giopersico
 
Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"Елена Шальнова
 
Indicaciones de un helipuerto
Indicaciones de un helipuertoIndicaciones de un helipuerto
Indicaciones de un helipuertoJorge Echeverria
 
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23Cámara de Comercio de Bilbao
 

Viewers also liked (20)

Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)Map Reduce: An Example (James Grant at Big Data Brighton)
Map Reduce: An Example (James Grant at Big Data Brighton)
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Presentation1
Presentation1Presentation1
Presentation1
 
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
Assign, commit, and review - A developer’s guide to OpenStack contribution-20...
 
Module 3
Module 3Module 3
Module 3
 
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
Microsoft Excel and Financial Modeling - Global Survey Results (July 2011)
 
Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"Маркетинговая программа "Быстрого роста 3+3"
Маркетинговая программа "Быстрого роста 3+3"
 
Indicaciones de un helipuerto
Indicaciones de un helipuertoIndicaciones de un helipuerto
Indicaciones de un helipuerto
 
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23Egoera: La Economía de Bizkaia - Junio 2016 - nº23
Egoera: La Economía de Bizkaia - Junio 2016 - nº23
 

Similar to Introduction to MapReduce and Hadoop

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopDilum Bandara
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueShay Sofer
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 

Similar to Introduction to MapReduce and Hadoop (20)

Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Spark overview
Spark overviewSpark overview
Spark overview
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 

Recently uploaded

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 

Recently uploaded (20)

A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 

Introduction to MapReduce and Hadoop

  • 1.
  • 2. Expected … what to be said! ● History. ● What is Hadoop. ● Hadoop vs SQl. ● MapReduce. ● Hadoop Building Blocks. ● Installing, Configuring and Running Hadoop. ● Anatomy of MapReduce program.
  • 4.
  • 5. How hadoop was born? Doug Cutting
  • 6. Challenges of Distributed Processing of Large Data ● How to distribute the work? ● How to store and distribute the data itself? ● How to overcome failures? ● How to balance the load? ● How to deal with unstructured data? ● ...
  • 8. What is Hadoop? Hadoop is an open source framework for writing and running distributed applications that process large amounts of data. Key distinctions of Hadoop: ● Accessible ● Robust ● Scalable ● Simple
  • 9. Hadoop vs SQL ● Structured and Unstructured data. ● Datastore and Data Analysis. ● Scale-out and Scale-up. ● Offline batch processing and Online transactions.
  • 11. ● Parallel programming model for clusters of commodity machines. ● MapReduce provides: o Automatic parallelization & distribution. o Fault tolerance. o Locality of data. What is MapReduce?
  • 12. MapReduce … Map then Reduce
  • 13. Keys and Values ● Key/Value pairs. ● Keys divide Reduce Space. Input Output Map <k1, v1> list(<k2, v2>) Reduce <k2, list(v2)> list(<k3, v3>)
  • 14. WordCount in Action Input: foo.txt: “This is the foo file” bar.txt: “And this is the bar one” 1 is 1 the 1 foo 1 file 1 and 1 this 1 is 1 the 1 Reduce#2: Input: Output: is, [1, 1] is, 2 Reduce#1: Input: Output: this, [1, 1] this, 2 Reduce#3: Input: Output: foo, [1] foo, 1. . Final output: this 2 is 2 the 2 foo 1 file 1 and 1 bar 1 one 1
  • 15. WordCount with MapReduce map(String filename, String document) { List<String> T = tokenize(document); for each token in T { emit ((String)token, (Integer) 1); } } reduce(String token, List<Integer> values) { Integer sum = 0; for each value in values { sum = sum + value; } emit ((String)token, (Integer) sum); }
  • 16. Hadoop Building Blocks How does hadoop work?...
  • 17. Hadoop Building Blocks 1. NameNode 2. DataNode 3. Secondary NameNode 4. JobTracker 5. TaskTracker
  • 18. HDFS: NameNode and DataNodes
  • 21. Running Hadoop Three modes to run Hadoop: 1. Local (standalone) mode. 2. Pseudo-distributed mode “cluster of one” . 3. Fully distributed mode.
  • 22. An Action Running Hadoop on Local Machine
  • 23. Actions ... 1. Installing Hadoop. 2. Configuring Hadoop (Pseudo-distributed mode). 3. Running WordCount example. 4. Web-based cluster UI.
  • 24. HDFS 1. HDFS is a filesystem designed for large-scale distributed data processing. 2. HDFS isn’t a native Unix filesystem. Basic File Commands: $ hadoop fs -cmd <args> $ hadoop fs –ls $ hadoop fs –mkdir /user/chuck $ hadoop fs -copyFromLocal
  • 25. Anatomy of a MapReduce program MapReduce and beyond
  • 26. Hadoop 1. Data Types 2. Mapper 3. Reducer 4. Partitioner 5. Combiner 6. Reading and Writing a. InputFormat b. OutputFormat
  • 27. Anatomy of a MapReduce program
  • 28. Hadoop Data Types ● Certain defined way of serializing key/value pairs. ● Values should implement Writable Interface. ● Keys should implement WritableComparable interface. ● Some predefined classes: o BooleanWritable. o ByteWritable. o IntWritable o ...
  • 30. Mapper 1. Mapper<K1,V1,K2,V2> 2. Override method: void map(K1 key, V1 value, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 31. WordCount Mapperpublic static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context){ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); }
  • 34. Reducer 1. Extends Reducer<K1,V1,K2,V2> 2. Overrides method: void reduce(K2, Iterable<V2>, Context context) 3. Use context.write(K2, V2) to emit key/value pairs.
  • 35. WordCount Reducer public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
  • 38. Partitioner The partitioner decides which key goes where class WordSizePartitioner extends Partitioner<Text, IntWritable> { @Override public int getPartition(Text word, IntWritable count, int numOfPartions) { return 0; } }
  • 40. Combiner It’s a local Reduce Task at Mapper. WordCout Mapper Output: 1. Without Combiner:<the, 1>, <file, 1>, <the, 1>, … 2. With Combiner:<the, 2>, <file, 2>, ...
  • 42. Reading and Writing 1. Input data usually resides in large files. 2. MapReduce’s processing power is the splitting of the input data into chunks(InputSplit). 3. Hadoop’s FileSystem provides the class FSDataInputStream for file reading. It extends DataInputStream with random read access.
  • 43. InputFormat Classes ● TextInputFormat o <offset, line> ● KeyValueTextInputFormat o keytvaue => <key, value> ● NLineInputFormat o <offset, nLines> You can define your own InputFormat class ...
  • 44. 1. The output has no splits. 2. Each reducer generates output file named part-nnnnn, where nnnnn is the partition ID of the reducer. Predefined OutputFormat classes: > TextOutputFormat <k, v> => ktv OutputFormat
  • 45. Recap
  • 47. Q

Editor's Notes

  1. https://sites.google.com/site/hadoopintroduction/home/what-is-hadoop
  2. Lucene is a full featured text indexer and searching library. Nutch was trying to build a complete web search engine with Lucene, it has web crawler and HTML parser and so on.. Problem: There are billions of web pages there!! What can the poor Nutch do? > Google announced GFS and MapReduce 2004, they said that they are using these techniques in their search engine … realy? :/ < Doug and his team used these techniques for nutch and then Hadoop was born. Doug Cutting
  3. Challenges in processing Large Data in a distributed way.
  4. Accessible—Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon’s Elastic Compute Cloud (EC2). Robust—Because it is intended to run on commodity hardware, Hadoop is archi­tected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures. Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster. Simple—Hadoop allows users to quickly write efficient parallel code. Hadoop in Action section 1.2
  5. REF: https://sites.google.com/site/hadoopintroduction/home/comparing-sql-databases-and-hadoop
  6. REF: https://developer.yahoo.com/hadoop/tutorial/module4.html
  7. Table from “Hadoop In Action” Images source: https://developer.yahoo.com/hadoop/tutorial/module4.html
  8. Pseudo-code for map and reduce functions for word counting Source: Hadoop In Action
  9. We now know a general overview about mapreduce, let’s see how hadoop works
  10. Hadoop In Action Figure 2.1
  11. Local (standalone) mode. No HDFS. No Hadoop Daemons. Debugging and testing the logic of MapReduce program. Pseudo-distributed mode. All daemons running on a single machine. Debugging your code, allowing you to examine memory usage, HDFS input/out­put issues, and other daemon interactions. Fully distributed mode.
  12. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  13. This slide is initially left blank.
  14. https://developer.yahoo.com/hadoop/tutorial/module4.html
  15. This slide is initially left blank.
  16. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.
  17. When the reducer task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.