SlideShare a Scribd company logo
1 of 27
Abhishek Mukherjee
Utkarsh Srivastava
12th,September
Not everything that can be counted counts, and not
everything that counts can be counted.
WELCOME TO BIG DATA
TRANING
What are we going to cover today?
 Uses of Big Data
 What is Hadoop?
 Short intro to the HDFS architecture.
 What is Map Reduce?
 The components of Map Reduce Algorithm
 Hello world of map reduce i.e. Word Count Algorithm
 Tips and Tricks of Map Reduce
 Big data is an evolving term that describes any voluminous
amount of structured, semi-structured and
unstructured data that has the potential to be mined for
information.
 Lots of Data(Zetabytes or Terabytes or Petabytes)
 Systems / Enterprises generate huge amount of data from
Terabytes to and even Petabytes of information.
 A airline jet collects 10 terabytes of sensor data for every 30
minutes of flying time.
What is Big Data?
Serial vs sequential processingSerial vs parallel processing
WHY BIGDATA?
WHY BIGDATA?
WHY BIGDATA?
Walmart has exhaustive customer data of
close to 145 million Americans of which 60%
of the data is of U.S adults. Walmart tracks
and targets every consumer individually
Walmart observed a significant 10% to 15%
increase in online sales for $1 billion in
incremental revenue.
Accessible
Robust
Scalable
Simple
Differentiating Factors:
Father of Hadoop?
HADOOP ECOSYSTEM
HDFS ARCHITECTURE
HDFS ARCHITECTURE CONTD.
 Map Phase
 Combiner Phase(Optional)
 Sort Phase
 Shuffle Phase
 Partition Phase(Optional)
 Reducer Phase
Key points
Map Reduce Algorithm
 Hello my name is abhishek Hello my name is utsav
 Hello my passion is cricket
Imagine this as the input file:
Map Phase
This file has 2 lines. Each line in the file has a byte offset of
its own which serves as a key to the mapper and the
value of the mapper is the data which is present In the
line.
Operation on output of map phase
Hello 1
my 1
name 1
is 1
abhishek 1
Hello 1
my 1
name 1
is 1
utsav 1
Hello 1
my 1
passion 1
is 1
cricket 1
Hello(1,1,1)
my(1,1,1)
name(1,1,1)
is(1,1,1)
abhishek(1)
utsav(1)
passion(1)
cricket(1)
Key(tuple of values)
 The key points are as follows:
 Sort the key value pairs according to the key values
 Shuffle the mapped output to get values with same key to
create a tuple of values with same key
 This output is fed to the reducer which in turn maps the
values of the tuple by returning a single value for a list of
values present in the tuple
Explaination of sort and shuffle phase
Reducer phase
Hello(1,1,1)
my(1,1,1)
name(1,1,1)
is(1,1,1)
abhishek(1)
utsav(1)
passion(1)
cricket(1)
Key(tuple of values)
abhishek(1)
cricket(1)
Hello(3)
is(3)
my(3)
name(3)
passion(1)
utsav(1)
Key(single value)
 sudo su – makes temporary super user.
 hadoop fs -ls /
 hadoop fs -mkdir /mycreatedfolderinhdfs
 hadoop fs -put /usr/directoryinlocal /user/root/directoryinhdfs
 hadoop fs -get /user/root/mycreatedfolderinhdfs /usr/folderinlocal
 hadoop fs -r -mr /mycreatedfolderinhdfs
 Hadoop jar com.bigdata.session.hadoop.tool.jar {sourcepath} {Destination
path}
BASIC HADOOP COMMANDS
 Two types of splitting of input files are possible
 HDFS split: Splitting of files into blocks of fixed size e.g.
splitting a file into blocks of 64 MB to promote parallel
processing.
 N line split: Splitting of files into lines of fixed number of
lines to promote parallel processing
 Lets see an example in the next slide
Types of splits(Parallel processing in action):
 Consider this as the input file:
 Map reduce is a framework based on processing of
data paralelly. This algorithm consists of three phases
namely map , shuffle and sort ,reduce. Here we will
observe the effect of n line splitter on the number of
map tasks i.e. the number of mappers created. This will
create a better understanding on how a file splits.
N LINE SPLITTING:
Can you guess what will happen?????
 Assume the value of n as 3
 Map reduce is a framework based on processing of
data paralelly. This algorithm consists of three phases
namely map , shuffle and sort ,reduce. Here we will
N LINE SPLITTING contd.
observe the effect of n line splitter on the number of
map tasks i.e. the number of mappers created. This will
create a better understanding of how a file splits.
So both of these splits of the file will be sent to two different
mappers while in the case of HDFS split the amount of data
being sent to mappers depends on the size of the respective
splits
 Hadoop uses its own serialization format, Writables, which
is certainly compact and fast. Data needs to be serialized
to be sent via a network path.
Data Types available in Map Reduce
Thus we see that these
Serialized data types
are Java equivalent
data types
 Combiner optimization
 Partitioner optimization
 Custom Writables
Tips for optimizing map reduce codes:
ANY QUERIES?
Abhishek Mukherjee Utkarsh Srivastava
scobbyabhi9@gmail.com utkarshsrivastava538@gmail.com
No. 9629341857 No. 9629341221
CONTACT DETAILS

More Related Content

What's hot

Survey Performance Improvement Construct FP-Growth Tree
Survey Performance Improvement Construct FP-Growth TreeSurvey Performance Improvement Construct FP-Growth Tree
Survey Performance Improvement Construct FP-Growth Treeijsrd.com
 
Frequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataFrequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataRaju Gupta
 
R programming analysis
R programming analysisR programming analysis
R programming analysisdigitaladitya
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementationtugrulh
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoopvishnu rao
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce Fabio Fumarola
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & developmentShashwat Shriparv
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysisRajesh M
 
Tech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemTech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemRishabh Dugar
 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges finalRajesh M
 
A brief introduction to 'R' statistical package
A brief introduction to 'R' statistical packageA brief introduction to 'R' statistical package
A brief introduction to 'R' statistical packageShanmukha S. Potti
 
Datastructures using c++
Datastructures using c++Datastructures using c++
Datastructures using c++Gopi Nath
 

What's hot (20)

Survey Performance Improvement Construct FP-Growth Tree
Survey Performance Improvement Construct FP-Growth TreeSurvey Performance Improvement Construct FP-Growth Tree
Survey Performance Improvement Construct FP-Growth Tree
 
Frequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigDataFrequent Itemset Mining(FIM) on BigData
Frequent Itemset Mining(FIM) on BigData
 
R programming analysis
R programming analysisR programming analysis
R programming analysis
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
 
simple introduction to hadoop
simple introduction to hadoopsimple introduction to hadoop
simple introduction to hadoop
 
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
A Parallel Algorithm for Approximate Frequent Itemset Mining using MapReduce
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Starting work with R
Starting work with RStarting work with R
Starting work with R
 
Python data structures - best in class for data analysis
Python data structures -   best in class for data analysisPython data structures -   best in class for data analysis
Python data structures - best in class for data analysis
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Tech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed SystemTech Talk - Underutilized Resources in Distributed System
Tech Talk - Underutilized Resources in Distributed System
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Pandas data transformational data structure patterns and challenges final
Pandas   data transformational data structure patterns and challenges  finalPandas   data transformational data structure patterns and challenges  final
Pandas data transformational data structure patterns and challenges final
 
Data structure
Data structureData structure
Data structure
 
A brief introduction to 'R' statistical package
A brief introduction to 'R' statistical packageA brief introduction to 'R' statistical package
A brief introduction to 'R' statistical package
 
07.bootstrapping
07.bootstrapping07.bootstrapping
07.bootstrapping
 
R program
R programR program
R program
 
Datastructures using c++
Datastructures using c++Datastructures using c++
Datastructures using c++
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 

Viewers also liked

Viewers also liked (10)

Ice cold Presentation
Ice cold PresentationIce cold Presentation
Ice cold Presentation
 
new resume
new resumenew resume
new resume
 
La proporción
La proporciónLa proporción
La proporción
 
Fruit&Veg Insurance Flyer
Fruit&Veg Insurance FlyerFruit&Veg Insurance Flyer
Fruit&Veg Insurance Flyer
 
Entrada 12
Entrada 12Entrada 12
Entrada 12
 
Diagrama
DiagramaDiagrama
Diagrama
 
Nm2 probabilidades 1
Nm2 probabilidades 1Nm2 probabilidades 1
Nm2 probabilidades 1
 
Dígitro STT | Transcrição com alta produtividade
Dígitro STT | Transcrição com alta produtividade Dígitro STT | Transcrição com alta produtividade
Dígitro STT | Transcrição com alta produtividade
 
Digestion diary (01).docx
Digestion diary (01).docxDigestion diary (01).docx
Digestion diary (01).docx
 
Meeting minutes communication across globe
Meeting minutes communication across globeMeeting minutes communication across globe
Meeting minutes communication across globe
 

Similar to WELCOME TO BIG DATA TRANING

Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.pptCheeWeiTan10
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & HadoopAhmed Gamil
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce worldYu Liu
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Codemotion
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsbarbie0909
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 

Similar to WELCOME TO BIG DATA TRANING (20)

Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
MapReduceAlgorithms.ppt
MapReduceAlgorithms.pptMapReduceAlgorithms.ppt
MapReduceAlgorithms.ppt
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
 
What is MapReduce ?
What is MapReduce ?What is MapReduce ?
What is MapReduce ?
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 

WELCOME TO BIG DATA TRANING

  • 1. Abhishek Mukherjee Utkarsh Srivastava 12th,September Not everything that can be counted counts, and not everything that counts can be counted. WELCOME TO BIG DATA TRANING
  • 2. What are we going to cover today?  Uses of Big Data  What is Hadoop?  Short intro to the HDFS architecture.  What is Map Reduce?  The components of Map Reduce Algorithm  Hello world of map reduce i.e. Word Count Algorithm  Tips and Tricks of Map Reduce
  • 3.  Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.  Lots of Data(Zetabytes or Terabytes or Petabytes)  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information.  A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. What is Big Data?
  • 4. Serial vs sequential processingSerial vs parallel processing WHY BIGDATA?
  • 6. WHY BIGDATA? Walmart has exhaustive customer data of close to 145 million Americans of which 60% of the data is of U.S adults. Walmart tracks and targets every consumer individually Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue.
  • 12.  Map Phase  Combiner Phase(Optional)  Sort Phase  Shuffle Phase  Partition Phase(Optional)  Reducer Phase Key points Map Reduce Algorithm
  • 13.
  • 14.
  • 15.  Hello my name is abhishek Hello my name is utsav  Hello my passion is cricket Imagine this as the input file: Map Phase This file has 2 lines. Each line in the file has a byte offset of its own which serves as a key to the mapper and the value of the mapper is the data which is present In the line.
  • 16. Operation on output of map phase Hello 1 my 1 name 1 is 1 abhishek 1 Hello 1 my 1 name 1 is 1 utsav 1 Hello 1 my 1 passion 1 is 1 cricket 1 Hello(1,1,1) my(1,1,1) name(1,1,1) is(1,1,1) abhishek(1) utsav(1) passion(1) cricket(1) Key(tuple of values)
  • 17.  The key points are as follows:  Sort the key value pairs according to the key values  Shuffle the mapped output to get values with same key to create a tuple of values with same key  This output is fed to the reducer which in turn maps the values of the tuple by returning a single value for a list of values present in the tuple Explaination of sort and shuffle phase
  • 18. Reducer phase Hello(1,1,1) my(1,1,1) name(1,1,1) is(1,1,1) abhishek(1) utsav(1) passion(1) cricket(1) Key(tuple of values) abhishek(1) cricket(1) Hello(3) is(3) my(3) name(3) passion(1) utsav(1) Key(single value)
  • 19.  sudo su – makes temporary super user.  hadoop fs -ls /  hadoop fs -mkdir /mycreatedfolderinhdfs  hadoop fs -put /usr/directoryinlocal /user/root/directoryinhdfs  hadoop fs -get /user/root/mycreatedfolderinhdfs /usr/folderinlocal  hadoop fs -r -mr /mycreatedfolderinhdfs  Hadoop jar com.bigdata.session.hadoop.tool.jar {sourcepath} {Destination path} BASIC HADOOP COMMANDS
  • 20.
  • 21.  Two types of splitting of input files are possible  HDFS split: Splitting of files into blocks of fixed size e.g. splitting a file into blocks of 64 MB to promote parallel processing.  N line split: Splitting of files into lines of fixed number of lines to promote parallel processing  Lets see an example in the next slide Types of splits(Parallel processing in action):
  • 22.  Consider this as the input file:  Map reduce is a framework based on processing of data paralelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will observe the effect of n line splitter on the number of map tasks i.e. the number of mappers created. This will create a better understanding on how a file splits. N LINE SPLITTING: Can you guess what will happen?????
  • 23.  Assume the value of n as 3  Map reduce is a framework based on processing of data paralelly. This algorithm consists of three phases namely map , shuffle and sort ,reduce. Here we will N LINE SPLITTING contd. observe the effect of n line splitter on the number of map tasks i.e. the number of mappers created. This will create a better understanding of how a file splits. So both of these splits of the file will be sent to two different mappers while in the case of HDFS split the amount of data being sent to mappers depends on the size of the respective splits
  • 24.  Hadoop uses its own serialization format, Writables, which is certainly compact and fast. Data needs to be serialized to be sent via a network path. Data Types available in Map Reduce Thus we see that these Serialized data types are Java equivalent data types
  • 25.  Combiner optimization  Partitioner optimization  Custom Writables Tips for optimizing map reduce codes:
  • 27. Abhishek Mukherjee Utkarsh Srivastava scobbyabhi9@gmail.com utkarshsrivastava538@gmail.com No. 9629341857 No. 9629341221 CONTACT DETAILS