SlideShare a Scribd company logo
1 of 35
Download to read offline
 What is Hadoop?, What Hadoop is not?, and Hadoop Assumptions.
 What is Rack, Cluster, Nodes and Commodity Hardware?
 HDFS - Hadoop Distributed File System
 Using HDFS commands
 MapReduce
 Higher-level languages over Hadoop: Pig and Hive
 HBase – Overview
 HCatalog
 What is Hadoop and its components?
 What is the commodity server/Hardware?
 Why HDFS ?
 What is the responsibility of NameNode in HDFS?
 What is Fault Tolerance?
 What is the default replication factor in HDFS?
 What is the heartbeat in HDFS?
 What are JobTracker and TaskTracker?
 Why MapReduce programming model?
 Where do we have Data Locality in MapReduce?
 Why we need to use Pig and Hive?
 What is the difference between Hbase and HCatalog
• At Google:
• Index building for Google Search
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!:
• Index building for Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook:
• Data mining
• Ad optimization
• Spam detection
The MapReduce algorithm contains two important tasks (Map and Reduce tasks)
• The Map task:
• The Reduce task
Map Output (key-value pairs)
The quick
Brown fox
The fox ate
Map input (set of data )
converts
The 1
quick 1
Brown 1
Fox 1
The 1
Fox 1
Ate 1
Ate 1
Brown 1
Fox 1
Fox 1
quick 1
The 1
The 1
combines
Reduce input (key-value pairs)
Ate 1
Brown 1
Fox 2
quick 1
The 2
Reduce Output
I’m a
leading task
MapReduce
By the way,
I always
start first
• Data type: key-value records
• Map function:
• Reduce function:
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
• Single master controls job execution on multiple slaves
• Mappers preferentially placed on same node or same rack as their
input block
• Minimizes network usage
• Mappers save outputs to local disk before serving them to
reducers
• Allows recovery if a reducer crashes
• Allows having more reducers than nodes
• A combiner is a local aggregation function for repeated keys
produced by same map
• Works for associative functions like sum, count, max
• Decreases size of intermediate data
• Example: map-side aggregation for Word Count:
def combiner(key, values):
output(key, sum(values))
Input Map & Combine Shuffle & Sort Reduce Output
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 2
fox, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Phase − Here we have a Record Reader that
translates each record in an input file and sends the
parsed data to the mapper in the form of key-value pairs.
Map Phase − Map is a user-defined function, which takes
a series of key-value pairs and processes each one of them
to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by
the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that
groups similar data from the map phase into identifiable
sets. It takes the intermediate keys from the mapper as
input and applies a user-defined code to aggregate the
values in a small scope of one mapper. It is not a part of
the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the
Shuffle and Sort step. It downloads the grouped key-value
pairs onto the local machine, where the Reducer is
running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent
keys together so that their values can be iterated easily in
the Reducer task.
Reducer − The Reducer takes the grouped key-value
paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide
range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output
formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a
record writer.
Word Count in Java
public class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable ONE = new IntWritable(1);
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
out.collect(new text(itr.nextToken()), ONE);
}
}
}
public class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> out,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
Word Count in Java
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.setInputPaths(conf, args[0]);
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setOutputKeyClass(Text.class); // out keys are words
(strings)
conf.setOutputValueClass(IntWritable.class); // values are counts
JobClient.runJob(conf);
}
import sys
for line in sys.stdin:
for word in line.split():
print(word.lower() + "t" + 1)
import sys
counts = {}
for line in sys.stdin:
word, count = line.split("t”)
dict[word] = dict.get(word, 0) +
int(count)
for word, count in counts:
print(word.lower() + "t" + 1)
A real-world example to comprehend the power of MapReduce.
Twitter receives around 500 million tweets per day, which is nearly
3000 tweets per second. The following illustration shows how
Tweeter manages its tweets with the help of MapReduce.
 Many parallel algorithms can be expressed by a
series of MapReduce jobs
 But MapReduce is fairly low-level: must think
about keys, values, partitioning, etc
 Can we capture common “job building blocks”?
 Started at Yahoo! Research
 Runs about 30% of Yahoo!’s jobs
 Features:
• Expresses sequences of MapReduce jobs
• Data model: nested “bags” of items
• Provides relational (SQL) operators (JOIN, GROUP BY, etc)
• Easy to plug in Java functions
• Pig Pen development environment for Eclipse
• Suppose you have user data in
one file, page view data in
another, and you need to find
the top 5 most visited pages by
users aged 18 - 25.
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
In MapReduce
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Users = load ‘users’ as (name, age);
Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
Notice how naturally the components of the job translate into Pig Latin.
Job 1
Job 3
Load Users Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …
Filtered = filter …
Pages = load …
Joined = join …
Grouped = group …
Summed = … count()…
Sorted = order …
Top5 = limit …
 Developed at Facebook
 Used for majority of Facebook jobs
 “Relational database” built on Hadoop
 Maintains list of table schemas
 SQL-like query language (HQL)
 Can call Hadoop Streaming scripts from HQL
 Supports table partitioning, clustering, complex
data types, some optimizations
•Find top 5 pages visited by users aged 18-25:
•Filter page views through Python script:
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner. That means one has to search the entire
dataset even for the simplest of jobs. A new solution is needed to
access any point of data in a single unit of time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
Rowid Column Family Column Family Column Family Column Family
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
1
2
3
Features of HBase
• HBase is linearly scalable.
• It has automatic failure
support.
• It provides consistent read
and writes.
• It integrates with Hadoop,
both as a source and a
destination.
• It has easy java API for client.
• It provides data replication
across clusters.
Where to Use HBase
• Apache HBase is used to have
random, real-time read/write
access to Big Data.
• It hosts very large tables on top of
clusters of commodity hardware.
• Apache HBase is a non-relational
database modeled after Google's
Bigtable. Bigtable acts up on
Google File System, likewise
Apache HBase works on top of
Hadoop and HDFS.
Applications of HBase
• It is used whenever
there is a need to write
heavy applications.
• HBase is used whenever
we need to provide fast
random access to
available data.
• Companies such as
Facebook, Twitter,
Yahoo, and Adobe use
HBase internally.
HDFS HBase
HBase RDBMS
HBase is schema-less, it doesn't have the concept
of fixed columns schema; defines only column
families.
An RDBMS is governed by its schema, which
describes the whole structure of tables.
It is built for wide tables. HBase is horizontally
scalable.
It is thin and built for small tables. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as structured
data.
It is good for structured data.
HCatalog, provides a relational table
abstraction layer over HDFS. Using the
HCatalog abstraction layer allows query tools
such as Pig and Hive to treat the data in a
familiar relational architecture. It also permits
easier exchange of data between the HDFS
storage and client tools used to present the data
for analysis using familiar data exchange
application programming interfaces (APIs) such
as Java Database Connectivity (JDBC) and
Open Database Connectivity.




Lecture 2 part 3

More Related Content

What's hot

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsDataWorks Summit
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hivezahid-mian
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slidesAnandMHadoop
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce AnandMHadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 

What's hot (19)

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
01 hbase
01 hbase01 hbase
01 hbase
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 

Similar to Lecture 2 part 3

Similar to Lecture 2 part 3 (20)

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop – Architecture.pptx
Hadoop – Architecture.pptxHadoop – Architecture.pptx
Hadoop – Architecture.pptx
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop
HadoopHadoop
Hadoop
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Data Science
Data ScienceData Science
Data Science
 

More from Jazan University

Relationship between cloud computing and big data
Relationship between cloud computing and big dataRelationship between cloud computing and big data
Relationship between cloud computing and big dataJazan University
 
Cyber-infrastructure Presentation 2015
Cyber-infrastructure Presentation 2015Cyber-infrastructure Presentation 2015
Cyber-infrastructure Presentation 2015Jazan University
 

More from Jazan University (6)

Relationship between cloud computing and big data
Relationship between cloud computing and big dataRelationship between cloud computing and big data
Relationship between cloud computing and big data
 
Cyber-infrastructure Presentation 2015
Cyber-infrastructure Presentation 2015Cyber-infrastructure Presentation 2015
Cyber-infrastructure Presentation 2015
 
Anas bahkali 2
Anas bahkali 2Anas bahkali 2
Anas bahkali 2
 
Lecture 2 part 2
Lecture 2 part 2Lecture 2 part 2
Lecture 2 part 2
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Presentation of Kent Park
Presentation of Kent ParkPresentation of Kent Park
Presentation of Kent Park
 

Recently uploaded

Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Recently uploaded (20)

Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

Lecture 2 part 3

  • 1.
  • 2.  What is Hadoop?, What Hadoop is not?, and Hadoop Assumptions.  What is Rack, Cluster, Nodes and Commodity Hardware?  HDFS - Hadoop Distributed File System  Using HDFS commands  MapReduce  Higher-level languages over Hadoop: Pig and Hive  HBase – Overview  HCatalog
  • 3.  What is Hadoop and its components?  What is the commodity server/Hardware?  Why HDFS ?  What is the responsibility of NameNode in HDFS?  What is Fault Tolerance?  What is the default replication factor in HDFS?  What is the heartbeat in HDFS?  What are JobTracker and TaskTracker?  Why MapReduce programming model?  Where do we have Data Locality in MapReduce?  Why we need to use Pig and Hive?  What is the difference between Hbase and HCatalog
  • 4. • At Google: • Index building for Google Search • Article clustering for Google News • Statistical machine translation • At Yahoo!: • Index building for Yahoo! Search • Spam detection for Yahoo! Mail • At Facebook: • Data mining • Ad optimization • Spam detection
  • 5. The MapReduce algorithm contains two important tasks (Map and Reduce tasks) • The Map task: • The Reduce task Map Output (key-value pairs) The quick Brown fox The fox ate Map input (set of data ) converts The 1 quick 1 Brown 1 Fox 1 The 1 Fox 1 Ate 1 Ate 1 Brown 1 Fox 1 Fox 1 quick 1 The 1 The 1 combines Reduce input (key-value pairs) Ate 1 Brown 1 Fox 2 quick 1 The 2 Reduce Output I’m a leading task MapReduce By the way, I always start first
  • 6. • Data type: key-value records • Map function: • Reduce function:
  • 7. def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  • 8. the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output
  • 9. • Single master controls job execution on multiple slaves • Mappers preferentially placed on same node or same rack as their input block • Minimizes network usage • Mappers save outputs to local disk before serving them to reducers • Allows recovery if a reducer crashes • Allows having more reducers than nodes
  • 10. • A combiner is a local aggregation function for repeated keys produced by same map • Works for associative functions like sum, count, max • Decreases size of intermediate data • Example: map-side aggregation for Word Count: def combiner(key, values): output(key, sum(values))
  • 11. Input Map & Combine Shuffle & Sort Reduce Output the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 the, 2 fox, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1
  • 12. Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs. Map Phase − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs. Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
  • 13. Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional. Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.
  • 14. Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step. Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.
  • 15. Word Count in Java public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } } }
  • 16. public class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } Word Count in Java
  • 17. public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); // out keys are words (strings) conf.setOutputValueClass(IntWritable.class); // values are counts JobClient.runJob(conf); }
  • 18. import sys for line in sys.stdin: for word in line.split(): print(word.lower() + "t" + 1) import sys counts = {} for line in sys.stdin: word, count = line.split("t”) dict[word] = dict.get(word, 0) + int(count) for word, count in counts: print(word.lower() + "t" + 1)
  • 19. A real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with the help of MapReduce.
  • 20.  Many parallel algorithms can be expressed by a series of MapReduce jobs  But MapReduce is fairly low-level: must think about keys, values, partitioning, etc  Can we capture common “job building blocks”?
  • 21.  Started at Yahoo! Research  Runs about 30% of Yahoo!’s jobs  Features: • Expresses sequences of MapReduce jobs • Data model: nested “bags” of items • Provides relational (SQL) operators (JOIN, GROUP BY, etc) • Easy to plug in Java functions • Pig Pen development environment for Eclipse
  • 22. • Suppose you have user data in one file, page view data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5
  • 23. In MapReduce Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 24. Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt
  • 25. Notice how naturally the components of the job translate into Pig Latin. Job 1 Job 3 Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit …
  • 26.  Developed at Facebook  Used for majority of Facebook jobs  “Relational database” built on Hadoop  Maintains list of table schemas  SQL-like query language (HQL)  Can call Hadoop Streaming scripts from HQL  Supports table partitioning, clustering, complex data types, some optimizations
  • 27. •Find top 5 pages visited by users aged 18-25: •Filter page views through Python script:
  • 28. Limitations of Hadoop Hadoop can perform only batch processing, and data will be accessed only in a sequential manner. That means one has to search the entire dataset even for the simplest of jobs. A new solution is needed to access any point of data in a single unit of time (random access). What is HBase? HBase is a distributed column-oriented database built on top of the Hadoop file system. It is designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).
  • 29. Rowid Column Family Column Family Column Family Column Family col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3 1 2 3
  • 30. Features of HBase • HBase is linearly scalable. • It has automatic failure support. • It provides consistent read and writes. • It integrates with Hadoop, both as a source and a destination. • It has easy java API for client. • It provides data replication across clusters. Where to Use HBase • Apache HBase is used to have random, real-time read/write access to Big Data. • It hosts very large tables on top of clusters of commodity hardware. • Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS. Applications of HBase • It is used whenever there is a need to write heavy applications. • HBase is used whenever we need to provide fast random access to available data. • Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
  • 32. HBase RDBMS HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables. It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale. No transactions are there in HBase. RDBMS is transactional. It has de-normalized data. It will have normalized data. It is good for semi-structured as well as structured data. It is good for structured data.
  • 33. HCatalog, provides a relational table abstraction layer over HDFS. Using the HCatalog abstraction layer allows query tools such as Pig and Hive to treat the data in a familiar relational architecture. It also permits easier exchange of data between the HDFS storage and client tools used to present the data for analysis using familiar data exchange application programming interfaces (APIs) such as Java Database Connectivity (JDBC) and Open Database Connectivity.