SlideShare a Scribd company logo
Hadoop: A Software Framework for Data
Intensive Computing Applications
Rajinder Sandhu
Researcher in Cloud Computing and Virtualization
Technologies
errajindersandhu.blogspot.com
What is Hadoop?
 Software platform that lets one easily write and run applications that

process vast amounts of data. It includes:
– MapReduce – offline computing engine
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
 Yahoo! is the biggest contributor
 Here's what makes it especially useful:
 Scalable: It can reliably store and process petabytes.
 Economical: It distributes the data and processing across clusters of

commonly available computers (in thousands).
 Efficient: By distributing the data, it can process it in parallel on the
nodes where the data is located.
 Reliable: It automatically maintains multiple copies of data and
automatically redeploys computing tasks based on failures.
2
Hadoop: Assumptions
It is written with large clusters of computers in mind and is built around the
following assumptions:
 Hardware will fail.
 Processing will be run in batches. Thus there is an emphasis on high
throughput as opposed to low latency.
 Applications that run on HDFS have large data sets. A typical file in HDFS is
gigabytes to terabytes in size.
 It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster. It should support tens of millions of files in a single
instance.
 Applications need a write-once-read-many access model.
 Moving Computation is Cheaper than Moving Data.
 Portability is important.
3
Apache Hadoop Wins Terabyte Sort Benchmark (July
2013)
 One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 42 seconds, which

beat the previous record of 80 seconds in 2011 for the annual general purpose
(daytona) terabyte sort benchmark. The sort benchmark specifies the input data (10
billion 100 byte records), which must be completely sorted and written to disk.

4
Example Applications and Organizations using
Hadoop
 A9.com – Amazon: To build Amazon's product search indices; process millions of











5

sessions daily for analytics, using both the Java and streaming APIs; clusters vary
from 1 to 100 nodes.
Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest
cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for
Ad Systems and Web Search
AOL : Used for a variety of things ranging from statistics generation to running
advanced algorithms for doing behavioral analysis and targeting; cluster size is 50
machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB
hard-disk giving us a total of 37 TB HDFS capacity.
Facebook: To store copies of internal log and dimension data sources and use it as a
source for reporting/analytics and machine learning; 320 machine cluster with 2,560
cores and about 1.3 PB raw storage;
FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine
storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log
analysis, data mining and machine learning
University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store
and serve physics data;
More Hadoop Applications
 Adknowledge - to build the recommender system for behavioral targeting, plus

other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2.
 Contextweb - to store ad serving log and use it as a source for Ad optimizations/

Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about
35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of
storage.
 Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz

Xeon Processor, 2 GB RAM, 72GB Hard Drive)
 NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used

for crawling, processing, serving and log analysis
 The New York Times : Large scale image conversions ; EC2 to run hadoop on a

large virtual cluster
 Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon

EC2 ; data storage in Amazon S3

6
MapReduce Paradigm
 Programming model developed at Google
 Sort/merge based distributed computing
 Initially, it was intended for their internal search/indexing

application, but now used extensively by more organizations (e.g.,
Yahoo, Amazon.com, IBM, etc.)
 It is functional style programming that is naturally parallelizable across
a large cluster of workstations or PCS.
 The underlying system takes care of the partitioning of the
input data, scheduling the program’s execution across
several machines, handling machine failures, and managing
required inter-machine communication. (This is the key for
Hadoop’s success)
7
What does it do?
 Hadoop implements Google’s MapReduce, using HDFS
 MapReduce divides applications into many small blocks of work.
 HDFS creates multiple replicas of data blocks for reliability, placing them on

compute nodes around the cluster.
 MapReduce can then process the data where it is located.
 Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.

8
How does MapReduce work?
 The run time partitions the input and provides it to different Map

instances;
 Map (key, value)  (key’, value’)
 The run time collects the (key’, value’) pairs and distributes them
to several Reduce functions so that each Reduce function gets the
pairs with the same key’.
 Each Reduce produces a single (or zero) file output.
 Map and Reduce are user written functions

9
Example MapReduce: To count the occurrences
of words in the given set of documents
map(String key, String value):
// key: document name; value: document contents; map (k1,v1)  list(k2,v2)
for each word w in value: EmitIntermediate(w, "1");
(Example: If input string is (“God is God. I am I”), Map produces {<“God”,1”>,
<“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>}
reduce(String key, Iterator values):
// key: a word; values: a list of counts; reduce (k2,list(v2))  list(v2)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
(Example: reduce(“I”, <1,1>)  2)
10
Word Count
see bob throw

see spot run

see
bob
throw
see
spot
run

1
1
1
1
1
1

bob
run
see
spot
throw

Can we do word count in parallel?

1
1
2
1
1
Automatic Parallel Execution in
MapReduce

Handles failures automatically, e.g., restarts tasks if a
node fails; runs multiples copies of the same task to
avoid a slow task slowing down the whole job
MapReduce in Hadoop
Data Flow in a MapReduce
Program in Hadoop
InputFormat
Map function
Partitioner
Sorting & Merging
Combiner
Shuffling
Merging
Reduce function
OutputFormat

 1:many
MapReduce-Fault tolerance
 Worker failure: The master pings every worker periodically. If no response

is received from a worker in a certain amount of time, the master marks the
worker as failed. Any map tasks completed by the worker are reset back to
their initial idle state, and therefore become eligible for scheduling on other workers.
Similarly, any map task or reduce task in progress on a failed worker is also
reset to idle and becomes eligible for rescheduling.
 Master Failure: It is easy to make the master write periodic
checkpoints of the master data structures described above. If the master
task dies, a new copy can be started from the last checkpointed state.
However, in most cases, the user restarts the job.

15
Mapping workers to Processors
 The input data (on HDFS) is stored on the local disks of the machines

in the cluster. HDFS divides each file into 64 MB blocks, and stores
several copies of each block (typically 3 copies) on different machines.
 The MapReduce master takes the location information of the input files

into account and attempts to schedule a map task on a machine that
contains a replica of the corresponding input data. Failing that, it
attempts to schedule a map task near a replica of that task's input data.
When running large MapReduce operations on a significant fraction of
the workers in a cluster, most input data is read locally and consumes
no network bandwidth.

16
Task Granularity
 The map phase has M pieces and the reduce phase has R pieces.
 M and R should be much larger than the number of worker machines.
 Having each worker perform many different tasks improves dynamic

load balancing, and also speeds up recovery when a worker fails.
 Larger the M and R, more the decisions the master must make
 R is often constrained by users because the output of each reduce task
ends up in a separate output file.
 Typically, (at Google), M = 200,000 and R = 5,000, using 2,000
worker machines.

17
HDFS
 The Hadoop Distributed File System (HDFS) is a distributed file system

designed to run on commodity hardware. It has many similarities with existing
distributed file systems. However, the differences from other distributed file
systems are significant.
 highly fault-tolerant and is designed to be deployed on low-cost hardware.
 provides high throughput access to application data and is suitable for
applications that have large data sets.
 relaxes a few POSIX requirements to enable streaming access to file system
data.
 part of the Apache Hadoop Core project. The project URL is
http://hadoop.apache.org/core/.

18
HDFS v1.0 Architecture

19
20
Hadoop v2.2 (YARN)

21
Execution overview
1. The MapReduce library in the user program first splits input files into M

pieces of typically 16 MB to 64 MB/piece. It then starts up many copies of
the program on a cluster of machines.
2. One of the copies of the program is the master. The rest are workers that are
assigned work by the master. There are M map tasks and R reduce tasks to
assign. The master picks idle workers and assigns each one a map task or a
reduce task.
3. A worker who is assigned a map task reads the contents of the assigned
input split. It parses key/value pairs out of the input data and passes each
pair to the user-defined Map function. The intermediate key/value pairs
produced by the Map function are buffered in memory.
4. The locations of these buffered pairs on the local disk are passed back to the
master, who forwards these locations to the reduce workers.

22
Execution overview (cont.)
5. When a reduce worker is notified by the master about these locations, it
uses RPC remote procedure calls to read the buffered data from the
local disks of the map workers. When a reduce worker has read all
intermediate data, it sorts it by the intermediate keys so that all
occurrences of the same key are grouped together.
6. The reduce worker iterates over the sorted intermediate data and for
each unique intermediate key encountered, it passes the key and the
corresponding set of intermediate values to the user's Reduce function.
The output of the Reduce function is appended to a final output file for
this reduce partition.
7. When all map tasks and reduce tasks have been completed, the master
wakes up the user program---the MapReduce call in the user program
returns back to the user code. The output of the mapreduce execution is
available in the R output files (one per reduce task).
23
References
[1] J. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing
on Large Clusters,’’ OSDI 2004. (Google)
[2] D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON,
Portland, OR, USA, 25 July 2007 (Yahoo!)
[3] R. E. Brayant, “Data Intensive Scalable computing: The case for
DISC,” Tech Report: CMU-CS-07-128,
http://www.cs.cmu.edu/~bryant

24

More Related Content

What's hot

Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
darugar
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
Nagarjuna Kanamarlapudi
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
Romeo Kienzler
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
its_skm
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop
Hadoop Hadoop
Hadoop
Shamama Kamal
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

What's hot (19)

Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Anju
AnjuAnju
Anju
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop
Hadoop Hadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Viewers also liked

Scientific writing in Engineering and Technology
Scientific writing in Engineering and TechnologyScientific writing in Engineering and Technology
Scientific writing in Engineering and Technology
rajsandhu1989
 
Map, Reduce and Filter in Swift
Map, Reduce and Filter in SwiftMap, Reduce and Filter in Swift
Map, Reduce and Filter in Swift
Aleksandras Smirnovas
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Manuel Correa
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
Kalyan Hadoop
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
What is Comms Planning?
What is Comms Planning?What is Comms Planning?
What is Comms Planning?
Julian Cole
 

Viewers also liked (9)

Scientific writing in Engineering and Technology
Scientific writing in Engineering and TechnologyScientific writing in Engineering and Technology
Scientific writing in Engineering and Technology
 
Map, Reduce and Filter in Swift
Map, Reduce and Filter in SwiftMap, Reduce and Filter in Swift
Map, Reduce and Filter in Swift
 
Introduction to Jaql
Introduction to JaqlIntroduction to Jaql
Introduction to Jaql
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
What is Comms Planning?
What is Comms Planning?What is Comms Planning?
What is Comms Planning?
 

Similar to Hadoop and Mapreduce Introduction

Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
Geohedrick
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Cognizant
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
 
hadoop
hadoophadoop
hadoop
Deep Mehta
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
Xuan-Chao Huang
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
Harika583
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
 

Similar to Hadoop and Mapreduce Introduction (20)

Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big data
Big dataBig data
Big data
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
hadoop
hadoophadoop
hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop
HadoopHadoop
Hadoop
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 

Recently uploaded

Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 

Recently uploaded (20)

Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 

Hadoop and Mapreduce Introduction

  • 1. Hadoop: A Software Framework for Data Intensive Computing Applications Rajinder Sandhu Researcher in Cloud Computing and Virtualization Technologies errajindersandhu.blogspot.com
  • 2. What is Hadoop?  Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – MapReduce – offline computing engine – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access  Yahoo! is the biggest contributor  Here's what makes it especially useful:  Scalable: It can reliably store and process petabytes.  Economical: It distributes the data and processing across clusters of commonly available computers (in thousands).  Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located.  Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. 2
  • 3. Hadoop: Assumptions It is written with large clusters of computers in mind and is built around the following assumptions:  Hardware will fail.  Processing will be run in batches. Thus there is an emphasis on high throughput as opposed to low latency.  Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size.  It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.  Applications need a write-once-read-many access model.  Moving Computation is Cheaper than Moving Data.  Portability is important. 3
  • 4. Apache Hadoop Wins Terabyte Sort Benchmark (July 2013)  One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 42 seconds, which beat the previous record of 80 seconds in 2011 for the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. 4
  • 5. Example Applications and Organizations using Hadoop  A9.com – Amazon: To build Amazon's product search indices; process millions of      5 sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data;
  • 6. More Hadoop Applications  Adknowledge - to build the recommender system for behavioral targeting, plus other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2.  Contextweb - to store ad serving log and use it as a source for Ad optimizations/ Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.  Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)  NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis  The New York Times : Large scale image conversions ; EC2 to run hadoop on a large virtual cluster  Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3 6
  • 7. MapReduce Paradigm  Programming model developed at Google  Sort/merge based distributed computing  Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.)  It is functional style programming that is naturally parallelizable across a large cluster of workstations or PCS.  The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) 7
  • 8. What does it do?  Hadoop implements Google’s MapReduce, using HDFS  MapReduce divides applications into many small blocks of work.  HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster.  MapReduce can then process the data where it is located.  Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. 8
  • 9. How does MapReduce work?  The run time partitions the input and provides it to different Map instances;  Map (key, value)  (key’, value’)  The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’.  Each Reduce produces a single (or zero) file output.  Map and Reduce are user written functions 9
  • 10. Example MapReduce: To count the occurrences of words in the given set of documents map(String key, String value): // key: document name; value: document contents; map (k1,v1)  list(k2,v2) for each word w in value: EmitIntermediate(w, "1"); (Example: If input string is (“God is God. I am I”), Map produces {<“God”,1”>, <“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>} reduce(String key, Iterator values): // key: a word; values: a list of counts; reduce (k2,list(v2))  list(v2) int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); (Example: reduce(“I”, <1,1>)  2) 10
  • 11. Word Count see bob throw see spot run see bob throw see spot run 1 1 1 1 1 1 bob run see spot throw Can we do word count in parallel? 1 1 2 1 1
  • 12. Automatic Parallel Execution in MapReduce Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job
  • 14. Data Flow in a MapReduce Program in Hadoop InputFormat Map function Partitioner Sorting & Merging Combiner Shuffling Merging Reduce function OutputFormat  1:many
  • 15. MapReduce-Fault tolerance  Worker failure: The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.  Master Failure: It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, in most cases, the user restarts the job. 15
  • 16. Mapping workers to Processors  The input data (on HDFS) is stored on the local disks of the machines in the cluster. HDFS divides each file into 64 MB blocks, and stores several copies of each block (typically 3 copies) on different machines.  The MapReduce master takes the location information of the input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data. When running large MapReduce operations on a significant fraction of the workers in a cluster, most input data is read locally and consumes no network bandwidth. 16
  • 17. Task Granularity  The map phase has M pieces and the reduce phase has R pieces.  M and R should be much larger than the number of worker machines.  Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails.  Larger the M and R, more the decisions the master must make  R is often constrained by users because the output of each reduce task ends up in a separate output file.  Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker machines. 17
  • 18. HDFS  The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.  highly fault-tolerant and is designed to be deployed on low-cost hardware.  provides high throughput access to application data and is suitable for applications that have large data sets.  relaxes a few POSIX requirements to enable streaming access to file system data.  part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/. 18
  • 20. 20
  • 22. Execution overview 1. The MapReduce library in the user program first splits input files into M pieces of typically 16 MB to 64 MB/piece. It then starts up many copies of the program on a cluster of machines. 2. One of the copies of the program is the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task. 3. A worker who is assigned a map task reads the contents of the assigned input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 4. The locations of these buffered pairs on the local disk are passed back to the master, who forwards these locations to the reduce workers. 22
  • 23. Execution overview (cont.) 5. When a reduce worker is notified by the master about these locations, it uses RPC remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program---the MapReduce call in the user program returns back to the user code. The output of the mapreduce execution is available in the R output files (one per reduce task). 23
  • 24. References [1] J. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on Large Clusters,’’ OSDI 2004. (Google) [2] D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland, OR, USA, 25 July 2007 (Yahoo!) [3] R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,” Tech Report: CMU-CS-07-128, http://www.cs.cmu.edu/~bryant 24