SlideShare a Scribd company logo
1 of 47
Presented By
info@kellytechno.com  +91 998 570
6789
 Open source software framework designed for
storage and processing of large scale data on
clusters of commodity hardware
 Created by Doug Cutting and Mike Carafella in
2005.
 Cutting named the program after his son’s toy
elephant.
www.kellytechnmo.com
 Data-intensive text processing
 Assembly of large genomes
 Graph mining
 Machine learning and data mining
 Large scale social network analysis
www.kellytechnmo.com
www.kellytechnmo.com
www.kellytechnmo.com
What considerations led to its design
www.kellytechnmo.com
 What were the limitations of earlier large-scale
computing?
 What requirements should an alternative
approach have?
 How does Hadoop address those
requirements?
www.kellytechnmo.com
 Historically computation was processor-bound
 Data volume has been relatively small
 Complicated computations are performed on that
data
 Advances in computer technology has
historically centered around improving the
power of a single machine
www.kellytechnmo.com
www.kellytechnmo.com
 Moore’s Law
 The number of transistors on a dense integrated
circuit doubles every two years
 Single-core computing can’t scale with
current computing needs
www.kellytechnmo.com
 Power consumption limits the speed
increase we get from transistor density
www.kellytechnmo.com
 Allows developers to
use multiple
machines for a
single task
www.kellytechnmo.com
 Programming on a distributed system is much
more complex
 Synchronizing data exchanges
 Managing a finite bandwidth
 Controlling computation timing is complicated
www.kellytechnmo.com
“You know you have a distributed system when
the crash of a computer you’ve never
heard of stops you from getting any work done.”
–Leslie Lamport
Distributed systems must be designed with the
expectation of failure
www.kellytechnmo.com
 Typically divided into Data Nodes and Compute
Nodes
 At compute time, data is copied to the Compute
Nodes
 Fine for relatively small amounts of data
 Modern systems deal with far more data than
was gathering in the past
www.kellytechnmo.com
 Facebook
 500 TB per day
 Yahoo
 Over 170 PB
 eBay
 Over 6 PB
 Getting the data to the processors becomes the
bottleneck
www.kellytechnmo.com
 Must support partial
failure
 Must be scalable
www.kellytechnmo.com
 Failure of a single component must not
cause the failure of the entire system only a
degradation of the application performance
 Failure should not
result in the loss of
any data
www.kellytechnmo.com
 If a component fails, it should be able to
recover without restarting the entire system
 Component failure or recovery during a job
must not affect the final output
www.kellytechnmo.com
 Increasing resources should increase load
capacity
 Increasing the load on the system should result
in a graceful decline in performance for all jobs
 Not system failure
www.kellytechnmo.com
 Based on work done by Google in the early
2000s
 “The Google File System” in 2003
 “MapReduce: Simplified Data Processing on Large
Clusters” in 2004
 The core idea was to distribute the data as it is
initially stored
 Each node can then perform computation on the
data it stores without moving the data for the initial
processing
www.kellytechnmo.com
 Applications are written in a high-level
programming language
 No network programming or temporal dependency
 Nodes should communicate as little as
possible
 A “shared nothing” architecture
 Data is spread among the machines in
advance
 Perform computation where the data is already
stored as often as possible
www.kellytechnmo.com
 When data is loaded onto the system it is divided
into blocks
 Typically 64MB or 128MB
 Tasks are divided into two phases
 Map tasks which are done on small portions of data
where the data is stored
 Reduce tasks which combine data to produce the final
output
 A master program allocates work to individual
nodes
www.kellytechnmo.com
 Failures are detected by the master program
which reassigns the work to a different node
 Restarting a task does not affect the nodes
working on other portions of the data
 If a failed node restarts, it is added back to the
system and assigned new tasks
 The master can redundantly execute the same
task to avoid slow running nodes
www.kellytechnmo.com
HDFS
www.kellytechnmo.com
 Responsible for storing data on the cluster
 Data files are split into blocks and distributed
across the nodes in the cluster
 Each block is replicated multiple times
www.kellytechnmo.com
 HDFS is a file system written in Java based on
the Google’s GFS
 Provides redundant storage for massive
amounts of data
www.kellytechnmo.com
 HDFS works best with a smaller number of
large files
 Millions as opposed to billions of files
 Typically 100MB or more per file
 Files in HDFS are write once
 Optimized for streaming reads of large files and
not random reads
www.kellytechnmo.com
 Files are split into blocks
 Blocks are split across many machines at load
time
 Different blocks from the same file will be stored on
different machines
 Blocks are replicated across multiple machines
 The NameNode keeps track of which blocks
make up a file and where they are stored
www.kellytechnmo.com
 Default replication is 3-fold
www.kellytechnmo.com
 When a client wants to retrieve data
 Communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored
 Then communicated directly with the data nodes to
read the data
www.kellytechnmo.com
Distributing computation across nodes
www.kellytechnmo.com
 A method for distributing computation across
multiple nodes
 Each node processes the data that is stored at
that node
 Consists of two main phases
 Map
 Reduce
www.kellytechnmo.com
 Automatic parallelization and distribution
 Fault-Tolerance
 Provides a clean abstraction for programmers
to use
www.kellytechnmo.com
 Reads data as key/value pairs
 The key is often discarded
 Outputs zero or more key/value pairs
www.kellytechnmo.com
 Output from the mapper is sorted by key
 All values with the same key are guaranteed to
go to the same machine
www.kellytechnmo.com
 Called once for each unique key
 Gets a list of all values associated with a key as
input
 The reducer outputs zero or more final
key/value pairs
 Usually just one output per input key
www.kellytechnmo.com
www.kellytechnmo.com
What parts actually make up a Hadoop cluster
www.kellytechnmo.com
 NameNode
 Holds the metadata for the HDFS
 Secondary NameNode
 Performs housekeeping functions for the NameNode
 DataNode
 Stores the actual HDFS data blocks
 JobTracker
 Manages MapReduce jobs
 TaskTracker
 Monitors individual Map and Reduce tasks
www.kellytechnmo.com
 Stores the HDFS file system information in a
fsimage
 Updates to the file system (add/remove blocks) do
not change the fsimage file
 They are instead written to a log file
 When starting the NameNode loads the fsimage
file and then applies the changes in the log file
www.kellytechnmo.com
 NOT a backup for the NameNode
 Periodically reads the log file and applies the
changes to the fsimage file bringing it up to
date
 Allows the NameNode to restart faster when
required
www.kellytechnmo.com
 JobTracker
 Determines the execution plan for the job
 Assigns individual tasks
 TaskTracker
 Keeps track of the performance of an individual
mapper or reducer
www.kellytechnmo.com
Other available tools
www.kellytechnmo.com
 MapReduce is very powerful, but can be
awkward to master
 These tools allow programmers who are
familiar with other programming styles to take
advantage of the power of MapReduce
www.kellytechnmo.com
 Hive
 Hadoop processing with SQL
 Pig
 Hadoop processing with scripting
 Cascading
 Pipe and Filter processing model
 HBase
 Database model built on top of Hadoop
 Flume
 Designed for large scale data movement
www.kellytechnmo.com
Presented By
info@kellytechno.com  +91 998 570 6789

More Related Content

More from Kelly Technologies

Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadKelly Technologies
 
Data science institutes in hyderabad
Data science institutes in hyderabadData science institutes in hyderabad
Data science institutes in hyderabadKelly Technologies
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadKelly Technologies
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabadKelly Technologies
 
Hadoop institutes in hyderabad
Hadoop institutes in hyderabadHadoop institutes in hyderabad
Hadoop institutes in hyderabadKelly Technologies
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Websphere mb training in hyderabad
Websphere mb training in hyderabadWebsphere mb training in hyderabad
Websphere mb training in hyderabadKelly Technologies
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Oracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabadOracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabadKelly Technologies
 
Hadoop training institutes in bangalore
Hadoop training institutes in bangaloreHadoop training institutes in bangalore
Hadoop training institutes in bangaloreKelly Technologies
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangaloreKelly Technologies
 
Salesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreSalesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreKelly Technologies
 
Qlikview training in hyderabad
Qlikview training in hyderabadQlikview training in hyderabad
Qlikview training in hyderabadKelly Technologies
 
Project Management Planning training in hyderabad
Project Management Planning training in hyderabadProject Management Planning training in hyderabad
Project Management Planning training in hyderabadKelly Technologies
 

More from Kelly Technologies (20)

Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science institutes in hyderabad
Data science institutes in hyderabadData science institutes in hyderabad
Data science institutes in hyderabad
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
Hadoop institutes in hyderabad
Hadoop institutes in hyderabadHadoop institutes in hyderabad
Hadoop institutes in hyderabad
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Sas training in hyderabad
Sas training in hyderabadSas training in hyderabad
Sas training in hyderabad
 
Websphere mb training in hyderabad
Websphere mb training in hyderabadWebsphere mb training in hyderabad
Websphere mb training in hyderabad
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Oracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabadOracle training-institutes-in-hyderabad
Oracle training-institutes-in-hyderabad
 
Hadoop training institutes in bangalore
Hadoop training institutes in bangaloreHadoop training institutes in bangalore
Hadoop training institutes in bangalore
 
Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Tableau training in bangalore
Tableau training in bangaloreTableau training in bangalore
Tableau training in bangalore
 
Salesforce crm-training-in-bangalore
Salesforce crm-training-in-bangaloreSalesforce crm-training-in-bangalore
Salesforce crm-training-in-bangalore
 
Oracle training in hyderabad
Oracle training in hyderabadOracle training in hyderabad
Oracle training in hyderabad
 
Qlikview training in hyderabad
Qlikview training in hyderabadQlikview training in hyderabad
Qlikview training in hyderabad
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Project Management Planning training in hyderabad
Project Management Planning training in hyderabadProject Management Planning training in hyderabad
Project Management Planning training in hyderabad
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 

Recently uploaded

How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17Celine George
 
demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxMohamed Rizk Khodair
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文中 央社
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Denish Jangid
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024Borja Sotomayor
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project researchCaitlinCummins3
 
The Liver & Gallbladder (Anatomy & Physiology).pptx
The Liver &  Gallbladder (Anatomy & Physiology).pptxThe Liver &  Gallbladder (Anatomy & Physiology).pptx
The Liver & Gallbladder (Anatomy & Physiology).pptxVishal Singh
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFVivekanand Anglo Vedic Academy
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...Nguyen Thanh Tu Collection
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptxPoojaSen20
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismDabee Kamal
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....Ritu480198
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/siemaillard
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...Nguyen Thanh Tu Collection
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...Nguyen Thanh Tu Collection
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxneillewis46
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽中 央社
 
Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesAmanpreetKaur157993
 

Recently uploaded (20)

How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17
 
demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptx
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
The Liver & Gallbladder (Anatomy & Physiology).pptx
The Liver &  Gallbladder (Anatomy & Physiology).pptxThe Liver &  Gallbladder (Anatomy & Physiology).pptx
The Liver & Gallbladder (Anatomy & Physiology).pptx
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
IPL Online Quiz by Pragya; Question Set.
IPL Online Quiz by Pragya; Question Set.IPL Online Quiz by Pragya; Question Set.
IPL Online Quiz by Pragya; Question Set.
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptx
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
Major project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategiesMajor project report on Tata Motors and its marketing strategies
Major project report on Tata Motors and its marketing strategies
 

Hadoop training in hyderabad@kellytechnologies

  • 2.  Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and Mike Carafella in 2005.  Cutting named the program after his son’s toy elephant. www.kellytechnmo.com
  • 3.  Data-intensive text processing  Assembly of large genomes  Graph mining  Machine learning and data mining  Large scale social network analysis www.kellytechnmo.com
  • 6. What considerations led to its design www.kellytechnmo.com
  • 7.  What were the limitations of earlier large-scale computing?  What requirements should an alternative approach have?  How does Hadoop address those requirements? www.kellytechnmo.com
  • 8.  Historically computation was processor-bound  Data volume has been relatively small  Complicated computations are performed on that data  Advances in computer technology has historically centered around improving the power of a single machine www.kellytechnmo.com
  • 10.  Moore’s Law  The number of transistors on a dense integrated circuit doubles every two years  Single-core computing can’t scale with current computing needs www.kellytechnmo.com
  • 11.  Power consumption limits the speed increase we get from transistor density www.kellytechnmo.com
  • 12.  Allows developers to use multiple machines for a single task www.kellytechnmo.com
  • 13.  Programming on a distributed system is much more complex  Synchronizing data exchanges  Managing a finite bandwidth  Controlling computation timing is complicated www.kellytechnmo.com
  • 14. “You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.” –Leslie Lamport Distributed systems must be designed with the expectation of failure www.kellytechnmo.com
  • 15.  Typically divided into Data Nodes and Compute Nodes  At compute time, data is copied to the Compute Nodes  Fine for relatively small amounts of data  Modern systems deal with far more data than was gathering in the past www.kellytechnmo.com
  • 16.  Facebook  500 TB per day  Yahoo  Over 170 PB  eBay  Over 6 PB  Getting the data to the processors becomes the bottleneck www.kellytechnmo.com
  • 17.  Must support partial failure  Must be scalable www.kellytechnmo.com
  • 18.  Failure of a single component must not cause the failure of the entire system only a degradation of the application performance  Failure should not result in the loss of any data www.kellytechnmo.com
  • 19.  If a component fails, it should be able to recover without restarting the entire system  Component failure or recovery during a job must not affect the final output www.kellytechnmo.com
  • 20.  Increasing resources should increase load capacity  Increasing the load on the system should result in a graceful decline in performance for all jobs  Not system failure www.kellytechnmo.com
  • 21.  Based on work done by Google in the early 2000s  “The Google File System” in 2003  “MapReduce: Simplified Data Processing on Large Clusters” in 2004  The core idea was to distribute the data as it is initially stored  Each node can then perform computation on the data it stores without moving the data for the initial processing www.kellytechnmo.com
  • 22.  Applications are written in a high-level programming language  No network programming or temporal dependency  Nodes should communicate as little as possible  A “shared nothing” architecture  Data is spread among the machines in advance  Perform computation where the data is already stored as often as possible www.kellytechnmo.com
  • 23.  When data is loaded onto the system it is divided into blocks  Typically 64MB or 128MB  Tasks are divided into two phases  Map tasks which are done on small portions of data where the data is stored  Reduce tasks which combine data to produce the final output  A master program allocates work to individual nodes www.kellytechnmo.com
  • 24.  Failures are detected by the master program which reassigns the work to a different node  Restarting a task does not affect the nodes working on other portions of the data  If a failed node restarts, it is added back to the system and assigned new tasks  The master can redundantly execute the same task to avoid slow running nodes www.kellytechnmo.com
  • 26.  Responsible for storing data on the cluster  Data files are split into blocks and distributed across the nodes in the cluster  Each block is replicated multiple times www.kellytechnmo.com
  • 27.  HDFS is a file system written in Java based on the Google’s GFS  Provides redundant storage for massive amounts of data www.kellytechnmo.com
  • 28.  HDFS works best with a smaller number of large files  Millions as opposed to billions of files  Typically 100MB or more per file  Files in HDFS are write once  Optimized for streaming reads of large files and not random reads www.kellytechnmo.com
  • 29.  Files are split into blocks  Blocks are split across many machines at load time  Different blocks from the same file will be stored on different machines  Blocks are replicated across multiple machines  The NameNode keeps track of which blocks make up a file and where they are stored www.kellytechnmo.com
  • 30.  Default replication is 3-fold www.kellytechnmo.com
  • 31.  When a client wants to retrieve data  Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored  Then communicated directly with the data nodes to read the data www.kellytechnmo.com
  • 32. Distributing computation across nodes www.kellytechnmo.com
  • 33.  A method for distributing computation across multiple nodes  Each node processes the data that is stored at that node  Consists of two main phases  Map  Reduce www.kellytechnmo.com
  • 34.  Automatic parallelization and distribution  Fault-Tolerance  Provides a clean abstraction for programmers to use www.kellytechnmo.com
  • 35.  Reads data as key/value pairs  The key is often discarded  Outputs zero or more key/value pairs www.kellytechnmo.com
  • 36.  Output from the mapper is sorted by key  All values with the same key are guaranteed to go to the same machine www.kellytechnmo.com
  • 37.  Called once for each unique key  Gets a list of all values associated with a key as input  The reducer outputs zero or more final key/value pairs  Usually just one output per input key www.kellytechnmo.com
  • 39. What parts actually make up a Hadoop cluster www.kellytechnmo.com
  • 40.  NameNode  Holds the metadata for the HDFS  Secondary NameNode  Performs housekeeping functions for the NameNode  DataNode  Stores the actual HDFS data blocks  JobTracker  Manages MapReduce jobs  TaskTracker  Monitors individual Map and Reduce tasks www.kellytechnmo.com
  • 41.  Stores the HDFS file system information in a fsimage  Updates to the file system (add/remove blocks) do not change the fsimage file  They are instead written to a log file  When starting the NameNode loads the fsimage file and then applies the changes in the log file www.kellytechnmo.com
  • 42.  NOT a backup for the NameNode  Periodically reads the log file and applies the changes to the fsimage file bringing it up to date  Allows the NameNode to restart faster when required www.kellytechnmo.com
  • 43.  JobTracker  Determines the execution plan for the job  Assigns individual tasks  TaskTracker  Keeps track of the performance of an individual mapper or reducer www.kellytechnmo.com
  • 45.  MapReduce is very powerful, but can be awkward to master  These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce www.kellytechnmo.com
  • 46.  Hive  Hadoop processing with SQL  Pig  Hadoop processing with scripting  Cascading  Pipe and Filter processing model  HBase  Database model built on top of Hadoop  Flume  Designed for large scale data movement www.kellytechnmo.com

Editor's Notes

  1. Example of failure issues. Linux lab is distributed file system, if the file server fails, what happens.
  2. Example of map and reduce
  3. Default replication is 3-fold