SlideShare a Scribd company logo
1 of 34
 Read news papers,Blogs, Twitter, RSS feed.

 Wish: Aggregate sources and track emerging topics
 Search ,Collect , store ,analyze ,understand

Finally , Answer quires !

problem:

THIS IS NOT STRUCTURED DATA

Solution:

!
too large to store it on single machine !
Too long to process in serial

Use multiple machines
Single machines tend to fail:

Requirements:

 Hard disk.
 Power supply

 Built-in backup.
 Built-in failover

More machines
 Increased failure probability
 Has never dealt with
 large (petabytes) amount of data.
 Has no thorough understanding of parallel programming.
 Has no time to make software production ready.

Requirements
● Built-in backup.
● Built-in failover.
● Easy to use.
● Parallel on rails.
 Built-in backup.

HADOOP

 Built-in failover.
 Easy to administrate.
 Single system.

 Easy to use.
 Parallel on rails.
 Active development

IS THE
SOLUTION
OK, So what exactly is
Hadoop?
 
Large scale, open source software framework

 Dedicated to scalable, distributed, data-intensive computing
 
Handles thousands of nodes and petabytes of data
 
Supports applications under a free license
 Hadoop main subprojects:

1.

 HFDS: Hadoop Distributed File System with high throughput access to
application data

2.

MapReduce: A software framework for distributed processing of large data sets on
computer clusters
 HDFS – a distributed file system
 NameNode – namespace and block management
 DataNodes – block replica container

 MapReduce – a framework for distributed computations
 JobTracker – job scheduling, resource management, lifecycle coordination
 TaskTracker – task execution module
Hadoop: Architecture Principles

 Linear scalability: more nodes can do more work within the same time

Linear on data size:
2. Linear on compute resources:
 Move computation to data
1. Minimize expensive data transfers
2. Data are large, programs are small
 Reliability and Availability: Failures are common
 Simple computational model
1. hides complexity in efficient execution framework
2. Sequential data processing (avoid random reads)
1.
 A reliable, scalable, high performance distributed computing system
 The Hadoop Distributed File System (HDFS)
1.

Reliable storage layer

2.

With more sophisticated layers on top

 MapReduce – distributed computation framework
 Hadoop scales computation capacity, storage capacity, and I/O bandwidth by adding

commodity servers.
 Divide-and-conquer using lots of commodity hardware
 HDFS – a distributed file system
 NameNode – namespace and block management
 DataNodes – block replica container

 MapReduce – a framework for distributed computations
 JobTracker – job scheduling, resource management, lifecycle coordination
 TaskTracker – task execution module
 The name space is a hierarchy of files and directories
 Files are divided into blocks (typically 128 MB)
 Namespace (metadata) is decoupled from data
1.

Lots of fast namespace operations, not slowed down by

2.

Data streaming

 Single NameNode keeps the entire name space in RAM
 DataNodes store block replicas
 Blocks are replicated on 3 DataNodes for redundancy and availability
 The durability of the name space is maintained by a

write-ahead journal and checkpoints
1.

Journal transactions are persisted into edits file before replying to the client

2.

Checkpoints are periodically written to fsimage file Handled by Checkpointer,
SecondaryNameNode

3.

Block locations discovered from DataNodes during startup via block reports. Not
persisted on NameNode

 BackupNode Multiple storage directories

Two on local drives, and one remote server, typically NFS file
 DataNodes register with the NameNode, and provide periodic block reports that list the

block replicas on hand
 block report contains block id, generation stamp and length for each replica DataNodes

send heartbeats to the NameNode to confirm its alive: 3 sec
 If no heartbeat is received during a 10-minute interval, the node is presumed to be lost,

and the replicas hosted by that node to be unavailable
 NameNode schedules re-replication of lost replicas Heartbeat responses give instructions

for managing replicas
1.

Replicate blocks to other nodes

2.

Remove local block replicas

3.

Re-register or shut down the node

4.

Send an urgent block report
 Supports conventional file system operation
1.
2.

Create, read, write, delete files
Create, delete directories Rename files and directories Permissions

 Modification and access times for files

 Quotas: 1) namespace, 2) disk space
 Per-file, when created

Replication factor – can be changed later
Block size – can not be reset after creation Block replica locations are exposed to
external applications
Open a socket stream to
chosen DataNode, read
bytes from the stream
If fails add to dead
DataNodes
Choose the next DataNode
in the list
Retry 2 times
 MapReduce schedules a task assigned to process block B to a DataNode serving a

replica of B
 Local access to data
 HDFS implements a single-writer, multiple-reader model.
 HDFS client maintains a lease on files it opened for write
 Only one client can hold a lease on a single file
 Client periodically renews the lease by sending heartbeats to the NameNode Lease

expiration:
 Until soft limit expires client has exclusive access to the file
 After soft limit (10 min): any client can reclaim the lease
 After hard limit (1 hour): NameNode mandatory closes the file, revokes the lease

Writer's lease does not prevent other clients from reading the file
 Original implementation supported write-once semantics
 After the file is closed, the bytes written cannot be altered or removed Now files can

be modified by reopening for append
 Block modifications during appends use the copy-on-write technique
 Last block is copied into temp location and modified
 When “full” it is copied into its permanent location HDFS provides consistent visibility

of data for readers before file is closed
 hflush operation provides the visibility guarantee
 On hflush current packet is immediately pushed to the pipeline

 hflush waits until all DataNodes successfully receive the packet hsync also

guarantees the data is persisted to local disks on DataNode
 Cluster topology
 Hierarchal grouping of nodes according to
 network distance Default block placement policy - a tradeoff
 between minimizing the write cost,
 and maximizing data reliability, availability and aggregate read bandwidth 1

 . First replica on the local to the writer node
 2. Second and the Third replicas on two different nodes in a different rack
 3. the Rest are placed on random nodes with restrictions
 no more than one replica of the same block is placed at one node and

 no more than two replicas are placed in the same rack (if there is enough racks) HDFS

provides a configurable block placement policy interface
 experimental
 Namespace ID
 a unique cluster id common for cluster components (NN and DNs)
 Namespace ID assigned to the file system at format time  Prevents DNs from other

clusters join this cluster Storage ID
 DataNode persistently stores its unique (per cluster) Storage ID
 makes DN recognizable even if it is restarted with a different IP address or port
 assigned to the DataNode when it registers with the NameNode for the first time

Software Version (build version)
 Different software versions of NN and DN are incompatible

 Starting from Hadoop-2: version compatibility for rolling upgrades Data integrity via

Block Checksum
 NameNode startup
1.

Read image, replay journal, write new image and empty journal

2.

Enter SafeMode

 DataNode startup
1.

Handshake: check Namespace ID and Software Version

2.

Registration: NameNode records Storage ID and address of DN

3.

Send initial block report

 SafeMode – read-only mode for NameNode
1.

Disallows modifications of the namespace

2.

Disallows block replication or deletion

3.

Prevents unnecessary block replications until the majority of blocks are reported
Minimally replicated blocks

4.

SafeMode threshold, extension

5.

Manual SafeMode during unforeseen circumstances
 Ensure that each block always has the intended number of replicas
 Conform with block placement policy (BPP)
 Replication is updated when a block report is received
 Over-replicated blocks: choose replica to remove
1.

Balance storage utilization across nodes without reducing the block’s availability

2.

Try not to reduce the number of racks that host replicas

3.

Remove replica from DataNode with longest heartbeat or least available space

 Under-replicated blocks: place into the replication priority queue
1.

Less replicas means higher in the queue

2.

Minimize cost of replica creation but concur with BPP Mis-replicated block, not to BPP

 Missing block: nothing to do
 Corrupt block: try to replicate good replicas, keep corrupt replicas as is
 Snapshots prevent from data corruption/loss during software upgrades

allow to rollback if software upgrades go bad
 NameNode image snapshot
1.

Start hadoop namenode –upgrade

2.

Read checkpoint image and journal based on persisted Layout Version o New
software can always read old layouts

3.

Rename current storage directory to previous

4.

Save image into new current
 • One master, many workers
 – Input data split into M map tasks (typically 64 MB in size)
 – Reduce phase partitioned into R reduce tasks
 – Tasks are assigned to workers dynamically
 – Often: M=200,000; R=4,000; workers=2,000
 • Master assigns each map task to a free worker
 – Considers locality of data to worker when assigning task
 – Worker reads task input (often from local disk!)
 – Worker produces R local files containing intermediate k/v pairs
 • Master assigns each reduce task to a free worker

 – Worker reads intermediate k/v pairs from map workers
 – Worker sorts & applies user’s Reduce op to produce the output
 On worker failure:
 • Detect failure via periodic heartbeats
 • Re-execute completed and in-progress map tasks
 • Re-execute in progress reduce tasks

 • Task completion committed through master
 On master failure:
 • State is checkpointed to GFS: new master recovers &
 continues
 Very Robust: lost 1600 of 1800 machines once, but finished fine
 HDFS – a distributed file system
 NameNode – namespace and block management
 DataNodes – block replica container

 MapReduce – a framework for distributed computations
 JobTracker – job scheduling, resource management, lifecycle coordination
 TaskTracker – task execution module
Example: Word Frequencies in Web Pages
A typical exercise for a new engineer in his or her first week
• Input is files with one document per record
• Specify a map function that takes a key/value pair
key = document URL
value = document contents
• Output of map function is (potentially many) key/value pairs.
Key:“document1”
In our case, output (word, “1”) once per word in the document
Value:“to be or not to be”
MapReduce library gathers together all pairs with the same key
MAP
(shuffle/sort)
• The reduceValue SHUFFLE the valuesValue key REDUCE
function combines Key
for a
Key
In our case, “1”
compute the sum
“to”
&
“be”
“1”“1”
“be”
“1”
SORT
“not”
“1”
“or”
“1”
“or”
“1”
“not”
”1”
“to”
“1””1”
“to”
”1”
“be”
”1”

Key
“be”
“not”
“or”
“to”

Value
“2”
“1”
“1”
“2”
 Recommendation Financial analysis
 systems
 Natural Language
 Processing (NLP)

 Image/video processing
 Log analysis
 Data warehousing
 Correlation engines
 Market research/forecasting
 Finance
 Health &
 Life Sciences Academic research
 Government

 Social networking
 Telecommunications
Hadoop

More Related Content

What's hot

Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 

What's hot (18)

Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Anatomy of file read in hadoop
Anatomy of file read in hadoopAnatomy of file read in hadoop
Anatomy of file read in hadoop
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Anatomy of file write in hadoop
Anatomy of file write in hadoopAnatomy of file write in hadoop
Anatomy of file write in hadoop
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfs
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 

Viewers also liked (7)

Do notdelete
Do notdeleteDo notdelete
Do notdelete
 
Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)Hadoop 130419075715-phpapp02(1)
Hadoop 130419075715-phpapp02(1)
 
Hadoop
HadoopHadoop
Hadoop
 
Ruby on hadoop
Ruby on hadoopRuby on hadoop
Ruby on hadoop
 
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersSoft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End Users
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Classification of Quantum Repeater Attacks
Classification of Quantum Repeater AttacksClassification of Quantum Repeater Attacks
Classification of Quantum Repeater Attacks
 

Similar to Hadoop

Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glanceTan Tran
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptxSwarnaSLcse
 
Distributed File System
Distributed File SystemDistributed File System
Distributed File SystemNtu
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 

Similar to Hadoop (20)

Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Unit 1
Unit 1Unit 1
Unit 1
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop at a glance
Hadoop at a glanceHadoop at a glance
Hadoop at a glance
 
Google
GoogleGoogle
Google
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
Distributed File System
Distributed File SystemDistributed File System
Distributed File System
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Hdfs
HdfsHdfs
Hdfs
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Hadoop

  • 1.
  • 2.  Read news papers,Blogs, Twitter, RSS feed.  Wish: Aggregate sources and track emerging topics
  • 3.
  • 4.  Search ,Collect , store ,analyze ,understand Finally , Answer quires ! problem: THIS IS NOT STRUCTURED DATA Solution: ! too large to store it on single machine ! Too long to process in serial Use multiple machines
  • 5. Single machines tend to fail: Requirements:  Hard disk.  Power supply  Built-in backup.  Built-in failover More machines  Increased failure probability
  • 6.  Has never dealt with  large (petabytes) amount of data.  Has no thorough understanding of parallel programming.  Has no time to make software production ready. Requirements ● Built-in backup. ● Built-in failover. ● Easy to use. ● Parallel on rails.
  • 7.  Built-in backup. HADOOP  Built-in failover.  Easy to administrate.  Single system.  Easy to use.  Parallel on rails.  Active development IS THE SOLUTION
  • 8. OK, So what exactly is Hadoop?
  • 9.   Large scale, open source software framework  Dedicated to scalable, distributed, data-intensive computing   Handles thousands of nodes and petabytes of data   Supports applications under a free license  Hadoop main subprojects: 1.  HFDS: Hadoop Distributed File System with high throughput access to application data 2. MapReduce: A software framework for distributed processing of large data sets on computer clusters
  • 10.  HDFS – a distributed file system  NameNode – namespace and block management  DataNodes – block replica container  MapReduce – a framework for distributed computations  JobTracker – job scheduling, resource management, lifecycle coordination  TaskTracker – task execution module
  • 11. Hadoop: Architecture Principles  Linear scalability: more nodes can do more work within the same time Linear on data size: 2. Linear on compute resources:  Move computation to data 1. Minimize expensive data transfers 2. Data are large, programs are small  Reliability and Availability: Failures are common  Simple computational model 1. hides complexity in efficient execution framework 2. Sequential data processing (avoid random reads) 1.
  • 12.  A reliable, scalable, high performance distributed computing system  The Hadoop Distributed File System (HDFS) 1. Reliable storage layer 2. With more sophisticated layers on top  MapReduce – distributed computation framework  Hadoop scales computation capacity, storage capacity, and I/O bandwidth by adding commodity servers.  Divide-and-conquer using lots of commodity hardware
  • 13.  HDFS – a distributed file system  NameNode – namespace and block management  DataNodes – block replica container  MapReduce – a framework for distributed computations  JobTracker – job scheduling, resource management, lifecycle coordination  TaskTracker – task execution module
  • 14.  The name space is a hierarchy of files and directories  Files are divided into blocks (typically 128 MB)  Namespace (metadata) is decoupled from data 1. Lots of fast namespace operations, not slowed down by 2. Data streaming  Single NameNode keeps the entire name space in RAM  DataNodes store block replicas  Blocks are replicated on 3 DataNodes for redundancy and availability
  • 15.  The durability of the name space is maintained by a write-ahead journal and checkpoints 1. Journal transactions are persisted into edits file before replying to the client 2. Checkpoints are periodically written to fsimage file Handled by Checkpointer, SecondaryNameNode 3. Block locations discovered from DataNodes during startup via block reports. Not persisted on NameNode  BackupNode Multiple storage directories Two on local drives, and one remote server, typically NFS file
  • 16.  DataNodes register with the NameNode, and provide periodic block reports that list the block replicas on hand  block report contains block id, generation stamp and length for each replica DataNodes send heartbeats to the NameNode to confirm its alive: 3 sec  If no heartbeat is received during a 10-minute interval, the node is presumed to be lost, and the replicas hosted by that node to be unavailable  NameNode schedules re-replication of lost replicas Heartbeat responses give instructions for managing replicas 1. Replicate blocks to other nodes 2. Remove local block replicas 3. Re-register or shut down the node 4. Send an urgent block report
  • 17.  Supports conventional file system operation 1. 2. Create, read, write, delete files Create, delete directories Rename files and directories Permissions  Modification and access times for files  Quotas: 1) namespace, 2) disk space  Per-file, when created Replication factor – can be changed later Block size – can not be reset after creation Block replica locations are exposed to external applications
  • 18. Open a socket stream to chosen DataNode, read bytes from the stream If fails add to dead DataNodes Choose the next DataNode in the list Retry 2 times
  • 19.  MapReduce schedules a task assigned to process block B to a DataNode serving a replica of B  Local access to data
  • 20.  HDFS implements a single-writer, multiple-reader model.  HDFS client maintains a lease on files it opened for write  Only one client can hold a lease on a single file  Client periodically renews the lease by sending heartbeats to the NameNode Lease expiration:  Until soft limit expires client has exclusive access to the file  After soft limit (10 min): any client can reclaim the lease  After hard limit (1 hour): NameNode mandatory closes the file, revokes the lease Writer's lease does not prevent other clients from reading the file
  • 21.  Original implementation supported write-once semantics  After the file is closed, the bytes written cannot be altered or removed Now files can be modified by reopening for append  Block modifications during appends use the copy-on-write technique  Last block is copied into temp location and modified  When “full” it is copied into its permanent location HDFS provides consistent visibility of data for readers before file is closed  hflush operation provides the visibility guarantee  On hflush current packet is immediately pushed to the pipeline  hflush waits until all DataNodes successfully receive the packet hsync also guarantees the data is persisted to local disks on DataNode
  • 22.
  • 23.  Cluster topology  Hierarchal grouping of nodes according to  network distance Default block placement policy - a tradeoff  between minimizing the write cost,  and maximizing data reliability, availability and aggregate read bandwidth 1  . First replica on the local to the writer node  2. Second and the Third replicas on two different nodes in a different rack  3. the Rest are placed on random nodes with restrictions  no more than one replica of the same block is placed at one node and  no more than two replicas are placed in the same rack (if there is enough racks) HDFS provides a configurable block placement policy interface  experimental
  • 24.  Namespace ID  a unique cluster id common for cluster components (NN and DNs)  Namespace ID assigned to the file system at format time  Prevents DNs from other clusters join this cluster Storage ID  DataNode persistently stores its unique (per cluster) Storage ID  makes DN recognizable even if it is restarted with a different IP address or port  assigned to the DataNode when it registers with the NameNode for the first time Software Version (build version)  Different software versions of NN and DN are incompatible  Starting from Hadoop-2: version compatibility for rolling upgrades Data integrity via Block Checksum
  • 25.  NameNode startup 1. Read image, replay journal, write new image and empty journal 2. Enter SafeMode  DataNode startup 1. Handshake: check Namespace ID and Software Version 2. Registration: NameNode records Storage ID and address of DN 3. Send initial block report  SafeMode – read-only mode for NameNode 1. Disallows modifications of the namespace 2. Disallows block replication or deletion 3. Prevents unnecessary block replications until the majority of blocks are reported Minimally replicated blocks 4. SafeMode threshold, extension 5. Manual SafeMode during unforeseen circumstances
  • 26.  Ensure that each block always has the intended number of replicas  Conform with block placement policy (BPP)  Replication is updated when a block report is received  Over-replicated blocks: choose replica to remove 1. Balance storage utilization across nodes without reducing the block’s availability 2. Try not to reduce the number of racks that host replicas 3. Remove replica from DataNode with longest heartbeat or least available space  Under-replicated blocks: place into the replication priority queue 1. Less replicas means higher in the queue 2. Minimize cost of replica creation but concur with BPP Mis-replicated block, not to BPP  Missing block: nothing to do  Corrupt block: try to replicate good replicas, keep corrupt replicas as is
  • 27.  Snapshots prevent from data corruption/loss during software upgrades allow to rollback if software upgrades go bad  NameNode image snapshot 1. Start hadoop namenode –upgrade 2. Read checkpoint image and journal based on persisted Layout Version o New software can always read old layouts 3. Rename current storage directory to previous 4. Save image into new current
  • 28.  • One master, many workers  – Input data split into M map tasks (typically 64 MB in size)  – Reduce phase partitioned into R reduce tasks  – Tasks are assigned to workers dynamically  – Often: M=200,000; R=4,000; workers=2,000  • Master assigns each map task to a free worker  – Considers locality of data to worker when assigning task  – Worker reads task input (often from local disk!)  – Worker produces R local files containing intermediate k/v pairs  • Master assigns each reduce task to a free worker  – Worker reads intermediate k/v pairs from map workers  – Worker sorts & applies user’s Reduce op to produce the output
  • 29.  On worker failure:  • Detect failure via periodic heartbeats  • Re-execute completed and in-progress map tasks  • Re-execute in progress reduce tasks  • Task completion committed through master  On master failure:  • State is checkpointed to GFS: new master recovers &  continues  Very Robust: lost 1600 of 1800 machines once, but finished fine
  • 30.  HDFS – a distributed file system  NameNode – namespace and block management  DataNodes – block replica container  MapReduce – a framework for distributed computations  JobTracker – job scheduling, resource management, lifecycle coordination  TaskTracker – task execution module
  • 31. Example: Word Frequencies in Web Pages A typical exercise for a new engineer in his or her first week • Input is files with one document per record • Specify a map function that takes a key/value pair key = document URL value = document contents • Output of map function is (potentially many) key/value pairs. Key:“document1” In our case, output (word, “1”) once per word in the document Value:“to be or not to be” MapReduce library gathers together all pairs with the same key MAP (shuffle/sort) • The reduceValue SHUFFLE the valuesValue key REDUCE function combines Key for a Key In our case, “1” compute the sum “to” & “be” “1”“1” “be” “1” SORT “not” “1” “or” “1” “or” “1” “not” ”1” “to” “1””1” “to” ”1” “be” ”1” Key “be” “not” “or” “to” Value “2” “1” “1” “2”
  • 32.  Recommendation Financial analysis  systems  Natural Language  Processing (NLP)  Image/video processing  Log analysis  Data warehousing  Correlation engines  Market research/forecasting
  • 33.  Finance  Health &  Life Sciences Academic research  Government  Social networking  Telecommunications