SlideShare a Scribd company logo
1 of 38
Unit-1
Introduction to Big Data
❑Big Data
❑Hadoop
❑HDFS
❑MapReduce
Big Data
⮚ Big dnalytics has been aata refers to data that is so large, fast or complex
that it’s difficult or impossible to process using traditional methods.
⮚ The act of accessing and storing large amounts of information for
around for a long time. But the concept of big data gained momentum in
the early 2000s.
⮚ Big Data is high-volume, high-velocity and/or high-variety information
asset that requires new forms of processing for enhanced decision
making, insight discovery and process optimization (Gartner 2012).
⮚ “Data of a very large size, typically to the extent that its manipulation
and management present significant logistical challenges”.
Types of Big Data
⮚ Big data is classified in three ways: Structured Data, Unstructured
Data and Semi-Structured Data.
⮚ Structured data is the easiest to work with. It is highly organized with
dimensions defined by set parameters. Structured data follows schemas:
essentially road maps to specific data points. These schemas outline
where each datum is and what it means. It’s all your quantitative data like
Age, Billing, Address etc.
⮚ Unstructured data is all your unorganized data. The hardest part of
analyzing unstructured data is teaching an application to understand the
information it’s extracting. More often than not, this means translating it
into some form of structured data.
⮚ Semi-structured data toes the line between structured and
unstructured. Most of the time, this translates to unstructured data with
metadata attached to it. Examples of this data are:- time, location, device
ID stamp or email address, or it can be a semantic tag attached to the data
later. Semi-structured data has no set schema.
3 V’s of Big Data
Hadoop
What is Hadoop?
⮚Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models.
⮚The Hadoop framework application works in an environment that
provides distributed storage and computation across clusters of
computers.
⮚Hadoop is a framework that uses distributed storage and parallel
processing to store and manage Big Data.
⮚
⮚Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Applications
Social Network Analysis
Content Optimization
Network Analytics
Loyalty & Promotions Analysis
Fraud Analysis
Entity Analysis
Clickstream Sessionization
Clickstream Sessionization
Mediation
Data Factory
Trade Reconciliation
SIGINT
Application Application
Industry
Web
Media
Telco
Retail
Financial
Federal
Bioinformatics Genome Mapping
Sequencing Analysis
Use Case
Use Case
Hadoop Core Principles
⮚Scale-Out rather than Scale-Up
⮚Bring code to data rather than data to code
⮚Deal with failures – they are common
⮚Abstract complexity of distributed and concurrent
applications
Scale-Out rather than Scale-Up
1)It is harder and more expensive to Scale-Up
i. Add additional resources to an existing node (CPU, RAM)
ii.Moore’s Law can’t keep up with data growth
iii.New units must be purchased if required resources can not be added
iv.Also known as scale vertically
1)Scale-Out
i. Add more nodes/machines to an existing distributed application
ii.Software Layer is designed for node additions or removal
iii.Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
iv.Very easy to scale down as well
Bring Code to Data rather than Data to Code
◆Hadoop co-locates processors and storage
◆Code is moved to data (size is tiny, usually in KBs)
◆Processors execute code and access underlying local storage
Hadoop is designed to cope with node failures
⮚If a node fails, the master will detect that failure and re-assign the work to
a different node on the system.
⮚Restarting a task does not require communication with nodes working on
other portions of the data.
⮚If a failed node restarts, it is automatically added back to the system and
assigned new tasks.
⮚If a node appears to be running slowly, the master can redundantly
execute another instance of the same task
⮚ Results from the first to finish will be used
Block Size = 64MB
Replication Factor = 3
Cost is $400-$500/TB
1
2
3
4
5 2
3
4
5
2
4
5
1
3
5
1
2
5
1
3
4
HDFS
HDFS Replication
Hadoop Core Components
What is File System (FS)?
⮚ File management system is used by the operating system to access the
files and folders stored in a computer or any external storage devices.
⮚ A file system stores and organizes data and can be thought of as a type
of index for all the data contained in a storage device. These devices
can include hard drives, optical drives and flash drives.
⮚ Imagine file management system as a big dictionary that contains
information about file names, locations and types.
⮚ File systems specify conventions for naming files, including the
maximum number of characters in a name, which characters can be
used etc.
⮚ File management system is capable of handling files within one
What is Distributed File System
(DFS)?
⮚A Distributed File System (DFS) as the name suggests, is a file system
that is distributed on multiple file servers or multiple locations.
⮚It allows programs to access or store isolated files as they do with the
local ones, allowing programmers to access files from any network or
computer.
⮚The main purpose of the Distributed File System (DFS) is to allows
users of physically distributed systems to share their data and resources
by using a Common File System.
⮚A collection of workstations and mainframes connected by a Local
Area Network (LAN) is a configuration on Distributed File System.
How Distributed file system (DFS)
works?
?Distributed file system works as follows:
a) Distribution: Distribute blocks of data sets across multiple nodes. Each
node has its own computing power; which gives the ability of DFS to parallel
processing data blocks.
b) Replication: Distributed file system will also replicate data blocks on
different clusters by copy the same pieces of information into multiple
clusters on different racks. This will help to achieve the following:
c) Fault Tolerance: recover data block in case of cluster failure or Rack
failure.
d) High Concurrency: avail same piece of data to be processed by multiple
clients at the same time. It is done using the computation power of each node
to parallel process data blocks.
DFS Advantages
a) Scalability: You can scale up your infrastructure by adding more racks or
clusters to your system.
b) Fault Tolerance: Data replication will help to achieve fault tolerance in
the following cases: Cluster is down, Rack is down, Rack is disconnected
from the network and Job failure or restart.
c) High Concurrency: utilize the compute power of each node to handle
multiple client requests (in a parallel way) at the same time.
DFS Disadvantages
a) In Distributed File System nodes and connections needs to be secured
therefore we can say that security is at stake.
b) There is a possibility of lose of messages and data in the network
while movement from one node to another.
c) Database connection in case of Distributed File System is complicated.
d) Also handling of the database is not easy in Distributed File System as
compared to a single user system.
Hadoop
Distributed File
System (HDFS)
HDFS Basics
⮚ The Hadoop Distributed File System (HDFS) is based on the Google File
System (GFS)
⮚ Hadoop Distributed File System is responsible for storing data on the
cluster.
⮚ Data files are split into blocks and distributed across multiple nodes in the
cluster.
⮚ Each block is replicated multiple times
⮚--Default is to replicate each block three times
⮚--Replicas are stored on different nodes
⮚--This ensures both reliability and availability
⮚ A distributed file system that provides high-throughput access to
application data.
HDFS Architecture
HDFS Architecture
Hadoop Daemons
▪ Hadoop is comprised of five separate daemons
▪ NameNode: Holds the metadata for HDFS
▪ Secondary NameNode
– Performs housekeeping functions for the NameNode
– Is not a backup or hot standby for the NameNode!
▪ DataNode: Stores actual HDFS data blocks
▪ JobTracker: Manages MapReduce jobs, distributes individual tasks
▪ TaskTracker: Responsible for instantiating and monitoring individual Map and
Reduce tasks
Functions of Namenode
⮚ It is the master daemon that maintains and manages the DataNodes
(slave nodes)
⮚ It records the metadata of all the files stored in the cluster, e.g.
The location of blocks stored, the size of the files, permissions,
hierarchy, etc. There are two files associated with the metadata:
● FsImage: Complete state of the file system namespace since the start
of the NameNode.
● EditLogs: All the recent modifications made to the file system with
respect to the most recent FsImage.
⮚ It records each change that takes place to the file system metadata.
Functions of Namenode (Continued..)
⮚ It regularly receives a Heartbeat and a block report from all the
DataNodes in the cluster to ensure that the DataNodes are live.
⮚ It keeps a record of all the blocks in HDFS and in which nodes these
blocks are located.
⮚ The NameNode is also responsible to take care of
the replication factor .
⮚ In case of the DataNode failure, the NameNode chooses new
DataNodes for new replicas, balance disk usage and manages the
communication traffic to the DataNodes.
⮚These are slave daemons or process which runs on each slave machine.
⮚The actual data is stored on DataNodes.
⮚The DataNodes perform the low-level read and write requests from the
file system’s clients.
⮚They send heartbeats to the NameNode periodically to report the
overall health of HDFS, by default, this frequency is set to 3 seconds.
Functions of Datanode
Functions of Secondary NameNode
⮚ The Secondary NameNode is one which constantly reads all the file systems and
metadata from the RAM of the NameNode and writes it into the hard disk or the
file system.
⮚ It is responsible for combining the EditLogs with FsImage from the NameNode.
⮚ It downloads the EditLogs from the NameNode at regular intervals and applies to
FsImage.
⮚ The new FsImage is copied back to the NameNode, which is used whenever the
NameNode is started the next time
MapReduce(MR)
What is MapReduce?
■MapReduce is a processing technique and a program model for distributed computing
based on java.
■The MapReduce algorithm contains two important tasks, namely Map and Reduce.
■Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
■Reducer task which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples.
■As the sequence of the name MapReduce implies, the reduce task is always performed
after the map job.
■ MapReduce is the system used to process data in the Hadoop cluster.
■ Consists of two phases: Map, and then Reduce.
■ Each Map task operates on a discrete portion (one HDFS Block) of the overall
dataset.
■ MapReduce system distributes the intermediate data to nodes which perform the
Reduce phase.
MapReduce WordCount Example
Hadoop MapReduce WordCount Example
Hadoop MapReduce WordCount Example
(Continued..)
Hadoop MapReduce WordCount Example
(Continued...)
Hadoop MapReduce WordCount Example
(Continued....)
Hadoop MapReduce Working

More Related Content

Similar to Introduction to Big Data: HDFS, MapReduce & Hadoop

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
Data Analytics: HDFS with Big Data : Issues and Application
Data Analytics:  HDFS  with  Big Data :  Issues and ApplicationData Analytics:  HDFS  with  Big Data :  Issues and Application
Data Analytics: HDFS with Big Data : Issues and ApplicationDr. Chitra Dhawale
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 

Similar to Introduction to Big Data: HDFS, MapReduce & Hadoop (20)

Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Data Analytics: HDFS with Big Data : Issues and Application
Data Analytics:  HDFS  with  Big Data :  Issues and ApplicationData Analytics:  HDFS  with  Big Data :  Issues and Application
Data Analytics: HDFS with Big Data : Issues and Application
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 

Recently uploaded

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 

Recently uploaded (20)

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 

Introduction to Big Data: HDFS, MapReduce & Hadoop

  • 1. Unit-1 Introduction to Big Data ❑Big Data ❑Hadoop ❑HDFS ❑MapReduce
  • 2. Big Data ⮚ Big dnalytics has been aata refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. ⮚ The act of accessing and storing large amounts of information for around for a long time. But the concept of big data gained momentum in the early 2000s. ⮚ Big Data is high-volume, high-velocity and/or high-variety information asset that requires new forms of processing for enhanced decision making, insight discovery and process optimization (Gartner 2012). ⮚ “Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges”.
  • 3. Types of Big Data ⮚ Big data is classified in three ways: Structured Data, Unstructured Data and Semi-Structured Data. ⮚ Structured data is the easiest to work with. It is highly organized with dimensions defined by set parameters. Structured data follows schemas: essentially road maps to specific data points. These schemas outline where each datum is and what it means. It’s all your quantitative data like Age, Billing, Address etc. ⮚ Unstructured data is all your unorganized data. The hardest part of analyzing unstructured data is teaching an application to understand the information it’s extracting. More often than not, this means translating it into some form of structured data.
  • 4. ⮚ Semi-structured data toes the line between structured and unstructured. Most of the time, this translates to unstructured data with metadata attached to it. Examples of this data are:- time, location, device ID stamp or email address, or it can be a semantic tag attached to the data later. Semi-structured data has no set schema.
  • 5. 3 V’s of Big Data
  • 7. What is Hadoop? ⮚Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. ⮚The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. ⮚Hadoop is a framework that uses distributed storage and parallel processing to store and manage Big Data. ⮚ ⮚Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
  • 8. Hadoop Applications Social Network Analysis Content Optimization Network Analytics Loyalty & Promotions Analysis Fraud Analysis Entity Analysis Clickstream Sessionization Clickstream Sessionization Mediation Data Factory Trade Reconciliation SIGINT Application Application Industry Web Media Telco Retail Financial Federal Bioinformatics Genome Mapping Sequencing Analysis Use Case Use Case
  • 9. Hadoop Core Principles ⮚Scale-Out rather than Scale-Up ⮚Bring code to data rather than data to code ⮚Deal with failures – they are common ⮚Abstract complexity of distributed and concurrent applications
  • 10. Scale-Out rather than Scale-Up 1)It is harder and more expensive to Scale-Up i. Add additional resources to an existing node (CPU, RAM) ii.Moore’s Law can’t keep up with data growth iii.New units must be purchased if required resources can not be added iv.Also known as scale vertically 1)Scale-Out i. Add more nodes/machines to an existing distributed application ii.Software Layer is designed for node additions or removal iii.Hadoop takes this approach - A set of nodes are bonded together as a single distributed system iv.Very easy to scale down as well
  • 11. Bring Code to Data rather than Data to Code ◆Hadoop co-locates processors and storage ◆Code is moved to data (size is tiny, usually in KBs) ◆Processors execute code and access underlying local storage
  • 12. Hadoop is designed to cope with node failures ⮚If a node fails, the master will detect that failure and re-assign the work to a different node on the system. ⮚Restarting a task does not require communication with nodes working on other portions of the data. ⮚If a failed node restarts, it is automatically added back to the system and assigned new tasks. ⮚If a node appears to be running slowly, the master can redundantly execute another instance of the same task ⮚ Results from the first to finish will be used
  • 13. Block Size = 64MB Replication Factor = 3 Cost is $400-$500/TB 1 2 3 4 5 2 3 4 5 2 4 5 1 3 5 1 2 5 1 3 4 HDFS HDFS Replication
  • 15. What is File System (FS)? ⮚ File management system is used by the operating system to access the files and folders stored in a computer or any external storage devices. ⮚ A file system stores and organizes data and can be thought of as a type of index for all the data contained in a storage device. These devices can include hard drives, optical drives and flash drives. ⮚ Imagine file management system as a big dictionary that contains information about file names, locations and types. ⮚ File systems specify conventions for naming files, including the maximum number of characters in a name, which characters can be used etc. ⮚ File management system is capable of handling files within one
  • 16. What is Distributed File System (DFS)? ⮚A Distributed File System (DFS) as the name suggests, is a file system that is distributed on multiple file servers or multiple locations. ⮚It allows programs to access or store isolated files as they do with the local ones, allowing programmers to access files from any network or computer. ⮚The main purpose of the Distributed File System (DFS) is to allows users of physically distributed systems to share their data and resources by using a Common File System. ⮚A collection of workstations and mainframes connected by a Local Area Network (LAN) is a configuration on Distributed File System.
  • 17. How Distributed file system (DFS) works? ?Distributed file system works as follows: a) Distribution: Distribute blocks of data sets across multiple nodes. Each node has its own computing power; which gives the ability of DFS to parallel processing data blocks. b) Replication: Distributed file system will also replicate data blocks on different clusters by copy the same pieces of information into multiple clusters on different racks. This will help to achieve the following: c) Fault Tolerance: recover data block in case of cluster failure or Rack failure. d) High Concurrency: avail same piece of data to be processed by multiple clients at the same time. It is done using the computation power of each node to parallel process data blocks.
  • 18. DFS Advantages a) Scalability: You can scale up your infrastructure by adding more racks or clusters to your system. b) Fault Tolerance: Data replication will help to achieve fault tolerance in the following cases: Cluster is down, Rack is down, Rack is disconnected from the network and Job failure or restart. c) High Concurrency: utilize the compute power of each node to handle multiple client requests (in a parallel way) at the same time.
  • 19. DFS Disadvantages a) In Distributed File System nodes and connections needs to be secured therefore we can say that security is at stake. b) There is a possibility of lose of messages and data in the network while movement from one node to another. c) Database connection in case of Distributed File System is complicated. d) Also handling of the database is not easy in Distributed File System as compared to a single user system.
  • 21. HDFS Basics ⮚ The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) ⮚ Hadoop Distributed File System is responsible for storing data on the cluster. ⮚ Data files are split into blocks and distributed across multiple nodes in the cluster. ⮚ Each block is replicated multiple times ⮚--Default is to replicate each block three times ⮚--Replicas are stored on different nodes ⮚--This ensures both reliability and availability ⮚ A distributed file system that provides high-throughput access to application data.
  • 24. Hadoop Daemons ▪ Hadoop is comprised of five separate daemons ▪ NameNode: Holds the metadata for HDFS ▪ Secondary NameNode – Performs housekeeping functions for the NameNode – Is not a backup or hot standby for the NameNode! ▪ DataNode: Stores actual HDFS data blocks ▪ JobTracker: Manages MapReduce jobs, distributes individual tasks ▪ TaskTracker: Responsible for instantiating and monitoring individual Map and Reduce tasks
  • 25. Functions of Namenode ⮚ It is the master daemon that maintains and manages the DataNodes (slave nodes) ⮚ It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the metadata: ● FsImage: Complete state of the file system namespace since the start of the NameNode. ● EditLogs: All the recent modifications made to the file system with respect to the most recent FsImage. ⮚ It records each change that takes place to the file system metadata.
  • 26. Functions of Namenode (Continued..) ⮚ It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. ⮚ It keeps a record of all the blocks in HDFS and in which nodes these blocks are located. ⮚ The NameNode is also responsible to take care of the replication factor . ⮚ In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas, balance disk usage and manages the communication traffic to the DataNodes.
  • 27. ⮚These are slave daemons or process which runs on each slave machine. ⮚The actual data is stored on DataNodes. ⮚The DataNodes perform the low-level read and write requests from the file system’s clients. ⮚They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds. Functions of Datanode
  • 28. Functions of Secondary NameNode ⮚ The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system. ⮚ It is responsible for combining the EditLogs with FsImage from the NameNode. ⮚ It downloads the EditLogs from the NameNode at regular intervals and applies to FsImage. ⮚ The new FsImage is copied back to the NameNode, which is used whenever the NameNode is started the next time
  • 30. What is MapReduce? ■MapReduce is a processing technique and a program model for distributed computing based on java. ■The MapReduce algorithm contains two important tasks, namely Map and Reduce. ■Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). ■Reducer task which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. ■As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
  • 31. ■ MapReduce is the system used to process data in the Hadoop cluster. ■ Consists of two phases: Map, and then Reduce. ■ Each Map task operates on a discrete portion (one HDFS Block) of the overall dataset. ■ MapReduce system distributes the intermediate data to nodes which perform the Reduce phase.
  • 33.
  • 35. Hadoop MapReduce WordCount Example (Continued..)
  • 36. Hadoop MapReduce WordCount Example (Continued...)
  • 37. Hadoop MapReduce WordCount Example (Continued....)