Made By:-
Akshansh Agarwal
ICITSS
ICAI
2018
-Akshansh Agarwal
What is Big Data?
Big Data is huge volume of data sets generating with fast velocity and
various variety of information assets. It ranges in 100s of TBs.
This huge data is created from everything that is digitized or connected to
electronic devices. It is generated from what you store on mobile phone and
desktop and from your activities on it.
Data is produced when you visit to places like hospitals, shopping malls,
retail stores, event functions, restaurants etc.
Big Data is also generated from non - human bodies like an airplane,
satellite, an electric pole, connected cars etc.
-Akshansh Agarwal
Characteristics of Big Data
Volume – Today data size has increased to size of terabytes in the
form of records or transactions.
Variety – There is huge variety of data. Data can be of structured,
semi structured or unstructured type.
Velocity – It means near or real-time
assimilation (speed) of data coming in
huge volume.
-Akshansh Agarwal
Sources of Big Data
• Social Media
• Banks
• Instruments
• Websites
• Stock Market
-Akshansh Agarwal
Uses of Big Data
1. Recommendation Engines
2. Analyzing call record details
3. Sentimental Analysis
4. Market – Basket Analysis
5. Pricing Optimization
6. Behavioral Analytics
7. Fraud Detection
8. Customer Satisfaction
-Akshansh Agarwal
What is Hadoop?
Hadoop is an open source big data management tool from the ASF –
Apache Software Foundation. Open source project means it is freely
available and even its source code can be changed as per the
requirements. If certain functionality does not fulfill your requirement,
you can change it according to your need.
Hadoop is not just a storage system; it is a platform for huge data
storage and processing.
-Akshansh Agarwal
Why Hadoop?
Now, let us understand why Hadoop is very popular and has captured more than
90% of the Big Data market.
1. Open Source
2. Distributed Processing
3. Fault Tolerant
4. Reliability
5. High availability
6. Scalability
7. Economical
8. Easy to use
9. Data Locality
-Akshansh Agarwal
Hadoop Components (Core)
• HDFS
• Map Reduce
• YARN
-Akshansh Agarwal
HDFS
(Hadoop Distributed File System)
HDFS stands for Hadoop Distributed File System for managing big data sets with
High Volume, Velocity and Variety. HDFS implements master slave architecture.
Master is Name node and slave is data node.
Features:
• Scalable
• Reliable
• Commodity Hardware
HDFS is the well known
for Big Data storage.
-Akshansh Agarwal
Map Reduce
Map Reduce is a programming model designed to process high volume
distributed data. Platform is built using Java for better exception handling. Map
Reduce includes two daemons, Job tracker and Task Tracker.
Features:
• Functional Programming.
• Works very well on Big Data.
• Can process large datasets.
Map Reduce is the main component known for processing big data.
-Akshansh Agarwal
YARN
(Yet Another Resource Negotiator)
YARN stands for Yet Another Resource Negotiator. It is also called as
MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1,
resource management and job scheduling/ monitoring are split into separate
daemons which are Resource Manager, Node Manager and Application Master.
Features:
• Better resource management.
• Scalability
• Dynamic allocation of cluster resources.
-Akshansh Agarwal
Hadoop Daemons
Hadoop - 1
• Name Node
• Data Node
• Secondary Name Node
• Job Tracker
• Task Tracker
Hadoop - 2
• Name Node
• Data Node
• Secondary Name Node
• Resource Manager
• Node Manager
-Akshansh Agarwal
HDFS MR / YARN
Storage /
Read - write Processing
• Name Node
• Data Node
• Job Tracker / Resource Manager
• Task Tracker / Node Manager
-Akshansh Agarwal
Hadoop Master - Slave Architecture
HDFS
Name
Node
Master
Data Node
Slave
MR /YARN
Resource
Manager
Master
Node
Manager
Slave
-Akshansh Agarwal
Hadoop Cluster (4 Nodes)
MASTER
SLAVE
1. Name Node
2. Resource Manager
1. Data Node
2. Node Manager
-Akshansh Agarwal
Secondary Name Node
•Secondary Name node is not a hot backup for name node.
•It just takes hourly backup of name node metadata.
•It can be used to restart a crashed Hadoop Cluster.
•Secondary name node is an important daemon in Hadoop
and it runs on Master Computer.
-Akshansh Agarwal
Conclusion
• Finishing the presentation, we can conclude that Data is being
created every time and everywhere. It has resulted in an important
role in growth of businesses and lifestyle of people.
• Also, not only collection and storage of data is necessary. Rather,
processing and managing the data, using programs like Hadoop is
more important for studying and utilizing the data.
Finally, I would like to conclude my presentation with a quote by
JESS GREENWOOD,
“Data is the new oil. It’s only useful when it is refined.”
-Akshansh Agarwal
-Akshansh Agarwal

Big data & Hadoop

  • 1.
  • 2.
    What is BigData? Big Data is huge volume of data sets generating with fast velocity and various variety of information assets. It ranges in 100s of TBs. This huge data is created from everything that is digitized or connected to electronic devices. It is generated from what you store on mobile phone and desktop and from your activities on it. Data is produced when you visit to places like hospitals, shopping malls, retail stores, event functions, restaurants etc. Big Data is also generated from non - human bodies like an airplane, satellite, an electric pole, connected cars etc. -Akshansh Agarwal
  • 3.
    Characteristics of BigData Volume – Today data size has increased to size of terabytes in the form of records or transactions. Variety – There is huge variety of data. Data can be of structured, semi structured or unstructured type. Velocity – It means near or real-time assimilation (speed) of data coming in huge volume. -Akshansh Agarwal
  • 4.
    Sources of BigData • Social Media • Banks • Instruments • Websites • Stock Market -Akshansh Agarwal
  • 5.
    Uses of BigData 1. Recommendation Engines 2. Analyzing call record details 3. Sentimental Analysis 4. Market – Basket Analysis 5. Pricing Optimization 6. Behavioral Analytics 7. Fraud Detection 8. Customer Satisfaction -Akshansh Agarwal
  • 6.
    What is Hadoop? Hadoopis an open source big data management tool from the ASF – Apache Software Foundation. Open source project means it is freely available and even its source code can be changed as per the requirements. If certain functionality does not fulfill your requirement, you can change it according to your need. Hadoop is not just a storage system; it is a platform for huge data storage and processing. -Akshansh Agarwal
  • 7.
    Why Hadoop? Now, letus understand why Hadoop is very popular and has captured more than 90% of the Big Data market. 1. Open Source 2. Distributed Processing 3. Fault Tolerant 4. Reliability 5. High availability 6. Scalability 7. Economical 8. Easy to use 9. Data Locality -Akshansh Agarwal
  • 8.
    Hadoop Components (Core) •HDFS • Map Reduce • YARN -Akshansh Agarwal
  • 9.
    HDFS (Hadoop Distributed FileSystem) HDFS stands for Hadoop Distributed File System for managing big data sets with High Volume, Velocity and Variety. HDFS implements master slave architecture. Master is Name node and slave is data node. Features: • Scalable • Reliable • Commodity Hardware HDFS is the well known for Big Data storage. -Akshansh Agarwal
  • 10.
    Map Reduce Map Reduceis a programming model designed to process high volume distributed data. Platform is built using Java for better exception handling. Map Reduce includes two daemons, Job tracker and Task Tracker. Features: • Functional Programming. • Works very well on Big Data. • Can process large datasets. Map Reduce is the main component known for processing big data. -Akshansh Agarwal
  • 11.
    YARN (Yet Another ResourceNegotiator) YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1, resource management and job scheduling/ monitoring are split into separate daemons which are Resource Manager, Node Manager and Application Master. Features: • Better resource management. • Scalability • Dynamic allocation of cluster resources. -Akshansh Agarwal
  • 12.
    Hadoop Daemons Hadoop -1 • Name Node • Data Node • Secondary Name Node • Job Tracker • Task Tracker Hadoop - 2 • Name Node • Data Node • Secondary Name Node • Resource Manager • Node Manager -Akshansh Agarwal
  • 13.
    HDFS MR /YARN Storage / Read - write Processing • Name Node • Data Node • Job Tracker / Resource Manager • Task Tracker / Node Manager -Akshansh Agarwal
  • 14.
    Hadoop Master -Slave Architecture HDFS Name Node Master Data Node Slave MR /YARN Resource Manager Master Node Manager Slave -Akshansh Agarwal
  • 15.
    Hadoop Cluster (4Nodes) MASTER SLAVE 1. Name Node 2. Resource Manager 1. Data Node 2. Node Manager -Akshansh Agarwal
  • 16.
    Secondary Name Node •SecondaryName node is not a hot backup for name node. •It just takes hourly backup of name node metadata. •It can be used to restart a crashed Hadoop Cluster. •Secondary name node is an important daemon in Hadoop and it runs on Master Computer. -Akshansh Agarwal
  • 17.
    Conclusion • Finishing thepresentation, we can conclude that Data is being created every time and everywhere. It has resulted in an important role in growth of businesses and lifestyle of people. • Also, not only collection and storage of data is necessary. Rather, processing and managing the data, using programs like Hadoop is more important for studying and utilizing the data. Finally, I would like to conclude my presentation with a quote by JESS GREENWOOD, “Data is the new oil. It’s only useful when it is refined.” -Akshansh Agarwal
  • 18.