BigData AnalyticsBigData Analytics
Incorporation Pvt. Ltd.
Presented By:-Presented By:-
Mayank Kumar Sharma
1
2AMSTECH Incorporation Pvt. Ltd.
Internet = Ocean of informationInternet = Ocean of information
3AMSTECH Incorporation Pvt. Ltd.
4AMSTECH Incorporation Pvt. Ltd.
What is BigData?
What makes data, “Big” Data?
Why BigData?
5AMSTECH Incorporation Pvt. Ltd.
“Extremely large data sets that may be analyzed
computationally to reveal patterns, trends, and
associations, especially relating to human behavior
and interactions are known as BigDataBigData.”
OR
BigDataBigData is the term for a collection of data sets
so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications.
What is BigData?
6AMSTECH Incorporation Pvt. Ltd.
“Gartner Definition(2012): "BigData is high
volume, high velocity, and/or high variety information
assets that require new forms of processing to enable
enhanced decision making, insight discovery and
process optimization.”
“No exact Definition, Only Experience.”
What is BigData?
7AMSTECH Incorporation Pvt. Ltd.
Every day, we create 3.5 quintillion bytes of data — so
much that 90% of the data in the world today has been
created in the last two years alone.
An example of big data might be petabytes (1,024
terabytes) or exabytes (1,024 petabytes) of data consisting
of billions to trillions of records of millions of people.
Storage capacity increases 23% on average annually.
Exponential growth during a decade starts from 2010.
What makes data, “Big” Data?
8AMSTECH Incorporation Pvt. Ltd.
• Creates over 30 billion pieces of content per day.
• Stores 30 petabytes of data.
• 90 million tweets per day.
9AMSTECH Incorporation Pvt. Ltd.
Why BigData?
To Manage Data Better.
[Abstraction has enabled numerous use cases where
data in a wide variety of formats]
Benefit From Speed, Capacity and Scalability of Cloud
Storage.
[Utilize substantially large data sets provide both the storage and
the computing power necessary crunch data for a specific period.]
End Users Can Visualize Data
[Data in easy-to-read charts, graphs and slideshows]
10AMSTECH Incorporation Pvt. Ltd.
Why BigData?
Find New Business Opportunities.
[Social media, Business Intelligence]
Data Analysis Methods, Capabilities Will Evolve
[Utilize substantially large data sets provide both the storage and
the computing power necessary crunch data for a specific period.]
11AMSTECH Incorporation Pvt. Ltd.
Why BigData?
12AMSTECH Incorporation Pvt. Ltd.
Who uses BigData?
1. Banking
2. Education
3. Government
4. Health Care
5. Manufacturing
6. Retail
“ It’s important to remember that the primary value
from big data comes not from the data in its raw form, but
from the processing and analysis of it and the insights,
products, and services that emerge from analysis. “
13AMSTECH Incorporation Pvt. Ltd.
BigData Challenges
14AMSTECH Incorporation Pvt. Ltd.
Big data can be characterized by 3Vs,
which can be known as Volume, Velocity and
Variety.
Characteristics of Big Data:
15AMSTECH Incorporation Pvt. Ltd.
Data Volume
 44x increase from 2009 2020
 From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
16
Exponential increase in
collected/generated data
Volume : BigData 3Vs
AMSTECH Incorporation Pvt. Ltd.
Various formats, types, and structures.
Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
Static data vs. streaming data
A single application can be generating/collecting many types
of data
17
To extract knowledge All these types of data need
to linked together
To extract knowledge All these types of data need
to linked together
Variety : BigData 3Vs
AMSTECH Incorporation Pvt. Ltd.
18AMSTECH Incorporation Pvt. Ltd.
Data is begin generated fast and need to be processed fast
Online Data Analytics
Late decisions  missing opportunities
Examples
 E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you
 Healthcare monitoring: sensors monitoring your activities and body
 any abnormal measurements require immediate reaction.
19
Velocity : BigData 3Vs
AMSTECH Incorporation Pvt. Ltd.
20AMSTECH Incorporation Pvt. Ltd.
Shim, K., S., Lee, S., K. and Kim, M., S. “Application Traffic Classification
in Hadoop Distributed Computing Environment” published in Asia-Pacific
Network Operation and Management Symposium (APNOMS) 2014.
1. This research work proposed an application traffic
classification in Hadoop Distributed Computing Environment.
2. Traffic phenomena of current network have been changes and
conventional traffic analysis method are not adequate.
3. In the proposed solution, authors consider packet units of
traffic from campus network. Collected packets are converted
into Flow format through the flow generator. The flow is
defined by 5 –tuple analysis.
Research Study
21AMSTECH Incorporation Pvt. Ltd.
Conclusion
4. Proposed method perform well in term of processing
speed through a comparison between the Hadoop based
system and a single server system.
5. On the other hand, it has certain drawbacks which are;
1. Adoption of Classification technique rather than
clustering.
2. Low analysis rate.
22AMSTECH Incorporation Pvt. Ltd.
Existing Solution for Traffic Classification
23AMSTECH Incorporation Pvt. Ltd.
BigData Analytics
24AMSTECH Incorporation Pvt. Ltd.
BigData Analytics Use Cases
Real Time
Intelligence
Data
Discovery
Business
Intelligence
Data
Scientist Business
User
Consumer
25AMSTECH Incorporation Pvt. Ltd.
1. Hadoop is a free, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment.
2. The Hadoop Distributed File System (HDFS) is designed
to store very large data sets reliably, and to stream those
data sets at high bandwidth to user applications.
3. By distributing storage and computation across many
servers, the resource can grow with demand while
remaining economical at every size.
BigData: Hadoop
26AMSTECH Incorporation Pvt. Ltd.
4. An important characteristic of Hadoop is the partitioning
of data and computation across many (thousands) of
hosts, and executing application computations in parallel
close to their data.
5. A Hadoop cluster scales computation capacity, storage
capacity and IO bandwidth by simply adding commodity
servers.
6. In simple words, it is a scalable fault tolerant grid
operating system for data storage and processing with
high bandwidth and clustering storage.
27AMSTECH Incorporation Pvt. Ltd.
Figure 2: HADOOP Components
28AMSTECH Incorporation Pvt. Ltd.
Figure 3: HDFS Processing
29AMSTECH Incorporation Pvt. Ltd.
30AMSTECH Incorporation Pvt. Ltd.
1. NameNode is the centerpiece of HDFS.
2. NameNode is also known as the Master
3. NameNode only stores the metadata of HDFS – the
directory tree of all files in the file system, and tracks the
files across the cluster.
4. NameNode does not store the actual data or the dataset.
The data itself is actually stored in the DataNodes.
5. NameNode knows the list of the blocks and its location
for any given file in HDFS. With this information
NameNode knows how to construct the file from blocks.
Name Node
31AMSTECH Incorporation Pvt. Ltd.
6. NameNode is so critical to HDFS and when the
NameNode is down, HDFS/Hadoop cluster is inaccessible
and considered down.
7. NameNode is a single point of failure in Hadoop cluster.
8. NameNode is usually configured with a lot of memory
(RAM). Because the block locations are help in main
memory.
32AMSTECH Incorporation Pvt. Ltd.
1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the
NameNode along with the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability
of data or the cluster. NameNode will arrange for replication
for the blocks managed by the DataNode that is not
available.
6. DataNode is usually configured with a lot of hard disk space.
Because the actual data is stored in the DataNode.
DataNode
33AMSTECH Incorporation Pvt. Ltd.
Operation series when writing a file
34AMSTECH Incorporation Pvt. Ltd.
Operation series when reading a file
35AMSTECH Incorporation Pvt. Ltd.
Hadoop ConfigurationHadoop Configuration
36AMSTECH Incorporation Pvt. Ltd.
Thanks A LotThanks A Lot
Incorporation Pvt. Ltd.
By:
Mayank Kumar Sharma
37AMSTECH Incorporation Pvt. Ltd.

big data

  • 1.
    BigData AnalyticsBigData Analytics IncorporationPvt. Ltd. Presented By:-Presented By:- Mayank Kumar Sharma 1
  • 2.
  • 3.
    Internet = Oceanof informationInternet = Ocean of information 3AMSTECH Incorporation Pvt. Ltd.
  • 4.
  • 5.
    What is BigData? Whatmakes data, “Big” Data? Why BigData? 5AMSTECH Incorporation Pvt. Ltd.
  • 6.
    “Extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions are known as BigDataBigData.” OR BigDataBigData is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. What is BigData? 6AMSTECH Incorporation Pvt. Ltd.
  • 7.
    “Gartner Definition(2012): "BigDatais high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” “No exact Definition, Only Experience.” What is BigData? 7AMSTECH Incorporation Pvt. Ltd.
  • 8.
    Every day, wecreate 3.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 petabytes) of data consisting of billions to trillions of records of millions of people. Storage capacity increases 23% on average annually. Exponential growth during a decade starts from 2010. What makes data, “Big” Data? 8AMSTECH Incorporation Pvt. Ltd.
  • 9.
    • Creates over30 billion pieces of content per day. • Stores 30 petabytes of data. • 90 million tweets per day. 9AMSTECH Incorporation Pvt. Ltd.
  • 10.
    Why BigData? To ManageData Better. [Abstraction has enabled numerous use cases where data in a wide variety of formats] Benefit From Speed, Capacity and Scalability of Cloud Storage. [Utilize substantially large data sets provide both the storage and the computing power necessary crunch data for a specific period.] End Users Can Visualize Data [Data in easy-to-read charts, graphs and slideshows] 10AMSTECH Incorporation Pvt. Ltd.
  • 11.
    Why BigData? Find NewBusiness Opportunities. [Social media, Business Intelligence] Data Analysis Methods, Capabilities Will Evolve [Utilize substantially large data sets provide both the storage and the computing power necessary crunch data for a specific period.] 11AMSTECH Incorporation Pvt. Ltd.
  • 12.
  • 13.
    Who uses BigData? 1.Banking 2. Education 3. Government 4. Health Care 5. Manufacturing 6. Retail “ It’s important to remember that the primary value from big data comes not from the data in its raw form, but from the processing and analysis of it and the insights, products, and services that emerge from analysis. “ 13AMSTECH Incorporation Pvt. Ltd.
  • 14.
  • 15.
    Big data canbe characterized by 3Vs, which can be known as Volume, Velocity and Variety. Characteristics of Big Data: 15AMSTECH Incorporation Pvt. Ltd.
  • 16.
    Data Volume  44xincrease from 2009 2020  From 0.8 zettabytes to 35zb Data volume is increasing exponentially 16 Exponential increase in collected/generated data Volume : BigData 3Vs AMSTECH Incorporation Pvt. Ltd.
  • 17.
    Various formats, types,and structures. Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… Static data vs. streaming data A single application can be generating/collecting many types of data 17 To extract knowledge All these types of data need to linked together To extract knowledge All these types of data need to linked together Variety : BigData 3Vs AMSTECH Incorporation Pvt. Ltd.
  • 18.
  • 19.
    Data is begingenerated fast and need to be processed fast Online Data Analytics Late decisions  missing opportunities Examples  E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you  Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction. 19 Velocity : BigData 3Vs AMSTECH Incorporation Pvt. Ltd.
  • 20.
  • 21.
    Shim, K., S.,Lee, S., K. and Kim, M., S. “Application Traffic Classification in Hadoop Distributed Computing Environment” published in Asia-Pacific Network Operation and Management Symposium (APNOMS) 2014. 1. This research work proposed an application traffic classification in Hadoop Distributed Computing Environment. 2. Traffic phenomena of current network have been changes and conventional traffic analysis method are not adequate. 3. In the proposed solution, authors consider packet units of traffic from campus network. Collected packets are converted into Flow format through the flow generator. The flow is defined by 5 –tuple analysis. Research Study 21AMSTECH Incorporation Pvt. Ltd.
  • 22.
    Conclusion 4. Proposed methodperform well in term of processing speed through a comparison between the Hadoop based system and a single server system. 5. On the other hand, it has certain drawbacks which are; 1. Adoption of Classification technique rather than clustering. 2. Low analysis rate. 22AMSTECH Incorporation Pvt. Ltd.
  • 23.
    Existing Solution forTraffic Classification 23AMSTECH Incorporation Pvt. Ltd.
  • 24.
  • 25.
    BigData Analytics UseCases Real Time Intelligence Data Discovery Business Intelligence Data Scientist Business User Consumer 25AMSTECH Incorporation Pvt. Ltd.
  • 26.
    1. Hadoop isa free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. 2. The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. 3. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. BigData: Hadoop 26AMSTECH Incorporation Pvt. Ltd.
  • 27.
    4. An importantcharacteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and executing application computations in parallel close to their data. 5. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. 6. In simple words, it is a scalable fault tolerant grid operating system for data storage and processing with high bandwidth and clustering storage. 27AMSTECH Incorporation Pvt. Ltd.
  • 28.
    Figure 2: HADOOPComponents 28AMSTECH Incorporation Pvt. Ltd.
  • 29.
    Figure 3: HDFSProcessing 29AMSTECH Incorporation Pvt. Ltd.
  • 30.
  • 31.
    1. NameNode isthe centerpiece of HDFS. 2. NameNode is also known as the Master 3. NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster. 4. NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes. 5. NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks. Name Node 31AMSTECH Incorporation Pvt. Ltd.
  • 32.
    6. NameNode isso critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down. 7. NameNode is a single point of failure in Hadoop cluster. 8. NameNode is usually configured with a lot of memory (RAM). Because the block locations are help in main memory. 32AMSTECH Incorporation Pvt. Ltd.
  • 33.
    1. DataNode isresponsible for storing the actual data in HDFS. 2. DataNode is also known as the Slave 3. NameNode and DataNode are in constant communication. 4. When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for. 5. When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available. 6. DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode. DataNode 33AMSTECH Incorporation Pvt. Ltd.
  • 34.
    Operation series whenwriting a file 34AMSTECH Incorporation Pvt. Ltd.
  • 35.
    Operation series whenreading a file 35AMSTECH Incorporation Pvt. Ltd.
  • 36.
  • 37.
    Thanks A LotThanksA Lot Incorporation Pvt. Ltd. By: Mayank Kumar Sharma 37AMSTECH Incorporation Pvt. Ltd.