Big Data : Concept and Applications
Big data: Concept & Applications
Big data is the term for collection of dataset so large
and complex that it become difficult to process using
on hand database management tools or traditional
data processing applications.
The amount of data that is beyond the storage
and processing capabilities of single physical
machine then it is called Big data.
Big data ?
Large volume of data
Existing tools were not designed to handle such a huge data
.
Gigabyte  Terabyte  Petabyte  Exabyte  Zeta byte
Title : Big data : Concept & Applications
Amazon  collect social data ,log data , different flavor of data.
Walmart  handles more than 1 million customer transactions every hour.
Twitter  300000 tweets per minutes
Instagram  250000 upload new picture per minutes
Email  5 million messages (gmail)
WhatsApp  4,00,000 pictures per min
Google  5 millions search request per min
Facebook  2.5 millions contents per min
500 TB per day
Having data bigger it requires different approaches:
Techniques, Tools and Architecture
An aim to solve new problems or old problems in a better
way
Big Data generates value from the storage and processing of
very large quantities of digital information that cannot be
analyzed with traditional computing techniques.
Big data : 3V
•Variety
data coming from various sources
• Velocity
real time live streaming data
• Volume
in order of terabyte and petabyte
Title : Big data : Concept & Applications
Big Data are in everywhere.
Network Analysis
Social Network Web Graph
Bigdata : Volume
 Volume of data is increasing in every second
 Data will be measured in TB and ZB.
 Amount of data will be double in every two
years
 100 terabytes of data are uploaded
daily to Facebook
 100 hours of video uploaded in
every minute
 Research estimated 65% annual
growth in digital contents , mainly
unstructured data.
Gigabyte  Terabyte  Petabyte  Exabyte  Zetabyte
Data is created real
time
Internet of thing (IOT),
social media – major
contributor for the
speed at which the
data is generated.
In every minute
25 million queries on Google
 20 million photos are viewed on Flickr
 over 200 million emails are sent
Big Data : Velocity
Data are coming in all shape
structured,
semistructured, unstructured &
even complexed structure
90% of data generated is
‘unstructured’
starting from text to audio,
image or video data.
Big Data : Variety
Big Data Life Cycle
Storage
Capacity
2000 2018
Storage
MB
PB
2025
Processing
Speed
10
Solution : Big Data
11
Hadoop
Apache Hadoop is a framework for storing ,processing and
analyzing big data.
•Distributed
•Scalable
•Open Source
12
Why Hadoop?
• 1 TB data is processed
by 1 computer
• Each computer is
having 4 I/O channel
of 100 mbps.
• Total time required :
44 minutes
1 TB data is processed by 10
computers (same configuration)
parallel .
Total time required : 4.4 minutes
CASE 1 CASE 2
13
HDFS (Hadoop Distributed file System)
- Stores data on the cluster
HDFS is a file system written in Java
Provide storage for massive amount of data
- Scalable
- Fault Tolerance
- Support efficient processing in MR
14
Hadoop : How files are stored?
-Data files split into blocks and distributed to data nodes
- Each block is replicated in multiple nodes ( default 3x)
15
HDFS (Master/Slaves Architecture)
Master machine is Name Node
Slaves machine are Data Node
16
MAP REDUCE
Map Reduce is a framework for executing highly parallelizable and
distributable algorithms across huge datasets.
17
Map Reduce : Mappers run parallel
18
Mad Reduce : Analyzing data
19
Basic Cluster Configuration
20
HADOOP ECO-SYSTEM
21
HADOOP
Hadoop =HDFS + Map Reduce
Hadoop HDFS commands are similar
to unix command.
Map reduce is programming model
Hive  Data Manipulation (like SQL)
Pig  Data Manipulation using Script
Sqoop  Import and Export on HDFS
22
Import/Export using Sqoop and Flume
Sqoop : Transfers data between RDBMS and HDFS
Flume : A service to move large amounts of data in real Time
Applications
E-commerce : (Amazon)
Recommendation Engine
-User buy pattern
-Digital Marketing Analysis
Telecommunication
-Call drop Analysis
-Network Problem Optimization
Entertainment
-Content Analytics (Netflix)
Sports
-Fitness Management (fitbit)
Health Care
-Early Disease Detection (pfizer)
Applications
Technology: In the technology, it is used in the websites like eBay,
Amazon and Facebook and Google utilize it.
Private sector: The application of big data in the private sector includes
the retail, retail banking, and real estate.
Government: The big data is also utilized by the Indian government.
International development: The development in the big data analysis
furnishes cost-effective opportunities to enhance the decision in critical
advancement areas like health care, employment opportunities and crime,
security and natural disaster. Hence, in this way, the big data is helpful for
the international development.
References
1.www.google.com
2.www.wikipedia.com
3.www.hortonworks.com
Question
Thank You.

Big Data

  • 1.
    Big Data :Concept and Applications
  • 2.
    Big data: Concept& Applications Big data is the term for collection of dataset so large and complex that it become difficult to process using on hand database management tools or traditional data processing applications. The amount of data that is beyond the storage and processing capabilities of single physical machine then it is called Big data. Big data ? Large volume of data Existing tools were not designed to handle such a huge data . Gigabyte  Terabyte  Petabyte  Exabyte  Zeta byte
  • 3.
    Title : Bigdata : Concept & Applications Amazon  collect social data ,log data , different flavor of data. Walmart  handles more than 1 million customer transactions every hour. Twitter  300000 tweets per minutes Instagram  250000 upload new picture per minutes Email  5 million messages (gmail) WhatsApp  4,00,000 pictures per min Google  5 millions search request per min Facebook  2.5 millions contents per min 500 TB per day Having data bigger it requires different approaches: Techniques, Tools and Architecture An aim to solve new problems or old problems in a better way Big Data generates value from the storage and processing of very large quantities of digital information that cannot be analyzed with traditional computing techniques.
  • 4.
    Big data :3V •Variety data coming from various sources • Velocity real time live streaming data • Volume in order of terabyte and petabyte
  • 5.
    Title : Bigdata : Concept & Applications Big Data are in everywhere. Network Analysis Social Network Web Graph
  • 6.
    Bigdata : Volume Volume of data is increasing in every second  Data will be measured in TB and ZB.  Amount of data will be double in every two years  100 terabytes of data are uploaded daily to Facebook  100 hours of video uploaded in every minute  Research estimated 65% annual growth in digital contents , mainly unstructured data. Gigabyte  Terabyte  Petabyte  Exabyte  Zetabyte
  • 7.
    Data is createdreal time Internet of thing (IOT), social media – major contributor for the speed at which the data is generated. In every minute 25 million queries on Google  20 million photos are viewed on Flickr  over 200 million emails are sent Big Data : Velocity
  • 8.
    Data are comingin all shape structured, semistructured, unstructured & even complexed structure 90% of data generated is ‘unstructured’ starting from text to audio, image or video data. Big Data : Variety
  • 9.
    Big Data LifeCycle Storage Capacity 2000 2018 Storage MB PB 2025 Processing Speed
  • 10.
  • 11.
    11 Hadoop Apache Hadoop isa framework for storing ,processing and analyzing big data. •Distributed •Scalable •Open Source
  • 12.
    12 Why Hadoop? • 1TB data is processed by 1 computer • Each computer is having 4 I/O channel of 100 mbps. • Total time required : 44 minutes 1 TB data is processed by 10 computers (same configuration) parallel . Total time required : 4.4 minutes CASE 1 CASE 2
  • 13.
    13 HDFS (Hadoop Distributedfile System) - Stores data on the cluster HDFS is a file system written in Java Provide storage for massive amount of data - Scalable - Fault Tolerance - Support efficient processing in MR
  • 14.
    14 Hadoop : Howfiles are stored? -Data files split into blocks and distributed to data nodes - Each block is replicated in multiple nodes ( default 3x)
  • 15.
    15 HDFS (Master/Slaves Architecture) Mastermachine is Name Node Slaves machine are Data Node
  • 16.
    16 MAP REDUCE Map Reduceis a framework for executing highly parallelizable and distributable algorithms across huge datasets.
  • 17.
    17 Map Reduce :Mappers run parallel
  • 18.
    18 Mad Reduce :Analyzing data
  • 19.
  • 20.
  • 21.
    21 HADOOP Hadoop =HDFS +Map Reduce Hadoop HDFS commands are similar to unix command. Map reduce is programming model Hive  Data Manipulation (like SQL) Pig  Data Manipulation using Script Sqoop  Import and Export on HDFS
  • 22.
    22 Import/Export using Sqoopand Flume Sqoop : Transfers data between RDBMS and HDFS Flume : A service to move large amounts of data in real Time
  • 23.
    Applications E-commerce : (Amazon) RecommendationEngine -User buy pattern -Digital Marketing Analysis Telecommunication -Call drop Analysis -Network Problem Optimization Entertainment -Content Analytics (Netflix) Sports -Fitness Management (fitbit) Health Care -Early Disease Detection (pfizer)
  • 24.
    Applications Technology: In thetechnology, it is used in the websites like eBay, Amazon and Facebook and Google utilize it. Private sector: The application of big data in the private sector includes the retail, retail banking, and real estate. Government: The big data is also utilized by the Indian government. International development: The development in the big data analysis furnishes cost-effective opportunities to enhance the decision in critical advancement areas like health care, employment opportunities and crime, security and natural disaster. Hence, in this way, the big data is helpful for the international development.
  • 25.
  • 26.

Editor's Notes

  • #6 INSTRUCTIONS: Standard technical results slide (2-slide version). Please keep this layout and subheadings. A template is at the end of this exemplar. Bar-Noy, Basu, Johnson, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Algosensors 2011, Germany, Sept 2011 Johnson, Phelan, Bar-Noy, Basu, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Draft for submission to IEEE ToN (ToN paper has some more hardness results, simulation study and comparisons) The problem of interest is to broadcast a message originating at a source node to all nodes in the network. Source node and relay node can multicast to a subset of their neighbors (and they may also perform multiple multicasts to disjoint sets of neighbors). If a node multicasts to a subset k of its neighbors, the incurred cost is 1 + A k^b, where A, b are non-negative constants; `1’ represents the normalized cost of the (first) transmission, and the second term the cost of ACKs and re-transmits. The work also considers the case where the second term is either a sub-linear or a super-linear function of k. The minimum cost problem is formulated as an integer programming problem, and is NP-hard for a range of b expressed as a function of A. The top line in the table is, in fact, a very important result: if b > g(A) := log2( 2 + 1/A), then multicast cannot outperform unicast; thus, the spanning tree is optimal. If b=0, problem reduces to the connected dominating set (CDS) problem for which approximability results are known; the approximation ratio is HΔ+ 2 If b=1, problem reduces to minimizing number of transmitters (equivalently the maximum leaf spanning tree); a polynomial time algorithm with approximation ratio 2 is known; the paper improves the approximation ratio by using a pruned CDS approach. For b > g(A), spanning tree is optimal For b < g(A), the problem is shown to be NP-hard For 1 < b < g(A), the paper shows that a spanning tree has very good approximation ratio (less than 2) For 0 < b < 1, a greedy algorithm is proposed and its approximation ratio derived. Note that the approximation ratio improves with larger b and smaller Δ Overall note that the approximation ratio becomes worse for smaller b Note: the network size ‘n’ plays a part in the `inapproximability’ results The model assumes a known cost function; but the exponent ‘b’ depends both upon the actual protocol as well as open the operating environment (e.g., congestion). Thus ‘b’ may vary and may be hard to estimate. How sensitive is the proposed algorithm when there are errors in estimating ‘b’? The figure on the right shows cost (as incurred by the proposed algorithm) vs. the actual ‘b’ of the underlying cost function; the black curve is the `optimal’ one – it uses the true value of ‘b’; the performance of the algorithm when ‘b’ is assumed to be fixed at some value is shown Here, Δ is the maximum node degree in the graph Hn = n-th Harmonic number = 1 + 1/2 + 1/3 + ¼ + … + 1/n ~= log (n) + \gamma + small constant Where \gamma is the Euler-Mascheroni constant, approximately 0.5772
  • #7 INSTRUCTIONS: Standard technical results slide (2-slide version). Please keep this layout and subheadings. A template is at the end of this exemplar. Bar-Noy, Basu, Johnson, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Algosensors 2011, Germany, Sept 2011 Johnson, Phelan, Bar-Noy, Basu, Ramanathan, “Minimum-cost Broadcast through Varying-size Neighborcast”, Draft for submission to IEEE ToN (ToN paper has some more hardness results, simulation study and comparisons) The problem of interest is to broadcast a message originating at a source node to all nodes in the network. Source node and relay node can multicast to a subset of their neighbors (and they may also perform multiple multicasts to disjoint sets of neighbors). If a node multicasts to a subset k of its neighbors, the incurred cost is 1 + A k^b, where A, b are non-negative constants; `1’ represents the normalized cost of the (first) transmission, and the second term the cost of ACKs and re-transmits. The work also considers the case where the second term is either a sub-linear or a super-linear function of k. The minimum cost problem is formulated as an integer programming problem, and is NP-hard for a range of b expressed as a function of A. The top line in the table is, in fact, a very important result: if b > g(A) := log2( 2 + 1/A), then multicast cannot outperform unicast; thus, the spanning tree is optimal. If b=0, problem reduces to the connected dominating set (CDS) problem for which approximability results are known; the approximation ratio is HΔ+ 2 If b=1, problem reduces to minimizing number of transmitters (equivalently the maximum leaf spanning tree); a polynomial time algorithm with approximation ratio 2 is known; the paper improves the approximation ratio by using a pruned CDS approach. For b > g(A), spanning tree is optimal For b < g(A), the problem is shown to be NP-hard For 1 < b < g(A), the paper shows that a spanning tree has very good approximation ratio (less than 2) For 0 < b < 1, a greedy algorithm is proposed and its approximation ratio derived. Note that the approximation ratio improves with larger b and smaller Δ Overall note that the approximation ratio becomes worse for smaller b Note: the network size ‘n’ plays a part in the `inapproximability’ results The model assumes a known cost function; but the exponent ‘b’ depends both upon the actual protocol as well as open the operating environment (e.g., congestion). Thus ‘b’ may vary and may be hard to estimate. How sensitive is the proposed algorithm when there are errors in estimating ‘b’? The figure on the right shows cost (as incurred by the proposed algorithm) vs. the actual ‘b’ of the underlying cost function; the black curve is the `optimal’ one – it uses the true value of ‘b’; the performance of the algorithm when ‘b’ is assumed to be fixed at some value is shown Here, Δ is the maximum node degree in the graph Hn = n-th Harmonic number = 1 + 1/2 + 1/3 + ¼ + … + 1/n ~= log (n) + \gamma + small constant Where \gamma is the Euler-Mascheroni constant, approximately 0.5772