Prepared By: Marwan A. Al-Wajeeh
1
2
Outline
Big Data an Overview
Big Data Sources
What Is Big Data
Big Data Challenges
Big Data Analytics
3
More than 2.5 billion bytes of data are created EVERY DAY
IBM: 90 percent world’s Data today was produced in the last
two years
80% of world data is unstructured
Facebook Process 500 TB per day.
Lots and Lots of Web Pages (20 billion web pages in google)
A billion Facebook Users
Billions+ Facebook Pages
Hundreds of Million Twitters Account
Hundreds of Million Twitters per Day
Billions Google Queries per Day
Millions of servers, Beta Bytes of Data
4
Big Data an Overview
5
Big Data
6
Internet of Events: 4 sources of event data
7
Big Data Sources
Big Data is a collection of data sets that are large and
complex in nature.
Big Data is any data that is expensive to manage and
hard to extract value from.
They constitute both structure and un structured
data they grow large so fast that they are not
manageable by traditional relational database
systems or congenital statistical tools.
8
What Is Big Data?
Volume: the size of data
 Google Example:
 10 Billions web pages
 Average size of web pages = 200KB
 10 billion * 20KB= 200 TB
 Disk read bandwidth = 50MB/Sec
 Time to read= 4 million seconds= 46+ Day
 Airbus A380 Example:
 Each A380 four engine generates 1 PB of data on a flight,
for example, from London (LHR) to Singapore (SIN)
9
Big Data: Four Challenges (4 V’s)
Velocity (speed of change).
 we are not only generating a lot amount of data but the data is
continuously being added and things are changing very
rapidly.
Verity (different types of data source).
 The diversity of sources, format, quality, and structure
Veracity (uncertainty of data).
 that means that you cannot completely sure that we have
recorded incompletely sure.
10
Big Data: Four Challenges (4 V’s)
11
Traditional vs Big Data
Big data analytics is the process of:
Collecting
Organizing and
Analyzing
Of large set of data “big data” to
Discover patterns and
Other useful information
12
Big Data Analytics
Traditional Analytics Big Data Analytics
Analytics using known data which
is well understood
Not well understood data format
from it largely being unstructured
and semi-structured
Built based on relational data
models
Big data comes in various form and
formats from multiple disconnected
systems. They are almost flat with
no relation ship.
13
Traditional vs Big Data Analytics
 Traditional RDBMS Fails to handle Big Data
Big Data (terabytes) can not fit in the memory for a
single computer
Processing of Big Data in single computer will take a
lot of time
Scaling with the traditional RDBMS is expensive.
14
Analytical Challenges with Big Data
Memory
Disk
CPU
Machine Learning, Statistics
 The algorithms runs on the CPU, and access the data that is in
memory
Then bring the data from disk into memory
What Happens if the data so big, that is can’t all fit in the
memory at the same time.
15
Single Node architecture
 10 billion web pages
Average size of webpage= 20KB
10 billion * 20 KB= 200TB
Disk read bandwidth = 50MB/sec
Time to read = 4 million second= 46+ days
Thus: this is unacceptable, and we need a better solution
 Clustering Computing emerge as new solution
The fundamental idea is to split the data into chunks, if we
have 1000 disks and CPUs, the process will done with in
hour.
16
Google Example
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Each rack contains 16-64 nodes
Mem
Disk
CPU
Mem
Disk
CPU
…
Switch
Switch1 Gbps between
any pair of nodes
in a rack
2-10 Gbps backbone between racks
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
17
Cluster Architecture
Multiple rack So We
have a data center
18
Now once we have this kind of cluster
This does not solve the problem completely
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org 19
 Node Failure
A single server can stay up for 3 years (1000 days)
1000 server in the cluster => 1 failure/ day
Million server in cluster => 1000 failure/day (Google have
approximately million server)
 how to store data persistently and keep it available if
nodes can fail
 how to deal with node failure during along running
computation?
20
Cluster Commuting Challenges
 Network bottleneck
Network bandwidth = 1 Gbps
Moving 10 TB takes approximately 1 day
Complex computation might need to move a lot of data
and that can slow computation down.
We need a framework doesn't move data around so much
while it’s doing computation.
Distribution programming is hard!
 It is hard to write distributed programs correctly
We need simple model that hides most of complexity of
distributed programming
21
Cluster Commuting Challenges
Map- Reduce address the challenges of cluster
computing
Store date redundantly on multiple nodes for persistence
and availability
Move computation close to the data to minimize data
movement
Simple programming model to hide complexity of all this
magic
22
Map-Reduce
23
Hadoop= MapReduce + HDFS
Pig Hive HBase
Flume
Rhado
op
Spoop
Oozie
Avro
Zoo
Keeper
Big Data Analytics Tools and Technologies
Thank You
24
4 Types of Analytics
Descriptive: What happened?
Diagnostics: Why did it happen?
Predictive: what will happen?
Prescriptive: what is the best that can happen
Analytics Tools:
SAS
IBM SPSS
Stata
R
MATLAb
25
 The key aspects of the big data platform are: Integration, Analytics
, Visualization, Development, workload optimization , security and
governs
26
The 5 High Value Big Data Use
Cases
27
Thank You
28

Introduction Big data

  • 1.
    Prepared By: MarwanA. Al-Wajeeh 1
  • 2.
  • 3.
    Outline Big Data anOverview Big Data Sources What Is Big Data Big Data Challenges Big Data Analytics 3
  • 4.
    More than 2.5billion bytes of data are created EVERY DAY IBM: 90 percent world’s Data today was produced in the last two years 80% of world data is unstructured Facebook Process 500 TB per day. Lots and Lots of Web Pages (20 billion web pages in google) A billion Facebook Users Billions+ Facebook Pages Hundreds of Million Twitters Account Hundreds of Million Twitters per Day Billions Google Queries per Day Millions of servers, Beta Bytes of Data 4 Big Data an Overview
  • 5.
  • 6.
    6 Internet of Events:4 sources of event data
  • 7.
  • 8.
    Big Data isa collection of data sets that are large and complex in nature. Big Data is any data that is expensive to manage and hard to extract value from. They constitute both structure and un structured data they grow large so fast that they are not manageable by traditional relational database systems or congenital statistical tools. 8 What Is Big Data?
  • 9.
    Volume: the sizeof data  Google Example:  10 Billions web pages  Average size of web pages = 200KB  10 billion * 20KB= 200 TB  Disk read bandwidth = 50MB/Sec  Time to read= 4 million seconds= 46+ Day  Airbus A380 Example:  Each A380 four engine generates 1 PB of data on a flight, for example, from London (LHR) to Singapore (SIN) 9 Big Data: Four Challenges (4 V’s)
  • 10.
    Velocity (speed ofchange).  we are not only generating a lot amount of data but the data is continuously being added and things are changing very rapidly. Verity (different types of data source).  The diversity of sources, format, quality, and structure Veracity (uncertainty of data).  that means that you cannot completely sure that we have recorded incompletely sure. 10 Big Data: Four Challenges (4 V’s)
  • 11.
  • 12.
    Big data analyticsis the process of: Collecting Organizing and Analyzing Of large set of data “big data” to Discover patterns and Other useful information 12 Big Data Analytics
  • 13.
    Traditional Analytics BigData Analytics Analytics using known data which is well understood Not well understood data format from it largely being unstructured and semi-structured Built based on relational data models Big data comes in various form and formats from multiple disconnected systems. They are almost flat with no relation ship. 13 Traditional vs Big Data Analytics
  • 14.
     Traditional RDBMSFails to handle Big Data Big Data (terabytes) can not fit in the memory for a single computer Processing of Big Data in single computer will take a lot of time Scaling with the traditional RDBMS is expensive. 14 Analytical Challenges with Big Data
  • 15.
    Memory Disk CPU Machine Learning, Statistics The algorithms runs on the CPU, and access the data that is in memory Then bring the data from disk into memory What Happens if the data so big, that is can’t all fit in the memory at the same time. 15 Single Node architecture
  • 16.
     10 billionweb pages Average size of webpage= 20KB 10 billion * 20 KB= 200TB Disk read bandwidth = 50MB/sec Time to read = 4 million second= 46+ days Thus: this is unacceptable, and we need a better solution  Clustering Computing emerge as new solution The fundamental idea is to split the data into chunks, if we have 1000 disks and CPUs, the process will done with in hour. 16 Google Example
  • 17.
    Mem Disk CPU Mem Disk CPU … Switch Each rack contains16-64 nodes Mem Disk CPU Mem Disk CPU … Switch Switch1 Gbps between any pair of nodes in a rack 2-10 Gbps backbone between racks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17 Cluster Architecture Multiple rack So We have a data center
  • 18.
    18 Now once wehave this kind of cluster This does not solve the problem completely
  • 19.
    J. Leskovec, A.Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
  • 20.
     Node Failure Asingle server can stay up for 3 years (1000 days) 1000 server in the cluster => 1 failure/ day Million server in cluster => 1000 failure/day (Google have approximately million server)  how to store data persistently and keep it available if nodes can fail  how to deal with node failure during along running computation? 20 Cluster Commuting Challenges
  • 21.
     Network bottleneck Networkbandwidth = 1 Gbps Moving 10 TB takes approximately 1 day Complex computation might need to move a lot of data and that can slow computation down. We need a framework doesn't move data around so much while it’s doing computation. Distribution programming is hard!  It is hard to write distributed programs correctly We need simple model that hides most of complexity of distributed programming 21 Cluster Commuting Challenges
  • 22.
    Map- Reduce addressthe challenges of cluster computing Store date redundantly on multiple nodes for persistence and availability Move computation close to the data to minimize data movement Simple programming model to hide complexity of all this magic 22 Map-Reduce
  • 23.
    23 Hadoop= MapReduce +HDFS Pig Hive HBase Flume Rhado op Spoop Oozie Avro Zoo Keeper Big Data Analytics Tools and Technologies
  • 24.
  • 25.
    4 Types ofAnalytics Descriptive: What happened? Diagnostics: Why did it happen? Predictive: what will happen? Prescriptive: what is the best that can happen Analytics Tools: SAS IBM SPSS Stata R MATLAb 25
  • 26.
     The keyaspects of the big data platform are: Integration, Analytics , Visualization, Development, workload optimization , security and governs 26
  • 27.
    The 5 HighValue Big Data Use Cases 27
  • 28.