This document discusses big data and its characteristics. It notes that 2.5 quintillion bytes of data are created every day, with 90% created in just the last two years. Big data comes from many sources and is defined as data that requires new techniques to manage and extract value from due to its scale, diversity and complexity. Examples are provided of the large amounts of data generated by companies like Google, Facebook and CERN. The types of data discussed include relational, text, semi-structured, graph and streaming data. The characteristics of big data outlined are its scale, complexity and speed of generation. The challenges in handling big data are the need for new architectures, algorithms and technical skills to manage and analyze the large and
2. “Every day, we create 2.5 quintillion bytes of data — so much that
90% of the data in the world today has been created in the last two
years alone. This data comes from everywhere: sensors used to
gather climate information, posts to social media sites, digital
pictures and videos, purchase transaction records, and cell phone
GPS signals to name a few.
This data is “big data.”
Big Data
3. Big Data
• No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
4. Big Data Everywhere!
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• purchases at department/
grocery stores
• Bank/Credit Card
transactions
• Social Network
5. How much data?
• Google processes 20 PB a day (2008)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a
year
640K ought to be
enough for anybody.
6. Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
7. What to do with these data?
• Aggregation and Statistics
• Data warehouse and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
8. Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
9. Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and structures
• Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types of
data
10. Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what
you like send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body any
abnormal measurements require immediate reaction
11. Who’s Generating Big Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
12. The Model Has Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
13. Value of Big Data Analytics
• Big data is more real-time in nature
than traditional DW applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
• Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps
14. Challenges in Handling Big Data
• The Bottleneck is in technology
• New architecture, algorithms, techniques are needed
• Also in technical skills
• Experts in using the new technology and dealing with big data