7. Volume
• 2.7 Zetabytes of data exist in the digital universe today (2012)
• Ford’s modern hybrid Fusion model generates up to
25 GB of data per hour
• Google processes 20 PB a day (2008)
• Facebook has 30+ PB of user generated data
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
A petabyte (PB) is 1015 bytes of data, 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).
11. Velocity
• YouTube users upload 48 hours of new video every minute
• The LHC experiments represent about 150 million sensors
delivering data 40 million times per second.
• Twitter has 50 million tweets per day (2012)
• Prozone tracks 10 data points per second for every player,
or 1.4 million data points per game
13. Variety
• Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
• Static data vs. streaming data
15. Veracity (complexity)
Ventana report (02/2014) indicated that, in every
analytic exercise, 40-60% of time is spent on "data
preparation" processes - removing duplicates, fixing
partial entries, eliminating null/blank entries,
concatenating data, collapsing columns or splitting
columns, aggregating results into buckets, and
more.
17. NoSQL
A NoSQL database provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular
relations used in relational databases. Motivations for this
approach include simplicity of design, horizontal scaling, and
finer control over availability. The data structures used by
NoSQL databases (e.g. key-value, graph, or document) differ
from those used in relational databases, making some
operations faster in NoSQL and others faster in relational
databases.
18. NoSQL
• Large Volumes of data
• Dynamic Schemas
• Auto-Sharding
• Replication
• Horizontally Scalable
20. CAP Theorem
• Consistency - A read is guaranteed to return the
most recent write for a given client.
• Availability - The system will always respond to
a request (even if it's not the latest data or
consistent across the system).
• Partition Tolerance - The system continues to
operate if individual servers fail or can't be
reached.
23. CAP Theorem
AP - Availability/Partition Tolerance -
Return the most recent version of the data
you have, which could be stale. This
system state will also accept writes that
can be processed later when the partition
is resolved. Choose Availability over
Consistency when your business
requirements allow for some flexibility
around when the data in the system
synchronizes or when the system needs to
continue to function in spite of external
errors (shopping carts, etc.)
24. CAP Theorem
CP - Consistency/Partition Tolerance -
Wait for a response from the partitioned
node which could result in a timeout
error. The system can also choose to
return an error, depending on the
scenario you desire. Choose
Consistency over Availability when your
business requirements dictate atomic
reads and writes.
25. CAP Theorem
• CP can also
have some
Availability
through
Replication
26. Summary
• Solution depends of the problem!
• There’s more than one way.
• Don’t throw away RDBMS
• Big Data is FUN!
Everything has a sensor now, everything is mobile, everything has internet access
Broadband!
LHC = Large Hadron Collider
Big Data Veracity refers to the biases, noise and abnormality in data
Analysis!
Doesn’t mean “No Sql”
Classic relational databases are not good for big data!
Source: Wikipedia
Wikipedia
CAP (by Eric Brewer, 1998) states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees
Classic model is CA, doesn’t have partition tolerance so replication of data is not that easy.
NEXT SLIDE HAS IMAGES
Let’s just focus in CP and AP
AMAZON EXAMPLE!
Mobile games example! Or Web ads!
Atomicity - Everything in a transaction must happen successfully or none of the changes are committed. This avoids a transaction that changes multiple pieces of data from failing halfway and only making a few changes.
If the data is replicated the probability to be available (to be most updated) is high