Big Data Introduction

Title
presenters
Big Data
Tiago Knoch - Software developer

Agenda
• What and Why Big Data?
• 4 Vs
• NoSQL
• CAP Theorem

What is Big Data?
Source: http://olap.com/wp-content/uploads/2013/11/bigstock-Big-data-concept-in-word-tag-c-49922318.jpg

Why Big Data?
Source: http://www.go-gulf.com/wp-content/themes/go-gulf/blog/60seconds.jpg

4 Vs
• Volume
• Velocity
• Variety
• Veracity

Volume
Source: http://www-01.ibm.com/software/data/bigdata/images/4-Vs-of-big-data.jpg

Volume
• 2.7 Zetabytes of data exist in the digital universe today (2012)
• Ford’s modern hybrid Fusion model generates up to
25 GB of data per hour
• Google processes 20 PB a day (2008)
• Facebook has 30+ PB of user generated data
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
A petabyte (PB) is 1015 bytes of data, 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).

Velocity
• YouTube users upload 48 hours of new video every minute
• The LHC experiments represent about 150 million sensors
delivering data 40 million times per second.
• Twitter has 50 million tweets per day (2012)
• Prozone tracks 10 data points per second for every player,
or 1.4 million data points per game

Variety
• Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
• Static data vs. streaming data

Veracity (complexity)
Ventana report (02/2014) indicated that, in every
analytic exercise, 40-60% of time is spent on "data
preparation" processes - removing duplicates, fixing
partial entries, eliminating null/blank entries,
concatenating data, collapsing columns or splitting
columns, aggregating results into buckets, and
more.

NoSQL
A NoSQL database provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular
relations used in relational databases. Motivations for this
approach include simplicity of design, horizontal scaling, and
finer control over availability. The data structures used by
NoSQL databases (e.g. key-value, graph, or document) differ
from those used in relational databases, making some
operations faster in NoSQL and others faster in relational
databases.

NoSQL
• Large Volumes of data
• Dynamic Schemas
• Auto-Sharding
• Replication
• Horizontally Scalable

CAP Theorem
• Consistency - A read is guaranteed to return the
most recent write for a given client.
• Availability - The system will always respond to
a request (even if it's not the latest data or
consistent across the system).
• Partition Tolerance - The system continues to
operate if individual servers fail or can't be
reached.

CAP Theorem
Source: http://robertgreiner.com/2014/08/cap-theorem-revisited/

CAP Theorem
AP - Availability/Partition Tolerance -
Return the most recent version of the data
you have, which could be stale. This
system state will also accept writes that
can be processed later when the partition
is resolved. Choose Availability over
Consistency when your business
requirements allow for some flexibility
around when the data in the system
synchronizes or when the system needs to
continue to function in spite of external
errors (shopping carts, etc.)

CAP Theorem
CP - Consistency/Partition Tolerance -
Wait for a response from the partitioned
node which could result in a timeout
error. The system can also choose to
return an error, depending on the
scenario you desire. Choose
Consistency over Availability when your
business requirements dictate atomic
reads and writes.

CAP Theorem
• CP can also
have some
Availability
through
Replication

Summary
• Solution depends of the problem!
• There’s more than one way.
• Don’t throw away RDBMS
• Big Data is FUN!

Big Data Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Introduction

Similar to Big Data Introduction (20)

Recently uploaded

Recently uploaded (20)

Big Data Introduction

Editor's Notes