Big Data 2107 for Ribbon

Copyright © 2011 LOGTEL
Big Data Definition
 No single standard definition…
“Big Data” is data whose scale, diversity, and
complexity require new architecture,
techniques, algorithms, and analytics to
manage it and extract value and hidden
knowledge from it…
2

Agenda
 Big Data Concept and main vocabulary
 How big is BIG ? 4 V’s
 CAP theorem
 Big Data - Applicative (what can we do with it)
 NoSQL – Big Data technology
3

Google trends
4

10GB ? 10TB ? 10 PB ?
How big is BIG ?
5

The 4 V’s
6

Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
7
Exponential increase in
collected/generated data

The 4 V’s
8

2-Complexity (Varity)
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series, social
media data, multi-dim arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types of
data
9

The 4 V’s
10

3-Speed (Velocity)
• Data is begin generated fast and need to be processed
fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase
history, what you like  send promotions right now for store next to
you
• Healthcare monitoring: sensors monitoring your activities and body
 any abnormal measurements require immediate reaction
11

The 4 V’s
12

V’s can be added
13
During time, more
and more V’s were
added

Who’s Generating Big Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and
networks
(measuring all kinds of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable
fashion
14

The Model Has
Changed…
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming
data
15

ACID
Atomic: Either the whole process of a
transaction is done or none is.
Consistency: Database constraints
(application-specific) are preserved.
Isolation: It appears to the user as if only
one process executes at a time. (Two
concurrent transactions will not see on
another’s transaction while “in flight”.)
Durability: The updates made to the
database in a committed transaction will be
visible to future transactions. (Effects of a
process do not get lost if the system crashes.)

CAP Theorem
 Three properties of a system
 Consistency (all copies have same value)
 Availability (system can run even if parts have failed)
 Partitions (network can break into two or more parts, each
with active systems that can’t talk to other parts)
 Brewer’s CAP “Theorem”: You can have at most two
of these three properties for any system
 Very large systems will partition at some point
 Choose one of consistency or availability
 Traditional database choose consistency
 Most Web applications choose availability
 Except for specific parts such as order processing

The reminder
Dial 1-800-remind ......
 Available , Consist – not Partitioned
 Not available ...
 Available , Partitioned – not Consistent
 Consistent, Partitioned – not Available
18

The proof…
19

WHAT CAN WE DO WITH
BIG DATA ?
20

What’s driving Big Data
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
21

Value of Big Data Analytics
 Big data is more real-time in nature than traditional DW applications
 Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited
for big data apps
 Shared nothing, massively parallel processing, scale out architectures
are well-suited for big data apps
22

Big Data Applications Domains
23

Telecom’s main drivers for Big Data
24

Big Data as a Product:
ImmobilienScout (Deutsche Telekom)
– 25 –

Technology
26

Architecture
28

MapReduce
29

Scaling Up
 What if the dataset is huge, and very high
number of transactions per second
 Use multiple servers to host database
 ‘scaling out’ or ‘horizontal scaling’
 Parallel databases have been around for a
while
 But expensive, and designed for decision support
not OLTP (Online Transaction Processing)

Scaling RDBMS – Master/Slave
 Master-Slave
 All writes are written to the master. All reads
performed against the replicated slave
databases
 Good for mostly read, very few update
applications
 Critical reads may be incorrect as writes may
not have been propagated down
 Large data sets can pose problems as master
needs to duplicate data to slaves

Scaling RDBMS - Partitioning
 Partitioning
 Divide the database across many machines
 E.g. hash or range partitioning
 Handled transparently by parallel databases
 but they are expensive
 “Sharding”
 Divide data amongst many cheap databases
(MySQL/PostgreSQL)
 Manage parallel access in the application
 Scales well for both reads and writes
 Not transparent, application needs to be partition-aware

What is NoSQL?
 Stands for Not Only SQL
 Class of non-relational data storage systems
 E.g. BigTable, Dynamo, PNUTS/Sherpa, ..
 Usually do not require a fixed table schema nor do
they use the concept of joins
 All NoSQL offerings relax one or more of the ACID
properties (will talk about the CAP theorem)
 Not a backlash/rebellion against RDBMS
 SQL is a rich query language that cannot be rivaled
by the current list of NoSQL offerings

NoSQL Data Storage: Classification
 NoSQL solutions fall into 4 major areas:
 Uninterpreted key/value or ‘the big hash
table’.
 Amazon S3 (Dynamo)
 Voldemort
 Scalaris
 Column-based, with interpreted keys
 Cassandra, BigTable, HBase, Sherpa/PNuts
 Others
 CouchDB (document-based)
 Neo4J (graph-based)

NoSQL ecosystem
35

Big Data Landscape 2015
37

Availability
 Traditionally, thought of as the server/process
available five 9’s (99.999 %).
 However, for large node system, at almost
any point in time there’s a good chance that a
node is either down or there is a network
disruption among the nodes.
 Want a system that is resilient in the face of
network disruption

Eventual Consistency
 When no updates occur for a long period of time,
eventually all updates will propagate through the system
and all the nodes will be consistent
 For a given accepted update and a given node, eventually
either the update reaches the node or the node is removed
from service
 Known as BASE (Basically Available, Soft state, Eventual
consistency), as opposed to ACID
 Soft state: copies of a data item may be inconsistent
 Eventually Consistent – copies becomes consistent at
some later time if there are no more updates to that
data item

BASE in Cassandra
Query
Closest replica
Cassandra Cluster
Replica A
Result
Replica B Replica C
Digest Query
Digest Response Digest Response
Result
Client
Read repair if
digests differ

Common Advantages
 Cheap, easy to implement (open source)
 Data are replicated to multiple nodes (therefore
identical and fault-tolerant) and can be partitioned
 When data is written, the latest version is on at least
one node and then replicated to other nodes
 Down nodes easily replaced
 No single point of failure
 Easy to distribute
 Don't require a schema

What am I giving up?
 joins
 group by
 order by
 ACID transactions
 SQL as a sometimes frustrating but still
powerful query language
 easy integration with other applications that
support SQL

Distributed Key-Value Data Stores
 Distributed key-value data storage systems allow
key-value pairs to be stored (and retrieved on
key) in a massively parallel system
 E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon
Dynamo, ..
 Partitioning, high availability etc. completely
transparent to application
 Sharding systems and key-value stores don’t
support many relational features
 No join operations (except within partition)
 No referential integrity constraints across
partitions
 etc.

Flexible Data Model
Rockets
Key Value
1
2
3
Name Value
toon
inventoryQty
brakes
Rocket-Powered Roller Skates
Ready, Set, Zoom
5
false
name
Name Value
toon
inventoryQty
brakes
Little Giant Do-It-Yourself Rocket-Sled Kit
Beep Prepared
4
false
Name Value
toon
inventoryQty
wheels
Acme Jet Propelled Unicycle
Hot Rod and Reel
1
1
name
name

HBase
51

Google
 Tables are sorted by Row
 Table schema only define its column families .
 Each family consists of any number of columns
 Each column consists of any number of versions
 Columns only exist when inserted, NULLs are free.
 Columns within a family are sorted and stored together
 Everything except table names are byte[]
 (Row, Family: Column, Timestamp)  Value
Row key
Column Family
value
TimeStamp

Splunk – Document base
53

Splunk – log analysis
54

PNUTS Data Storage Architecture

0
1
1/2
F
E
D
C
B
A N=3
h(key2)
h(key1)
57
Partitioning And Replication

Graphs DB
58

...how to think «graphically» with
one of the most common domains
in the enterprise world:
The old-classic CRM* domain
* today in 99% of the cases a RDBMS is used
Lets take a real example - CRM

Back to school:
Graph Theory crash course

Likes
Avital
Sam
FriendOf
NoSQL
lecture
Doron
Joins

Customer Address
Order Stock
Registry system
Order system
Domain: minimal CRM

Stock
Registry system
Order
Order system
Customer Address
How does
Relational DBMS
manage relationships?

JOIN Customer.Address -> Address.Id
Customer
Id Name Address
10 Samuel 34
11 Katja 44
34 Sylvia 54
56 Mark 66
88 Steve 68
Address
Id Location
34 Rome, London
44 Cologne
54 Rome
66 New Mexico
68 Palo Alto
Relational World: 1-1 Relationships

Inverse JOIN Address.Customer -> Customer.Id
Customer
Id Name
10 Samuel
11 Katja
34 Sylvia
56 Mark
88 Steve
Address
Id Customer Location
24 10 Rome
33 10 London
44 34 Rome
66 11 Cologne
68 88 Palo Alto
Relational World: 1-N Relationships

Additional table with 2 JOINs
(1) CustomerAddress.Id -> Customer.Id and
(2) CustomerAddress.Address -> Address.Id
Customer
Id Name
10 Samuel
11 Katja
34 Sylvia
56 Mark
88 Steve
Address
Id Location
24 Rome
33 London
44 Rome
66 Cologne
68 Palo Alto
CustomerAddr
ess
Id Address
10 24
10 33
34 24
Relational World: N-M Relationships

What’s wrong with the
Relational Model?

These are all JOINs executed
everytime you traverse a
relationship
Customer
Id Name
10 Samuel
11 Katja
34 Sylvia
56 Mark
88 Steve
Address
Id Location
24 Rome
33 London
44 Rome
66 Cologne
68 Palo Alto
relationship
relationship
relationship!
CustomerAddr
ess
Id Address
10 24
10 33
34 24
The JOIN is the evil!

Why not JOIN
• A JOIN means searching for a key in another table
• The first rule to improve performance is
indexing all the keys
• Index speeds up searches but slows down
insert, updates and deletes
• So in the best case a JOIN is a lookup into in an
index
• This is done per single join!
• If you traverse hundreds of relationships
you’re executing hundreds of JOINs

Index Lookup
it is really that fast?

A-Z
A-L M-Z
A-L
A-D E-L
M-Z
M-R S-Z
A-D
A-B C-D
E-L
E-G H-L
E-G
E-F G
H-L
H-J K-L
Index algorithms are all
similar and based on
balanced trees
Index Lookup: how does it works?
Think to an
Address Book
where we have to find
Samuel’s phone number

A-Z
A-L M-Z
A-L
A-D E-L
M-Z
M-R S-Z
A-D
A-B C-D
E-L
E-G H-L
E-G
E-F G
H-L
H-J K-L
Found!
Each lookup takes
X steps, where X
grows with the
index size!

An index lookup is executed
for each JOIN
Querying more tables can easily
produce millions of JOINs/Lookups!
Here the rule: more entries
= more lookup steps = slower JOIN

Sam
Lives
out : [#14:54]
label : ‘Customer’
name : ‘Sam’
out: [#13:35]
in: [#13:100]
Label : ‘Lives’
RID =
#13:35
RID =
#14:54
RID =
#13:100
in: [#14:54]
label = ‘Address’
name = ‘Rome’
The Record ID (RID)
is a Physical position
Rome
OrientDB: traverse a relationship

GraphDB handles relationships as a
physical LINK to the record
assigned when the edge is created
on the other side
RDBMS computes the
relationship every time you query a database
Is not that crazy?!

This means jumping from a
O(log N) algorithm to a near O(1)
traversing cost is not more affected
by database size!
This is huge in the BigData age

$luca> cd bin
$luca> ./console.sh
OrientDB console v.1.2.0-SNAPSHOT (www.orientdb.org)
Type 'help' to display all the commands supported.
orientdb> create vertex V set name = ‘Sam’, label = ‘Customer’
Created vertex #13:35 in 0.03 secs
orientdb> create vertex V set name = ‘Rome’, label = ‘Address’
Created vertex #13:100 in 0.02 secs
orientdb> create edge E from #13:35 to #13:100 set label = ‘Lives’
Created edge #14:54 in 0.02 secs
Create the graph in SQL

OGraphDatabase graph = new OGraphDatabase("local:/tmp/db/graph”);
ODocument sam= graph.createVertex();
sam.field(“name", “Sam");
sam.field(“label", “Customer");
ODocument rome = graph.createVertex();
rome.field(“name", “Rome”);
rome.field(“label", “Address”);
ODocument edge = graph.createEdge(sam, rome).field(“label”, “Lives”);
edge.save();
graph.close();
Create the graph in Java

orientdb> select in[label=‘Lives’].out from V where
label = ‘Address’ and name = ‘Rome’
---+--------+--------------------+--------------------+--------------------+
#| REC ID |label |out |in |
---+--------+--------------------+--------------------+--------------------+
0| 13:35|Sam |[#14:54] | |
---+--------+--------------------+--------------------+--------------------+
1 item(s) found. Query executed in 0.007 sec(s).
orientdb> select * from V where label = ‘Address’ AND
in[label=‘Lives’].size() > 0
---+--------+--------------------+--------------------+--------------------+
#| REC ID |label |out |in |
---+--------+--------------------+--------------------+--------------------+
0| 13:100| Rome | |[#14:54] |
---+--------+--------------------+--------------------+--------------------+
1 item(s) found. Query executed in 0.007 sec(s).
Query the graph in SQL

OGraphDatabase graph = new
OGraphDatabase("local:/tmp/db/graph”);
// GET ALL THE THE CUSTOMER FROM ROME, ITALY
List<ODocument> result = graph.command( new OCommandSQL (
“select in[label=‘Lives’].out from V where label = ‘Address’
and name = ?”)
).execute( “Rome”);
for( ODocument v : result ) {
System.out.println(“Result: “ + v.field(“label”) );
}
-----------------------------------------------------------------------------------
----Result: Sam
Query the graph in Java

Query vs. traversal
 Once you’ve a well connected
database in the form of a Super
Graph you can cross records
instead of query them!
 All you need is some root vertices
where to start to traverse

Customers
Sam John Sylvia
Order
2332
Order
8834
White
Soap
Stocks
Special
Customers
This is a
root
vertex
Query vs. traversal

Supposing that the root node #30:0 links all the
Customer vertices
Get all the customers:
orientdb> select out.in from #30:0
Get all the customers who bought at least one ‘White Soap’
product:
orientdb> select * from ( select out.in from #30:0) where
out.in.out[label=‘Bought’].in.name = ‘White Soap’
Customers
#30:0
Query the graph in SQL

Demo time!

Should I be using NoSQL Databases?
 For almost all of us, regular relational
databases are THE correct solution
 NoSQL Data storage systems makes sense for
applications that need to deal with very very
large semi-structured data
 Log Analysis
 Social Networking Feeds

Key concept of Big Data
• Store everything
• Don’t delete anything
• Schema is a bottleneck
• Think always on parallel
• Remember the CAP theorem

ThankYou!!!
…and please fill the evaluation form
87

Big Data 2107 for Ribbon

More Related Content

What's hot

Similar to Big Data 2107 for Ribbon

More from Samuel Dratwa

Recently uploaded

Big Data 2107 for Ribbon