1
That is the question
{
"_id": "555ae00a475a9b259281b21a",
"name": "Nicola Galgano",
"alias": "alikon",
"gender": "male",
"work": "DB consultant on banking systems",
"company": "looking for a new one",
"email": "info@alikonweb.it",
"twitter": "@alikon",
"address": "Roma, Italy, EU“,
“current_hobby”:”run away from dentist”
}
2
Henri Poincaré
3
Ipse dixit
4
What is Big
Data ?
Big data is an all-
encompassing term for
any collection of data
sets so large or
complex that it
becomes difficult to
process them using
traditional data
processing applications.
From wikipedia
5
How much is Big data ?
DVD 4.7 GB
Human brain 2.5 PB
LHC 1 PB/s
Net traffic 1 ZB/year
6
Internet
of
Everything
 IPv6 = 2^128
3,4e+38
7
IPv6 can address
every quark
in the world
8
Structured / Unstructured
Volume
9
 Volume
 Velocity
 Variety
 Veracity
10
Availability
Downtime/year Downtime/month Downtime/week
90 % (1 nine)
36.5 days 72 hours
16.8 hours
99 % (2 nines)
3.65 days 7.20 hours 1.68 hours
99,9 % (3 nines)
8.76 hours 43.8 minutes 10.1 minutes
99,99 % (4 nines)
52.56 minutes 4.38 minutes 1.01 minutes
99,999% (5 nines)
5.26 minutes 25.9 seconds 6.05 seconds
11
12
Next Generation Databases mostly addressing some of the points:
 non-relational
 distributed
 horizontal scalable
 open-source
From www.nosql-database.org
13
 Key / value
 Column
 Document
 Graph
14
A data model is a rapresentation that we use to perceive and manipulate data
15
•Logic model
•Normalization
• 1NF,2NF,3NF,..
• E-R
• Schema (rigid)
• Algebra of sets
•Impedance mismatch
16
Schemaless
(dynamic/implicit)
Denormalization
Aggregate
Aggregates are the
basic element of data
storage
17
Simple data model
Blob/Opaque
Only 3 API function
• Get(key)
• Set(key, value)
• Delete(key)
Key and value can be complex
More trasparent
18
JSON
(JavaScript Object Notation)
A lightweight data
interchange format
Easy for humans
and machines to
read and write
Column
Sparse semi structured,
sorted map.
Flexible number of columns
Column key can be grouped to
family
19
How is stored
 Graph theory model G = ( V, E )
 Store, map and query relationships
20
•Node connected by edges
•Complex relationships
•Recommend products
•ACID
Queries = graph traversal
The map job
takes a set of data and converts it
into another set of data, where
individual elements are broken down
into tuples (key/value pairs)
The reduce job
takes the output from a map as input
and combines those data tuples into
a smaller set of tuples
21
refers to 2 separate and distinct tasks
Tasks runs in parallel
22
 There are multiple ways to model data
 How the data is going to be accessed
 Read intensive or Write intensive
 Complex queries
23
Schemaless Normalized
Model
Vertical (up)
Add more power (ram/cpu/disk)
Horizontal (out)
Add more commodity systems
24
 1. The network is reliable.
 2. Latency is zero.
 3. Bandwidth is infinite.
 4. The network is secure.
 5. Topology doesn't change.
 6. There is one administrator.
 7. Transport cost is zero.
 8. The network is homogeneous.
25
 Split up data into multiple chunks
 Store each chunk in a separate data node
 Partitioning strategy “The shard key“
 Multishard ops (Join/aggregate)
 Load balancing
26
 Master / Slave
 Multi / Master
 Synchonous
 Asynchonous
 Provide redundancy
 Increase availability
 Failover (automatic)
27
28
Maria NickData
Get(X)
T0
Get(X)
T1
T2
Put(X)
Put(X)
T3
Transaction
 A sequence of operations that form a single unit of work
 Transaction have 4 properties
 Atomic
 Consistent
 Isolated
 Durable
29
ACID - Atomicity
Transfer 100€ from A to B
1. Read(a)
2. If a > 100
3. A=A-100
4. Write(A)
5. Read(b)
6. B=B+100
7. Write(B)
30
ACID - Consistency
Transfer 100€ from A to B
1. Read(a)
2. If A > 100
3. A=A-100
4. Write(A)
5. Read(B)
6. B=B+100
7. Write(B)
31
ACID - Isolation
Transfer 100€ from A to B
1. Read(A)
2. If A > 100
3. A=A-100
4. Write(A)
5. Read(B)
6. B=B+100
7. Write(B)
32
ACID - Durability
Transfer 100€ from A to B
1. Read(A)
2. If A > 100
3. A=A-100
4. Write(A)
5. Read(b)
6. B=B+100
7. Write(B)
33
Basically Available:
 There will be a response to any request.
 Fast response even if some replicas are slow or crashed
Soft State:
 The state of the system could change over time
 It’s user application task to guarantee consistency
Eventual consistent:
 The system will eventually become consistent once it stops
receiving input.
 The data will propagate to everywhere
34
 Nick finds a cool photo and shares with Maria by posting
on her Facebook wall
 Nick asks Maria to check it out
 Maria logs in her account, checks her Facebook wall but:
- Nothing is there! (x apart)
 Nick tells Maria to wait a bit and check out later
 Maria waits for a minute or so and checks back:
- She finds the photo Nick shared with her!
35
 It’s impossible for a distributed computer system to
simultaneously provide all this three guarantees:
 Consistency – all node see the same data at same time
 Availability – all can always read and write
 Partition tollerance – the system will work on failure*
 A distributed system can satisfay only 2 at the same time
36
37
Nick Maria
Who will take the next flight ?
EU US
38
 ATM will allow you to withdraw money even if the
machine is partitioned from the network
 Higher availability means higher revenue
 However, it puts a limit on the amount of withdraw
 The bank might also charge you a fee when a
overdraft happens
In the absence of partitions
how does the system trade off
latency (L) and consistency (C)?
39
40
ACID RDBMS BASE NOSQL
 Strong consistency
 Isolation
 Transaction
 Mature technology
 SQL
 Available & consistent
 Scale up (limited)
 Shared something (disk/ram/proc)
 Weak consistenct (stale data)
 Last write wins
 Program managed
 New technology
 No standard
 Available & partition tolerant
 Scale out (unlimited*)
 Shared nothing (parallelizable)
41

Sql or NoSql: that is the question...

  • 1.
    1 That is thequestion
  • 2.
    { "_id": "555ae00a475a9b259281b21a", "name": "NicolaGalgano", "alias": "alikon", "gender": "male", "work": "DB consultant on banking systems", "company": "looking for a new one", "email": "info@alikonweb.it", "twitter": "@alikon", "address": "Roma, Italy, EU“, “current_hobby”:”run away from dentist” } 2
  • 3.
  • 4.
  • 5.
    What is Big Data? Big data is an all- encompassing term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data processing applications. From wikipedia 5
  • 6.
    How much isBig data ? DVD 4.7 GB Human brain 2.5 PB LHC 1 PB/s Net traffic 1 ZB/year 6
  • 7.
    Internet of Everything  IPv6 =2^128 3,4e+38 7 IPv6 can address every quark in the world
  • 8.
  • 9.
  • 10.
     Volume  Velocity Variety  Veracity 10
  • 11.
    Availability Downtime/year Downtime/month Downtime/week 90% (1 nine) 36.5 days 72 hours 16.8 hours 99 % (2 nines) 3.65 days 7.20 hours 1.68 hours 99,9 % (3 nines) 8.76 hours 43.8 minutes 10.1 minutes 99,99 % (4 nines) 52.56 minutes 4.38 minutes 1.01 minutes 99,999% (5 nines) 5.26 minutes 25.9 seconds 6.05 seconds 11
  • 12.
  • 13.
    Next Generation Databasesmostly addressing some of the points:  non-relational  distributed  horizontal scalable  open-source From www.nosql-database.org 13
  • 14.
     Key /value  Column  Document  Graph 14
  • 15.
    A data modelis a rapresentation that we use to perceive and manipulate data 15 •Logic model •Normalization • 1NF,2NF,3NF,.. • E-R • Schema (rigid) • Algebra of sets •Impedance mismatch
  • 16.
  • 17.
    17 Simple data model Blob/Opaque Only3 API function • Get(key) • Set(key, value) • Delete(key) Key and value can be complex
  • 18.
    More trasparent 18 JSON (JavaScript ObjectNotation) A lightweight data interchange format Easy for humans and machines to read and write
  • 19.
    Column Sparse semi structured, sortedmap. Flexible number of columns Column key can be grouped to family 19 How is stored
  • 20.
     Graph theorymodel G = ( V, E )  Store, map and query relationships 20 •Node connected by edges •Complex relationships •Recommend products •ACID Queries = graph traversal
  • 21.
    The map job takesa set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs) The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples 21 refers to 2 separate and distinct tasks Tasks runs in parallel
  • 22.
  • 23.
     There aremultiple ways to model data  How the data is going to be accessed  Read intensive or Write intensive  Complex queries 23 Schemaless Normalized Model
  • 24.
    Vertical (up) Add morepower (ram/cpu/disk) Horizontal (out) Add more commodity systems 24
  • 25.
     1. Thenetwork is reliable.  2. Latency is zero.  3. Bandwidth is infinite.  4. The network is secure.  5. Topology doesn't change.  6. There is one administrator.  7. Transport cost is zero.  8. The network is homogeneous. 25
  • 26.
     Split updata into multiple chunks  Store each chunk in a separate data node  Partitioning strategy “The shard key“  Multishard ops (Join/aggregate)  Load balancing 26
  • 27.
     Master /Slave  Multi / Master  Synchonous  Asynchonous  Provide redundancy  Increase availability  Failover (automatic) 27
  • 28.
  • 29.
    Transaction  A sequenceof operations that form a single unit of work  Transaction have 4 properties  Atomic  Consistent  Isolated  Durable 29
  • 30.
    ACID - Atomicity Transfer100€ from A to B 1. Read(a) 2. If a > 100 3. A=A-100 4. Write(A) 5. Read(b) 6. B=B+100 7. Write(B) 30
  • 31.
    ACID - Consistency Transfer100€ from A to B 1. Read(a) 2. If A > 100 3. A=A-100 4. Write(A) 5. Read(B) 6. B=B+100 7. Write(B) 31
  • 32.
    ACID - Isolation Transfer100€ from A to B 1. Read(A) 2. If A > 100 3. A=A-100 4. Write(A) 5. Read(B) 6. B=B+100 7. Write(B) 32
  • 33.
    ACID - Durability Transfer100€ from A to B 1. Read(A) 2. If A > 100 3. A=A-100 4. Write(A) 5. Read(b) 6. B=B+100 7. Write(B) 33
  • 34.
    Basically Available:  Therewill be a response to any request.  Fast response even if some replicas are slow or crashed Soft State:  The state of the system could change over time  It’s user application task to guarantee consistency Eventual consistent:  The system will eventually become consistent once it stops receiving input.  The data will propagate to everywhere 34
  • 35.
     Nick findsa cool photo and shares with Maria by posting on her Facebook wall  Nick asks Maria to check it out  Maria logs in her account, checks her Facebook wall but: - Nothing is there! (x apart)  Nick tells Maria to wait a bit and check out later  Maria waits for a minute or so and checks back: - She finds the photo Nick shared with her! 35
  • 36.
     It’s impossiblefor a distributed computer system to simultaneously provide all this three guarantees:  Consistency – all node see the same data at same time  Availability – all can always read and write  Partition tollerance – the system will work on failure*  A distributed system can satisfay only 2 at the same time 36
  • 37.
    37 Nick Maria Who willtake the next flight ? EU US
  • 38.
    38  ATM willallow you to withdraw money even if the machine is partitioned from the network  Higher availability means higher revenue  However, it puts a limit on the amount of withdraw  The bank might also charge you a fee when a overdraft happens
  • 39.
    In the absenceof partitions how does the system trade off latency (L) and consistency (C)? 39
  • 40.
  • 41.
    ACID RDBMS BASENOSQL  Strong consistency  Isolation  Transaction  Mature technology  SQL  Available & consistent  Scale up (limited)  Shared something (disk/ram/proc)  Weak consistenct (stale data)  Last write wins  Program managed  New technology  No standard  Available & partition tolerant  Scale out (unlimited*)  Shared nothing (parallelizable) 41