1. Content
• Introduc*on
• Databases
– ACID
– Data
structures,
algorithms
– Scalability
issues
– Scaling
pa=erns
• Search
engines
– Data
structures,
algorithms
– Pros
&
cons
• NoSQL
Movement
– Why
and
What
1
2. Content
• NoSQL
Families
– Key
value
stores
– Column
stores
– Document
stores
– Graph
DB
• Principles:
CAP,
Scaling
pa=erns,
High
availability
pa=erns,
Elas*city
• How
to
choose
?
• Conclusion
2
3. Introduc,on
• Who
we
are:
– Clément
STENAC
(Indexing
and
search
techs)
– Jérémie
BORDIER
(360
team
(a
bit
of
everything))
• Exalead:
– Indexing
technologies
provider
since
1998
– Online
search
engine:
h=p://www.exalead.com
– Daily
challenge:
Tackle
informa*on
access
problems
for
large
companies.
3
4. Introduc,on
• Universal
answer
to
data
storage:
RELATIONAL
DATABASES
• Well
known
data
representa*on:
Objects
and
rela*onships
• Powerful
query
language:
SQL
• Open
source
implementa*ons:
– MySQL
– PostgreSQL
– …
4
5. Introduc,on
• Database
scalability
problems
?
• Used
to
be
a
Telco
and
bank
problem…
• Un*l
the
internet
has
come
!
Twitter whale, 2008
5
6. Introduc,on
• Thanks
to
the
internet…
• …millions
of
rows
is
frequent…
• …
real
*me
websites.
How
to
deal
with
massive
amount
of
structured
data
?
Are
there
alterna*ves
?
What’s
this
NoSQL
buzz
?
6
8. Databases:
ACID
ACID
constraints
• Atomicity
• Transac*ons
succeed
or
fail
atomically
• Consistency
• Transac*ons
leave
the
database
in
a
consistent
state
• Isola,on
• Transac*ons
do
not
see
the
effects
of
concurrent
transac*ons
• Durability
• Once
a
transac*on
is
commi=ed,
it
can’t
be
lost
9. Database
structures
Primary
storage
CREATE TABLE author (
Heuris*cs
change
it
id INTEGER PRIMARY KEY,
nick VARCHAR(16), Fixed size
to
variable-‐size
age INTEGER,
firstname VARCHAR(128),
biography TEXT); Variable size
CREATE TABLE post (
Each
value
or
pointer
id INTEGER PRIMARY KEY,
can
be
retrieved
at
a
author_id FOREIGN KEY REFERENCES author(id);
timestamp TIMESTAMP,
known
offset
in
the
row
title VARCHAR(256),
text TEXT);
Id age nick firstname biography
Row 1 4 bytes 4 bytes 16 bytes pointer pointer
Id age nick firstname biography
Row 2 4 bytes 4 bytes 16 bytes pointer pointer
Table strings len data len data len data len data
10. Searching
in
a
database
SELECT * FROM author WHERE age=24;
The
raw
way:
full
scan
• Enumerate
all
records
in
the
table
• For
each
record,
fetch
the
condi*on
value
• Inline
value:
direct
access
at
row_address + offset(column)
• Outside
value
:
fetch
pointer
and
fetch
data
• Perform
comparison
Analysis
• Need
to
analyse
the
full
table
• Very
CPU
intensive
• If
the
table
does
not
fit
in
memory
?
–
I/O
on
the
whole
table
11. Database
structures
Indexes
What
is
an
index
?
• Primary
storage:
forward
mapping
row_id –> row data
• Index
:
reverse
mapping
row data –> row_id(s)
• Updated
together
with
the
primary
storage
Searching
with
an
index
• Retrieve
the
row
ids
using
the
index
• Fetch
the
row
data
from
primary
storage
12. Database
structures
Indexes
–
Hash
index
How
it
works
• Stores
hashes
of
column
values
in
as
hash-‐table
• Retrieve
through
the
hash
table
Pros
• Very
easy
and
fast
to
update
• Fast
lookup
–
single
hashtable
lookup
Cons
• Only
provides
equality
matching
• Unable
to
answer
inequality
queries
13. Database
structures
Indexes
–
BTree
index
Binary search tree B-Tree
Pros
• Provides
range
and
inequality
queries
easily
• Quite
fast
(logarithmic)
opera*ons
Cons
• More
complex
and
expensive
to
update
• B-‐Tree
rebalancing
14. Choosing
how
to
search
Is
indexed
search
always
be=er
?
• SELECT * from author where age < 300;
Analysis
• Fetch
of
whole
table
• Index:
random
lookups
• Full
scan
:
sequen*al
fetch
Choosing
wisely
• Iden*fy
the
expensive
queries
• Use
the
EXPLAIN
statement
• Only
add
indexes
where
they
are
required
• Indexes
are
expensive
to
update
15. Joining
Goal
• Put
together
data
from
several
tables
• For
some
values
in
table
A,
find
matching
values
in
table
B
Example
• ELECT * FROM post
S
INNER JOIN author
ON author.id = post.author_id
WHERE author.age = 42;
16. Join
algorithms
Nested
loops
• Foreach (author WHERE age=42) {
Foreach(post) {
if (post.author_id == author.id) {
append post to the result set;
}
}
}
• Very
naive
algorithm
:
runs
in
PxA
*me
• Provides
all
predicates
Hash
join
• Algorithm
• Make
a
hashtable
of
author
ids
matching
the
«
age
=
42
»
condi*on
• Scan
once
the
post
table
• For
each
post,
lookup
in
the
hashtable
to
check
if
it
matches
a
valid
author
• Faster
than
nested
loops
(2
scans
instead
of
A)
• Requires
memory
to
store
the
hashtable
• Only
provides
equality
predicate
17. Join
algorithms
Merge
join
• Need
to
have
both
tables
sorted
by
join
key
• Post
sorted
by
author_id
• Author
sorted
by
id
• Perform
a
single
parallel
scan
of
the
two
tables
and
iden*fy
matches
• Fastest
algorithm,
but
needs
sorted
data
• Disk-‐based
sort
for
large
data
sets
Choice
of
join
algorithm
• Performed
automa*cally
by
the
query
op*mizer
(EXPLAIN)
• Main
parameters:
• Rela*ons
cardinali*es
• Data
order
(presence
of
an
ORDER
BY
clause
?)
• Available
indexes
• JOIN
are
always
expensive
-‐>
schema
denormaliza,on
18. Database
scaling
Typical
workloads
Mostly
read
workloads
• Example:
Wikipedia
• First
solu*on:
high-‐level
(frontend
*er)
caching
• Database
scaling
:
1
master
–
N
slaves
• Replica,on
of
changes
from
master
to
slaves
• Does
not
solve
the
write
bo=leneck
problem
High
write
workloads
• Examples:
credit
cards,
Twi=er
(>1000
tweets/second,
1000s
of
deliveries)
• Performance
limited
by
write
I/O
throughput
• Because
of
the
«
D
»
constraint
• Hard
to
have
more
than
1000-‐2000
writes/second
19. Database
scaling
Scaling
writes
Mul*ple
master
setups
• All
masters
have
the
same
data
and
share
the
updates
• «
share-‐all
»
cluster
architecture
• Extremely
complex
synchroniza*on
• Bi-‐direc*onal
replica*on
• Conflict
detec*on
• Bad
performance
• Complex
resilience
• Down*me
of
a
master:
need
a
resync
• Complex,
heavy
and
expensive
architectures
Bi-directional
Client 1 Master replication flow
Master Client 2
1 2
20. Database
scaling
Scaling
writes
Sharding
• Split
the
data
between
the
masters
based
on
a
criterion
• Date
• User
id
•
hash(url),
…
• Clients
query
the
correct
master
for
each
data
• No
shared
data
between
masters
(«
share-‐nothing
»)
Client 1
Master Master
1 2
Client 2
21. Database
scaling
Problems
with
SQL
sharding
Complexity
• Not
integrated
in
SQL
• Need
to
perform
the
sharding
in
applica*ve
code
Resilience
• Several
machines
but
no
resilience
• Loss
of
one
master
=
loss
of
data
(compare
to
RAID-‐0)
Loss
of
features
• You
can’t
do
cross-‐shard
joins
Complex
evolu*ons
• How
do
you
keep
scaling
?
• To
add
another
machine,
you
need
to
change
the
distribu*on
func*on
22. Database
scaling
Other
SQL
shortcomings
Strict
schema
• It
is
good,
it
provides
strong
typing
• But,
migra*on
hell
!
• Web
applica*ons
changes
quickly
• Not
«
Agile
»
24. A
quick
look
at
search
engines
Differences
from
a
tradi*onal
database
• Not
designed
for
OLTP
• Update
by
batches
• No
transac*ons,
updates
are
available
to
readers
«
later
»
• Heavily
read-‐op*mized
Full
text
search
• It’s
more
complex
than
LIKE ’%myword%’;
• Need
specific
data
structures
30. NoSQL
Movement:
Issue
• RDBMS
fails
with
huge
amount
of
data
– Facebook’s
70TB
of
inbox
– Digg’s
3TB
– eBay’s
2PB…
• High
scale
SQL
systems
are
either:
– Very
expensive
to
buy
and
quite
to
maintain
– Very
expensive
to
maintain
30
31. NoSQL
Movement
• We
need
new
systems
that:
– Scales
horizontally
(both
read/write)
– Have
no
single
point
of
failure
– Are
fault
tolerant
– Are
elas*cs
(adding
nodes
is
easy)
– Have
flexible
data
schemas
– Are
more
web
applica*ons
friendly
31
33. NoSQL:
Key-‐Value
stores
• Distributed
hashtables
– Btrees
– Fixed
sized
tables
• Benefits:
– Very
simple
API
(get/put/delete/range)
– Easily
shardable
– Fast
reads
• Drawbacks:
– No
data
schema
(no
joins,
data
fla=ening…)
– No
query
language
• Implems:
Redis,
Amazon
Dynamo,
Voldemort
33
34. NoSQL:
Column
Stores
Id
Lastname
Firstname
Salary
1
Smith
Joe
40000
2
Jones
Mary
50000
3
Johnson
Cathy
44000
• Row
based
storage:
– 1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;
• Column
based
storage:
– 1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;
34
35. NoSQL:
Column
Stores
• Benefits:
– Reading
all
the
values
of
a
given
column
is
faster
(ex:
aggregates)
– Batch
writes
are
faster
• Joins
are
faster
– Comparing
two
columns
is
sequen*al
– Much
more
L1
CPU
cache
hits
– L1
cache
reference:
0.5ns
– L2
cache
reference:
7ns
35
36. NoSQL:
Column
Stores
• Drawbacks:
– Reading
a
single
object
is
slower
(mul*
ios)
– Wri*ng
a
single
object
is
slower
(mul*
ios)
– Doesn’t
fit
to
most
applica*ons
• Finally:
– Well
suited
for
heavy
write
/
read
applica*ons
• (eg:
Facebook
inbox
indexes)
36
37. NoSQL:
Document
Stores
• Can
be
seen
as
schema
free,
hierarchical
database
(usually
represented
as
JSON)
SQL Schema: Document store:
Person:
-‐
id
Person: - name
-‐
id 1
-‐
address
-‐
id
- name Animal: - phone
-‐
person_id
-‐
address
-‐
id
- animals =
-‐
name
- phone N - person_id
-‐
address
- name
-‐
phone
-‐
address
- phone
37
38. NoSQL:
Document
Stores
• Benefits:
– Data
spa*ality
!
Everything
in
one
place
– Efficient
write
and
updates
(in
place)
– Efficient
read
– Highly
flexible
data
schema
– Usually
provides
indexes
over
each
object
key
to
have
powerful
query
language
• Drawbacks
– Doesn’t
encourage
well
designed
data
schema
38
39. NoSQL:
Graph
Stores
• An
entry
is
a
node
• Nodes
have
proper*es
• Edges
are
links
between
nodes
39
40. NoSQL:
Graph
Stores
• Benefits:
– Faster
to
fetch
an
entry
and
its
related
entries
(links
are
already
resolved,
no
need
to
join)
– Flexible
data
schema
• Drawbacks:
– Complex
APIs
– Slow
for
batch
opera*ons
– Open
source
implems
are
not
that
good…
40
42. CAP
Theorem
• CAP:
– Consistency:
Opera*ng
fully
or
not
at
all.
– Availability:
The
service
must
be
reachable
at
any
*me.
– Par,,on
Tolerance:
No
set
of
failures
less
than
total
network
failure
is
allowed
to
cause
the
system
to
respond
incorrectly.
Any
shared-‐data
system
can
only
achieve
two
of
these
three.
CAP Theorem, Dr. Eric Brewer, Berkeley (2000)
42
43. Consistent
Hashing
• Ensuring
data
availability:
replica*on
!
• Reaching
the
right
nodes
?
Hashing
• Consistent
hashing:
Hash
ring
– Objects
are
mapped
into
a
range
– Nodes
are
mapped
into
that
range
– We
write
the
object
into
the
nearest
node,
clockwise
43
44. Data
consistency
• Ensuring
data
eventual
consistency:
Quorum
writes
– W
=
number
of
writes
to
ensure
before
returning
OK
– R
=
number
of
reads
to
ensure
– N
=
replica*on
factor
• W
<
N
==
High
write
availability
– Data
may
be
lost
or
outdated
if
read
from
another
node
• R
<
N
==
High
read
availability
– Data
may
be
outdated
• W
+
R
>
N
==
Full
consistency
!
– But
slower
writes
/
reads
44
45. Conflicts
resolu,on
• What
happens
when
R
>
1
and
two
different
versions
are
found
?
• Conflict
resolu*on
!
• Common
algorithm:
Vector
clocks
45
46. Vector
clocks
• Assign
to
each
node
a
unique
ID
• A
node
increments
its
own
vector
and
keep
track
of
the
old
entries
46
50. Choosing
your
storage
system
• “Don’t
op,mize
too
early”
• MySQL
is
robust
and
works
VERY
well
– You’ll
know
where
bugs
come
from
(you)
• Key-‐Value
stores
are
hype,
and
o`en
badly
implemented
• Anyway,
most
mature
“NoSQL”
systems:
– MongoDB
– Cassandra
50