SlideShare a Scribd company logo
1 of 43
Download to read offline
Lecture: 
The 
Google 
Bigtable 
h#p://research.google.com/archive/bigtable.html 
10/09/2014 
Romain 
Jaco3n 
romain.jaco7n@orange.fr
Agenda 
• Introduc3on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
2 
is 
it 
the 
Google’s 
big 
table 
? 
;-­‐)
Introduc7on 
Abstract 
• Bigtable 
is 
a 
distributed 
storage 
system 
for 
managing 
structured 
data 
– Designed 
to 
scale 
to 
a 
very 
large 
size: 
petabytes 
of 
data 
across 
thousands 
of 
commodity 
servers 
– It 
is 
not 
a 
rela7onal 
database, 
it 
is 
a 
sparse, 
distributed, 
persistent 
mul7-­‐dimensional 
sorted 
map 
(key/value 
store) 
• Many 
projects 
at 
Google 
store 
data 
in 
Bigtable 
– Web 
indexing, 
Google 
earth 
and 
Google 
finance, 
… 
– Different 
data 
sizes: 
URL, 
web 
pages, 
satellite 
imagery, 
… 
– Different 
latency 
requirements: 
backend 
bulk 
processing 
to 
real-­‐7me 
data 
serving 
• à 
Bigtable 
is 
a 
flexible, 
high 
performance 
solu<on 
3
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
4
Data 
model 
<html>… 
ß 
t3 
<html>… 
ß 
t5 
5 
Column 
Family 
contents 
Column 
Family 
language 
Column 
Family 
anchor 
contents: 
language: 
anchor:cnnsi.com 
anchor:my.look.ca 
Column 
key 
string 
= 
family:qualifier 
com.cnn.www 
com.google.www 
com.lego.com 
org.apache.hadoop 
org.apache.hbase 
org.golang 
64 
KB 
string 
in 
lexicographic 
order 
Row 
key 
Column 
Family 
example 
with 
only 
one 
column 
key 
Column 
Family 
example 
with 
variable 
mul3ple 
column 
keys 
Column 
Family 
name 
Tablet 
1 
Tablet 
2 
Tablet 
3 
<html>… 
ß 
t6 
EN 
ß 
t6 
CNN 
ß 
t9 
CNN.com 
ß 
t8 
A 
cell 
is 
an 
un-­‐interpreted 
array 
of 
bytes 
Cell 
versions 
are 
custom 
garbage-­‐collected 
by 
column 
family: 
• Keep 
the 
last 
N 
versions 
• Only 
keep 
values 
wriQen 
in 
the 
last 
M 
days 
Timestamp 
management 
requires 
delete 
opera7ons 
to 
be 
also 
recorded 
(= 
tombstone 
marker) 
The 
row 
range 
for 
a 
table 
is 
dynamically 
par33oned 
in 
tablets
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
6
API 
Bigtable 
API 
• Tablet 
server 
AddServer(tabletServer) / RemoveServer(tabletServer)! 
• Table 
CreateTable(table) / DeleteTable(table)! 
• Column 
family 
CreateColumnFamily(columnFamily) / DeleteColumnFamily(columnFamily)! 
• Table 
access 
control 
rights 
and 
metadata 
SetTableFlag(table, flags) / . . .! 
• Colum 
family 
access 
control 
rights 
and 
metadata 
SetColumnFamilyFlag(table, colfamily, flags) / . . .! 
• Cell 
value 
Put(rowkey, columnkey, value) / Get(rowkey, columnkey) / Delete(rowkey, columnkey)! 
• Look 
up 
value 
from 
individual 
row 
Has(rowkey, columnfamily) / . . . 
• Look 
up 
values 
from 
table 
(=MapReduce 
like 
RPC) 
Scan(rowFilter, columnFilter, timestampFilter)! 
– Can 
iterate 
over 
mul7ple 
column 
families 
– Can 
limi7ng 
rows/colums/7mestamps 
• Single-­‐row 
transac3ons 
(atomic 
read-­‐modify-­‐write 
sequence) 
• No 
support 
for 
general 
transac3ons 
across 
row 
keys 
• Cells 
can 
be 
used 
as 
integer 
counters 
Increment(rowkey, columnkey, increment) 
• Execu3on 
of 
read-­‐only 
client-­‐supplied 
scripts 
in 
the 
address 
spaces 
of 
the 
servers: 
Sawzall 
– h#p://research.google.com/archive/sawzall.html 
• Bigtable 
can 
be 
used 
with 
MapReduce 
(for 
input 
and/or 
output) 
7
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
8
Building 
blocks 
Google 
File 
System 
( 
GFS 
) 
• Bigtable 
uses 
the 
fault-­‐tolerant 
and 
scalable 
distributed 
GFS 
file 
system 
to 
store 
log 
and 
data 
files 
– metadata 
= 
METADATA 
tablets 
(store 
tablets 
loca7on) 
– data 
= 
SSTables 
collec7on 
by 
tablet 
– log 
= 
Tablet 
logs 
Google 
SSTable 
file 
format 
( 
Sorted 
String 
Table 
) 
• Used 
to 
store 
table 
data 
in 
GFS 
– Persistent, 
ordered 
immutable 
map 
from 
keys 
to 
values 
METADATA 
tablet 
43-­‐51 
9 
Google 
METADATA 
tablets 
DATA 
tablets 
METADATA 
tablet 
1-­‐42 
= 
SSTable 
files 
(64 
KB 
blocks) 
Log 
files 
Commit 
log 
Tablet 
1 
(128 
MB) 
METADATA 
root 
tablet 
File 
System 
Block 
0 
Block 
1 
Block 
2 
Block 
3 
Block 
index 
Tablet 
1 
Commit 
log 
Tablet 
2 
Commit 
log 
Tablet 
3
Building 
blocks 
Google 
SSTable 
file 
format 
( 
Sorted 
String 
Table 
) 
• Contains 
a 
sequence 
of 
64 
KB 
Blocks 
(size 
configurable) 
• Op7onally, 
Blocks 
can 
be 
completely 
mapped 
into 
memory 
(= 
lookups 
and 
scans 
without 
touching 
disk) 
• Block 
index 
stored 
at 
the 
end 
of 
the 
file 
– Used 
to 
locate 
blocks 
– Index 
loaded 
in 
memory 
when 
the 
SSTable 
is 
opened 
– Lookup 
with 
a 
single 
seek 
1. Find 
the 
appropriate 
block 
by 
performing 
a 
binary 
search 
in 
the 
in-­‐memory 
index 
2. Reading 
the 
appropriate 
block 
from 
disk 
10 
Block 
0 
Block 
1 
Block 
2 
Block 
3 
Block 
0 
Block 
index 
Tablet 
1 
rowkey 
1 
col 
key 
7mestamp 
value 
col 
key 
7mestamp 
value 
rowkey 
2 
col 
key 
7mestamp 
value 
col 
key 
7mestamp 
value 
rowkey 
3 
col 
key 
7mestamp 
value 
First 
row 
rowkey 
1 
Last 
row 
rowkey 
3 
File 
offset 
Block 
1 
First 
row 
rowkey 
4 
Last 
row 
rowkey 
9 
File 
offset 
Block 
2 
First 
row 
rowkey 
10 
Last 
row 
rowkey 
14 
File 
offset 
Block 
3 
First 
row 
rowkey 
15 
Last 
row 
rowkey 
18 
File 
offset
Building 
blocks 
Google 
Chubby 
• Chubby 
service 
consists 
of 
5 
ac3ve 
replicas 
with 
one 
master 
to 
serve 
requests 
(PAXOS 
consensus) 
• Chubby 
provides 
a 
namespace 
that 
contains 
directories 
and 
small 
files 
(<256 
KB) 
– Each 
directory 
or 
file 
can 
be 
used 
as 
a 
lock 
– Reads 
and 
writes 
to 
a 
file 
are 
atomic 
– Chubby 
client 
library 
provides 
consistent 
caching 
of 
Chubby 
files 
– Each 
Chubby 
client 
maintains 
a 
session 
with 
a 
Chubby 
service 
– Client’s 
session 
expires 
if 
is 
unable 
to 
renew 
its 
session 
lease 
within 
the 
lease 
expira7on 
7me 
– When 
a 
client’s 
session 
expires, 
it 
loses 
any 
locks 
and 
open 
handles 
– Chubby 
clients 
can 
also 
register 
callbacks 
on 
Chubby 
files 
and 
directories 
for 
no7fica7on 
of 
changes 
or 
session 
expira7on 
(server 
push) 
• Bigtable 
uses 
Chubby 
for 
a 
variety 
of 
tasks 
– To 
ensure 
there 
is 
at 
most 
one 
ac7ve 
master 
at 
any 
7me 
– To 
store 
the 
bootstrap 
loca7on 
of 
Bigtable 
data 
(Root 
tablet) 
– To 
discover 
tablet 
servers 
and 
finalize 
tablet 
server 
deaths 
– To 
store 
Bigtable 
schema 
informa7on 
(column 
family 
informa7on 
for 
each 
table) 
– To 
store 
access 
control 
lists 
(ACL) 
– Chubby 
unavailable 
= 
Bigtable 
unavailable 
– 14 
Bigtable 
clusters 
spanning 
11 
Chubby 
instances: 
average 
% 
server 
hours 
unavailability 
= 
0,0047% 
11 
Applica3on 
Chubby 
lib 
Applica3on 
Chubby 
lib 
Chubby 
cell 
Master 
RPCs 
replica 
replica 
replica 
replica
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa3on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
12
Chubby 
lib 
Chubby 
lib 
Chubby 
lib 
Tablet 
server 
Chubby 
lib 
Tablet 
server 
Implementa7on 
Tablet 
server 
Major 
components 
• One 
master 
server 
• Many 
tablet 
servers 
• A 
library 
linked 
into 
every 
client 
Master 
– Assigning 
tablets 
to 
tablet 
servers 
– Detec7ng 
the 
addi7on 
and 
expira7on 
of 
tablet 
servers 
– Balancing 
tablet 
server 
load 
– Garbage 
collec7ng 
of 
files 
in 
GFS 
– Handling 
schema 
changes 
(table 
crea7on, 
column 
family 
crea7on/dele7on 
Tablet 
server 
– manages 
a 
set 
of 
tablets 
– Handles 
read 
and 
write 
request 
to 
the 
tablets 
– Splits 
tablets 
that 
have 
grown 
too 
large 
(100-­‐200 
MB) 
Client 
– Do 
not 
rely 
on 
the 
master 
for 
tablet 
loca7on 
informa7on 
– Communicates 
directly 
with 
tablet 
servers 
for 
reads 
and 
writes 
Chubby 
lib 
13 
Chubby 
cell 
replica 
replica 
Master 
replica 
replica 
Chubby 
lib 
METADATA 
root 
tablet 
METADATA 
tablet 
1-­‐42 
METADATA 
tablet 
43-­‐51 
Block 
0 
Block 
1 
BigT 
lib 
Block 
2 
Block 
index 
Tablet 
1 
Commit 
log 
Tablet 
1 
RPCs 
Master 
Tablet 
server 
read/write 
Client 
App 
RPCs 
METADATA 
scan, 
Tablet 
assignment, 
Monitoring 
garbage 
collect 
GFS
Tablet 
Loca3on 
“DNS 
like” 
Tablet 
server 
loca3on 
! 
;-­‐) 
• 3 
level 
hierarchy 
to 
store 
tablet 
loca7on 
informa7on 
(B+ 
tree) 
14 
For 
the 
table 
named 
“Users” 
and 
given 
row 
key 
“Romain Jacotin” 
i 
know 
who 
manages 
Tablet 
1 
with 
a 
row 
range 
that 
could 
possibly 
contain 
this 
row 
key 
Chubby 
cell 
replica 
( 
and 
i 
know 
what 
are 
the 
GFS 
files 
used 
for 
Tablet 
1 
) 
I 
know 
who 
manages 
the 
Root 
tablet 
replica 
Master 
replica 
replica 
METADATA 
root 
tablet 
METADATA 
tablet 
1-­‐42 
METADATA 
tablet 
43-­‐51 
Who 
manages 
the 
Tablet 
for 
table 
“Users” 
and 
row 
“Romain Jacotin” 
??? 
• Empty 
Block 
0 
Block 
1 
Block 
2 
Block 
index 
Tablet 
1 
Commit 
log 
Tablet 
1 
GFS 
. 
. 
. 
Block 
0 
Block 
1 
Block 
index 
Tablet 
42 
Commit 
log 
Tablet 
42 
. 
. 
. 
B+ 
tree 
? 
Client 
App 
METADATA 
Tablet 
server 
METADATA 
Tablet 
server 
DATA 
Tablet 
server 
I 
know 
who 
manages 
each 
METADATA 
tablets 
based 
on 
“table name +end row” 
entry 
Read 
write 
Read 
write 
Read 
write 
cache 
= 
3 
network 
round-­‐trips: 
1. Request 
to 
Chubby 
for 
Root 
tablet 
loca7on 
2. Request 
to 
Root 
for 
METADATA 
tablet 
loca7on 
3. Request 
to 
MEDATA 
tablet 
for 
Tablet 
loca7on 
• Stale 
cache: 
worst 
case 
= 
6 
network 
round-­‐trips 
stale 
cache 
entries 
only 
discovered 
upon 
misses 
1 
2 3 
In 
fact 
METADATA 
tablets 
are 
“normal” 
Tablets 
(SST+Log) 
with 
specifics 
access 
control 
rights 
and 
Column 
Family 
schema 
! 
Root 
tablet 
is 
never 
split 
!
Implementa7on 
Tablet 
Assignment 
• Each 
tablet 
is 
assigned 
to 
one 
tablet 
server 
at 
a 
7me 
• Master 
keeps 
tracks 
of 
– the 
set 
of 
live 
tablet 
servers 
(tracking 
via 
Chubby) 
– the 
current 
assignment 
of 
tablet 
to 
tablet 
servers 
– the 
current 
unassigned 
tablets 
• When 
a 
tablet 
is 
unassigned, 
the 
master 
assigns 
the 
tablet 
to 
an 
available 
tablet 
server 
by 
sending 
a 
tablet 
load 
request 
to 
that 
tablet 
server 
15
Implementa7on 
Tablet 
Server 
discovery 
• When 
a 
tablet 
server 
starts, 
it 
creates, 
and 
acquires 
an 
exclusive 
lock 
on 
a 
uniquely-­‐named 
file 
in 
a 
specific 
Chubby 
directory 
(“servers” 
directory) 
• Master 
monitors 
this 
directory 
to 
discover 
tablet 
servers 
• A 
tablet 
server 
stops 
serving 
its 
tablets 
if 
it 
loses 
its 
exclusive 
Chubby 
lock 
• If 
the 
Chubby 
file 
no 
longer 
exists, 
then 
the 
tablet 
server 
will 
never 
be 
able 
to 
serve 
again, 
so 
it 
kills 
itself 
16
Implementa7on 
Tablet 
Server 
monitoring 
• Master 
is 
responsible 
for 
detec7ng 
when 
a 
tablet 
server 
is 
no 
longer 
serving 
its 
tablets, 
and 
for 
reassigning 
those 
tablets 
• Master 
periodically 
asks 
each 
tablet 
server 
for 
the 
status 
of 
its 
lock 
to 
detect 
when 
a 
tablet 
server 
is 
no 
longer 
serving 
its 
tablets 
• If 
a 
tablet 
server 
reports 
that 
it 
has 
lost 
its 
lock, 
or 
if 
the 
master 
was 
unable 
to 
reach 
a 
server 
during 
its 
last 
a#empts, 
the 
master 
a#empts 
to 
acquire 
acquire 
the 
lock 
for 
the 
Chubby 
file 
– If 
the 
master 
is 
able 
to 
acquire 
the 
lock 
then 
Chubby 
is 
live 
and 
the 
tablet 
server 
is 
dead 
or 
isolated, 
the 
master 
deletes 
its 
server 
file 
to 
ensure 
that 
the 
tablet 
server 
can 
never 
serve 
again 
– Then 
master 
can 
assigned 
all 
the 
tablets 
that 
were 
previously 
assigned 
to 
this 
tablet 
server 
into 
the 
set 
of 
unassigned 
tablets 
17
Implementa7on 
Master 
isolated 
• To 
ensure 
that 
a 
Bigtable 
cluster 
is 
not 
vulnerable 
to 
networking 
issues 
between 
the 
master 
and 
Chubby, 
the 
master 
kills 
itself 
if 
its 
Chubby 
session 
expires 
(master 
failures 
do 
not 
change 
the 
assignment 
of 
tablets 
to 
tablet 
servers) 
18
Implementa7on 
Master 
startup 
• When 
a 
master 
is 
started 
by 
the 
cluster 
management 
system, 
it 
needs 
to 
discover 
the 
current 
tablet 
assignments 
before 
it 
can 
changes 
them: 
1. Master 
grabs 
a 
unique 
master 
lock 
in 
Chubby 
to 
prevent 
concurrent 
master 
instan7a7ons 
2. Master 
scans 
the 
servers 
directory 
in 
Chubby 
to 
find 
the 
live 
tablet 
servers 
3. Master 
communicate 
with 
every 
live 
tablet 
servers 
to 
discover 
what 
tablets 
are 
already 
assigned 
to 
each 
server 
4. Master 
adds 
the 
root 
tablet 
to 
the 
set 
of 
unassigned 
tablets 
if 
an 
assignment 
for 
the 
root 
tablet 
is 
not 
discovered 
in 
step 
3. 
5. Master 
scans 
the 
METADATA 
table 
to 
learn 
the 
set 
of 
tablets 
(and 
detect 
unassigned 
tablets) 
19
Implementa7on 
Tablet 
: 
merging 
/ 
spliong 
• The 
set 
of 
exis7ng 
tablets 
only 
changes 
when 
a 
table 
is 
created 
or 
deleted 
– Two 
exis7ng 
tablets 
are 
merged 
to 
form 
one 
larger 
tablet 
– Exis7ng 
tablet 
is 
split 
into 
two 
smaller 
• Master 
ini7ates 
Tablets 
merging 
• Tablet 
server 
ini7ate 
tablet 
splimng 
– Commit 
the 
split 
by 
recording 
informa7on 
for 
new 
tablet 
in 
the 
METADATA 
table 
– Aner 
commi#ed, 
the 
tablet 
server 
no7fies 
the 
master 
20
Implementa7on 
Tablet 
Serving 
• Write 
opera3on 
1. Server 
checks 
that 
the 
request 
is 
well-­‐formed 
2. Server 
checks 
that 
the 
sender 
is 
authorized 
to 
write 
(list 
of 
permi#ed 
writers 
in 
a 
Chubby 
file) 
3. A 
valid 
muta7on 
is 
wri#en 
to 
the 
commit 
log 
that 
stores 
redo 
records 
(group 
commit 
to 
improve 
throughput) 
4. Aner 
the 
muta7on 
has 
been 
commi#ed, 
its 
contents 
are 
inserted 
into 
the 
memtable 
(= 
in 
memory 
sorted 
buffer) 
21 
Tablet 
server 
SST 
Commit 
log 
file 
1 
GFS 
17% 
WRITE 
memtable 
READ 
SST 
file 
2 
SST 
file 
3 
3 
4 
1, 
2
Implementa7on 
Tablet 
Serving 
• Read 
opera3on 
1. Server 
checks 
that 
the 
request 
it 
is 
well-­‐formed 
2. Server 
checks 
that 
the 
sender 
is 
authorized 
to 
read 
(list 
of 
permi#ed 
readers 
from 
a 
Chubby 
file) 
3. Valid 
read 
opera7on 
is 
executed 
on 
a 
merged 
view 
of 
the 
sequence 
of 
SSTables 
and 
the 
memtable 
22 
Tablet 
server 
SST 
Commit 
log 
file 
1 
GFS 
17% 
WRITE 
memtable 
READ 
SST 
file 
2 
SST 
file 
3 
3 
1, 
2 
3 3 3
Implementa7on 
Tablet 
Serving 
• Tablet 
Recovery 
1. Tablet 
server 
reads 
its 
metadata 
from 
the 
METADATA 
table 
(lists 
of 
SSTables 
that 
comprise 
a 
tablet 
and 
a 
set 
of 
a 
redo 
points, 
which 
are 
pointers 
into 
any 
commit 
logs 
that 
may 
contain 
data 
for 
the 
tablet) 
2. The 
tablet 
server 
reads 
the 
indices 
of 
the 
SSTables 
into 
memory 
and 
reconstructs 
the 
memtable 
by 
applying 
all 
of 
the 
updates 
that 
have 
a 
commi#ed 
since 
the 
redo 
points 
23 
Tablet 
server 
SST 
Commit 
log 
file 
1 
GFS 
17% 
WRITE 
memtable 
READ 
SST 
file 
2 
SST 
file 
3
Implementa7on 
Compac3ons 
• Minor 
compac3on 
– When 
memtable 
size 
reaches 
a 
threshold, 
memtable 
is 
frozen, 
a 
new 
memtable 
is 
created, 
and 
the 
frozen 
memtable 
is 
converted 
to 
a 
new 
SSTable 
and 
wri#en 
to 
GFS 
– Two 
goals: 
shrinks 
the 
memory 
usage 
of 
the 
tablet 
server, 
reduces 
the 
amount 
of 
data 
that 
has 
to 
be 
read 
from 
the 
commit 
log 
during 
a 
recovery 
24 
SST 
Commit 
log 
file 
1 
Tablet 
server 
GFS 
80% 
WRITE 
memtable 
READ 
SST 
Commit 
log 
file 
1 
Tablet 
server 
GFS 
0% 
WRITE 
memtable 
READ 
SST 
file 
2 
NEW 
SSTable 
memtable 
= 
New 
SSTable
Implementa7on 
Compac3ons 
• Merging 
compac3on 
– Problem: 
every 
minor 
compac7on 
creates 
a 
new 
SSTable 
(àarbitrary 
number 
of 
SSTables 
!) 
– Solu3on: 
periodic 
merging 
of 
a 
few 
SSTables 
and 
the 
memtable 
25 
Tablet 
server 
SST 
Commit 
log 
file 
1 
GFS 
80% 
WRITE 
memtable 
READ 
SST 
file 
2 
SST 
file 
3 
Commit 
log 
Tablet 
server 
GFS 
0% 
WRITE 
memtable 
READ 
SST 
file 
4 
SST 
file 
1 
NEW 
SSTable 
SST 
file 
2 
+ 
SST 
file 
3 
+ 
memtable 
= 
New 
SSTable
Implementa7on 
Compac3ons 
• Major 
compac3on 
– It 
is 
a 
merging 
compac7on 
that 
rewrites 
all 
SSTables 
into 
exactly 
one 
SSTable 
that 
contains 
no 
dele3on 
informa3on 
or 
deleted 
data 
– Bigtable 
cycles 
throught 
all 
of 
it 
tablets 
and 
regularly 
applies 
major 
compac7on 
to 
them 
(=reclaim 
ressources 
used 
by 
deleted 
data 
in 
a 
7mely 
fashion) 
26 
Tablet 
server 
SST 
Commit 
log 
file 
1 
GFS 
17% 
WRITE 
memtable 
READ 
SST 
file 
4 
SST 
file 
5 
Commit 
log 
Tablet 
server 
GFS 
19% 
WRITE 
memtable 
READ 
SST 
file 
6 
NEW 
SSTable 
SST 
file 
1 
+ 
SST 
file 
4 
+ 
SST 
file 
5 
= 
New 
SSTable
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
27
Refinements 
Locality 
groups 
• Clients 
can 
group 
mul7ple 
column 
families 
together 
into 
a 
locality 
group 
à 
more 
efficient 
reads 
☺ 
– A 
separate 
SSTable 
is 
generated 
for 
each 
locality 
group 
in 
each 
tablet 
In-­‐memory 
• A 
locality 
group 
can 
be 
declared 
to 
be 
in-­‐memory 
à 
no 
need 
for 
disk 
access 
☺ 
Compression 
• Clients 
can 
control 
wether 
or 
not 
the 
SSTables 
for 
a 
locality 
group 
are 
compressed 
– Compression 
format 
is 
applied 
to 
each 
SSTable 
block 
(64KB) 
– Two-­‐pass 
Compression: 
fist 
pass 
uses 
Bentley 
and 
McIlroy’s 
scheme, 
second 
pass 
uses 
a 
fast 
compression 
that 
looks 
for 
repe77ons 
in 
16 
KB 
window 
of 
the 
data 
(encode 
rate 
= 
100-­‐200 
MB/s, 
decode 
rate 
= 
400-­‐1000 
MB/s) 
28
Refinements 
Caching 
for 
read 
performance 
• Tablet 
servers 
use 
two 
levels 
of 
caching 
– The 
Scan 
Cache 
: 
high 
level 
cache 
for 
key-­‐value 
pairs 
returned 
by 
the 
SSTable 
interface 
to 
the 
tablet 
server 
code 
– The 
Block 
Cache 
: 
low 
level 
cache 
for 
SSTables 
blocks 
read 
from 
GFS 
29 
GFS 
SSTable 
file 
(64 
KB 
blocks) 
Block 
Cache 
Tablet 
server 
SSTable 
interface 
Scan 
Cache 
Tablet 
server 
code 
K42-­‐V42 
K66-­‐V66 
... 
Read 
Block 
1 
! 
Block 
index 
Read 
K51 
value 
in 
block 
1 
! 
Get 
K51 
value 
! 
is 
K51 
value 
in 
Scan 
Cache 
? 
is 
Block 
1 
in 
Block 
Cache 
? 
… 
K50 
: 
Block 
1 
@offset 
71256 
K51 
: 
Block 
1 
@offset 
75320 
K52 
: 
Block 
1 
@offset 
78190 
… 
Block 
0 
Block 
1 
Block 
2 
Block 
3 
Block 
index 
Block 
2
Refinements 
Bloom 
filters 
• Problem: 
read 
opera7on 
has 
to 
read 
from 
all 
SSTables 
that 
make 
up 
the 
state 
of 
a 
tablet 
– Lot 
of 
disk 
access 
# 
• Solu3on: 
use 
Bloom 
filers 
for 
each 
SSTable 
in 
a 
par7cular 
locality 
group 
– Bloom 
filter 
uses 
a 
small 
amount 
of 
memory 
and 
permit 
to 
know 
if 
a 
SSTable 
doesn’t 
contain 
a 
specified 
row/column 
pair 
– Most 
lookups 
for 
non 
existent 
rows 
or 
columns 
do 
not 
need 
to 
touch 
disk 
☺ 
30
Refinements 
Commit-­‐log 
implementa3on 
• Problem: 
If 
we 
kept 
the 
commit 
log 
for 
each 
tablet 
in 
a 
separate 
log 
file, 
a 
very 
large 
number 
of 
files 
would 
be 
wri#en 
concurrently 
in 
GFS 
= 
large 
number 
of 
disk 
seeks 
to 
write 
to 
the 
different 
physical 
log 
files 
… 
# 
• Solu3on: 
Tablet 
42 
Append 
muta3ons 
to 
a 
single 
commit 
log 
per 
tablet 
server, 
co-­‐mingling 
muta7ons 
for 
different 
tablets 
in 
the 
same 
physical 
log 
file 
– Using 
one 
log 
provides 
significant 
performance 
benefits 
during 
normal 
opera7on 
☺ 
– Using 
one 
log 
complicates 
recovery 
… 
! 
Sor3ng 
commit-­‐log 
in 
a 
distributed 
way 
before 
reassigned 
the 
tablets 
31 
Replica 
1 
Replica 
2 
Replica 
Commit 
log 
3 
Tablet 
server 
Tablet 
42 
Tablet 
1789 
Apply 
muta3on 
on 
Tablet 
42 
Apply 
muta3on 
on 
Tablet 
1789 
Replica 
1 
Replica 
2 
Replica 
3 
Commit 
log 
Tablet 
server 
Tablet 
1789 
GFS 
Apply 
muta3on 
on 
Tablet 
42 
Apply 
muta3on 
on 
Tablet 
1789 
Commit 
log 
Replica 
1 
Replica 
2 
Replica 
3 
GFS
Refinements 
Speeding 
up 
tablet 
recovery 
1. If 
Master 
moves 
a 
tablet 
from 
one 
tablet 
server 
to 
another, 
the 
source 
tablet 
server 
first 
does 
a 
minor 
compac7on 
on 
that 
tablet 
2. Aner 
this 
compac7on, 
the 
source 
tablet 
server 
stops 
serving 
the 
tablet 
3. Before 
unloads 
the 
tablet, 
the 
source 
tablet 
server 
does 
another 
(very 
fast) 
minor 
compac7on 
to 
eliminate 
any 
remaining 
un-­‐compacted 
state 
in 
the 
tablet 
server’s 
log 
that 
arrived 
while 
the 
first 
minor 
compac7on 
was 
being 
performed 
4. Tablet 
can 
now 
be 
loaded 
on 
another 
tablet 
server 
32
Refinements 
Exploi3ng 
immutability 
• Various 
parts 
of 
the 
Bigtable 
system 
have 
been 
simplified 
by 
the 
fact 
that 
all 
of 
the 
SSTables 
generated 
are 
immutable 
– No 
need 
for 
synchroniza7on 
when 
reading 
from 
SSTables 
à 
easy 
concurrency 
control 
over 
rows 
– The 
only 
mutable 
data 
structures 
accessed 
by 
both 
reads 
and 
writes 
is 
the 
“memtable” 
àeach 
memtable 
row 
use 
copy-­‐on-­‐ 
write 
and 
allow 
reads 
and 
writes 
to 
proceed 
in 
parallel 
• Garbage 
collec3on 
on 
obsolete 
SSTables 
– Master 
removes 
obsolete 
SSTables 
(in 
the 
METADATA 
table) 
as 
a 
mark-­‐and-­‐sweep 
garbage 
collec7on 
over 
the 
sate 
of 
SSTables 
• Immutability 
of 
SSTables 
permit 
to 
split 
tablets 
quickly 
– Child 
tablets 
share 
the 
SSTables 
of 
the 
parent 
tablet 
33
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua3on 
• Experience 
• Conclusions 
34
35 
Performance 
evalua7on 
(2006) 
Bigtable 
cluster 
• 500 
tablet 
servers 
– Configured 
to 
use 
1 
GB 
RAM 
– Dual-­‐core 
Opteron 
2 
GHz, 
Gigabit 
Ethernet 
NIC 
– Write 
to 
a 
GFS 
cell 
(1786 
machines 
with 
2 
x 
400 
GB 
IDE) 
• 500 
clients 
• Network 
round-­‐trip 
3me 
between 
any 
machine 
< 
1 
millisecond 
GFS 
data 
GFS 
server 
Tablet 
server 
GFS 
data 
GFS 
server 
Client 
GFS 
data 
GFS 
server 
500 
x 
Clients 
500 
x 
Tablet 
servers 
1.786 
x 
GFS 
Chunkservers
Performance 
evalua7on 
(2006) 
Sequen3al 
writes 
• Used 
R 
row 
keys 
par77oned 
and 
assigned 
to 
N 
clients 
• Single 
unique 
random 
string 
row 
key 
(uncompressible) 
Random 
writes 
• Similar 
to 
Sequen7al 
writes 
except 
row 
key 
hashed 
modulo 
R 
Sequen3al 
reads 
• Used 
R 
row 
keys 
par77oned 
and 
assigned 
to 
N 
clients 
Random 
reads 
• Similar 
to 
Sequen7al 
reads 
except 
row 
key 
hashed 
modulo 
R 
Random 
reads 
(memory) 
• Similar 
to 
Random 
reads 
benchmark 
except 
locality 
group 
that 
contains 
the 
data 
is 
marked 
as 
in-­‐memory 
Scans 
• Similar 
to 
Random 
reads 
but 
uses 
support 
provided 
by 
Bigtable 
API 
for 
scanning 
over 
all 
values 
in 
a 
row 
range 
(reduces 
RPC) 
36
Performance 
evalua7on 
(2006) 
• Reads 
• Writes 
• Scans 
37 
Number 
of 
1000-­‐byte 
values 
read/write 
per 
second 
(per 
tablet 
server) 
Number 
of 
1000-­‐byte 
values 
read/write 
per 
second 
(aggregate 
rate)
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
38
Experience 
(2006) 
Real 
applica3ons 
• Google 
Analy7cs 
• Google 
Earth 
• Personalized 
search 
• … 
39 
Tables 
in 
produc3on 
(table 
size 
= 
before 
compression) 
Distribu3on 
of 
number 
of 
tablet 
servers 
in 
Bigtable 
clusters
Experience 
(2006) 
Lessons 
• Large 
distributed 
systems 
are 
vulnerable 
to 
many 
types 
of 
failures 
– memory 
and 
network 
corrup<on 
– large 
clock 
skew 
– hung 
machines 
– extended 
and 
asymmetric 
par<<ons 
– bugs, 
– overflow 
– planned 
and 
unplanned 
hardware 
maintenance, 
… 
• Importance 
of 
proper 
system-­‐level 
monitoring 
– lock 
conten<on 
detec<on 
on 
tablet 
data 
structures 
– slow 
writes 
to 
GFS 
while 
commiSng 
Bigtable 
muta<ons 
– stuck 
access 
to 
METADATA 
when 
METADATA 
tablets 
unavailable 
– track 
down 
all 
clusters 
via 
Chubby 
• The 
value 
is 
in 
simple 
designs 
!!! 
☺ 
40
Agenda 
• Introduc7on 
• Data 
model 
• API 
• Building 
blocks 
• Implementa7on 
• Refinements 
• Performance 
evalua7on 
• Experience 
• Conclusions 
41
Conclusions 
(2006) 
Bigtable 
• 7 
person-­‐years 
on 
design 
and 
implementa3on, 
in 
produc3on 
since 
April 
2005 
• + 
16 
projects 
are 
using 
Bigtable 
(August 
2006) 
• Users 
like 
– performance 
and 
high 
availability 
– scaling 
capabili7es 
by 
simply 
adding 
more 
machines 
• SQL 
users 
are 
some3mes 
uncertain 
of 
how 
to 
best 
use 
the 
Bigtable 
interface 
… 
• New 
features 
to 
come 
(2006) 
– Secondary 
indices 
– Cross-­‐data-­‐center 
replicated 
Bigtable 
with 
mul7ple 
master 
replicas 
42
Any 
ques7ons 
? 
43 
Romain 
Jaco3n 
romain.jaco7n@orange.fr

More Related Content

What's hot

Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan BlueDatabricks
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraDataStax
 
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Amazon Web Services
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in CassandraShogo Hoshii
 
Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsNitish Upreti
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Dreaming Infrastructure
Dreaming InfrastructureDreaming Infrastructure
Dreaming Infrastructurekyhpudding
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
Introduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackIntroduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackOpenStack_Online
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistencyzqhxuyuan
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsShubham Tagra
 

What's hot (20)

Spark and S3 with Ryan Blue
Spark and S3 with Ryan BlueSpark and S3 with Ryan Blue
Spark and S3 with Ryan Blue
 
Big table
Big tableBig table
Big table
 
Mini training - Introduction to Microsoft Azure Storage
Mini training - Introduction to Microsoft Azure StorageMini training - Introduction to Microsoft Azure Storage
Mini training - Introduction to Microsoft Azure Storage
 
Understanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache CassandraUnderstanding Data Partitioning and Replication in Apache Cassandra
Understanding Data Partitioning and Replication in Apache Cassandra
 
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
Redshift at Lightspeed: How to continuously optimize and modify Redshift sche...
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Facebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platformsFacebook's TAO & Unicorn data storage and search platforms
Facebook's TAO & Unicorn data storage and search platforms
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Vertica-Database
Vertica-DatabaseVertica-Database
Vertica-Database
 
Dreaming Infrastructure
Dreaming InfrastructureDreaming Infrastructure
Dreaming Infrastructure
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Introduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStackIntroduction into Ceph storage for OpenStack
Introduction into Ceph storage for OpenStack
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
Cassandra consistency
Cassandra consistencyCassandra consistency
Cassandra consistency
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
 

Similar to The Google Bigtable

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databasesFabio Fumarola
 
ClustrixDB 7.5 Announcement
ClustrixDB 7.5 AnnouncementClustrixDB 7.5 Announcement
ClustrixDB 7.5 AnnouncementClustrix
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
What's New in MySQL 5.7
What's New in MySQL 5.7What's New in MySQL 5.7
What's New in MySQL 5.7Olivier DASINI
 
Hypertable Berlin Buzzwords
Hypertable Berlin BuzzwordsHypertable Berlin Buzzwords
Hypertable Berlin Buzzwordshypertable
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentationvanjakom
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systemsadrianionel
 
MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015Mario Beck
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDaehyeok Kim
 
MySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats newMySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats newMark Swarbrick
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLAlexei Krasner
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331Fengchang Xie
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0EDB
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under controlMarcin Przepiórowski
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptxShimoFcis
 

Similar to The Google Bigtable (20)

8. column oriented databases
8. column oriented databases8. column oriented databases
8. column oriented databases
 
GOOGLE BIGTABLE
GOOGLE BIGTABLEGOOGLE BIGTABLE
GOOGLE BIGTABLE
 
Google File System
Google File SystemGoogle File System
Google File System
 
ClustrixDB 7.5 Announcement
ClustrixDB 7.5 AnnouncementClustrixDB 7.5 Announcement
ClustrixDB 7.5 Announcement
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
What's New in MySQL 5.7
What's New in MySQL 5.7What's New in MySQL 5.7
What's New in MySQL 5.7
 
Hypertable Berlin Buzzwords
Hypertable Berlin BuzzwordsHypertable Berlin Buzzwords
Hypertable Berlin Buzzwords
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
 
MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015
 
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
MySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats newMySQL Tech Tour 2015 - 5.7 Whats new
MySQL Tech Tour 2015 - 5.7 Whats new
 
Fudcon talk.ppt
Fudcon talk.pptFudcon talk.ppt
Fudcon talk.ppt
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Take your database source code and data under control
Take your database source code and data under controlTake your database source code and data under control
Take your database source code and data under control
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 

Recently uploaded

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 

The Google Bigtable

  • 1. Lecture: The Google Bigtable h#p://research.google.com/archive/bigtable.html 10/09/2014 Romain Jaco3n romain.jaco7n@orange.fr
  • 2. Agenda • Introduc3on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua7on • Experience • Conclusions 2 is it the Google’s big table ? ;-­‐)
  • 3. Introduc7on Abstract • Bigtable is a distributed storage system for managing structured data – Designed to scale to a very large size: petabytes of data across thousands of commodity servers – It is not a rela7onal database, it is a sparse, distributed, persistent mul7-­‐dimensional sorted map (key/value store) • Many projects at Google store data in Bigtable – Web indexing, Google earth and Google finance, … – Different data sizes: URL, web pages, satellite imagery, … – Different latency requirements: backend bulk processing to real-­‐7me data serving • à Bigtable is a flexible, high performance solu<on 3
  • 4. Agenda • Introduc7on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua7on • Experience • Conclusions 4
  • 5. Data model <html>… ß t3 <html>… ß t5 5 Column Family contents Column Family language Column Family anchor contents: language: anchor:cnnsi.com anchor:my.look.ca Column key string = family:qualifier com.cnn.www com.google.www com.lego.com org.apache.hadoop org.apache.hbase org.golang 64 KB string in lexicographic order Row key Column Family example with only one column key Column Family example with variable mul3ple column keys Column Family name Tablet 1 Tablet 2 Tablet 3 <html>… ß t6 EN ß t6 CNN ß t9 CNN.com ß t8 A cell is an un-­‐interpreted array of bytes Cell versions are custom garbage-­‐collected by column family: • Keep the last N versions • Only keep values wriQen in the last M days Timestamp management requires delete opera7ons to be also recorded (= tombstone marker) The row range for a table is dynamically par33oned in tablets
  • 6. Agenda • Introduc7on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua7on • Experience • Conclusions 6
  • 7. API Bigtable API • Tablet server AddServer(tabletServer) / RemoveServer(tabletServer)! • Table CreateTable(table) / DeleteTable(table)! • Column family CreateColumnFamily(columnFamily) / DeleteColumnFamily(columnFamily)! • Table access control rights and metadata SetTableFlag(table, flags) / . . .! • Colum family access control rights and metadata SetColumnFamilyFlag(table, colfamily, flags) / . . .! • Cell value Put(rowkey, columnkey, value) / Get(rowkey, columnkey) / Delete(rowkey, columnkey)! • Look up value from individual row Has(rowkey, columnfamily) / . . . • Look up values from table (=MapReduce like RPC) Scan(rowFilter, columnFilter, timestampFilter)! – Can iterate over mul7ple column families – Can limi7ng rows/colums/7mestamps • Single-­‐row transac3ons (atomic read-­‐modify-­‐write sequence) • No support for general transac3ons across row keys • Cells can be used as integer counters Increment(rowkey, columnkey, increment) • Execu3on of read-­‐only client-­‐supplied scripts in the address spaces of the servers: Sawzall – h#p://research.google.com/archive/sawzall.html • Bigtable can be used with MapReduce (for input and/or output) 7
  • 8. Agenda • Introduc7on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua7on • Experience • Conclusions 8
  • 9. Building blocks Google File System ( GFS ) • Bigtable uses the fault-­‐tolerant and scalable distributed GFS file system to store log and data files – metadata = METADATA tablets (store tablets loca7on) – data = SSTables collec7on by tablet – log = Tablet logs Google SSTable file format ( Sorted String Table ) • Used to store table data in GFS – Persistent, ordered immutable map from keys to values METADATA tablet 43-­‐51 9 Google METADATA tablets DATA tablets METADATA tablet 1-­‐42 = SSTable files (64 KB blocks) Log files Commit log Tablet 1 (128 MB) METADATA root tablet File System Block 0 Block 1 Block 2 Block 3 Block index Tablet 1 Commit log Tablet 2 Commit log Tablet 3
  • 10. Building blocks Google SSTable file format ( Sorted String Table ) • Contains a sequence of 64 KB Blocks (size configurable) • Op7onally, Blocks can be completely mapped into memory (= lookups and scans without touching disk) • Block index stored at the end of the file – Used to locate blocks – Index loaded in memory when the SSTable is opened – Lookup with a single seek 1. Find the appropriate block by performing a binary search in the in-­‐memory index 2. Reading the appropriate block from disk 10 Block 0 Block 1 Block 2 Block 3 Block 0 Block index Tablet 1 rowkey 1 col key 7mestamp value col key 7mestamp value rowkey 2 col key 7mestamp value col key 7mestamp value rowkey 3 col key 7mestamp value First row rowkey 1 Last row rowkey 3 File offset Block 1 First row rowkey 4 Last row rowkey 9 File offset Block 2 First row rowkey 10 Last row rowkey 14 File offset Block 3 First row rowkey 15 Last row rowkey 18 File offset
  • 11. Building blocks Google Chubby • Chubby service consists of 5 ac3ve replicas with one master to serve requests (PAXOS consensus) • Chubby provides a namespace that contains directories and small files (<256 KB) – Each directory or file can be used as a lock – Reads and writes to a file are atomic – Chubby client library provides consistent caching of Chubby files – Each Chubby client maintains a session with a Chubby service – Client’s session expires if is unable to renew its session lease within the lease expira7on 7me – When a client’s session expires, it loses any locks and open handles – Chubby clients can also register callbacks on Chubby files and directories for no7fica7on of changes or session expira7on (server push) • Bigtable uses Chubby for a variety of tasks – To ensure there is at most one ac7ve master at any 7me – To store the bootstrap loca7on of Bigtable data (Root tablet) – To discover tablet servers and finalize tablet server deaths – To store Bigtable schema informa7on (column family informa7on for each table) – To store access control lists (ACL) – Chubby unavailable = Bigtable unavailable – 14 Bigtable clusters spanning 11 Chubby instances: average % server hours unavailability = 0,0047% 11 Applica3on Chubby lib Applica3on Chubby lib Chubby cell Master RPCs replica replica replica replica
  • 12. Agenda • Introduc7on • Data model • API • Building blocks • Implementa3on • Refinements • Performance evalua7on • Experience • Conclusions 12
  • 13. Chubby lib Chubby lib Chubby lib Tablet server Chubby lib Tablet server Implementa7on Tablet server Major components • One master server • Many tablet servers • A library linked into every client Master – Assigning tablets to tablet servers – Detec7ng the addi7on and expira7on of tablet servers – Balancing tablet server load – Garbage collec7ng of files in GFS – Handling schema changes (table crea7on, column family crea7on/dele7on Tablet server – manages a set of tablets – Handles read and write request to the tablets – Splits tablets that have grown too large (100-­‐200 MB) Client – Do not rely on the master for tablet loca7on informa7on – Communicates directly with tablet servers for reads and writes Chubby lib 13 Chubby cell replica replica Master replica replica Chubby lib METADATA root tablet METADATA tablet 1-­‐42 METADATA tablet 43-­‐51 Block 0 Block 1 BigT lib Block 2 Block index Tablet 1 Commit log Tablet 1 RPCs Master Tablet server read/write Client App RPCs METADATA scan, Tablet assignment, Monitoring garbage collect GFS
  • 14. Tablet Loca3on “DNS like” Tablet server loca3on ! ;-­‐) • 3 level hierarchy to store tablet loca7on informa7on (B+ tree) 14 For the table named “Users” and given row key “Romain Jacotin” i know who manages Tablet 1 with a row range that could possibly contain this row key Chubby cell replica ( and i know what are the GFS files used for Tablet 1 ) I know who manages the Root tablet replica Master replica replica METADATA root tablet METADATA tablet 1-­‐42 METADATA tablet 43-­‐51 Who manages the Tablet for table “Users” and row “Romain Jacotin” ??? • Empty Block 0 Block 1 Block 2 Block index Tablet 1 Commit log Tablet 1 GFS . . . Block 0 Block 1 Block index Tablet 42 Commit log Tablet 42 . . . B+ tree ? Client App METADATA Tablet server METADATA Tablet server DATA Tablet server I know who manages each METADATA tablets based on “table name +end row” entry Read write Read write Read write cache = 3 network round-­‐trips: 1. Request to Chubby for Root tablet loca7on 2. Request to Root for METADATA tablet loca7on 3. Request to MEDATA tablet for Tablet loca7on • Stale cache: worst case = 6 network round-­‐trips stale cache entries only discovered upon misses 1 2 3 In fact METADATA tablets are “normal” Tablets (SST+Log) with specifics access control rights and Column Family schema ! Root tablet is never split !
  • 15. Implementa7on Tablet Assignment • Each tablet is assigned to one tablet server at a 7me • Master keeps tracks of – the set of live tablet servers (tracking via Chubby) – the current assignment of tablet to tablet servers – the current unassigned tablets • When a tablet is unassigned, the master assigns the tablet to an available tablet server by sending a tablet load request to that tablet server 15
  • 16. Implementa7on Tablet Server discovery • When a tablet server starts, it creates, and acquires an exclusive lock on a uniquely-­‐named file in a specific Chubby directory (“servers” directory) • Master monitors this directory to discover tablet servers • A tablet server stops serving its tablets if it loses its exclusive Chubby lock • If the Chubby file no longer exists, then the tablet server will never be able to serve again, so it kills itself 16
  • 17. Implementa7on Tablet Server monitoring • Master is responsible for detec7ng when a tablet server is no longer serving its tablets, and for reassigning those tablets • Master periodically asks each tablet server for the status of its lock to detect when a tablet server is no longer serving its tablets • If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last a#empts, the master a#empts to acquire acquire the lock for the Chubby file – If the master is able to acquire the lock then Chubby is live and the tablet server is dead or isolated, the master deletes its server file to ensure that the tablet server can never serve again – Then master can assigned all the tablets that were previously assigned to this tablet server into the set of unassigned tablets 17
  • 18. Implementa7on Master isolated • To ensure that a Bigtable cluster is not vulnerable to networking issues between the master and Chubby, the master kills itself if its Chubby session expires (master failures do not change the assignment of tablets to tablet servers) 18
  • 19. Implementa7on Master startup • When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can changes them: 1. Master grabs a unique master lock in Chubby to prevent concurrent master instan7a7ons 2. Master scans the servers directory in Chubby to find the live tablet servers 3. Master communicate with every live tablet servers to discover what tablets are already assigned to each server 4. Master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet is not discovered in step 3. 5. Master scans the METADATA table to learn the set of tablets (and detect unassigned tablets) 19
  • 20. Implementa7on Tablet : merging / spliong • The set of exis7ng tablets only changes when a table is created or deleted – Two exis7ng tablets are merged to form one larger tablet – Exis7ng tablet is split into two smaller • Master ini7ates Tablets merging • Tablet server ini7ate tablet splimng – Commit the split by recording informa7on for new tablet in the METADATA table – Aner commi#ed, the tablet server no7fies the master 20
  • 21. Implementa7on Tablet Serving • Write opera3on 1. Server checks that the request is well-­‐formed 2. Server checks that the sender is authorized to write (list of permi#ed writers in a Chubby file) 3. A valid muta7on is wri#en to the commit log that stores redo records (group commit to improve throughput) 4. Aner the muta7on has been commi#ed, its contents are inserted into the memtable (= in memory sorted buffer) 21 Tablet server SST Commit log file 1 GFS 17% WRITE memtable READ SST file 2 SST file 3 3 4 1, 2
  • 22. Implementa7on Tablet Serving • Read opera3on 1. Server checks that the request it is well-­‐formed 2. Server checks that the sender is authorized to read (list of permi#ed readers from a Chubby file) 3. Valid read opera7on is executed on a merged view of the sequence of SSTables and the memtable 22 Tablet server SST Commit log file 1 GFS 17% WRITE memtable READ SST file 2 SST file 3 3 1, 2 3 3 3
  • 23. Implementa7on Tablet Serving • Tablet Recovery 1. Tablet server reads its metadata from the METADATA table (lists of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet) 2. The tablet server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have a commi#ed since the redo points 23 Tablet server SST Commit log file 1 GFS 17% WRITE memtable READ SST file 2 SST file 3
  • 24. Implementa7on Compac3ons • Minor compac3on – When memtable size reaches a threshold, memtable is frozen, a new memtable is created, and the frozen memtable is converted to a new SSTable and wri#en to GFS – Two goals: shrinks the memory usage of the tablet server, reduces the amount of data that has to be read from the commit log during a recovery 24 SST Commit log file 1 Tablet server GFS 80% WRITE memtable READ SST Commit log file 1 Tablet server GFS 0% WRITE memtable READ SST file 2 NEW SSTable memtable = New SSTable
  • 25. Implementa7on Compac3ons • Merging compac3on – Problem: every minor compac7on creates a new SSTable (àarbitrary number of SSTables !) – Solu3on: periodic merging of a few SSTables and the memtable 25 Tablet server SST Commit log file 1 GFS 80% WRITE memtable READ SST file 2 SST file 3 Commit log Tablet server GFS 0% WRITE memtable READ SST file 4 SST file 1 NEW SSTable SST file 2 + SST file 3 + memtable = New SSTable
  • 26. Implementa7on Compac3ons • Major compac3on – It is a merging compac7on that rewrites all SSTables into exactly one SSTable that contains no dele3on informa3on or deleted data – Bigtable cycles throught all of it tablets and regularly applies major compac7on to them (=reclaim ressources used by deleted data in a 7mely fashion) 26 Tablet server SST Commit log file 1 GFS 17% WRITE memtable READ SST file 4 SST file 5 Commit log Tablet server GFS 19% WRITE memtable READ SST file 6 NEW SSTable SST file 1 + SST file 4 + SST file 5 = New SSTable
  • 27. Agenda • Introduc7on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua7on • Experience • Conclusions 27
  • 28. Refinements Locality groups • Clients can group mul7ple column families together into a locality group à more efficient reads ☺ – A separate SSTable is generated for each locality group in each tablet In-­‐memory • A locality group can be declared to be in-­‐memory à no need for disk access ☺ Compression • Clients can control wether or not the SSTables for a locality group are compressed – Compression format is applied to each SSTable block (64KB) – Two-­‐pass Compression: fist pass uses Bentley and McIlroy’s scheme, second pass uses a fast compression that looks for repe77ons in 16 KB window of the data (encode rate = 100-­‐200 MB/s, decode rate = 400-­‐1000 MB/s) 28
  • 29. Refinements Caching for read performance • Tablet servers use two levels of caching – The Scan Cache : high level cache for key-­‐value pairs returned by the SSTable interface to the tablet server code – The Block Cache : low level cache for SSTables blocks read from GFS 29 GFS SSTable file (64 KB blocks) Block Cache Tablet server SSTable interface Scan Cache Tablet server code K42-­‐V42 K66-­‐V66 ... Read Block 1 ! Block index Read K51 value in block 1 ! Get K51 value ! is K51 value in Scan Cache ? is Block 1 in Block Cache ? … K50 : Block 1 @offset 71256 K51 : Block 1 @offset 75320 K52 : Block 1 @offset 78190 … Block 0 Block 1 Block 2 Block 3 Block index Block 2
  • 30. Refinements Bloom filters • Problem: read opera7on has to read from all SSTables that make up the state of a tablet – Lot of disk access # • Solu3on: use Bloom filers for each SSTable in a par7cular locality group – Bloom filter uses a small amount of memory and permit to know if a SSTable doesn’t contain a specified row/column pair – Most lookups for non existent rows or columns do not need to touch disk ☺ 30
  • 31. Refinements Commit-­‐log implementa3on • Problem: If we kept the commit log for each tablet in a separate log file, a very large number of files would be wri#en concurrently in GFS = large number of disk seeks to write to the different physical log files … # • Solu3on: Tablet 42 Append muta3ons to a single commit log per tablet server, co-­‐mingling muta7ons for different tablets in the same physical log file – Using one log provides significant performance benefits during normal opera7on ☺ – Using one log complicates recovery … ! Sor3ng commit-­‐log in a distributed way before reassigned the tablets 31 Replica 1 Replica 2 Replica Commit log 3 Tablet server Tablet 42 Tablet 1789 Apply muta3on on Tablet 42 Apply muta3on on Tablet 1789 Replica 1 Replica 2 Replica 3 Commit log Tablet server Tablet 1789 GFS Apply muta3on on Tablet 42 Apply muta3on on Tablet 1789 Commit log Replica 1 Replica 2 Replica 3 GFS
  • 32. Refinements Speeding up tablet recovery 1. If Master moves a tablet from one tablet server to another, the source tablet server first does a minor compac7on on that tablet 2. Aner this compac7on, the source tablet server stops serving the tablet 3. Before unloads the tablet, the source tablet server does another (very fast) minor compac7on to eliminate any remaining un-­‐compacted state in the tablet server’s log that arrived while the first minor compac7on was being performed 4. Tablet can now be loaded on another tablet server 32
  • 33. Refinements Exploi3ng immutability • Various parts of the Bigtable system have been simplified by the fact that all of the SSTables generated are immutable – No need for synchroniza7on when reading from SSTables à easy concurrency control over rows – The only mutable data structures accessed by both reads and writes is the “memtable” àeach memtable row use copy-­‐on-­‐ write and allow reads and writes to proceed in parallel • Garbage collec3on on obsolete SSTables – Master removes obsolete SSTables (in the METADATA table) as a mark-­‐and-­‐sweep garbage collec7on over the sate of SSTables • Immutability of SSTables permit to split tablets quickly – Child tablets share the SSTables of the parent tablet 33
  • 34. Agenda • Introduc7on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua3on • Experience • Conclusions 34
  • 35. 35 Performance evalua7on (2006) Bigtable cluster • 500 tablet servers – Configured to use 1 GB RAM – Dual-­‐core Opteron 2 GHz, Gigabit Ethernet NIC – Write to a GFS cell (1786 machines with 2 x 400 GB IDE) • 500 clients • Network round-­‐trip 3me between any machine < 1 millisecond GFS data GFS server Tablet server GFS data GFS server Client GFS data GFS server 500 x Clients 500 x Tablet servers 1.786 x GFS Chunkservers
  • 36. Performance evalua7on (2006) Sequen3al writes • Used R row keys par77oned and assigned to N clients • Single unique random string row key (uncompressible) Random writes • Similar to Sequen7al writes except row key hashed modulo R Sequen3al reads • Used R row keys par77oned and assigned to N clients Random reads • Similar to Sequen7al reads except row key hashed modulo R Random reads (memory) • Similar to Random reads benchmark except locality group that contains the data is marked as in-­‐memory Scans • Similar to Random reads but uses support provided by Bigtable API for scanning over all values in a row range (reduces RPC) 36
  • 37. Performance evalua7on (2006) • Reads • Writes • Scans 37 Number of 1000-­‐byte values read/write per second (per tablet server) Number of 1000-­‐byte values read/write per second (aggregate rate)
  • 38. Agenda • Introduc7on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua7on • Experience • Conclusions 38
  • 39. Experience (2006) Real applica3ons • Google Analy7cs • Google Earth • Personalized search • … 39 Tables in produc3on (table size = before compression) Distribu3on of number of tablet servers in Bigtable clusters
  • 40. Experience (2006) Lessons • Large distributed systems are vulnerable to many types of failures – memory and network corrup<on – large clock skew – hung machines – extended and asymmetric par<<ons – bugs, – overflow – planned and unplanned hardware maintenance, … • Importance of proper system-­‐level monitoring – lock conten<on detec<on on tablet data structures – slow writes to GFS while commiSng Bigtable muta<ons – stuck access to METADATA when METADATA tablets unavailable – track down all clusters via Chubby • The value is in simple designs !!! ☺ 40
  • 41. Agenda • Introduc7on • Data model • API • Building blocks • Implementa7on • Refinements • Performance evalua7on • Experience • Conclusions 41
  • 42. Conclusions (2006) Bigtable • 7 person-­‐years on design and implementa3on, in produc3on since April 2005 • + 16 projects are using Bigtable (August 2006) • Users like – performance and high availability – scaling capabili7es by simply adding more machines • SQL users are some3mes uncertain of how to best use the Bigtable interface … • New features to come (2006) – Secondary indices – Cross-­‐data-­‐center replicated Bigtable with mul7ple master replicas 42
  • 43. Any ques7ons ? 43 Romain Jaco3n romain.jaco7n@orange.fr