Advanced databases ben stopford

Data Storage for Extreme
Use Cases: The Lay of the
Land and a Peek at ODC
Ben Stopford : RBS

That‟s how long it takes light to
travel a room

How fast is a database lookup?

That‟s how long it takes light to
go to Australia and back

Computers really are very fast!

The problem is we‟re quite good at
writing software that slows them
down

Question:

Is it fair to compare the
performance of a Database with
a HashMap?

Mechanical Sympathy

Ethernet ping
1MB Disk/Ethernet
RDMA over Infiniband
Cross
Continental ms μs ns ps
Round
Trip
0.000,000,000,000
Main Memory L1 Cache Ref
Ref

1MB Main Memory L2 Cache Ref
* L1 ref is about 2 clock cycles or 0.7ns. This is the
time it takes light to travel 20cm

Key Point #1

Simple computer
programs, operating in a
single address space are
extremely fast.

Why are there so many
types of database
these days?
…because we need
different architectures
for different jobs

Traditional Database
Architecture is Aging

Traditional

Shared Shared
In Memory
Disk Nothing

Distributed Simpler
In Memory Contract

Key Point #2

Different architectural
decisions about how we
store and access data are
needed in different
environments.
Our ‘Context’ has changed

How big is the internet?

5 exabytes
(which is 5,000 petabytes
or 5,000,000 terabytes)

How big is an average
enterprise database

80% < 1TB
(in 2009)

The context of
our problem has
changed

Databases have huge
operational overheads

Taken from “OLTP Through the
Looking Glass, and What We Found
There” Harizopoulos et al

Avoid that overhead with a
simpler contract and avoiding
IO

Key Point #3

For the very top end data
volumes a simpler
contract is mandatory.
ACID is simply not
possible.

Key Point #3 (addendum)

But we should always
retain ACID properties if
our use case allows it.

Options for
scaling-out the
traditional
architecture

#1: The Shared Disk
Architecture

Shared
Disk

#2: The Shared Nothing
Architecture

Each machine is responsible for a subset
of the records. Each record exists on only
one machine.

1, 2, 3… 97, 98, 99…

765, 769… 169, 170…
Client

333, 334… 244, 245…

#3: The In Memory Database
(single address-space)

Databases must cache subsets
of the data in memory

Cache

Not knowing what you don‟t
know

90% in Cache

Data on Disk

If you can fit it ALL in memory
you know everything!!

The architecture of an in
memory database

Memory is at least 100x faster than
disk
ms μs ns ps

1MB Disk/Network 1MB Main Memory

0.000,000,000,000
Cross Continental Main Memory L1 Cache Ref
Round Trip Ref
Cross Network Round L2 Cache Ref
Trip * L1 ref is about 2 clock cycles or 0.7ns. This is the
time it takes light to travel 20cm

The proof is in the stats. TPC-H
Benchmarks on a 1TB data set

So why haven‟t in-
memory databases taken
off?

Address-Spaces are relatively
small and of a finite, fixed size

Distributed In Memory (Shared
Nothing)

Again we spread our data but this time
only using RAM.

1, 2, 3… 97, 98, 99…

765, 769… 169, 170…
Client

333, 334… 244, 245…

Distribution solves our two
problems

We get massive amounts of
parallel processing

But at the cost
of loosing the
single address
space

Key Point #4
There are three key forces:

Simplify the
Distribution No Disk
contract

Improve
Gain
scalability
scalability All data is
by picking
through a held in
appropriate
distributed RAM
ACID
architecture
properties.

These three non-
functional themes
lay behind the design
of ODC, RBS‟s in-
memory data
warehouse

ODC represents
a balance
between
throughput and
latency

Which is best for latency?

Shared
Nothing
(Distributed)
Traditional In-Memory
Database Database

Latency?

Which is best for throughput?

Shared
Nothing
(Distributed)
Traditional In-Memory
Database Database

Throughput?

So why do we use distributed
in-memory?
In Plentiful
Memory hardware

Latency Throughput

ODC – Distributed, Shared Nothing, In
Memory, Semi-Normalised, Realtime Graph
DB
450 processes 2TB of RAM

Messaging (Topic Based) as a system of record
(persistence)

The Layers
Access Layer Jav Jav
a a
clie clie
nt
API nt
API
Query Layer

Transaction
Data Layer

s
Mtms

Cashflows
Persistence
Layer

Three Tools of Distributed Data
Architecture
Indexing

Partitioning Replication

How should we use these tools?

Replication puts data
everywhere

But your storage is limited by
the memory on a node

Partitioning scales
Associating data in
different partitions implies
moving it.

Scalable storage, bandwidth
and processing

So we have some data.
Our data is bound together in a
model
Desk
Sub
Name
Trader
Party

Trade

Which we save..

Trade
r Part
y
Trad
e

Trad Trade Part
e r
y

Binding them back together involves a
“distributed join” => Lots of network
hops
Trade
r Part
y
Trad
e

Trad Trade Part
e r
y

The hops have to be spread
over time

Network
Time

Lots of network hops makes it
slow

OK – what if we held it
all together??
“Denormalised”

Hence denormalisation is FAST!
(for reads)

Denormalisation implies the
duplication of some sub-entities

…and that means managing
consistency over lots of copies

…and all the duplication means
you run out of space really
quickly

Spaces issues are exaggerated
further when data is versioned
Trade
r Part Version 1
y
Trad
e Trade
r Part Version 2
y
Trad
e Trade
r Part Version 3
y
Trad
e Trade
r Part Version 4
y
Trad
…and you need e

versioning to do MVCC

And reconstituting a previous
time slice becomes very
difficult.
Trad Trade
Part
e r
y

Part Trade
Trad y r
e
Part
y Trade
r
Trad
e Part
y

So we want to hold
entities separately
(normalised) to alleviate
concerns around
consistency and space
usage

Remember this means the
object graph will be split across
multiple machines. Data is
Independently
Versioned Trade
Singleton
r Part
y
Trad
e

Trad Trade Part
e r
y

Whereas the denormalised
model the join is already done

So what we want is the advantages
of a normalised store at the speed
of a denormalised one!

This is what using Snowflake Schemas and
the Connected Replication pattern is all
about!

Looking more closely: Why
does normalisation mean we
have to spread data around the
cluster. Why can‟t we hold it all
together?

We can collocate data with common keys but if
they crosscut the only way to collocate is to
replicate
Crosscuttin
g
Keys

Common
Keys

We tackle this problem with a
hybrid model:

Replicated
Trader
Party

Trade
Partitioned

We adapt the concept of a
Snowflake Schema.

Taking the concept of Facts and
Dimensions

Everything starts from a Core
Fact (Trades for us)

Facts are Big, dimensions are
small

Facts have one key that relates
them all (used to partition)

Dimensions have many keys
(which crosscut the partitioning key)

Looking at the data:

Facts:
=>Big,
common
keys

Dimensions
=>Small,
crosscutting
Keys

We remember we are a grid. We
should avoid the distributed
join.

… so we only want to „join‟ data
that is in the same process
Use a Key
Assignment
Trade
Policy
MTMs (e.g. KeyAssociation
s
in Coherence)

Common
Key

So we prescribe different
physical storage for Facts and
Dimensions
Replicated
Trader
Party

Trade
Partitioned

Facts are
partitioned, dimensions are
replicated

Query Layer
Trader
Party

Trade

Transactions

Data Layer
Mtms

Cashflows

Fact Storage
(Partitioned)

Facts are
partitioned, dimensions are
replicated
Dimension
s
(repliacte)
Transactions
Facts
Mtms

Cashflows
(distribute/
partition)
Fact Storage
(Partitioned)

The data volumes back this up
as a sensible hypothesis

Facts:
=>Big
=>Distribut
e

Dimensions
=>Small
=> Replicate

Key Point

We use a variant on a
Snowflake Schema to
partition big entities that can
be related via a partitioning
key and replicate small stuff
who’s keys can’t map to our
partitioning key.

So how does they help us to run
queries without distributed
joins?
Select Transaction, MTM,
RefrenceData From MTM,
Transaction, Ref Where Cost Centre
= ‘CC1’

What would this look like
without this pattern?
Get Get Get Get Get Get Get
Cost Ledger Source Transa MTMs Legs Cost
Center Books Books c-tions Center
s s

Network
Time

But by balancing Replication and
Partitioning we don‟t need all those hops

Get Get Get Get Get Get Get
Cost Ledger Source Transac MTMs Legs Cost
Centers Books Books -tions Centers

Network

Stage 1: Focus on the where
clause:
Where Cost Centre = „CC1‟

Stage 1: Get the right keys to
query the Facts
Select Transaction, MTM, ReferenceData From
MTM, Transaction, Ref Where Cost Centre =
‘CC1’

Join
Dimensions in
Query Layer
Transactions

Mtms

Cashflows

Partitioned

Stage 2: Cluster Join to get
Facts
‘CC1’

Join
Dimensions in
Query Layer
Transactions

Join Facts Mtms

acrossCashflows
cluster

Partitioned

Stage 2: Join the facts together
efficiently as we know they are
collocated

Stage 3: Augment raw Facts
with relevant Dimensions
‘CC1’

Join Join Dimensions
Dimensions in Query Layer
in Query
Layer

Transactions

Join FactsMtms

across Cashflows
cluster

Partitioned

Stage 3: Bind relevant
dimensions to the result

Bringing it together:
Jav
Replicated a
clie
Partitioned
nt
API
Dimensions Facts

We never have to do a distributed join!

So all the big stuff is
held partitioned

And we can join
without shipping
keys around and
having intermediate
results

We get to do this…

Trade
r Part
y
Trad
e

Trad Trade Part
e r
y

…and this…

Trade
r Part Version 1
y
Trad
e Trade
r Part Version 2
y
Trad
e Trade
r Part Version 3
y
Trad
e Trade
r Part Version 4
y
Trad
e

..and this..

Trad Trade
Part
e r
y

Part Trade
Trad y r
e
Part
y Trade
r
Trad
e Part
y

…without the problems of this…

..all at the speed of this… well
almost!

But there is a fly in the
ointment…

I lied earlier. These aren‟t all
Facts.

Facts

This is a dimension
• It has a different
key to the Facts. Dimensions
• And it’s BIG

We can‟t replicate really big
stuff… we‟ll run out of space
=> Big Dimensions are a
problem.

Fortunately there is a simple
solution!

Whilst there are lots of these
big dimensions, a large majority
are never used. They are not all
“connected”.

If there are no Trades for Goldmans
in the data store then a Trade Query
will never need the Goldmans
Counterparty

Looking at the Dimension data
some are quite large

But Connected Dimension Data
is tiny by comparison

One recent independent study
from the database community
showed that 80% of data
remains unused

So we only replicate
‘Connected’ or ‘Used’
dimensions

As data is written to the data store we
keep our „Connected Caches‟ up to date

Processing Layer
Dimension
Caches
(Replicated)
Transactions

Data Layer
As new Facts are added Mtms
relevant Dimensions that
they reference are moved
Cashflows
to processing layer
caches

Fact Storage
(Partitioned)

The Replicated Layer is updated
by recursing through the arcs
on the domain model when facts
change

Saving a trade causes all it‟s 1
levelst

references to be triggered

Query Layer
Save Trade (With connected
dimension Caches)

Data Layer
Cache
Trad (All Normalised)
Store e

Partitioned
Trigger Cache
Party Sourc Ccy
Alias e
Book

This updates the connected caches

Query Layer
(With connected
dimension Caches)

Data Layer
e

Party Sourc Ccy
Alias e
Book

The process recurses through the
object graph

Query Layer
(With connected
dimension Caches)

Data Layer
e

Party Sourc Ccy
Alias e
Book

Party Ledge
rBook

‘Connected Replication’
A simple pattern which
recurses through the foreign
keys in the domain
model, ensuring only
‘Connected’ dimensions are
replicated

With ‘Connected
Replication’ only
1/10th of the data

needs to be replicated
(on average).

Conclusion

Partitioned
Storage

Advanced databases ben stopford

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced databases ben stopford

Similar to Advanced databases ben stopford (20)

More from Ben Stopford

More from Ben Stopford (20)

Recently uploaded

Recently uploaded (20)

Advanced databases ben stopford

Editor's Notes