Big iron 2 (published)

The Return of Big Iron?
Ben Stopford
Distinguished Engineer
RBS Markets

What does this mean?
• A change in what customers (we) value
• The mainstream is not serving customers
(us) sufficiently

The Database field has problems

We Lose: Joe Hellerstein (Berkeley) 2001
“Databases are commoditised and cornered
to slow-moving, evolving, structure
intensive, applications that require schema
evolution.“ …
“The internet companies are lost and we will
remain in the doldrums of the enterprise
space.” …
“As databases are black boxes which
require a lot of coaxing to get maximum
performance”

His question was how to win
them back?

These new technologies also
caused frustration

Backlash (2009)
Not novel (dates back to the 80’s)
Physical level not the logical level (messy?)
Incompatible with tooling
Lack of integrity (referential) & ACID
MR is brute force ignoring indexing, scew

And they proved it too!
“A comparison of Approaches to Large Scale
Data Analysis” – Sigmod 2009
• Vertica vs. DBMSX vs.
Hadoop
• Vertica up to 7 x faster than
Hadoop over benchmarks
Databases faster
than Hadoop

But possibly missed the point?

Databases were traditionally
designed to keep data safe

NoSQL grew from a need to scale

It’s more than just scale, they
facilitate different practices

A Better Fit
They better match the way software is
engineered today.
– Iterative development
– Fast feedback
– Frequent releases

Is NoSQL a Disruptive Technology?
Christensen’s observation:
Market leaders are displaced when markets
shift in ways that the incumbent leaders are
not prepared for.

Aside: MongoDB
• Impressive trajectory
• Slightly crappy product (from a traditional
database standpoint)
• Most closely related to relational DB (of
the NoSQLs)
• Plays to the agile mindset

Yet the NoSQL market is relatively
small
• Currently around $600 but projected to
grow strongly
• Database and systems management
market is worth around $34billion

Key Point

There is more to NoSQL than just
scale, it sits better with the way we
build software today

We have new building blocks to
play with!

My Problem
• Sprawling application space, built over
many years, grouped into both vertical and
horizontal silos
• Duplication of effort
• Data corruption & preventative measures
• Consolidation is costly, time consuming
and technically challenging.

Traditional solutions
(in chronological order)
– Messaging
– SOA
– Enterprise Data Warehouse
– Data virtualisation

Bringing data, applications, people
together is hard

EDW pattern is workable, but tough
– As soon as you take a ‘view’ on what the
shape of the data is, it becomes harder to
change.
• Leave ‘taking a view” to the last responsible
moment

– Multifaceted: Shape, diversity of source,
diversity of population, temporal change

The Google Approach
MapReduce
Google Filesystem
BigTable
Tenzing
Megastore
F1
Dremel
Spanner

And just one code base!
So no enterprise schema secret
society!

The Partial-Schematic
Approach
Often termed Clobs & Cracking

Problems with solidifying a
schematic representation
• Risk of throwing information away, keeping
only what you think you need.
– OK if you create data
– Bad if you got data from elsewhere

• Data tends to be poly-structured in
programs and on the wire
• Early-binding slows down development

But schemas are good
• They guarantee a contract
• That contract spans the whole dataset
– Similar to static typing in programming
languages.

Compromise positions
• Query schema can be a subset of data
schema.
• Use schemaless databases to capture
diversity early and evolve it as you build.

Common solutions today use
multiple technologies
M Re u
ap d ce

D a
at
W ho se
are u

?
Ke Vl u
y ae
St o
re

In- M mry/
eo
O
LTP D ba
ata se

We use an late-bound schema,
sitting over a schemaless store
S
tructured
S
tandardisation
Layer
Raw Data

Late Bound
Schema

Evolutionary Approach
• Late-binding makes consolidation
incremental
– Schematic representation delivered at the ‘last
responsible moment’ (schema on demand)
– A trade in this model has 4 mandatory nodes. A
fully modeled trade has around 800.

• The system of record is raw data, not our
‘view’ of it
• No schema migration! But this comes at a
price.

Key based access always scales
Client

But queries (without the sharding key)
always broadcast
Client

As query complexity increases so does
the overhead
Client

Data Replicas provide hardware isolation
Client

Scaling
• Key based sharding is only sufficient very
simple workloads
• Course grained shards help (but suffer
from skew)
• Replication provides useful, if expensive,
hardware isolation
• Workload management is less useful in
my experience

Weak consistency forces the
problem onto the developer
Particularly bad for banks!

Scaling two phase commit is hard to
do efficiently
• Requires distributed lock/clock/counter
• Requires synchronisation of all readers &
writers

Alternatives to traditional 2PC
• MVCC over explicit locking
• Timestamp based strong consistency
– E.g. Granola

• Optimistic concurrency control
– Leverage short running transactions (avoid
cross-network transactions)
– Tolerate different temporal viewpoints to
reduce synchronization costs.

Immutable Data
•
•
•
•
•

Safety
‘As was’ view
Sits well with MVCC
Efficiency problems
Gaining popularity (e.g. Datomic)

Use joins to avoid ‘over aggregating’

Joins are ok, so long as they are
– Local
– via a unique key

Trade
r

Party
Trade

Memory/Disk Tradeoff
• Memory only (possibly overplayed)
• Pinned indexes (generally good idea if you
can afford the RAM)
• Disk resident (best general purpose
solution and for very large datasets)

Balance flexibility and complexity
Operational
(real time / MR)

Object/S
QL
S
tandardisation

Raw Data

Relational
Analytics

Supple at the front, more rigid at the back

Raw Access

Operational Access

Analytic Access

D

Looser

Tighter

L
M

Untyped

Object/S
QL

Reporting

Broad Data Coverage

Narrow Data Coverage

Narrow Query

Comprehensive Quer y

Principals
•
•
•
•

Record everything
Grow a schema, don’t do it upfront
Avoid using a ‘view’ as your system of record.
Differentiate between sourced data (out of
your control) and generated data (in your
control).
• Use automated replication (for isolation) as
well as sharding (for scale)
• Leverage asynchronicity to reduce
transaction overheads

Consolidation
means more trust,
less impedance
mismatches and
managing tighter
couplings

Target architectures are starting to
look more like large applications
of cloud enabled services than
heterogeneous application
conglomerates

Are we going back to the mainframe?

Thanks

http://www.benstopford.com

Big iron 2 (published)

More Related Content

What's hot

Viewers also liked

Similar to Big iron 2 (published)

More from Ben Stopford

Recently uploaded

Big iron 2 (published)

Editor's Notes