The Return of Big Iron?
Ben Stopford
Distinguished Engineer
RBS Markets
Much diversity
What does this mean?
• A change in what customers (we) value
• The mainstream is not serving customers
(us) sufficiently
The Database field has problems
We Lose: Joe Hellerstein (Berkeley) 2001
“Databases are commoditised and cornered
to slow-moving, evolving, structure
intensive, applications that require schema
evolution.“ …
“The internet companies are lost and we will
remain in the doldrums of the enterprise
space.” …
“As databases are black boxes which
require a lot of coaxing to get maximum
performance”
His question was how to win
them back?
These new technologies also
caused frustration
Backlash (2009)
Not novel (dates back to the 80’s)
Physical level not the logical level (messy?)
Incompatible with tooling
Lack of integrity (referential) & ACID
MR is brute force ignoring indexing, scew
All points are reasonable
And they proved it too!
“A comparison of Approaches to Large Scale
Data Analysis” – Sigmod 2009
• Vertica vs. DBMSX vs.
Hadoop
• Vertica up to 7 x faster than
Hadoop over benchmarks
Databases faster
than Hadoop
But possibly missed the point?
Databases were traditionally
designed to keep data safe
NoSQL grew from a need to scale
It’s more than just scale, they
facilitate different practices
A Better Fit
They better match the way software is
engineered today.
– Iterative development
– Fast feedback
– Frequent releases
Is NoSQL a Disruptive Technology?
Christensen’s observation:
Market leaders are displaced when markets
shift in ways that the incumbent leaders are
not prepared for.
Aside: MongoDB
• Impressive trajectory
• Slightly crappy product (from a traditional
database standpoint)
• Most closely related to relational DB (of
the NoSQLs)
• Plays to the agile mindset
Yet the NoSQL market is relatively
small
• Currently around $600 but projected to
grow strongly
• Database and systems management
market is worth around $34billion
Key Point

There is more to NoSQL than just
scale, it sits better with the way we
build software today
We have new building blocks to
play with!
My Problem
• Sprawling application space, built over
many years, grouped into both vertical and
horizontal silos
• Duplication of effort
• Data corruption & preventative measures
• Consolidation is costly, time consuming
and technically challenging.
Traditional solutions
(in chronological order)
– Messaging
– SOA
– Enterprise Data Warehouse
– Data virtualisation
Bringing data, applications, people
together is hard
A popular choice is an EDW
EDW pattern is workable, but tough
– As soon as you take a ‘view’ on what the
shape of the data is, it becomes harder to
change.
• Leave ‘taking a view” to the last responsible
moment

– Multifaceted: Shape, diversity of source,
diversity of population, temporal change
Harder to do iteratively
Is this the only way?
The Google Approach
MapReduce
Google Filesystem
BigTable
Tenzing
Megastore
F1
Dremel
Spanner
And just one code base!
So no enterprise schema secret
society!
The Ebay Approach
The Partial-Schematic
Approach
Often termed Clobs & Cracking
Problems with solidifying a
schematic representation
• Risk of throwing information away, keeping
only what you think you need.
– OK if you create data
– Bad if you got data from elsewhere

• Data tends to be poly-structured in
programs and on the wire
• Early-binding slows down development
But schemas are good
• They guarantee a contract
• That contract spans the whole dataset
– Similar to static typing in programming
languages.
Compromise positions
• Query schema can be a subset of data
schema.
• Use schemaless databases to capture
diversity early and evolve it as you build.
Common solutions today use
multiple technologies
M Re u
ap d ce

D a
at
W ho se
are u

?
Ke Vl u
y ae
St o
re

In- M mry/
eo
O
LTP D ba
ata se
We use an late-bound schema,
sitting over a schemaless store
S
tructured
S
tandardisation
Layer
Raw Data

Late Bound
Schema
Evolutionary Approach
• Late-binding makes consolidation
incremental
– Schematic representation delivered at the ‘last
responsible moment’ (schema on demand)
– A trade in this model has 4 mandatory nodes. A
fully modeled trade has around 800.

• The system of record is raw data, not our
‘view’ of it
• No schema migration! But this comes at a
price.
Scaling
Key based access always scales
Client
But queries (without the sharding key)
always broadcast
Client
As query complexity increases so does
the overhead
Client
Course grained shards
Client
Data Replicas provide hardware isolation
Client
Scaling
• Key based sharding is only sufficient very
simple workloads
• Course grained shards help (but suffer
from skew)
• Replication provides useful, if expensive,
hardware isolation
• Workload management is less useful in
my experience
Weak consistency forces the
problem onto the developer
Particularly bad for banks!
Scaling two phase commit is hard to
do efficiently
• Requires distributed lock/clock/counter
• Requires synchronisation of all readers &
writers
Alternatives to traditional 2PC
• MVCC over explicit locking
• Timestamp based strong consistency
– E.g. Granola

• Optimistic concurrency control
– Leverage short running transactions (avoid
cross-network transactions)
– Tolerate different temporal viewpoints to
reduce synchronization costs.
Immutable Data
•
•
•
•
•

Safety
‘As was’ view
Sits well with MVCC
Efficiency problems
Gaining popularity (e.g. Datomic)
Use joins to avoid ‘over aggregating’

Joins are ok, so long as they are
– Local
– via a unique key

Trade
r

Party
Trade
Memory/Disk Tradeoff
• Memory only (possibly overplayed)
• Pinned indexes (generally good idea if you
can afford the RAM)
• Disk resident (best general purpose
solution and for very large datasets)
Balance flexibility and complexity
Operational
(real time / MR)

Object/S
QL
S
tandardisation

Raw Data

Relational
Analytics
Supple at the front, more rigid at the back

Raw Access

Operational Access

Analytic Access

D

Looser

Tighter

L
M

Untyped

Object/S
QL

Reporting

Broad Data Coverage

Narrow Data Coverage

Narrow Query

Comprehensive Quer y
Principals
•
•
•
•

Record everything
Grow a schema, don’t do it upfront
Avoid using a ‘view’ as your system of record.
Differentiate between sourced data (out of
your control) and generated data (in your
control).
• Use automated replication (for isolation) as
well as sharding (for scale)
• Leverage asynchronicity to reduce
transaction overheads
Consolidation
means more trust,
less impedance
mismatches and
managing tighter
couplings
Target architectures are starting to
look more like large applications
of cloud enabled services than
heterogeneous application
conglomerates
Are we going back to the mainframe?
Thanks

http://www.benstopford.com

Big iron 2 (published)

  • 1.
    The Return ofBig Iron? Ben Stopford Distinguished Engineer RBS Markets
  • 2.
  • 3.
    What does thismean? • A change in what customers (we) value • The mainstream is not serving customers (us) sufficiently
  • 4.
    The Database fieldhas problems
  • 5.
    We Lose: JoeHellerstein (Berkeley) 2001 “Databases are commoditised and cornered to slow-moving, evolving, structure intensive, applications that require schema evolution.“ … “The internet companies are lost and we will remain in the doldrums of the enterprise space.” … “As databases are black boxes which require a lot of coaxing to get maximum performance”
  • 6.
    His question washow to win them back?
  • 7.
    These new technologiesalso caused frustration
  • 8.
    Backlash (2009) Not novel(dates back to the 80’s) Physical level not the logical level (messy?) Incompatible with tooling Lack of integrity (referential) & ACID MR is brute force ignoring indexing, scew
  • 9.
    All points arereasonable
  • 10.
    And they provedit too! “A comparison of Approaches to Large Scale Data Analysis” – Sigmod 2009 • Vertica vs. DBMSX vs. Hadoop • Vertica up to 7 x faster than Hadoop over benchmarks Databases faster than Hadoop
  • 11.
  • 12.
  • 13.
    NoSQL grew froma need to scale
  • 15.
    It’s more thanjust scale, they facilitate different practices
  • 16.
    A Better Fit Theybetter match the way software is engineered today. – Iterative development – Fast feedback – Frequent releases
  • 17.
    Is NoSQL aDisruptive Technology? Christensen’s observation: Market leaders are displaced when markets shift in ways that the incumbent leaders are not prepared for.
  • 18.
    Aside: MongoDB • Impressivetrajectory • Slightly crappy product (from a traditional database standpoint) • Most closely related to relational DB (of the NoSQLs) • Plays to the agile mindset
  • 19.
    Yet the NoSQLmarket is relatively small • Currently around $600 but projected to grow strongly • Database and systems management market is worth around $34billion
  • 20.
    Key Point There ismore to NoSQL than just scale, it sits better with the way we build software today
  • 21.
    We have newbuilding blocks to play with!
  • 22.
    My Problem • Sprawlingapplication space, built over many years, grouped into both vertical and horizontal silos • Duplication of effort • Data corruption & preventative measures • Consolidation is costly, time consuming and technically challenging.
  • 23.
    Traditional solutions (in chronologicalorder) – Messaging – SOA – Enterprise Data Warehouse – Data virtualisation
  • 24.
    Bringing data, applications,people together is hard
  • 25.
  • 26.
    EDW pattern isworkable, but tough – As soon as you take a ‘view’ on what the shape of the data is, it becomes harder to change. • Leave ‘taking a view” to the last responsible moment – Multifaceted: Shape, diversity of source, diversity of population, temporal change
  • 27.
    Harder to doiteratively
  • 28.
    Is this theonly way?
  • 29.
    The Google Approach MapReduce GoogleFilesystem BigTable Tenzing Megastore F1 Dremel Spanner
  • 30.
    And just onecode base! So no enterprise schema secret society!
  • 31.
  • 32.
  • 33.
    Problems with solidifyinga schematic representation • Risk of throwing information away, keeping only what you think you need. – OK if you create data – Bad if you got data from elsewhere • Data tends to be poly-structured in programs and on the wire • Early-binding slows down development
  • 34.
    But schemas aregood • They guarantee a contract • That contract spans the whole dataset – Similar to static typing in programming languages.
  • 35.
    Compromise positions • Queryschema can be a subset of data schema. • Use schemaless databases to capture diversity early and evolve it as you build.
  • 36.
    Common solutions todayuse multiple technologies M Re u ap d ce D a at W ho se are u ? Ke Vl u y ae St o re In- M mry/ eo O LTP D ba ata se
  • 37.
    We use anlate-bound schema, sitting over a schemaless store S tructured S tandardisation Layer Raw Data Late Bound Schema
  • 38.
    Evolutionary Approach • Late-bindingmakes consolidation incremental – Schematic representation delivered at the ‘last responsible moment’ (schema on demand) – A trade in this model has 4 mandatory nodes. A fully modeled trade has around 800. • The system of record is raw data, not our ‘view’ of it • No schema migration! But this comes at a price.
  • 39.
  • 40.
    Key based accessalways scales Client
  • 41.
    But queries (withoutthe sharding key) always broadcast Client
  • 42.
    As query complexityincreases so does the overhead Client
  • 43.
  • 44.
    Data Replicas providehardware isolation Client
  • 45.
    Scaling • Key basedsharding is only sufficient very simple workloads • Course grained shards help (but suffer from skew) • Replication provides useful, if expensive, hardware isolation • Workload management is less useful in my experience
  • 46.
    Weak consistency forcesthe problem onto the developer Particularly bad for banks!
  • 47.
    Scaling two phasecommit is hard to do efficiently • Requires distributed lock/clock/counter • Requires synchronisation of all readers & writers
  • 48.
    Alternatives to traditional2PC • MVCC over explicit locking • Timestamp based strong consistency – E.g. Granola • Optimistic concurrency control – Leverage short running transactions (avoid cross-network transactions) – Tolerate different temporal viewpoints to reduce synchronization costs.
  • 49.
    Immutable Data • • • • • Safety ‘As was’view Sits well with MVCC Efficiency problems Gaining popularity (e.g. Datomic)
  • 50.
    Use joins toavoid ‘over aggregating’ Joins are ok, so long as they are – Local – via a unique key Trade r Party Trade
  • 51.
    Memory/Disk Tradeoff • Memoryonly (possibly overplayed) • Pinned indexes (generally good idea if you can afford the RAM) • Disk resident (best general purpose solution and for very large datasets)
  • 52.
    Balance flexibility andcomplexity Operational (real time / MR) Object/S QL S tandardisation Raw Data Relational Analytics
  • 53.
    Supple at thefront, more rigid at the back Raw Access Operational Access Analytic Access D Looser Tighter L M Untyped Object/S QL Reporting Broad Data Coverage Narrow Data Coverage Narrow Query Comprehensive Quer y
  • 54.
    Principals • • • • Record everything Grow aschema, don’t do it upfront Avoid using a ‘view’ as your system of record. Differentiate between sourced data (out of your control) and generated data (in your control). • Use automated replication (for isolation) as well as sharding (for scale) • Leverage asynchronicity to reduce transaction overheads
  • 55.
    Consolidation means more trust, lessimpedance mismatches and managing tighter couplings
  • 56.
    Target architectures arestarting to look more like large applications of cloud enabled services than heterogeneous application conglomerates
  • 57.
    Are we goingback to the mainframe?
  • 58.

Editor's Notes

  • #2 Think about the systems you built five or ten years ago. Who was involved in the building of a new system in the early 2000s? Who used a relational DB? Who seriously considered using anything else?
  • #6 Retrospective
  • #9 (no schema or high level languages)
  • #30 Companies that grew up around technology.