Reinventing the Database
Max Schireson
President, 10gen
My background

At Oracle from 1994 to 2003

At MarkLogic from 2003 to Feb 2011

Join 10gen Feb 2011
The world has changed

                      1970                      2011
Main memory           Intel 1103, 1k bits       4GB of RAM costs
                                                $25.99
                                                $25 99
Mass storage          IBM 3330 Model 1, 100     3TB Superspeed USB
                      MB                        for $129
Microprocessor        Nearly – 4004 being       Westmere EX has 10
                      developed; 4 bits and     cores, 30MB L3 cache,
                      92,000 instructions per   runs at 2.4GHz
                      second
Motor Trend Car of the Ford Torino              Chevy Volt
Year
President             Richard Nixon             Barack Obama
Ted Codd              In his 40’s               Dead
Me                    In diapers                In my 40s
More recent changes

                        A decade ago            Now
Faster                  Buy a bigger server     Buy more servers
Faster t
F t storage             A SAN with more
                                 ith            SSD
                        spindles
More reliable storage   More expensive SAN      More copies of local
                                                storage
Deployed in             Your data center        The cloud – private or
                                                public
Large user base         Thousands -             Millions - consumers
                        employees
Tracking                Business transactions   Every click and more
Assumptions behind todays
         DBMS
Relational data model
Third normal form
ACID
SQL
 Q
Multi-
Multi-statement transactions
Database is hardware agnostic
RAM is small and disks are slow
If its too slow you can buy a faster computer
Yesterday’s assumptions in
      today’s
      t d ’ worldld

Scaleout is hard
  Distributed joins are hard
  Making two-phase commits fast is hard
          two-

Custom solutions proliferate
                 p

Too slow? Just add a cache

ORM t l everywhere
    tools     h

More computers and disk are nearly free but SAN
and f
  d faster computers are expensive
                               i
Challenging some
              assumptions
                      ti
Do you need a database at all

How does it scale out

What type of queries does it need to be able to do

How should it model data

How do you query it

How does it handle transactions and consistency

Is i
I it enterprise software, open source, an appliance, or a cloud service
            i     f                          li            l d      i

Does the data fit in memory?

What if your disks are SSD?
My opinions

Different use cases will produce different answers

Existing RDBMS solutions will continue to solve a
broad set of problems well but many applications
will work better on top of alternative technologies

Many new technologies will find niches but only
one or two will become mainstream
Do you need a database at
          all
           ll
Can you better solve your problem with a batch
processing framework

Can you better solve your problem with an in
memory object store/cache
How does it scale out

Scale-
Scale-out for working set size

Scale-
Scale-out for total data size

Scale out for write volume

Scale-
Scale-out for read volume

Scale-
Scale-out for redundancy

How do you incrementally add nodes or change configuration

How do you trade off query performance (which wants fewer
index segments) for elasticity (which wants more index
segments))
What type of queries does it
  need t b able to d
      d to be bl t do

Is a key/value store enough

Will you be retrieving your data by one key or by
many

Is there a primary way you ll be viewing your data
                       you’ll

Do you need specialized queries (eg, time series,
                                (eg,
geospatial)
Imagine a garage…
You hand your valet the keys to your car

Before they park your car, they completely disassemble it

The pistons are stored in piston storage, brake pads with brake pads, steering
    p                     p           g         p               p            g
wheels with steering wheels

Over time, they have storage areas for catalytic converters, DVD-based nav
                                                             DVD-
systems, headlight washers, and traction control systems

When you ask for your car back, the valet is incredibly fast at reassembly

One minor issue: you have to provide the disassembly and reassembly instructions
and they will be followed literally, even if you say the spare tire should be used as
a steering wheel and forgot to specify re-insertion of spark plugs
                                       re-



A technological marvel

Might be a good way to store your car if you don’t know whether you’ll be asking
for a car back or lots of brake pads or pistons – for a salvage yard?
How should it model data

Relational
  Row oriented or column oriented

Key value

Document oriented

Graph oriented
How do you query it

Do you want an API, a language, or a map-reduce
                                     map-
style interface?

Will most of your queries be hand-typed, embedded
                             hand-
in code or dynamically generated
How do you handle
transactions and consistency
t      ti      d     i t
Do you need transactions at all
  Be careful; web services, for example, need to be able to
  assign userIDs

Do you need multi-master updates
            multi-
  If so, how do y resolve conflicts
       ,        you

Do you need immediate consistency?
  For some queries or all?

How do you handle failures
  Are you optimizing for read availability or write
  availability
What is it

Enterprise software
Open source
 p
  With commercial support?

Appliance
  Packaged with commodity hardware
  Specialized hardware

Cloud
Cl d service
         i
  Available for on-premise deployment?
                on-
  Integrated in another PaaS offering?
  Where on the net?
Does the data fit in
          memory
Transactions can be very very fast

Do you trust enough copies in memory (perhaps
across multiple data centers) or do you require
some sort of sync to persistent storage

How big will the data be and how much do you
care about costs
What if your disks are SSD

Alleviate hotspots

Random accesses are measured in microseconds not
milliseconds

Degradation from in-memory to on-disk can be
                 in-          on-
more graceful
  But data representations on disk vs in memory may be
  very different which may create significant overhead
In choosing a solution

Examine your requirements
  They will dictate certain choices

Once you have narrowed the field
  Prefer solutions that may become mainstream
                          y
  Consider TCO:
    Purchase cost
    Learning curve
    L    i
    Productivity
    Viability
Which solution sets will
    become mainstream
    b          i t
High confidence
  Horizontally scalable: to take advantage of hardware trends
  Non-
  Non-relational: to enable scalability
  Highly functional: for usage beyond mega-scale
                                        mega-
  Developer-
  Developer-friendly: because decision making has shifted
  Freely available: for rapid adoption


My predictions
  Document oriented: enables scalability, functionality,
  developer friendliness, and agility
  Open source: with multiple PaaS providers

Re-inventing the Database: What to Keep and What to Throw Away

  • 1.
    Reinventing the Database MaxSchireson President, 10gen
  • 2.
    My background At Oraclefrom 1994 to 2003 At MarkLogic from 2003 to Feb 2011 Join 10gen Feb 2011
  • 3.
    The world haschanged 1970 2011 Main memory Intel 1103, 1k bits 4GB of RAM costs $25.99 $25 99 Mass storage IBM 3330 Model 1, 100 3TB Superspeed USB MB for $129 Microprocessor Nearly – 4004 being Westmere EX has 10 developed; 4 bits and cores, 30MB L3 cache, 92,000 instructions per runs at 2.4GHz second Motor Trend Car of the Ford Torino Chevy Volt Year President Richard Nixon Barack Obama Ted Codd In his 40’s Dead Me In diapers In my 40s
  • 4.
    More recent changes A decade ago Now Faster Buy a bigger server Buy more servers Faster t F t storage A SAN with more ith SSD spindles More reliable storage More expensive SAN More copies of local storage Deployed in Your data center The cloud – private or public Large user base Thousands - Millions - consumers employees Tracking Business transactions Every click and more
  • 5.
    Assumptions behind todays DBMS Relational data model Third normal form ACID SQL Q Multi- Multi-statement transactions Database is hardware agnostic RAM is small and disks are slow If its too slow you can buy a faster computer
  • 6.
    Yesterday’s assumptions in today’s t d ’ worldld Scaleout is hard Distributed joins are hard Making two-phase commits fast is hard two- Custom solutions proliferate p Too slow? Just add a cache ORM t l everywhere tools h More computers and disk are nearly free but SAN and f d faster computers are expensive i
  • 7.
    Challenging some assumptions ti Do you need a database at all How does it scale out What type of queries does it need to be able to do How should it model data How do you query it How does it handle transactions and consistency Is i I it enterprise software, open source, an appliance, or a cloud service i f li l d i Does the data fit in memory? What if your disks are SSD?
  • 8.
    My opinions Different usecases will produce different answers Existing RDBMS solutions will continue to solve a broad set of problems well but many applications will work better on top of alternative technologies Many new technologies will find niches but only one or two will become mainstream
  • 9.
    Do you needa database at all ll Can you better solve your problem with a batch processing framework Can you better solve your problem with an in memory object store/cache
  • 10.
    How does itscale out Scale- Scale-out for working set size Scale- Scale-out for total data size Scale out for write volume Scale- Scale-out for read volume Scale- Scale-out for redundancy How do you incrementally add nodes or change configuration How do you trade off query performance (which wants fewer index segments) for elasticity (which wants more index segments))
  • 11.
    What type ofqueries does it need t b able to d d to be bl t do Is a key/value store enough Will you be retrieving your data by one key or by many Is there a primary way you ll be viewing your data you’ll Do you need specialized queries (eg, time series, (eg, geospatial)
  • 12.
    Imagine a garage… Youhand your valet the keys to your car Before they park your car, they completely disassemble it The pistons are stored in piston storage, brake pads with brake pads, steering p p g p p g wheels with steering wheels Over time, they have storage areas for catalytic converters, DVD-based nav DVD- systems, headlight washers, and traction control systems When you ask for your car back, the valet is incredibly fast at reassembly One minor issue: you have to provide the disassembly and reassembly instructions and they will be followed literally, even if you say the spare tire should be used as a steering wheel and forgot to specify re-insertion of spark plugs re- A technological marvel Might be a good way to store your car if you don’t know whether you’ll be asking for a car back or lots of brake pads or pistons – for a salvage yard?
  • 13.
    How should itmodel data Relational Row oriented or column oriented Key value Document oriented Graph oriented
  • 14.
    How do youquery it Do you want an API, a language, or a map-reduce map- style interface? Will most of your queries be hand-typed, embedded hand- in code or dynamically generated
  • 15.
    How do youhandle transactions and consistency t ti d i t Do you need transactions at all Be careful; web services, for example, need to be able to assign userIDs Do you need multi-master updates multi- If so, how do y resolve conflicts , you Do you need immediate consistency? For some queries or all? How do you handle failures Are you optimizing for read availability or write availability
  • 16.
    What is it Enterprisesoftware Open source p With commercial support? Appliance Packaged with commodity hardware Specialized hardware Cloud Cl d service i Available for on-premise deployment? on- Integrated in another PaaS offering? Where on the net?
  • 17.
    Does the datafit in memory Transactions can be very very fast Do you trust enough copies in memory (perhaps across multiple data centers) or do you require some sort of sync to persistent storage How big will the data be and how much do you care about costs
  • 18.
    What if yourdisks are SSD Alleviate hotspots Random accesses are measured in microseconds not milliseconds Degradation from in-memory to on-disk can be in- on- more graceful But data representations on disk vs in memory may be very different which may create significant overhead
  • 19.
    In choosing asolution Examine your requirements They will dictate certain choices Once you have narrowed the field Prefer solutions that may become mainstream y Consider TCO: Purchase cost Learning curve L i Productivity Viability
  • 20.
    Which solution setswill become mainstream b i t High confidence Horizontally scalable: to take advantage of hardware trends Non- Non-relational: to enable scalability Highly functional: for usage beyond mega-scale mega- Developer- Developer-friendly: because decision making has shifted Freely available: for rapid adoption My predictions Document oriented: enables scalability, functionality, developer friendliness, and agility Open source: with multiple PaaS providers