NoSql at guardian.co.uk
        Matthew Wall
        Simon Willison
!
SQL
n
ot ly
Guardian journalism online: 1995
Guardian journalism online: 1999
Guardian journalism online: 2000
Guardian journalism online: 2010
Read all about it!
Web server          Web server         Web server



App bring
  I server   you NEWS!!!
                   App server          App server



                 Memcached (20Gb)




                     Oracle


         CMS                     Data feeds
Web server        Web server         Web server

            Why RDBMS?
App bring you NEWS!!!
  I server       App server       App server
      5 years ago, fewer alternatives

  Understand operations procedures
              Memcached

     Can easily recruit DBAs / devs

             Developer/ops tools
                   Oracle

 Business critical system: a safe choice
         CMS                   Data feeds
Related content from search engine
Related content from search engine




                Introduction of memcached
Related content from search engine   Big traffic spike




                Introduction of memcached
Distributed memcached

 Protects database from peak load

    Entities explicitly decached

        Queries given TTL

memcached = database supercharger
Now we have a stable “broadcast” platform

        We know how to scale it

     SQL running effectively at core

          We’ve finished, right?
Digital journalism is changing

            We can’t cover everything

        We can’t compete with everyone

Need to be “part of the web” not just “on the web”
Mutualise
the news!
Mutalisation of journalism


   Mutualised news! content
    No longer only broadcasting

      User engagement & contribution:
                journalism
                   data
                 software

         Data curation / linked data

Support engaged developers with data and APIs
Mutualised news!

Be a part of the data fabric of the internet
Mutualised news!
              Platform strategy

   Out: Release our data to the world via APIs

In: Rapidly build new functionality outside the core

  Write: Ingest, store & present arbitrary data
Mutualised news!

         Data Out

        Content API
Content API

             Delivered using Apache Solr
     Mutualised news!
          Document oriented search engine

                   Loose schema:
                records, fields, facets

               Fields can be multi-value

          Supports dynamic field generation

Can apply multiple facets in queries faster than RDBMS
Mutualised news!
Mutualised news!
Mutualised news!
Mutualised news!

       Is Solr a database?
Can perform complex queries, including full text search
     Mutualised news!
    Can filter results with facets (WHERE clause)

      ANYTHING can be a facet.Very powerful.

  On our dataset most queries are of a similar cost

             Scales very well horizontally

            Handles millions of documents
Mutualised news!
         No transactions

  Excellent for certain types of queries

       Not truly general purpose

     Schema design very important

   Search index not really persistence
Core
                             Api
   Web servers

                             Solr
    App server
                             Solr
Memcached (20Gb)
                             Solr

     rdbms         Solr
                             Solr

      M/Q                    Solr

     CMS                  Cloud, EC2
API
Mutualised news!
    Currently powering iPad app

         Site components

       External applications

           Editors tools

          More to follow
Mutualised news!

           Data In

    Application framework
Application framework

   Simple REST/ HTTP news! allows lightweight
      Mutualised framework
                   development

         Applications proxied for performance

Apps generally hosted in the cloud, hot deployment into
                      production

           No RDBMs provided for storage

             Can develop in news timeline
Core
   Apps                      Web servers

        App




                   Proxy
                              App server
        App
                           Memcached (20Gb)
        App

        App                   rdbms


        App
                              M/Q
        App
                              CMS
external hosting
 app engine etc
NoSQL for journalism
Some useful
          characteristics
• Scale down as well as up
• Support rapid production-ready prototyping:
  turn projects around in hours or days
• Handle massive traffic spikes
Desktop analysis
• Leaked BNP
  membership list
• Load postcodes to
  constituencies
  mapping in to Redis
• Generate heatmaps
  by looking up all
  12,000 postcodes
MP’s expenses
MP’s expenses




     SELECT * FROM pages WHERE
is_reviewed = 0 ORDER BY RAND()
v2 used Redis
v2 used Redis
                Set differ
  l a b ou r M            ence:
               P pages -
                         reviewed
                                  p a ge s




                           MEM BER
                     SRA ND
BigTable: Zeitgeist
Zeitgeist stores pre-
calculated results in BigTable
• Data comes in from stats system,
  comments system and OneRiot real-time
  search API
• AppEngine cron tasks populate task queues
• Task queues recalculate hotness levels
• “Live” BigTable queries are simple
  SELECT / SORT
Live debate poll




• Over a million votes cast in an hour
• Stretched limits of BigTable / AppEngine
• Sharded counter pattern to handle writes
Spreadsheets are
  NoSQL too...
Google Docs powered
    infographics
The Datablog
• Datablog was launched with no
  development involvement at all - it’s a blog,
  and a bunch of Google Docs Spreadsheets
• Retrieve data as CSV, XLS, JSON, Atom...
• “Make a copy” and run your own analysis
Mutualised news!

            Write

         Arbitrary data
Mutualised news!
Create schema free database alongside RDBMS

               Index in Solr

           Provide access in API

           Investigating: CouchDB
Core
                                                           Out
      In                           Web servers

        App                                                Solr

                   Proxy
                                   App server
        App                                                Solr
                             Memcached (20Gb)
        App                                                Solr
        App        CMS         Data feeds        Solr
                                                           Solr
        App
                                   M/Q
                                                           Solr
        App
                           rdbms         CouchDB?
external hosting                                        Cloud, EC2
 app engine etc

NoSql presentation

  • 1.
    NoSql at guardian.co.uk Matthew Wall Simon Willison
  • 3.
  • 4.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Web server Web server Web server App bring I server you NEWS!!! App server App server Memcached (20Gb) Oracle CMS Data feeds
  • 15.
    Web server Web server Web server Why RDBMS? App bring you NEWS!!! I server App server App server 5 years ago, fewer alternatives Understand operations procedures Memcached Can easily recruit DBAs / devs Developer/ops tools Oracle Business critical system: a safe choice CMS Data feeds
  • 20.
    Related content fromsearch engine
  • 21.
    Related content fromsearch engine Introduction of memcached
  • 22.
    Related content fromsearch engine Big traffic spike Introduction of memcached
  • 23.
    Distributed memcached Protectsdatabase from peak load Entities explicitly decached Queries given TTL memcached = database supercharger
  • 24.
    Now we havea stable “broadcast” platform We know how to scale it SQL running effectively at core We’ve finished, right?
  • 25.
    Digital journalism ischanging We can’t cover everything We can’t compete with everyone Need to be “part of the web” not just “on the web”
  • 26.
  • 27.
    Mutalisation of journalism Mutualised news! content No longer only broadcasting User engagement & contribution: journalism data software Data curation / linked data Support engaged developers with data and APIs
  • 28.
    Mutualised news! Be apart of the data fabric of the internet
  • 29.
    Mutualised news! Platform strategy Out: Release our data to the world via APIs In: Rapidly build new functionality outside the core Write: Ingest, store & present arbitrary data
  • 30.
    Mutualised news! Data Out Content API
  • 31.
    Content API Delivered using Apache Solr Mutualised news! Document oriented search engine Loose schema: records, fields, facets Fields can be multi-value Supports dynamic field generation Can apply multiple facets in queries faster than RDBMS
  • 32.
  • 33.
  • 34.
  • 35.
    Mutualised news! Is Solr a database?
  • 36.
    Can perform complexqueries, including full text search Mutualised news! Can filter results with facets (WHERE clause) ANYTHING can be a facet.Very powerful. On our dataset most queries are of a similar cost Scales very well horizontally Handles millions of documents
  • 37.
    Mutualised news! No transactions Excellent for certain types of queries Not truly general purpose Schema design very important Search index not really persistence
  • 38.
    Core Api Web servers Solr App server Solr Memcached (20Gb) Solr rdbms Solr Solr M/Q Solr CMS Cloud, EC2
  • 39.
    API Mutualised news! Currently powering iPad app Site components External applications Editors tools More to follow
  • 40.
    Mutualised news! Data In Application framework
  • 41.
    Application framework Simple REST/ HTTP news! allows lightweight Mutualised framework development Applications proxied for performance Apps generally hosted in the cloud, hot deployment into production No RDBMs provided for storage Can develop in news timeline
  • 42.
    Core Apps Web servers App Proxy App server App Memcached (20Gb) App App rdbms App M/Q App CMS external hosting app engine etc
  • 43.
  • 44.
    Some useful characteristics • Scale down as well as up • Support rapid production-ready prototyping: turn projects around in hours or days • Handle massive traffic spikes
  • 45.
    Desktop analysis • LeakedBNP membership list • Load postcodes to constituencies mapping in to Redis • Generate heatmaps by looking up all 12,000 postcodes
  • 46.
  • 47.
    MP’s expenses SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()
  • 48.
  • 49.
    v2 used Redis Set differ l a b ou r M ence: P pages - reviewed p a ge s MEM BER SRA ND
  • 50.
  • 51.
    Zeitgeist stores pre- calculatedresults in BigTable • Data comes in from stats system, comments system and OneRiot real-time search API • AppEngine cron tasks populate task queues • Task queues recalculate hotness levels • “Live” BigTable queries are simple SELECT / SORT
  • 52.
    Live debate poll •Over a million votes cast in an hour • Stretched limits of BigTable / AppEngine • Sharded counter pattern to handle writes
  • 53.
    Spreadsheets are NoSQL too...
  • 54.
  • 55.
  • 56.
    • Datablog waslaunched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets • Retrieve data as CSV, XLS, JSON, Atom... • “Make a copy” and run your own analysis
  • 57.
    Mutualised news! Write Arbitrary data
  • 58.
    Mutualised news! Create schemafree database alongside RDBMS Index in Solr Provide access in API Investigating: CouchDB
  • 59.
    Core Out In Web servers App Solr Proxy App server App Solr Memcached (20Gb) App Solr App CMS Data feeds Solr Solr App M/Q Solr App rdbms CouchDB? external hosting Cloud, EC2 app engine etc