No SQL at The Guardian
Upcoming SlideShare
Loading in...5
×
 

No SQL at The Guardian

on

  • 3,345 views

Presentation given at No:SQL EU conference describing architectures past, present & future for guardian.co.uk

Presentation given at No:SQL EU conference describing architectures past, present & future for guardian.co.uk

Statistics

Views

Total Views
3,345
Views on SlideShare
3,266
Embed Views
79

Actions

Likes
3
Downloads
38
Comments
0

3 Embeds 79

http://simonwillison.net 64
http://www.slideshare.net 14
http://80.68.89.23 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

No SQL at The Guardian No SQL at The Guardian Presentation Transcript

  • NoSql at guardian.co.uk Matthew Wall Simon Willison
  • !
  • SQL
  • n ot ly
  • Guardian journalism online: 1995
  • Guardian journalism online: 1999
  • Guardian journalism online: 2000
  • Guardian journalism online: 2010
  • Read all about it!
  • Web server Web server Web server App bring I server you NEWS!!! App server App server Memcached (20Gb) Oracle CMS Data feeds
  • Web server Web server Web server Why RDBMS? App bring you NEWS!!! I server App server App server 5 years ago, fewer alternatives Understand operations procedures Memcached Can easily recruit DBAs / devs Developer/ops tools Oracle Business critical system: a safe choice CMS Data feeds
  • Related content from search engine
  • Related content from search engine Introduction of memcached
  • Related content from search engine Big traffic spike Introduction of memcached
  • Distributed memcached Protects database from peak load Entities explicitly decached Queries given TTL memcached = database supercharger
  • Now we have a stable “broadcast” platform We know how to scale it SQL running effectively at core We’ve finished, right?
  • Digital journalism is changing We can’t cover everything We can’t compete with everyone Need to be “part of the web” not just “on the web”
  • Mutualise the news!
  • Mutalisation of journalism Mutualised news! content No longer only broadcasting User engagement & contribution: journalism data software Data curation / linked data Support engaged developers with data and APIs
  • Mutualised news! Be a part of the data fabric of the internet
  • Mutualised news! Platform strategy Out: Release our data to the world via APIs In: Rapidly build new functionality outside the core Write: Ingest, store & present arbitrary data
  • Mutualised news! Data Out Content API
  • Content API Delivered using Apache Solr Mutualised news! Document oriented search engine Loose schema: records, fields, facets Fields can be multi-value Supports dynamic field generation Can apply multiple facets in queries faster than RDBMS
  • Mutualised news!
  • Mutualised news!
  • Mutualised news!
  • Mutualised news! Is Solr a database?
  • Can perform complex queries, including full text search Mutualised news! Can filter results with facets (WHERE clause) ANYTHING can be a facet.Very powerful. On our dataset most queries are of a similar cost Scales very well horizontally Handles millions of documents
  • Mutualised news! No transactions Excellent for certain types of queries Not truly general purpose Schema design very important Search index not really persistence
  • Core Api Web servers Solr App server Solr Memcached (20Gb) Solr rdbms Solr Solr M/Q Solr CMS Cloud, EC2
  • API Mutualised news! Currently powering iPad app Site components External applications Editors tools More to follow
  • Mutualised news! Data In Application framework
  • Application framework Simple REST/ HTTP news! allows lightweight Mutualised framework development Applications proxied for performance Apps generally hosted in the cloud, hot deployment into production No RDBMs provided for storage Can develop in news timeline
  • Core Apps Web servers App Proxy App server App Memcached (20Gb) App App rdbms App M/Q App CMS external hosting app engine etc
  • NoSQL for journalism
  • Some useful characteristics • Scale down as well as up • Support rapid production-ready prototyping: turn projects around in hours or days • Handle massive traffic spikes
  • Desktop analysis • Leaked BNP membership list • Load postcodes to constituencies mapping in to Redis • Generate heatmaps by looking up all 12,000 postcodes
  • MP’s expenses
  • MP’s expenses SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()
  • v2 used Redis
  • v2 used Redis Set differ l a b ou r M ence: P pages - reviewed pages MEM BER SRA ND
  • BigTable: Zeitgeist
  • Zeitgeist stores pre- calculated results in BigTable • Data comes in from stats system, comments system and OneRiot real-time search API • AppEngine cron tasks populate task queues • Task queues recalculate hotness levels • “Live” BigTable queries are simple SELECT / SORT
  • Live debate poll • Over a million votes cast in an hour • Stretched limits of BigTable / AppEngine • Sharded counter pattern to handle writes
  • Spreadsheets are NoSQL too...
  • Google Docs powered infographics
  • The Datablog
  • • Datablog was launched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets • Retrieve data as CSV, XLS, JSON, Atom... • “Make a copy” and run your own analysis
  • Mutualised news! Write Arbitrary data
  • Mutualised news! Create schema free database alongside RDBMS Index in Solr Provide access in API Investigating: CouchDB
  • Core Out In Web servers App Solr Proxy App server App Solr Memcached (20Gb) App Solr App CMS Data feeds Solr Solr App M/Q Solr App rdbms CouchDB? external hosting Cloud, EC2 app engine etc