NoSql at guardian.co.uk
Matthew Wall
Simon Willison
!
SQL
n
ot ly
Guardian journalism online: 1995
Guardian journalism online: 1999
Guardian journalism online: 2000
Guardian journalism online: 2010
Read all about it!
Web server Web server Web server
App bring
I server you NEWS!!!
App server App server
Memcached (20Gb)
Oracle
CMS Data feeds
Web server Web server Web server
Why RDBMS?
App bring you NEWS!!!
I server App server App server
5 years ago, fewer alternatives
Understand operations procedures
Memcached
Can easily recruit DBAs / devs
Developer/ops tools
Oracle
Business critical system: a safe choice
CMS Data feeds
Related content from search engine
Related content from search engine
Introduction of memcached
Related content from search engine Big traffic spike
Introduction of memcached
Distributed memcached
Protects database from peak load
Entities explicitly decached
Queries given TTL
memcached = database supercharger
Now we have a stable “broadcast” platform
We know how to scale it
SQL running effectively at core
We’ve finished, right?
Digital journalism is changing
We can’t cover everything
We can’t compete with everyone
Need to be “part of the web” not just “on the web”
Mutualise
the news!
Mutalisation of journalism
Mutualised news! content
No longer only broadcasting
User engagement & contribution:
journalism
data
software
Data curation / linked data
Support engaged developers with data and APIs
Mutualised news!
Be a part of the data fabric of the internet
Mutualised news!
Platform strategy
Out: Release our data to the world via APIs
In: Rapidly build new functionality outside the core
Write: Ingest, store & present arbitrary data
Mutualised news!
Data Out
Content API
Content API
Delivered using Apache Solr
Mutualised news!
Document oriented search engine
Loose schema:
records, fields, facets
Fields can be multi-value
Supports dynamic field generation
Can apply multiple facets in queries faster than RDBMS
Mutualised news!
Mutualised news!
Mutualised news!
Mutualised news!
Is Solr a database?
Can perform complex queries, including full text search
Mutualised news!
Can filter results with facets (WHERE clause)
ANYTHING can be a facet.Very powerful.
On our dataset most queries are of a similar cost
Scales very well horizontally
Handles millions of documents
Mutualised news!
No transactions
Excellent for certain types of queries
Not truly general purpose
Schema design very important
Search index not really persistence
Core
Api
Web servers
Solr
App server
Solr
Memcached (20Gb)
Solr
rdbms Solr
Solr
M/Q Solr
CMS Cloud, EC2
API
Mutualised news!
Currently powering iPad app
Site components
External applications
Editors tools
More to follow
Mutualised news!
Data In
Application framework
Application framework
Simple REST/ HTTP news! allows lightweight
Mutualised framework
development
Applications proxied for performance
Apps generally hosted in the cloud, hot deployment into
production
No RDBMs provided for storage
Can develop in news timeline
Some useful
characteristics
• Scale down as well as up
• Support rapid production-ready prototyping:
turn projects around in hours or days
• Handle massive traffic spikes
Desktop analysis
• Leaked BNP
membership list
• Load postcodes to
constituencies
mapping in to Redis
• Generate heatmaps
by looking up all
12,000 postcodes
MP’s expenses
MP’s expenses
SELECT * FROM pages WHERE
is_reviewed = 0 ORDER BY RAND()
v2 used Redis
v2 used Redis
Set differ
l a b ou r M ence:
P pages -
reviewed
p a ge s
MEM BER
SRA ND
BigTable: Zeitgeist
Zeitgeist stores pre-
calculated results in BigTable
• Data comes in from stats system,
comments system and OneRiot real-time
search API
• AppEngine cron tasks populate task queues
• Task queues recalculate hotness levels
• “Live” BigTable queries are simple
SELECT / SORT
Live debate poll
• Over a million votes cast in an hour
• Stretched limits of BigTable / AppEngine
• Sharded counter pattern to handle writes
Spreadsheets are
NoSQL too...
Google Docs powered
infographics
The Datablog
• Datablog was launched with no
development involvement at all - it’s a blog,
and a bunch of Google Docs Spreadsheets
• Retrieve data as CSV, XLS, JSON, Atom...
• “Make a copy” and run your own analysis
Mutualised news!
Write
Arbitrary data
Mutualised news!
Create schema free database alongside RDBMS
Index in Solr
Provide access in API
Investigating: CouchDB
Core
Out
In Web servers
App Solr
Proxy
App server
App Solr
Memcached (20Gb)
App Solr
App CMS Data feeds Solr
Solr
App
M/Q
Solr
App
rdbms CouchDB?
external hosting Cloud, EC2
app engine etc
Let LinkedIn power your SlideShare experience
+
Let LinkedIn power your SlideShare experience
Customize SlideShare content based on your interests
We will import your LinkedIn profile and you will be visible on SlideShare.
Keep up to date when your LinkedIn contacts post on SlideShare