NoSql presentation

106,621 views
105,603 views

Published on

Presentation given at NoSql EU conference describing architectures past, present & future for guardian.co.uk

0 Comments
36 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
106,621
On SlideShare
0
From Embeds
0
Number of Embeds
10,995
Actions
Shares
0
Downloads
455
Comments
0
Likes
36
Embeds 0
No embeds

No notes for slide

NoSql presentation

  1. NoSql at guardian.co.uk Matthew Wall Simon Willison
  2. !
  3. SQL
  4. n ot ly
  5. Guardian journalism online: 1995
  6. Guardian journalism online: 1999
  7. Guardian journalism online: 2000
  8. Guardian journalism online: 2010
  9. Read all about it!
  10. Web server Web server Web server App bring I server you NEWS!!! App server App server Memcached (20Gb) Oracle CMS Data feeds
  11. Web server Web server Web server Why RDBMS? App bring you NEWS!!! I server App server App server 5 years ago, fewer alternatives Understand operations procedures Memcached Can easily recruit DBAs / devs Developer/ops tools Oracle Business critical system: a safe choice CMS Data feeds
  12. Related content from search engine
  13. Related content from search engine Introduction of memcached
  14. Related content from search engine Big traffic spike Introduction of memcached
  15. Distributed memcached Protects database from peak load Entities explicitly decached Queries given TTL memcached = database supercharger
  16. Now we have a stable “broadcast” platform We know how to scale it SQL running effectively at core We’ve finished, right?
  17. Digital journalism is changing We can’t cover everything We can’t compete with everyone Need to be “part of the web” not just “on the web”
  18. Mutualise the news!
  19. Mutalisation of journalism Mutualised news! content No longer only broadcasting User engagement & contribution: journalism data software Data curation / linked data Support engaged developers with data and APIs
  20. Mutualised news! Be a part of the data fabric of the internet
  21. Mutualised news! Platform strategy Out: Release our data to the world via APIs In: Rapidly build new functionality outside the core Write: Ingest, store & present arbitrary data
  22. Mutualised news! Data Out Content API
  23. Content API Delivered using Apache Solr Mutualised news! Document oriented search engine Loose schema: records, fields, facets Fields can be multi-value Supports dynamic field generation Can apply multiple facets in queries faster than RDBMS
  24. Mutualised news!
  25. Mutualised news!
  26. Mutualised news!
  27. Mutualised news! Is Solr a database?
  28. Can perform complex queries, including full text search Mutualised news! Can filter results with facets (WHERE clause) ANYTHING can be a facet.Very powerful. On our dataset most queries are of a similar cost Scales very well horizontally Handles millions of documents
  29. Mutualised news! No transactions Excellent for certain types of queries Not truly general purpose Schema design very important Search index not really persistence
  30. Core Api Web servers Solr App server Solr Memcached (20Gb) Solr rdbms Solr Solr M/Q Solr CMS Cloud, EC2
  31. API Mutualised news! Currently powering iPad app Site components External applications Editors tools More to follow
  32. Mutualised news! Data In Application framework
  33. Application framework Simple REST/ HTTP news! allows lightweight Mutualised framework development Applications proxied for performance Apps generally hosted in the cloud, hot deployment into production No RDBMs provided for storage Can develop in news timeline
  34. Core Apps Web servers App Proxy App server App Memcached (20Gb) App App rdbms App M/Q App CMS external hosting app engine etc
  35. NoSQL for journalism
  36. Some useful characteristics • Scale down as well as up • Support rapid production-ready prototyping: turn projects around in hours or days • Handle massive traffic spikes
  37. Desktop analysis • Leaked BNP membership list • Load postcodes to constituencies mapping in to Redis • Generate heatmaps by looking up all 12,000 postcodes
  38. MP’s expenses
  39. MP’s expenses SELECT * FROM pages WHERE is_reviewed = 0 ORDER BY RAND()
  40. v2 used Redis
  41. v2 used Redis Set differ l a b ou r M ence: P pages - reviewed p a ge s MEM BER SRA ND
  42. BigTable: Zeitgeist
  43. Zeitgeist stores pre- calculated results in BigTable • Data comes in from stats system, comments system and OneRiot real-time search API • AppEngine cron tasks populate task queues • Task queues recalculate hotness levels • “Live” BigTable queries are simple SELECT / SORT
  44. Live debate poll • Over a million votes cast in an hour • Stretched limits of BigTable / AppEngine • Sharded counter pattern to handle writes
  45. Spreadsheets are NoSQL too...
  46. Google Docs powered infographics
  47. The Datablog
  48. • Datablog was launched with no development involvement at all - it’s a blog, and a bunch of Google Docs Spreadsheets • Retrieve data as CSV, XLS, JSON, Atom... • “Make a copy” and run your own analysis
  49. Mutualised news! Write Arbitrary data
  50. Mutualised news! Create schema free database alongside RDBMS Index in Solr Provide access in API Investigating: CouchDB
  51. Core Out In Web servers App Solr Proxy App server App Solr Memcached (20Gb) App Solr App CMS Data feeds Solr Solr App M/Q Solr App rdbms CouchDB? external hosting Cloud, EC2 app engine etc

×