The Guardian Open Platform Content API: Implementation

13,173 views

Published on

The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content:

how we represented a complex relational database model in Solr

how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement

how we update the schema as the API evolves, with zero downtime

how we scale in response to unpredictable demand, using cloud services

Published in: Technology
0 Comments
29 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
13,173
On SlideShare
0
From Embeds
0
Number of Embeds
1,665
Actions
Shares
0
Downloads
272
Comments
0
Likes
29
Embeds 0
No embeds

No notes for slide

  • As Stephen said:
    Very basic links to interesting content
  • Note the registration paywall
  • Broadcast, stories, basic community
    Rebuild started in 2005
  • “Web 2.0”, community, (full fat) RSS, discoverability, tagging.
    Where do we go from here? Other newspaper sites - looking to restrict access to content via paywalls etc - we’re looking to open up

  • We’ve spent the last 12 months experimenting around open distribution and open partnerships - 4 initiatives make up the open platform (right now)
    (As stephen said)
  • This talk focuses on the content API - provides a way for others to re-present our content in their applications
  • http://content.guardianapis.com
  • http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance
    (most users want most recent content, so default ordering is newest)
    This is just a dismax search
  • Can also retrieve extra metadata, including tags
    http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all

  • If you have an API key can get full content. (You need to apply for this and agree to some T&Cs - mostly to ensure that we can take down content for legal reasons.) This example key is only valid for this conference, will be disabled afterwards :)
    http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010

  • Refinements give the ability to narrow down your result set (ofc these are just solr facets)
    http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-refinements=all

  • Our current architecture - perhaps we could feed the content api off the database?

  • Our current architecture - perhaps we could feed the content api off the database?

  • time to developer understanding: about 2 hours

  • currently rebuild every night, incrementals during the day
    [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
    We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container

  • currently rebuild every night, incrementals during the day
    [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
    We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container

  • Lots of talk nowadays on “no sql” solutions
  • No.
    Designed a new logo that better reflects where we currently are

  • disclaimer: the next slides describe how *we* did it; not necessarily best practice!
    We took the opportunity to simplify our domain model....
  • Content fields are just fields
    But also need to map tags, media, and factboxes




  • Here’s how we model tags & content






  • Fact boxes associate arbitary information with content
    We need to search them, but 1-to-1 relationship with content
    So no separate record
  • Fact boxes associate arbitary information with content
    We need to search them, but 1-to-1 relationship with content
    So no separate record
  • show-media allows access to the non-text assets of an item of content





  • Code mostly just takes input params, converts to solr query, and transforms result to json or xml
    I’m not here to talk about scala, but here’s a quick couple of snippets
  • RichSolrDocument makes SolrDocument more “scala” ish
  • Scala can make writing understandable code much easier
  • Supporting auto scaling in EC2 - our base images all have empty index
    (EC2 load balance is configured to check this url & add server to list on 200 response)
  • Thanks to Grant Ingersoll from Lucid Imagination for guiding us down this route (were planning to do something much more complicated), Also thanks to Francis Rhys-Jones to actually implementing this
    This is game changing - suddenly we’re prepared to change the index -- and NoSQL solutions seem a whole lot less scary: we migrate our entire database every night!











  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!
  • All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!
  • The Guardian Open Platform Content API: Implementation

    1. 1. Solr in the Wild: The Guardian’s Open Platform Content API Graham Tackley guardian.co.uk 1
    2. 2. Guardian journalism online: 1995
    3. 3. Guardian journalism online: 1999
    4. 4. Guardian journalism online: 2000
    5. 5. Guardian journalism online: 2010
    6. 6. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
    7. 7. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
    8. 8. • Content API • pis.com MicroApp Framework ard iana Politics API ten •ttp://con t.gu h Data Store • http://www.guardian.co.uk/open-platform
    9. 9. http://content.guardianapis.com
    10. 10. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all
    11. 11. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010
    12. 12. http://content.guardianapis.com/search.json?q=prague %20beer&order-by=relevance&show-refinements=all
    13. 13. Implementation • Traffic patterns much less predictable than a web site • Need to easily scale on demand... • ... and never take down guardian.co.uk due to API traffic
    14. 14. Core Web servers App server Memcached (20Gb) rdbms CMS
    15. 15. Core Web servers App server Memcached (20Gb) rdbms Content API CMS
    16. 16. Core Web servers App server Memcached (20Gb) rdbms Content API CMS
    17. 17. Why Solr? • Database could not cope... • ... and far too expensive to scale • Solr ... • ... was easy for developers to understand • ... has a great replication model • ... is simple to install
    18. 18. Core Web servers App server Memcached (20Gb) CMS
    19. 19. Core Web servers App server Memcached (20Gb) Solr Master Indexer CMS
    20. 20. Core Api Web servers Solr & Api App server Solr & Api Memcached (20Gb) Replication Solr & Api Solr Solr Master Solr & Api Indexer Solr & Api CMS Cloud, EC2
    21. 21. n otl y
    22. 22. Solr Schema • 350+ tables in database schema
    23. 23. Content fields are just fields...
    24. 24. Tags
    25. 25. Tags Factbox
    26. 26. Tags Factbox Media
    27. 27. Keywor Article d Contributor Video Series Tags Content Audio Publication Gallery Tone Cartoon
    28. 28. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...]
    29. 29. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness type: series internal-name: Eyewitness (centespread photo series)
    30. 30. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness Included in search type: series internal-name: Eyewitness (centespread stored=false photo series)
    31. 31. ... factboxes ...
    32. 32. ... factboxes ... record-type: content id: world/picture/2010/may/14/formula-one-monaco factbox-data: [ 197544~|~~|~photography-tip~|~ ] fact-data: [ 197544~|~pro-tip~|~The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ] fact-value: [ The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ]
    33. 33. ... media ...
    34. 34. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...]
    35. 35. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...] record-type: media id: PICTURE|362634152|IMAGE|362629791 credit: Mark Thompson/Getty Images width: 1024 height: 768 path: /sys-images/Guardian/About/General/2010/5/14/1273823813621/66-lap- Monaco-grand-prix-002.jpg
    36. 36. The Code • Written in Scala • Uses SolrJ • Plan to open source in the new few months
    37. 37. The Code
    38. 38. The Code
    39. 39. Creating the Index • Existing search index takes 20 hours to build • Solr index takes 1 hour • Here’s how...
    40. 40. 1.1 million+ items of content in the database
    41. 41. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
    42. 42. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
    43. 43. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
    44. 44. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
    45. 45. Summary • Solr made free access to our content API possible • Replication rocks for scaling • Solr just works for us (thank you!) • NoSQL really isn’t that scary
    46. 46. • http://guardian.co.uk/open-platform • http://content.guardianapis.com graham.tackley@guardian.co.uk · @tackers 37

    ×