Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Solr in the Wild:
The Guardian’s
Open Platform
 Content API
    Graham Tackley
    guardian.co.uk
                      1
Guardian journalism online: 1995
Guardian journalism online: 1999
Guardian journalism online: 2000
Guardian journalism online: 2010
• Content API
      • MicroApp Framework
      • Politics API
      • Data Store
http://www.guardian.co.uk/open-platform
• Content API
      • MicroApp Framework
      • Politics API
      • Data Store
http://www.guardian.co.uk/open-platform
• Content API
      •                               pis.com
        MicroApp Framework
                              ard i...
http://content.guardianapis.com
http://content.guardianapis.com/search.json?q=prague%20beer&order-
by=relevance&show-fields=all&show-tags=all
http://content.guardianapis.com/search.json?q=prague%20beer&order-
by=relevance&show-fields=all&show-tags=all&api-key=euroc...
http://content.guardianapis.com/search.json?q=prague
%20beer&order-by=relevance&show-refinements=all
Implementation

• Traffic patterns much less predictable than
  a web site
• Need to easily scale on demand...
• ... and ne...
Core

  Web servers


   App server


Memcached (20Gb)


      rdbms




      CMS
Core

  Web servers


   App server


Memcached (20Gb)


      rdbms        Content API



      CMS
Core

  Web servers


   App server


Memcached (20Gb)


      rdbms        Content API



      CMS
Why Solr?
• Database could not cope...
• ... and far too expensive to scale
• Solr ...
• ... was easy for developers to un...
Core

  Web servers


   App server


Memcached (20Gb)




      CMS
Core

  Web servers


   App server


Memcached (20Gb)


                Solr Master


                   Indexer
      CMS
Core
                                               Api
  Web servers
                                             Solr & ...
n
otl y
Solr Schema


• 350+ tables in database schema
Content fields are just fields...
Tags
Tags




Factbox
Tags




        Factbox

Media
Keywor                         Article
  d

Contributor                     Video


 Series       Tags   Content   Audio

...
... tags ...
record-type: content
id: world/picture/2010/may/14/formula-one-monaco
tag-ids: [ world/series/eyewitness, spo...
... tags ...
     record-type: content
     id: world/picture/2010/may/14/formula-one-monaco
     tag-ids: [ world/series/...
... tags ...
     record-type: content
     id: world/picture/2010/may/14/formula-one-monaco
     tag-ids: [ world/series/...
... factboxes ...
... factboxes ...




record-type: content
id: world/picture/2010/may/14/formula-one-monaco
factbox-data: [ 197544~|~~|~ph...
... media ...
... media
record-type: content
id: world/picture/2010/may/14/formula-one-monaco
media-asset-ids: [ PICTURE|362634152|IMAGE...
... media
record-type: content
id: world/picture/2010/may/14/formula-one-monaco
media-asset-ids: [ PICTURE|362634152|IMAGE...
The Code

• Written in Scala
• Uses SolrJ
• Plan to open source in the new few months
The Code
The Code
Creating the Index

• Existing search index takes 20 hours to
  build
• Solr index takes 1 hour
• Here’s how...
1.1 million+ items of content in the database
1.1 million+ items of content in the database




                Split into Batches
SELECT id FROM (
  SELECT id, ROWNUM ...
1.1 million+ items of content in the database




                Split into Batches
SELECT id FROM (
  SELECT id, ROWNUM ...
1.1 million+ items of content in the database




                                                      Actor 1



       ...
1.1 million+ items of content in the database




                                                      Actor 1



       ...
Summary

• Solr made free access to our content API
  possible
• Replication rocks for scaling
• Solr just works for us (t...
• http://guardian.co.uk/open-platform
   • http://content.guardianapis.com
graham.tackley@guardian.co.uk · @tackers
      ...
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
Upcoming SlideShare
Loading in …5
×

The Guardian Open Platform Content API: Implementation

13,813 views

Published on

The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content:

how we represented a complex relational database model in Solr

how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement

how we update the schema as the API evolves, with zero downtime

how we scale in response to unpredictable demand, using cloud services

Published in: Technology
  • Be the first to comment

The Guardian Open Platform Content API: Implementation

  1. 1. Solr in the Wild: The Guardian’s Open Platform Content API Graham Tackley guardian.co.uk 1
  2. 2. Guardian journalism online: 1995
  3. 3. Guardian journalism online: 1999
  4. 4. Guardian journalism online: 2000
  5. 5. Guardian journalism online: 2010
  6. 6. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
  7. 7. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
  8. 8. • Content API • pis.com MicroApp Framework ard iana Politics API ten •ttp://con t.gu h Data Store • http://www.guardian.co.uk/open-platform
  9. 9. http://content.guardianapis.com
  10. 10. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all
  11. 11. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010
  12. 12. http://content.guardianapis.com/search.json?q=prague %20beer&order-by=relevance&show-refinements=all
  13. 13. Implementation • Traffic patterns much less predictable than a web site • Need to easily scale on demand... • ... and never take down guardian.co.uk due to API traffic
  14. 14. Core Web servers App server Memcached (20Gb) rdbms CMS
  15. 15. Core Web servers App server Memcached (20Gb) rdbms Content API CMS
  16. 16. Core Web servers App server Memcached (20Gb) rdbms Content API CMS
  17. 17. Why Solr? • Database could not cope... • ... and far too expensive to scale • Solr ... • ... was easy for developers to understand • ... has a great replication model • ... is simple to install
  18. 18. Core Web servers App server Memcached (20Gb) CMS
  19. 19. Core Web servers App server Memcached (20Gb) Solr Master Indexer CMS
  20. 20. Core Api Web servers Solr & Api App server Solr & Api Memcached (20Gb) Replication Solr & Api Solr Solr Master Solr & Api Indexer Solr & Api CMS Cloud, EC2
  21. 21. n otl y
  22. 22. Solr Schema • 350+ tables in database schema
  23. 23. Content fields are just fields...
  24. 24. Tags
  25. 25. Tags Factbox
  26. 26. Tags Factbox Media
  27. 27. Keywor Article d Contributor Video Series Tags Content Audio Publication Gallery Tone Cartoon
  28. 28. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...]
  29. 29. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness type: series internal-name: Eyewitness (centespread photo series)
  30. 30. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness Included in search type: series internal-name: Eyewitness (centespread stored=false photo series)
  31. 31. ... factboxes ...
  32. 32. ... factboxes ... record-type: content id: world/picture/2010/may/14/formula-one-monaco factbox-data: [ 197544~|~~|~photography-tip~|~ ] fact-data: [ 197544~|~pro-tip~|~The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ] fact-value: [ The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ]
  33. 33. ... media ...
  34. 34. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...]
  35. 35. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...] record-type: media id: PICTURE|362634152|IMAGE|362629791 credit: Mark Thompson/Getty Images width: 1024 height: 768 path: /sys-images/Guardian/About/General/2010/5/14/1273823813621/66-lap- Monaco-grand-prix-002.jpg
  36. 36. The Code • Written in Scala • Uses SolrJ • Plan to open source in the new few months
  37. 37. The Code
  38. 38. The Code
  39. 39. Creating the Index • Existing search index takes 20 hours to build • Solr index takes 1 hour • Here’s how...
  40. 40. 1.1 million+ items of content in the database
  41. 41. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
  42. 42. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
  43. 43. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
  44. 44. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
  45. 45. Summary • Solr made free access to our content API possible • Replication rocks for scaling • Solr just works for us (thank you!) • NoSQL really isn’t that scary
  46. 46. • http://guardian.co.uk/open-platform • http://content.guardianapis.com graham.tackley@guardian.co.uk · @tackers 37

×