Your SlideShare is downloading. ×
0
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
The Guardian Open Platform Content API: Implementation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Guardian Open Platform Content API: Implementation

11,497

Published on

The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The …

The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content:

how we represented a complex relational database model in Solr

how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement

how we update the schema as the API evolves, with zero downtime

how we scale in response to unpredictable demand, using cloud services

Published in: Technology
0 Comments
28 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,497
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
262
Comments
0
Likes
28
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

  • As Stephen said:
    Very basic links to interesting content
  • Note the registration paywall
  • Broadcast, stories, basic community
    Rebuild started in 2005
  • “Web 2.0”, community, (full fat) RSS, discoverability, tagging.
    Where do we go from here? Other newspaper sites - looking to restrict access to content via paywalls etc - we’re looking to open up

  • We’ve spent the last 12 months experimenting around open distribution and open partnerships - 4 initiatives make up the open platform (right now)
    (As stephen said)
  • This talk focuses on the content API - provides a way for others to re-present our content in their applications
  • http://content.guardianapis.com
  • http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance
    (most users want most recent content, so default ordering is newest)
    This is just a dismax search
  • Can also retrieve extra metadata, including tags
    http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all

  • If you have an API key can get full content. (You need to apply for this and agree to some T&Cs - mostly to ensure that we can take down content for legal reasons.) This example key is only valid for this conference, will be disabled afterwards :)
    http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010

  • Refinements give the ability to narrow down your result set (ofc these are just solr facets)
    http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-refinements=all

  • Our current architecture - perhaps we could feed the content api off the database?

  • Our current architecture - perhaps we could feed the content api off the database?

  • time to developer understanding: about 2 hours

  • currently rebuild every night, incrementals during the day
    [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
    We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container

  • currently rebuild every night, incrementals during the day
    [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
    We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container

  • Lots of talk nowadays on “no sql” solutions
  • No.
    Designed a new logo that better reflects where we currently are

  • disclaimer: the next slides describe how *we* did it; not necessarily best practice!
    We took the opportunity to simplify our domain model....
  • Content fields are just fields
    But also need to map tags, media, and factboxes




  • Here’s how we model tags & content






  • Fact boxes associate arbitary information with content
    We need to search them, but 1-to-1 relationship with content
    So no separate record
  • Fact boxes associate arbitary information with content
    We need to search them, but 1-to-1 relationship with content
    So no separate record
  • show-media allows access to the non-text assets of an item of content





  • Code mostly just takes input params, converts to solr query, and transforms result to json or xml
    I’m not here to talk about scala, but here’s a quick couple of snippets
  • RichSolrDocument makes SolrDocument more “scala” ish
  • Scala can make writing understandable code much easier
  • Supporting auto scaling in EC2 - our base images all have empty index
    (EC2 load balance is configured to check this url & add server to list on 200 response)
  • Thanks to Grant Ingersoll from Lucid Imagination for guiding us down this route (were planning to do something much more complicated), Also thanks to Francis Rhys-Jones to actually implementing this
    This is game changing - suddenly we’re prepared to change the index -- and NoSQL solutions seem a whole lot less scary: we migrate our entire database every night!











  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • Effectively the batch divisions become a work queue fed to a set of actors
    (Actually, we found that 8 worked best with our hardware)
    Each actor reads the data from the database; creates a solrinputdocument; submits
  • All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!
  • All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution!
  • Transcript

    • 1. Solr in the Wild: The Guardian’s Open Platform Content API Graham Tackley guardian.co.uk 1
    • 2. Guardian journalism online: 1995
    • 3. Guardian journalism online: 1999
    • 4. Guardian journalism online: 2000
    • 5. Guardian journalism online: 2010
    • 6. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
    • 7. • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
    • 8. • Content API • pis.com MicroApp Framework ard iana Politics API ten •ttp://con t.gu h Data Store • http://www.guardian.co.uk/open-platform
    • 9. http://content.guardianapis.com
    • 10. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all
    • 11. http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010
    • 12. http://content.guardianapis.com/search.json?q=prague %20beer&order-by=relevance&show-refinements=all
    • 13. Implementation • Traffic patterns much less predictable than a web site • Need to easily scale on demand... • ... and never take down guardian.co.uk due to API traffic
    • 14. Core Web servers App server Memcached (20Gb) rdbms CMS
    • 15. Core Web servers App server Memcached (20Gb) rdbms Content API CMS
    • 16. Core Web servers App server Memcached (20Gb) rdbms Content API CMS
    • 17. Why Solr? • Database could not cope... • ... and far too expensive to scale • Solr ... • ... was easy for developers to understand • ... has a great replication model • ... is simple to install
    • 18. Core Web servers App server Memcached (20Gb) CMS
    • 19. Core Web servers App server Memcached (20Gb) Solr Master Indexer CMS
    • 20. Core Api Web servers Solr & Api App server Solr & Api Memcached (20Gb) Replication Solr & Api Solr Solr Master Solr & Api Indexer Solr & Api CMS Cloud, EC2
    • 21. n otl y
    • 22. Solr Schema • 350+ tables in database schema
    • 23. Content fields are just fields...
    • 24. Tags
    • 25. Tags Factbox
    • 26. Tags Factbox Media
    • 27. Keywor Article d Contributor Video Series Tags Content Audio Publication Gallery Tone Cartoon
    • 28. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...]
    • 29. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness type: series internal-name: Eyewitness (centespread photo series)
    • 30. ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness Included in search type: series internal-name: Eyewitness (centespread stored=false photo series)
    • 31. ... factboxes ...
    • 32. ... factboxes ... record-type: content id: world/picture/2010/may/14/formula-one-monaco factbox-data: [ 197544~|~~|~photography-tip~|~ ] fact-data: [ 197544~|~pro-tip~|~The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ] fact-value: [ The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ]
    • 33. ... media ...
    • 34. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...]
    • 35. ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...] record-type: media id: PICTURE|362634152|IMAGE|362629791 credit: Mark Thompson/Getty Images width: 1024 height: 768 path: /sys-images/Guardian/About/General/2010/5/14/1273823813621/66-lap- Monaco-grand-prix-002.jpg
    • 36. The Code • Written in Scala • Uses SolrJ • Plan to open source in the new few months
    • 37. The Code
    • 38. The Code
    • 39. Creating the Index • Existing search index takes 20 hours to build • Solr index takes 1 hour • Here’s how...
    • 40. 1.1 million+ items of content in the database
    • 41. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
    • 42. 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
    • 43. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
    • 44. 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
    • 45. Summary • Solr made free access to our content API possible • Replication rocks for scaling • Solr just works for us (thank you!) • NoSQL really isn’t that scary
    • 46. • http://guardian.co.uk/open-platform • http://content.guardianapis.com graham.tackley@guardian.co.uk · @tackers 37

    ×