• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Guardian Open Platform Content API: Implementation
 

The Guardian Open Platform Content API: Implementation

on

  • 12,128 views

The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The ...

The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content:

how we represented a complex relational database model in Solr

how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement

how we update the schema as the API evolves, with zero downtime

how we scale in response to unpredictable demand, using cloud services

Statistics

Views

Total Views
12,128
Views on SlideShare
10,586
Embed Views
1,542

Actions

Likes
27
Downloads
248
Comments
0

12 Embeds 1,542

http://www.guardian.co.uk 982
http://www.slideshare.net 187
http://www.theguardian.com 166
http://www.mattmcalister.com 159
http://eao197.blogspot.com 32
http://eao197.blogspot.ru 5
http://www.pcanete.com.ar 5
http://rss2.com 2
http://static.slidesharecdn.com 1
http://www.guprod.gnl 1
http://mikebracken.com 1
http://eao197.blogspot.de 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • As Stephen said: <br /> Very basic links to interesting content <br />
  • Note the registration paywall <br />
  • Broadcast, stories, basic community <br /> Rebuild started in 2005 <br />
  • &#x201C;Web 2.0&#x201D;, community, (full fat) RSS, discoverability, tagging. <br /> Where do we go from here? Other newspaper sites - looking to restrict access to content via paywalls etc - we&#x2019;re looking to open up <br /> <br />
  • We&#x2019;ve spent the last 12 months experimenting around open distribution and open partnerships - 4 initiatives make up the open platform (right now) <br /> (As stephen said) <br />
  • This talk focuses on the content API - provides a way for others to re-present our content in their applications <br />
  • http://content.guardianapis.com <br />
  • http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance <br /> (most users want most recent content, so default ordering is newest) <br /> This is just a dismax search <br />
  • Can also retrieve extra metadata, including tags <br /> http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all <br /> <br />
  • If you have an API key can get full content. (You need to apply for this and agree to some T&Cs - mostly to ensure that we can take down content for legal reasons.) This example key is only valid for this conference, will be disabled afterwards :) <br /> http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010 <br /> <br />
  • Refinements give the ability to narrow down your result set (ofc these are just solr facets) <br /> http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-refinements=all <br />
  • <br />
  • Our current architecture - perhaps we could feed the content api off the database? <br /> <br />
  • Our current architecture - perhaps we could feed the content api off the database? <br /> <br />
  • time to developer understanding: about 2 hours <br /> <br />
  • currently rebuild every night, incrementals during the day <br /> [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config. <br /> We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container <br /> <br />
  • currently rebuild every night, incrementals during the day <br /> [next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config. <br /> We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container <br /> <br />
  • Lots of talk nowadays on &#x201C;no sql&#x201D; solutions <br />
  • No. <br /> Designed a new logo that better reflects where we currently are <br />
  • <br />
  • disclaimer: the next slides describe how *we* did it; not necessarily best practice! <br /> We took the opportunity to simplify our domain model.... <br />
  • Content fields are just fields <br /> But also need to map tags, media, and factboxes <br /> <br />
  • <br />
  • <br />
  • <br />
  • Here&#x2019;s how we model tags & content <br />
  • <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • Fact boxes associate arbitary information with content <br /> We need to search them, but 1-to-1 relationship with content <br /> So no separate record <br />
  • Fact boxes associate arbitary information with content <br /> We need to search them, but 1-to-1 relationship with content <br /> So no separate record <br />
  • show-media allows access to the non-text assets of an item of content <br /> <br />
  • <br /> <br />
  • <br /> <br />
  • Code mostly just takes input params, converts to solr query, and transforms result to json or xml <br /> I&#x2019;m not here to talk about scala, but here&#x2019;s a quick couple of snippets <br />
  • RichSolrDocument makes SolrDocument more &#x201C;scala&#x201D; ish <br />
  • Scala can make writing understandable code much easier <br />
  • Supporting auto scaling in EC2 - our base images all have empty index <br /> (EC2 load balance is configured to check this url & add server to list on 200 response) <br />
  • Thanks to Grant Ingersoll from Lucid Imagination for guiding us down this route (were planning to do something much more complicated), Also thanks to Francis Rhys-Jones to actually implementing this <br /> This is game changing - suddenly we&#x2019;re prepared to change the index -- and NoSQL solutions seem a whole lot less scary: we migrate our entire database every night! <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • Effectively the batch divisions become a work queue fed to a set of actors <br /> (Actually, we found that 8 worked best with our hardware) <br /> Each actor reads the data from the database; creates a solrinputdocument; submits <br />
  • All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution! <br />
  • All we wanted was a search engine... but actually we got an easy to work with, fast, scalable NoSQL solution! <br />

The Guardian Open Platform Content API: Implementation The Guardian Open Platform Content API: Implementation Presentation Transcript

  • Solr in the Wild: The Guardian’s Open Platform Content API Graham Tackley guardian.co.uk 1
  • Guardian journalism online: 1995
  • Guardian journalism online: 1999
  • Guardian journalism online: 2000
  • Guardian journalism online: 2010
  • • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
  • • Content API • MicroApp Framework • Politics API • Data Store http://www.guardian.co.uk/open-platform
  • • Content API • pis.com MicroApp Framework ard iana Politics API ten •ttp://con t.gu h Data Store • http://www.guardian.co.uk/open-platform
  • http://content.guardianapis.com
  • http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all
  • http://content.guardianapis.com/search.json?q=prague%20beer&order- by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010
  • http://content.guardianapis.com/search.json?q=prague %20beer&order-by=relevance&show-refinements=all
  • Implementation • Traffic patterns much less predictable than a web site • Need to easily scale on demand... • ... and never take down guardian.co.uk due to API traffic
  • Core Web servers App server Memcached (20Gb) rdbms CMS
  • Core Web servers App server Memcached (20Gb) rdbms Content API CMS
  • Core Web servers App server Memcached (20Gb) rdbms Content API CMS
  • Why Solr? • Database could not cope... • ... and far too expensive to scale • Solr ... • ... was easy for developers to understand • ... has a great replication model • ... is simple to install
  • Core Web servers App server Memcached (20Gb) CMS
  • Core Web servers App server Memcached (20Gb) Solr Master Indexer CMS
  • Core Api Web servers Solr & Api App server Solr & Api Memcached (20Gb) Replication Solr & Api Solr Solr Master Solr & Api Indexer Solr & Api CMS Cloud, EC2
  • n otl y
  • Solr Schema • 350+ tables in database schema
  • Content fields are just fields...
  • Tags
  • Tags Factbox
  • Tags Factbox Media
  • Keywor Article d Contributor Video Series Tags Content Audio Publication Gallery Tone Cartoon
  • ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...]
  • ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness type: series internal-name: Eyewitness (centespread photo series)
  • ... tags ... record-type: content id: world/picture/2010/may/14/formula-one-monaco tag-ids: [ world/series/eyewitness, sport/formulaone, world/monaco ...] tag-external-names: [ Eyewitness, Formula One, Monaco, ...] record-type: tag id: world/series/eyewitness section-name: World news web-title: Eyewitness Included in search type: series internal-name: Eyewitness (centespread stored=false photo series)
  • ... factboxes ...
  • ... factboxes ... record-type: content id: world/picture/2010/may/14/formula-one-monaco factbox-data: [ 197544~|~~|~photography-tip~|~ ] fact-data: [ 197544~|~pro-tip~|~The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ] fact-value: [ The photographer has framed the cars between the boats and spectators and played with the scales of the components of the scene ]
  • ... media ...
  • ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...]
  • ... media record-type: content id: world/picture/2010/may/14/formula-one-monaco media-asset-ids: [ PICTURE|362634152|IMAGE|362629791, ...] record-type: media id: PICTURE|362634152|IMAGE|362629791 credit: Mark Thompson/Getty Images width: 1024 height: 768 path: /sys-images/Guardian/About/General/2010/5/14/1273823813621/66-lap- Monaco-grand-prix-002.jpg
  • The Code • Written in Scala • Uses SolrJ • Plan to open source in the new few months
  • The Code
  • The Code
  • Creating the Index • Existing search index takes 20 hours to build • Solr index takes 1 hour • Here’s how...
  • 1.1 million+ items of content in the database
  • 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
  • 1.1 million+ items of content in the database Split into Batches SELECT id FROM ( SELECT id, ROWNUM rownumber FROM content_live ORDER BY id ) WHERE MOD(rownumber, 10000) = 0
  • 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
  • 1.1 million+ items of content in the database Actor 1 Actor 2 Each actor: Actor 3 1. reads data from database 2. builds solr input document Actor 4 3. submits to solr
  • Summary • Solr made free access to our content API possible • Replication rocks for scaling • Solr just works for us (thank you!) • NoSQL really isn’t that scary
  • • http://guardian.co.uk/open-platform • http://content.guardianapis.com graham.tackley@guardian.co.uk · @tackers 37