The Guardian Open Platform Content API: Implementation
by The Guardian Open Platform
- 10,066 views
The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The ...
The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content:
how we represented a complex relational database model in Solr
how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement
how we update the schema as the API evolves, with zero downtime
how we scale in response to unpredictable demand, using cloud services
Accessibility
Categories
Upload Details
Uploaded via SlideShare as Apple Keynote
Usage Rights
© All Rights Reserved
Statistics
- Likes
- 26
- Downloads
- 236
- Comments
- 0
- Embed Views
- Views on SlideShare
- 8,785
- Total Views
- 10,066
Very basic links to interesting content
Rebuild started in 2005
Where do we go from here? Other newspaper sites - looking to restrict access to content via paywalls etc - we’re looking to open up
(As stephen said)
(most users want most recent content, so default ordering is newest)
This is just a dismax search
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-fields=all&show-tags=all&api-key=eurocon2010
http://content.guardianapis.com/search.json?q=prague%20beer&order-by=relevance&show-refinements=all
[next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container
[next] expose solr master to EC2, create hosts in EC2 that replicate using solr replication - works fantastically. 6GB index size. Load-balancer config.
We use solr.war from 1.4 dist totally unchanged - run api webapp in same jetty container
Designed a new logo that better reflects where we currently are
We took the opportunity to simplify our domain model....
But also need to map tags, media, and factboxes
We need to search them, but 1-to-1 relationship with content
So no separate record
We need to search them, but 1-to-1 relationship with content
So no separate record
I’m not here to talk about scala, but here’s a quick couple of snippets
(EC2 load balance is configured to check this url & add server to list on 200 response)
This is game changing - suddenly we’re prepared to change the index -- and NoSQL solutions seem a whole lot less scary: we migrate our entire database every night!
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits
(Actually, we found that 8 worked best with our hardware)
Each actor reads the data from the database; creates a solrinputdocument; submits