Drew Engelson and Edgar Roman present on how PBS uses Python, Django, Celery, Solr and autoscales Amazon EC2 to power the highly trafficked http://www.pbs.org/ and related sites (such as http://video.pbs.org/).
2. PBS is…
• PBS is a national federation of independently owned and
operated public television stations and producers
– Each with their own management and development resources
• 1500+ highly trafficked websites:
– http://www.pbs.org/
– http://www.pbs.org/nova/
– http://pbskids.org/
– http://pbskids.org/sesame/
– http://video.pbs.org/
• Enterprise services/APIs
3. PBS is not!
• We do television dammit!
• Or any of the other ~200 local stations.
4. What we do
• Technology leadership within public
broadcasting community
• Distribution of national programming content
• Services to local stations
• Core application development. Yeah!!!
6. History of PBS.org
Early 1990’s: Hand rolled static html
Late 1990’s: Hand crafted static html + CGI!
Most of 2000’s: Zope/Plone CMS generated static html
2008-10: Django generated static html
Launched Oct 2010: Django all the way
7. COVE API
• Contains the metadata for all PBS videos online
including pointers to streaming video
• Needed to be:
– Secure
– Fast
– Scalable
8. COVE API – Technology Stack
• Amazon Elastic Cluster Computing (EC2)
• Amazon Relational Database Service (RDS)
• Linux
• Python
• Django
• Piston for REST API
9. COVE API - Architecture
Internet
Elastic Load Balancer
Auto Scale Array
App Server 1 App Server N…
HA Proxy
RDS Master RDS Slave 1
RDS Slave 1
RDS Slave 1
App Sync Server
S3
Backups
10. COVE API – Management Tools
• Amazon Web Service Console
• RightScale
• Splunk
11. COVE API – Interesting Stuff
• Easy to load test
– Duplicate environment for several days
• Easy to scale
– Autoscale array grows automatically
• Easy to upgrade
– Each server built from vanilla base
12. COVE API – Lessons learned
• Use normalized data for administration and de-
normalized data for API
13. COVE API – Lessons learned
• Piston is fine, but lacks flexibility without
significant customization
– TastyPie?
• JSON is probably good enough
• Don’t get fancy with your endpoints
• Stick to REST principles
• Don’t get fancy with your authentication
– Use OAuth2 or simple token
14. PBS.org and Merlin API
• PBS.org
– Slim, fast layer
– Pulls data from Merlin API
– Uses memcache extensively
– Currently Django, but could be anything (Flask?)
• Merlin API
– Aggregate content from distributed CMSes
– Expose via standardized API
– Power PBS.org and more
17. Merlin API architecture
API Endpoint – Django Piston
Search service
Django-haystack
Indexing service
Solr
Data layer – MySQL (RDS)
Administration
Django admin
Feed ingestion
Celery
18. Merlin API server topology
Elastic Load Balancer
Internet
S3 backups
Celery
Master
DB RDS
Solr
Index
App #N
App #N
App #N
App #n
Autoscaling
array
19. Merlin API – Management Tools
• Amazon Web Service Console
• RightScale
• Splunk
20. API - Piston/Haystack/Solr
class WebObjectIndexHandler(BaseHandler):
...
def get_queryset(self):
...
return PistonSearchQuerySet().models(*models)
from haystack.query import SearchQuerySet
class PistonSearchQuerySet(SearchQuerySet):
...
def __getitem__(self, k):
...
return [IndexSerializer(i) for i in
super(PistonSearchQuerySet, self).__getitem__(k)]
22. Merlin API - Lessons learned
• Memcached was not necessary
• Denormalized search data via Solr index is much faster
than querying database
• Asynchronous task delegation is awesome
• Celery prone to memory leaks
• App server array for easy horizontal scaling
– Even if not autoscaling, increase min servers
• Never trust data you don’t control (validate!)
24. PBS Developer Community
• Dedicated to making open.PBS the industry
standard in open development communities.
http://open.pbs.org/
https://github.com/pbs
open@pbs.org