DCPython: Architecture at PBS (Jun 7, 2011)

Architecture of PBS.org
DCPython - June 7, 2011

PBS is…
• PBS is a national federation of independently owned and
operated public television stations and producers
– Each with their own management and development resources
• 1500+ highly trafficked websites:
– http://www.pbs.org/
– http://www.pbs.org/nova/
– http://pbskids.org/
– http://pbskids.org/sesame/
– http://video.pbs.org/
• Enterprise services/APIs

PBS is not!
• We do television dammit!
• Or any of the other ~200 local stations.

What we do
• Technology leadership within public
broadcasting community
• Distribution of national programming content
• Services to local stations
• Core application development. Yeah!!!

History of PBS.org
Early 1990’s: Hand rolled static html
Late 1990’s: Hand crafted static html + CGI!
Most of 2000’s: Zope/Plone CMS generated static html
2008-10: Django generated static html
Launched Oct 2010: Django all the way

COVE API
• Contains the metadata for all PBS videos online
including pointers to streaming video
• Needed to be:
– Secure
– Fast
– Scalable

COVE API – Technology Stack
• Amazon Elastic Cluster Computing (EC2)
• Amazon Relational Database Service (RDS)
• Linux
• Python
• Django
• Piston for REST API

COVE API - Architecture
Internet
Elastic Load Balancer
Auto Scale Array
App Server 1 App Server N…
HA Proxy
RDS Master RDS Slave 1
RDS Slave 1
RDS Slave 1
App Sync Server
S3
Backups

COVE API – Management Tools
• Amazon Web Service Console
• RightScale
• Splunk

COVE API – Interesting Stuff
• Easy to load test
– Duplicate environment for several days
• Easy to scale
– Autoscale array grows automatically
• Easy to upgrade
– Each server built from vanilla base

COVE API – Lessons learned
• Use normalized data for administration and de-
normalized data for API

COVE API – Lessons learned
• Piston is fine, but lacks flexibility without
significant customization
– TastyPie?
• JSON is probably good enough
• Don’t get fancy with your endpoints
• Stick to REST principles
• Don’t get fancy with your authentication
– Use OAuth2 or simple token

PBS.org and Merlin API
• PBS.org
– Slim, fast layer
– Pulls data from Merlin API
– Uses memcache extensively
– Currently Django, but could be anything (Flask?)
• Merlin API
– Aggregate content from distributed CMSes
– Expose via standardized API
– Power PBS.org and more

Merlin API – Technology stack
• Python
• Django
• MySQL
• Piston
• Solr
• Celery
• RabbitMQ
• Amazon Web Services (“cloud”)
– EC2
– RDS - Relational Database Service
– ELB - Elastic Load Balancing
– Cloudfront CDN
– S3 Storage

Data flow
RSS Feed
Ingestor
Standardized
API

Merlin API architecture
API Endpoint – Django Piston
Search service
Django-haystack
Indexing service
Solr
Data layer – MySQL (RDS)
Administration
Django admin
Feed ingestion
Celery

Merlin API server topology
Elastic Load Balancer
Internet
S3 backups
Celery
Master
DB RDS
Solr
Index
App #N
App #N
App #N
App #n
Autoscaling
array

Merlin API – Management Tools
• Amazon Web Service Console
• RightScale
• Splunk

API - Piston/Haystack/Solr
class WebObjectIndexHandler(BaseHandler):
...
def get_queryset(self):
...
return PistonSearchQuerySet().models(*models)
from haystack.query import SearchQuerySet
class PistonSearchQuerySet(SearchQuerySet):
...
def __getitem__(self, k):
...
return [IndexSerializer(i) for i in
super(PistonSearchQuerySet, self).__getitem__(k)]

Feed ingestor - Celery
from celery.decorators import task, periodic_task
@periodic_task(run_every=timedelta(seconds=300))
def update_webobject_states():
...
solr_visible = WebObject.children.filter(visible=True)
solr_visible = solr_visible.exclude(
flag__api_visible=True, available__isnull=True)
...
updated = solr_visible.update(visible=False,
is_indexed = False)
...
signals.bulk_update.send('tasks.update_webobject_states')

Merlin API - Lessons learned
• Memcached was not necessary
• Denormalized search data via Solr index is much faster
than querying database
• Asynchronous task delegation is awesome
• Celery prone to memory leaks
• App server array for easy horizontal scaling
– Even if not autoscaling, increase min servers
• Never trust data you don’t control (validate!)

Resources
• http://lucene.apache.org/solr/
• http://haystacksearch.org/
• http://celeryproject.org/
• http://celeryproject.org/docs/django-celery/
• http://aws.amazon.com/

PBS Developer Community
• Dedicated to making open.PBS the industry
standard in open development communities.
http://open.pbs.org/
https://github.com/pbs
open@pbs.org

Questions?
Drew Engelson
drew@engelson.net
http://tomatohater.com
Edgar Roman
emroman@pbs.org

DCPython: Architecture at PBS (Jun 7, 2011)

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to DCPython: Architecture at PBS (Jun 7, 2011)

Similar to DCPython: Architecture at PBS (Jun 7, 2011) (20)

Recently uploaded

Recently uploaded (20)

DCPython: Architecture at PBS (Jun 7, 2011)