Architecture at PBS

Architecture of PBS.org
DCPython - June 7, 2011

PBS is…

• PBS is a national federation of independently owned and
operated public television stations and producers
– Each with their own management and development resources

• 1500+ highly trafficked websites:
– http://www.pbs.org/
– http://www.pbs.org/nova/
– http://pbskids.org/
– http://pbskids.org/sesame/
– http://video.pbs.org/

• Enterprise services/APIs

PBS is not!

• Radio is easy… We do television!

• Or any of the other ~200 local stations.

What we do

• Technology leadership within public
broadcasting community
• Distribution of national programming content
• Services to local stations
• Core application development. Yeah!!!

History of PBS.org

Early 1990’s: Hand rolled static html
Late 1990’s: Hand crafted static html + CGI!
Most of 2000’s: Zope/Plone CMS generated static html
2008-10: Django generated static html
Launched Oct 2010: Django all the way

COVE API

• Contains the metadata for all PBS videos online
including pointers to streaming video
• Needed to be:
– Secure
– Fast
– Scalable

COVE API – Technology Stack

• Amazon Elastic Cluster Computing (EC2)
• Amazon Relational Database Service (RDS)
• Linux
• Python
• Django
• Piston for REST API

COVE API - Architecture
Internet

Elastic Load Balancer

Auto Scale Array
App Server 1 … App Server N

HA Proxy

S3
RDS Master RDS Slave 11
RDS Slave
RDS Slave 1 Backups

App Sync Server

COVE API – Management Tools

• Amazon Web Service Console
• RightScale
• Splunk

COVE API – Interesting Stuff

• Easy to load test
– Duplicate environment for several days
• Easy to scale
– Autoscale array grows automatically
• Easy to upgrade
– Each server built from vanilla base

COVE API – Lessons learned
• Use normalized data for administration and de-
normalized data for API

COVE API – Lessons learned
• Piston is fine, but lacks flexibility without
significant customization
– TastyPie?
• JSON is probably good enough
• Don’t get fancy with your endpoints
• Stick to REST principles
• Don’t get fancy with your authentication
– Use OAuth2 or simple token

PBS.org and Merlin API

• PBS.org
– Slim, fast layer
– Pulls data from Merlin API
– Uses memcache extensively
– Currently Django, but could be anything (Flask?)

• Merlin API
– Aggregate content from distributed CMSes
– Expose via standardized API
– Power PBS.org and more

Merlin API – Technology stack

• Python • Amazon Web Services (“cloud”)
• Django – EC2
• MySQL – RDS - Relational Database Service
– ELB - Elastic Load Balancing
• Piston
– Cloudfront CDN
• Solr – S3Storage
• Celery
• RabbitMQ

Data flow

RSS Feed Standardized
Ingestor API

Merlin API architecture

API Endpoint – Django Piston

Search service Indexing service
Django-haystack Solr

Data layer – MySQL (RDS)

Administration Feed ingestion
Django admin Celery

Merlin API server topology

Internet

Elastic Load Balancer

Autoscaling App #N
App #N Solr Master
array App #N Celery
App #n Index DB RDS

S3 backups

Merlin API – Management Tools

• Amazon Web Service Console
• RightScale
• Splunk

API - Piston/Haystack/Solr
class WebObjectIndexHandler(BaseHandler):
...
def get_queryset(self):
...
return PistonSearchQuerySet().models(*models)

from haystack.query import SearchQuerySet
class PistonSearchQuerySet(SearchQuerySet):
...
def __getitem__(self, k):
...
return [IndexSerializer(i) for i in
super(PistonSearchQuerySet, self).__getitem__(k)]

Feed ingestor - Celery
from celery.decorators import task, periodic_task

@periodic_task(run_every=timedelta(seconds=300))
def update_webobject_states():
...
solr_visible = WebObject.children.filter(visible=True)
solr_visible = solr_visible.exclude(
flag__api_visible=True, available__isnull=True)
...
updated = solr_visible.update(visible=False,
is_indexed = False)
...
signals.bulk_update.send('tasks.update_webobject_states')

Merlin API - Lessons learned

• Memcached was not necessary
• Denormalized search data via Solr index is much faster
than querying database
• Asynchronous task delegation is awesome
• Celery prone to memory leaks
• App server array for easy horizontal scaling
– Even if not autoscaling, increase min servers
• Never trust data you don’t control (validate!)

Resources

• http://lucene.apache.org/solr/
• http://haystacksearch.org/
• http://celeryproject.org/
• http://celeryproject.org/docs/django-celery/
• http://aws.amazon.com/

PBS Developer Community

• Dedicated to making open.PBS the industry
standard in open development communities.

http://open.pbs.org/
https://github.com/pbs

open@pbs.org

Questions?

Drew Engelson
drew@engelson.net
http://tomatohater.com

Edgar Roman
emroman@pbs.org

Architecture at PBS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecture at PBS

Similar to Architecture at PBS (20)

More from Public Broadcasting Service

More from Public Broadcasting Service (10)

Recently uploaded

Recently uploaded (20)

Architecture at PBS