Architecture of PBS.org
DCPython - June 7, 2011
PBS is…
• PBS is a national federation of independently owned and
operated public television stations and producers
– Each...
PBS is not!
• We do television dammit!
• Or any of the other ~200 local stations.
What we do
• Technology leadership within public
broadcasting community
• Distribution of national programming content
• S...
A few of our sites
History of PBS.org
Early 1990’s: Hand rolled static html
Late 1990’s: Hand crafted static html + CGI!
Most of 2000’s: Zope...
COVE API
• Contains the metadata for all PBS videos online
including pointers to streaming video
• Needed to be:
– Secure
...
COVE API – Technology Stack
• Amazon Elastic Cluster Computing (EC2)
• Amazon Relational Database Service (RDS)
• Linux
• ...
COVE API - Architecture
Internet
Elastic Load Balancer
Auto Scale Array
App Server 1 App Server N…
HA Proxy
RDS Master RDS...
COVE API – Management Tools
• Amazon Web Service Console
• RightScale
• Splunk
COVE API – Interesting Stuff
• Easy to load test
– Duplicate environment for several days
• Easy to scale
– Autoscale arra...
COVE API – Lessons learned
• Use normalized data for administration and de-
normalized data for API
COVE API – Lessons learned
• Piston is fine, but lacks flexibility without
significant customization
– TastyPie?
• JSON is...
PBS.org and Merlin API
• PBS.org
– Slim, fast layer
– Pulls data from Merlin API
– Uses memcache extensively
– Currently D...
Merlin API – Technology stack
• Python
• Django
• MySQL
• Piston
• Solr
• Celery
• RabbitMQ
• Amazon Web Services (“cloud”...
Data flow
RSS Feed
Ingestor
Standardized
API
Merlin API architecture
API Endpoint – Django Piston
Search service
Django-haystack
Indexing service
Solr
Data layer – MyS...
Merlin API server topology
Elastic Load Balancer
Internet
S3 backups
Celery
Master
DB RDS
Solr
Index
App #N
App #N
App #N
...
Merlin API – Management Tools
• Amazon Web Service Console
• RightScale
• Splunk
API - Piston/Haystack/Solr
class WebObjectIndexHandler(BaseHandler):
...
def get_queryset(self):
...
return PistonSearchQu...
Feed ingestor - Celery
from celery.decorators import task, periodic_task
@periodic_task(run_every=timedelta(seconds=300))
...
Merlin API - Lessons learned
• Memcached was not necessary
• Denormalized search data via Solr index is much faster
than q...
Resources
• http://lucene.apache.org/solr/
• http://haystacksearch.org/
• http://celeryproject.org/
• http://celeryproject...
PBS Developer Community
• Dedicated to making open.PBS the industry
standard in open development communities.
http://open....
Questions?
Drew Engelson
drew@engelson.net
http://tomatohater.com
Edgar Roman
emroman@pbs.org
Upcoming SlideShare
Loading in...5
×

DCPython: Architecture at PBS (Jun 7, 2011)

2,873

Published on

Drew Engelson and Edgar Roman present on how PBS uses Python, Django, Celery, Solr and autoscales Amazon EC2 to power the highly trafficked http://www.pbs.org/ and related sites (such as http://video.pbs.org/).

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,873
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "DCPython: Architecture at PBS (Jun 7, 2011)"

  1. 1. Architecture of PBS.org DCPython - June 7, 2011
  2. 2. PBS is… • PBS is a national federation of independently owned and operated public television stations and producers – Each with their own management and development resources • 1500+ highly trafficked websites: – http://www.pbs.org/ – http://www.pbs.org/nova/ – http://pbskids.org/ – http://pbskids.org/sesame/ – http://video.pbs.org/ • Enterprise services/APIs
  3. 3. PBS is not! • We do television dammit! • Or any of the other ~200 local stations.
  4. 4. What we do • Technology leadership within public broadcasting community • Distribution of national programming content • Services to local stations • Core application development. Yeah!!!
  5. 5. A few of our sites
  6. 6. History of PBS.org Early 1990’s: Hand rolled static html Late 1990’s: Hand crafted static html + CGI! Most of 2000’s: Zope/Plone CMS generated static html 2008-10: Django generated static html Launched Oct 2010: Django all the way
  7. 7. COVE API • Contains the metadata for all PBS videos online including pointers to streaming video • Needed to be: – Secure – Fast – Scalable
  8. 8. COVE API – Technology Stack • Amazon Elastic Cluster Computing (EC2) • Amazon Relational Database Service (RDS) • Linux • Python • Django • Piston for REST API
  9. 9. COVE API - Architecture Internet Elastic Load Balancer Auto Scale Array App Server 1 App Server N… HA Proxy RDS Master RDS Slave 1 RDS Slave 1 RDS Slave 1 App Sync Server S3 Backups
  10. 10. COVE API – Management Tools • Amazon Web Service Console • RightScale • Splunk
  11. 11. COVE API – Interesting Stuff • Easy to load test – Duplicate environment for several days • Easy to scale – Autoscale array grows automatically • Easy to upgrade – Each server built from vanilla base
  12. 12. COVE API – Lessons learned • Use normalized data for administration and de- normalized data for API
  13. 13. COVE API – Lessons learned • Piston is fine, but lacks flexibility without significant customization – TastyPie? • JSON is probably good enough • Don’t get fancy with your endpoints • Stick to REST principles • Don’t get fancy with your authentication – Use OAuth2 or simple token
  14. 14. PBS.org and Merlin API • PBS.org – Slim, fast layer – Pulls data from Merlin API – Uses memcache extensively – Currently Django, but could be anything (Flask?) • Merlin API – Aggregate content from distributed CMSes – Expose via standardized API – Power PBS.org and more
  15. 15. Merlin API – Technology stack • Python • Django • MySQL • Piston • Solr • Celery • RabbitMQ • Amazon Web Services (“cloud”) – EC2 – RDS - Relational Database Service – ELB - Elastic Load Balancing – Cloudfront CDN – S3 Storage
  16. 16. Data flow RSS Feed Ingestor Standardized API
  17. 17. Merlin API architecture API Endpoint – Django Piston Search service Django-haystack Indexing service Solr Data layer – MySQL (RDS) Administration Django admin Feed ingestion Celery
  18. 18. Merlin API server topology Elastic Load Balancer Internet S3 backups Celery Master DB RDS Solr Index App #N App #N App #N App #n Autoscaling array
  19. 19. Merlin API – Management Tools • Amazon Web Service Console • RightScale • Splunk
  20. 20. API - Piston/Haystack/Solr class WebObjectIndexHandler(BaseHandler): ... def get_queryset(self): ... return PistonSearchQuerySet().models(*models) from haystack.query import SearchQuerySet class PistonSearchQuerySet(SearchQuerySet): ... def __getitem__(self, k): ... return [IndexSerializer(i) for i in super(PistonSearchQuerySet, self).__getitem__(k)]
  21. 21. Feed ingestor - Celery from celery.decorators import task, periodic_task @periodic_task(run_every=timedelta(seconds=300)) def update_webobject_states(): ... solr_visible = WebObject.children.filter(visible=True) solr_visible = solr_visible.exclude( flag__api_visible=True, available__isnull=True) ... updated = solr_visible.update(visible=False, is_indexed = False) ... signals.bulk_update.send('tasks.update_webobject_states')
  22. 22. Merlin API - Lessons learned • Memcached was not necessary • Denormalized search data via Solr index is much faster than querying database • Asynchronous task delegation is awesome • Celery prone to memory leaks • App server array for easy horizontal scaling – Even if not autoscaling, increase min servers • Never trust data you don’t control (validate!)
  23. 23. Resources • http://lucene.apache.org/solr/ • http://haystacksearch.org/ • http://celeryproject.org/ • http://celeryproject.org/docs/django-celery/ • http://aws.amazon.com/
  24. 24. PBS Developer Community • Dedicated to making open.PBS the industry standard in open development communities. http://open.pbs.org/ https://github.com/pbs open@pbs.org
  25. 25. Questions? Drew Engelson drew@engelson.net http://tomatohater.com Edgar Roman emroman@pbs.org

×