Architecture of PBS.org DCPython - June 7, 2011
PBS is…• PBS is a national federation of independently owned and  operated public television stations and producers   – Ea...
PBS is not!• Radio is easy… We do television!• Or any of the other ~200 local stations.
What we do• Technology leadership within public  broadcasting community• Distribution of national programming content• Ser...
A few of our sites
History of PBS.org      Early 1990’s: Hand rolled static html       Late 1990’s: Hand crafted static html + CGI!    Most o...
COVE API• Contains the metadata for all PBS videos online  including pointers to streaming video• Needed to be:   – Secure...
COVE API – Technology Stack•   Amazon Elastic Cluster Computing (EC2)•   Amazon Relational Database Service (RDS)•   Linux...
COVE API - Architecture                         Internet                    Elastic Load Balancer Auto Scale Array        ...
COVE API – Management Tools• Amazon Web Service Console• RightScale• Splunk
COVE API – Interesting Stuff• Easy to load test  – Duplicate environment for several days• Easy to scale  – Autoscale arra...
COVE API – Lessons learned• Use normalized data for administration and de-  normalized data for API
COVE API – Lessons learned• Piston is fine, but lacks flexibility without  significant customization   – TastyPie?• JSON i...
PBS.org and Merlin API• PBS.org   – Slim, fast layer   – Pulls data from Merlin API   – Uses memcache extensively   – Curr...
Merlin API – Technology stack•   Python       • Amazon Web Services (“cloud”)•   Django         – EC2•   MySQL          – ...
Data flowRSS Feed   StandardizedIngestor       API
Merlin API architecture   API Endpoint – Django Piston          Search service             Indexing service         Django...
Merlin API server topology          Internet     Elastic Load BalancerAutoscaling     App #N                 App #N       ...
Merlin API – Management Tools• Amazon Web Service Console• RightScale• Splunk
API - Piston/Haystack/Solrclass WebObjectIndexHandler(BaseHandler):    ...    def get_queryset(self):        ...        re...
Feed ingestor - Celeryfrom celery.decorators import task, periodic_task@periodic_task(run_every=timedelta(seconds=300))def...
Merlin API - Lessons learned• Memcached was not necessary• Denormalized search data via Solr index is much faster  than qu...
Resources•   http://lucene.apache.org/solr/•   http://haystacksearch.org/•   http://celeryproject.org/•   http://celerypro...
PBS Developer Community• Dedicated to making open.PBS the industry  standard in open development communities.             ...
Questions?  Drew Engelson  drew@engelson.net  http://tomatohater.com  Edgar Roman  emroman@pbs.org
Upcoming SlideShare
Loading in …5
×

Architecture at PBS

1,577 views

Published on

Edgar and I had the pleasure of presenting at the DCPython meetup last night about how PBS uses Python, Django, Celery, Solr and Amazon Web Services (autoscaling EC2, RDS) to power many of our sites and services. We focused primarily on the COVE (video) and Merlin (content) APIs since those probably have the most interesting architectures.

We had a blast and received many smart questions from the crowd about Solr, Amazon Web Services, Celery and the recent Tupac incident in about that order. Thanks for having us DCPython!

Check out DCPython at http://dcpython.org or follow @DCPython.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,577
On SlideShare
0
From Embeds
0
Number of Embeds
65
Actions
Shares
0
Downloads
17
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Architecture at PBS

  1. 1. Architecture of PBS.org DCPython - June 7, 2011
  2. 2. PBS is…• PBS is a national federation of independently owned and operated public television stations and producers – Each with their own management and development resources• 1500+ highly trafficked websites: – http://www.pbs.org/ – http://www.pbs.org/nova/ – http://pbskids.org/ – http://pbskids.org/sesame/ – http://video.pbs.org/• Enterprise services/APIs
  3. 3. PBS is not!• Radio is easy… We do television!• Or any of the other ~200 local stations.
  4. 4. What we do• Technology leadership within public broadcasting community• Distribution of national programming content• Services to local stations• Core application development. Yeah!!!
  5. 5. A few of our sites
  6. 6. History of PBS.org Early 1990’s: Hand rolled static html Late 1990’s: Hand crafted static html + CGI! Most of 2000’s: Zope/Plone CMS generated static html 2008-10: Django generated static htmlLaunched Oct 2010: Django all the way
  7. 7. COVE API• Contains the metadata for all PBS videos online including pointers to streaming video• Needed to be: – Secure – Fast – Scalable
  8. 8. COVE API – Technology Stack• Amazon Elastic Cluster Computing (EC2)• Amazon Relational Database Service (RDS)• Linux• Python• Django• Piston for REST API
  9. 9. COVE API - Architecture Internet Elastic Load Balancer Auto Scale Array App Server 1 … App Server N HA Proxy S3 RDS Master RDS Slave 11 RDS Slave RDS Slave 1 BackupsApp Sync Server
  10. 10. COVE API – Management Tools• Amazon Web Service Console• RightScale• Splunk
  11. 11. COVE API – Interesting Stuff• Easy to load test – Duplicate environment for several days• Easy to scale – Autoscale array grows automatically• Easy to upgrade – Each server built from vanilla base
  12. 12. COVE API – Lessons learned• Use normalized data for administration and de- normalized data for API
  13. 13. COVE API – Lessons learned• Piston is fine, but lacks flexibility without significant customization – TastyPie?• JSON is probably good enough• Don’t get fancy with your endpoints• Stick to REST principles• Don’t get fancy with your authentication – Use OAuth2 or simple token
  14. 14. PBS.org and Merlin API• PBS.org – Slim, fast layer – Pulls data from Merlin API – Uses memcache extensively – Currently Django, but could be anything (Flask?)• Merlin API – Aggregate content from distributed CMSes – Expose via standardized API – Power PBS.org and more
  15. 15. Merlin API – Technology stack• Python • Amazon Web Services (“cloud”)• Django – EC2• MySQL – RDS - Relational Database Service – ELB - Elastic Load Balancing• Piston – Cloudfront CDN• Solr – S3Storage• Celery• RabbitMQ
  16. 16. Data flowRSS Feed StandardizedIngestor API
  17. 17. Merlin API architecture API Endpoint – Django Piston Search service Indexing service Django-haystack Solr Data layer – MySQL (RDS)Administration Feed ingestionDjango admin Celery
  18. 18. Merlin API server topology Internet Elastic Load BalancerAutoscaling App #N App #N Solr Master array App #N Celery App #n Index DB RDS S3 backups
  19. 19. Merlin API – Management Tools• Amazon Web Service Console• RightScale• Splunk
  20. 20. API - Piston/Haystack/Solrclass WebObjectIndexHandler(BaseHandler): ... def get_queryset(self): ... return PistonSearchQuerySet().models(*models)from haystack.query import SearchQuerySetclass PistonSearchQuerySet(SearchQuerySet): ... def __getitem__(self, k): ... return [IndexSerializer(i) for i insuper(PistonSearchQuerySet, self).__getitem__(k)]
  21. 21. Feed ingestor - Celeryfrom celery.decorators import task, periodic_task@periodic_task(run_every=timedelta(seconds=300))def update_webobject_states(): ...solr_visible = WebObject.children.filter(visible=True)solr_visible = solr_visible.exclude(flag__api_visible=True, available__isnull=True) ... updated = solr_visible.update(visible=False,is_indexed = False) ...signals.bulk_update.send(tasks.update_webobject_states)
  22. 22. Merlin API - Lessons learned• Memcached was not necessary• Denormalized search data via Solr index is much faster than querying database• Asynchronous task delegation is awesome• Celery prone to memory leaks• App server array for easy horizontal scaling – Even if not autoscaling, increase min servers• Never trust data you don’t control (validate!)
  23. 23. Resources• http://lucene.apache.org/solr/• http://haystacksearch.org/• http://celeryproject.org/• http://celeryproject.org/docs/django-celery/• http://aws.amazon.com/
  24. 24. PBS Developer Community• Dedicated to making open.PBS the industry standard in open development communities. http://open.pbs.org/ https://github.com/pbs open@pbs.org
  25. 25. Questions? Drew Engelson drew@engelson.net http://tomatohater.com Edgar Roman emroman@pbs.org

×