DCPython: Architecture at PBS (Jun 7, 2011)

2,934
-1

Published on

Drew Engelson and Edgar Roman present on how PBS uses Python, Django, Celery, Solr and autoscales Amazon EC2 to power the highly trafficked http://www.pbs.org/ and related sites (such as http://video.pbs.org/).

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,934
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

DCPython: Architecture at PBS (Jun 7, 2011)

  1. 1. Architecture of PBS.org DCPython - June 7, 2011
  2. 2. PBS is… • PBS is a national federation of independently owned and operated public television stations and producers – Each with their own management and development resources • 1500+ highly trafficked websites: – http://www.pbs.org/ – http://www.pbs.org/nova/ – http://pbskids.org/ – http://pbskids.org/sesame/ – http://video.pbs.org/ • Enterprise services/APIs
  3. 3. PBS is not! • We do television dammit! • Or any of the other ~200 local stations.
  4. 4. What we do • Technology leadership within public broadcasting community • Distribution of national programming content • Services to local stations • Core application development. Yeah!!!
  5. 5. A few of our sites
  6. 6. History of PBS.org Early 1990’s: Hand rolled static html Late 1990’s: Hand crafted static html + CGI! Most of 2000’s: Zope/Plone CMS generated static html 2008-10: Django generated static html Launched Oct 2010: Django all the way
  7. 7. COVE API • Contains the metadata for all PBS videos online including pointers to streaming video • Needed to be: – Secure – Fast – Scalable
  8. 8. COVE API – Technology Stack • Amazon Elastic Cluster Computing (EC2) • Amazon Relational Database Service (RDS) • Linux • Python • Django • Piston for REST API
  9. 9. COVE API - Architecture Internet Elastic Load Balancer Auto Scale Array App Server 1 App Server N… HA Proxy RDS Master RDS Slave 1 RDS Slave 1 RDS Slave 1 App Sync Server S3 Backups
  10. 10. COVE API – Management Tools • Amazon Web Service Console • RightScale • Splunk
  11. 11. COVE API – Interesting Stuff • Easy to load test – Duplicate environment for several days • Easy to scale – Autoscale array grows automatically • Easy to upgrade – Each server built from vanilla base
  12. 12. COVE API – Lessons learned • Use normalized data for administration and de- normalized data for API
  13. 13. COVE API – Lessons learned • Piston is fine, but lacks flexibility without significant customization – TastyPie? • JSON is probably good enough • Don’t get fancy with your endpoints • Stick to REST principles • Don’t get fancy with your authentication – Use OAuth2 or simple token
  14. 14. PBS.org and Merlin API • PBS.org – Slim, fast layer – Pulls data from Merlin API – Uses memcache extensively – Currently Django, but could be anything (Flask?) • Merlin API – Aggregate content from distributed CMSes – Expose via standardized API – Power PBS.org and more
  15. 15. Merlin API – Technology stack • Python • Django • MySQL • Piston • Solr • Celery • RabbitMQ • Amazon Web Services (“cloud”) – EC2 – RDS - Relational Database Service – ELB - Elastic Load Balancing – Cloudfront CDN – S3 Storage
  16. 16. Data flow RSS Feed Ingestor Standardized API
  17. 17. Merlin API architecture API Endpoint – Django Piston Search service Django-haystack Indexing service Solr Data layer – MySQL (RDS) Administration Django admin Feed ingestion Celery
  18. 18. Merlin API server topology Elastic Load Balancer Internet S3 backups Celery Master DB RDS Solr Index App #N App #N App #N App #n Autoscaling array
  19. 19. Merlin API – Management Tools • Amazon Web Service Console • RightScale • Splunk
  20. 20. API - Piston/Haystack/Solr class WebObjectIndexHandler(BaseHandler): ... def get_queryset(self): ... return PistonSearchQuerySet().models(*models) from haystack.query import SearchQuerySet class PistonSearchQuerySet(SearchQuerySet): ... def __getitem__(self, k): ... return [IndexSerializer(i) for i in super(PistonSearchQuerySet, self).__getitem__(k)]
  21. 21. Feed ingestor - Celery from celery.decorators import task, periodic_task @periodic_task(run_every=timedelta(seconds=300)) def update_webobject_states(): ... solr_visible = WebObject.children.filter(visible=True) solr_visible = solr_visible.exclude( flag__api_visible=True, available__isnull=True) ... updated = solr_visible.update(visible=False, is_indexed = False) ... signals.bulk_update.send('tasks.update_webobject_states')
  22. 22. Merlin API - Lessons learned • Memcached was not necessary • Denormalized search data via Solr index is much faster than querying database • Asynchronous task delegation is awesome • Celery prone to memory leaks • App server array for easy horizontal scaling – Even if not autoscaling, increase min servers • Never trust data you don’t control (validate!)
  23. 23. Resources • http://lucene.apache.org/solr/ • http://haystacksearch.org/ • http://celeryproject.org/ • http://celeryproject.org/docs/django-celery/ • http://aws.amazon.com/
  24. 24. PBS Developer Community • Dedicated to making open.PBS the industry standard in open development communities. http://open.pbs.org/ https://github.com/pbs open@pbs.org
  25. 25. Questions? Drew Engelson drew@engelson.net http://tomatohater.com Edgar Roman emroman@pbs.org

×