• Save

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

DCPython: Architecture at PBS (Jun 7, 2011)

on

  • 3,163 views

Drew Engelson and Edgar Roman present on how PBS uses Python, Django, Celery, Solr and autoscales Amazon EC2 to power the highly trafficked http://www.pbs.org/ and related sites (such as ...

Drew Engelson and Edgar Roman present on how PBS uses Python, Django, Celery, Solr and autoscales Amazon EC2 to power the highly trafficked http://www.pbs.org/ and related sites (such as http://video.pbs.org/).

Statistics

Views

Total Views
3,163
Views on SlideShare
1,789
Embed Views
1,374

Actions

Likes
2
Downloads
0
Comments
0

12 Embeds 1,374

http://tomatohater.com 697
http://localhost 363
http://open.pbs.org 279
http://tomatohater.github.io 12
http://localhost.nationalgeographic.com 4
http://ec2-23-20-61-84.compute-1.amazonaws.com 4
http://optimizelyedit.appspot.com 4
http://app.tomatohater.com 4
http://localhost:8000 4
http://www.fontslive.com 1
http://ec2-50-16-201-104.compute-1.amazonaws.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

DCPython: Architecture at PBS (Jun 7, 2011) DCPython: Architecture at PBS (Jun 7, 2011) Presentation Transcript

  • Architecture of PBS.org
    DCPython - June 7, 2011
  • PBS is…
    PBS is a national federation of independently owned and operated public television stations and producers
    Each with their own management and development resources
    1500+ highly trafficked websites:
    http://www.pbs.org/
    http://www.pbs.org/nova/
    http://pbskids.org/
    http://pbskids.org/sesame/
    http://video.pbs.org/
    Enterprise services/APIs
  • PBS is not!
    We do television dammit!
    Or any of the other ~200 local stations.
  • What we do
    Technology leadership within public broadcasting community
    Distribution of national programming content
    Services to local stations
    Core application development. Yeah!!!
  • A few of our sites
  • History of PBS.org
  • COVE API
    Contains the metadata for all PBS videos online including pointers to streaming video
    Needed to be:
    Secure
    Fast
    Scalable
  • COVE API – Technology Stack
    Amazon Elastic Cluster Computing (EC2)
    Amazon Relational Database Service (RDS)
    Linux
    Python
    Django
    Piston for REST API
  • COVE API - Architecture
    Internet
    Elastic Load Balancer
    Auto Scale Array

    App Server 1
    App Server N
    HA Proxy
    S3 Backups
    RDS Master
    RDS Slave 1
    RDS Slave 1
    RDS Slave 1
    App Sync Server
  • COVE API – Management Tools
    Amazon Web Service Console
    RightScale
    Splunk
  • COVE API – Interesting Stuff
    Easy to load test
    Duplicate environment for several days
    Easy to scale
    Autoscale array grows automatically
    Easy to upgrade
    Each server built from vanilla base
  • COVE API – Lessons learned
    Use normalized data for administration and de-normalized data for API
  • COVE API – Lessons learned
    Piston is fine, but lacks flexibility without significant customization
    TastyPie?
    JSON is probably good enough
    Don’t get fancy with your endpoints
    Stick to REST principles
    Don’t get fancy with your authentication
    Use OAuth2 or simple token
  • PBS.org and Merlin API
    PBS.org
    Slim, fast layer
    Pulls data from Merlin API
    Uses memcache extensively
    Currently Django, but could be anything (Flask?)
    Merlin API
    Aggregate content from distributed CMSes
    Expose via standardized API
    Power PBS.org and more
  • Merlin API – Technology stack
    • Python
    • Django
    • MySQL
    • Piston
    • Solr
    • Celery
    • RabbitMQ
    • Amazon Web Services (“cloud”)
    • EC2
    • RDS - Relational Database Service
    • ELB - Elastic Load Balancing
    • Cloudfront CDN
    • S3Storage
  • Data flow
    RSS Feed
    Ingestor
    Standardized API
  • Merlin API architecture
    API Endpoint – Django Piston
    Search service
    Django-haystack
    Indexing service
    Solr
    Data layer – MySQL (RDS)
    Administration
    Django admin
    Feed ingestion
    Celery
  • Merlin API server topology
    Internet
    Elastic Load Balancer
    App #N
    Master
    DB RDS
    Solr
    Index
    Autoscaling
    array
    Celery
    App #N
    App #N
    App #n
    S3 backups
  • Merlin API – Management Tools
    Amazon Web Service Console
    RightScale
    Splunk
  • API - Piston/Haystack/Solr
    class WebObjectIndexHandler(BaseHandler):
    ...
    def get_queryset(self):
    ...
    return PistonSearchQuerySet().models(*models)
    from haystack.query import SearchQuerySet
    class PistonSearchQuerySet(SearchQuerySet):
    ...
    def __getitem__(self, k):
    ...
    return [IndexSerializer(i) for i in
    super(PistonSearchQuerySet, self).__getitem__(k)]
  • Feed ingestor - Celery
    from celery.decorators import task, periodic_task
    @periodic_task(run_every=timedelta(seconds=300))
    def update_webobject_states():
    ...
    solr_visible = WebObject.children.filter(visible=True)
    solr_visible = solr_visible.exclude(
    flag__api_visible=True, available__isnull=True)
    ...
    updated = solr_visible.update(visible=False,
    is_indexed = False)
    ...
    signals.bulk_update.send('tasks.update_webobject_states')
  • Merlin API - Lessons learned
    Memcached was not necessary
    Denormalized search data via Solr index is much faster than querying database
    Asynchronous task delegation is awesome
    Celery prone to memory leaks
    App server array for easy horizontal scaling
    Even if not autoscaling, increase min servers
    Never trust data you don’t control (validate!)
  • Resources
    http://lucene.apache.org/solr/
    http://haystacksearch.org/
    http://celeryproject.org/
    http://celeryproject.org/docs/django-celery/
    http://aws.amazon.com/
  • PBS Developer Community
    Dedicated to making open.PBS the industry standard in open development communities.
    http://open.pbs.org/
    https://github.com/pbs
    open@pbs.org
  • Questions?
    Drew Engelson
    drew@engelson.net
    http://tomatohater.com
    Edgar Roman
    emroman@pbs.org