Building Lanyrd

Building Lanyrd

  1. 1. Building Lanyrd Simon Willison BrightonPy, 9th August 2011
  2. 2. Definitive database of professional events and speakers
  3. 3. Definitive database Social event recommendation of professional events Comprehensive speaker profiles and speakers Archive of slides, notes and video
  4. 4. A brief history
  5. 5. Casablanca! August 2010
  6. 6. • Aug 31st, 11:22: Launch! (1 linode) • Aug 31st, 12:41: Unlaunch • Aug 31st, 12:54: Read only mode • Aug 31st, 14:15: DB server (2 linodes) • Sep 1st: Limit 50 on dashboard • Sep 1st: disable-dashboard setting
  7. 7. • Sep 3rd: dConstruct (and Twitter bot) • Sep 4th: TechCrunched (read only :( ) • Sep 5th: 3 large EC2 + 1 RDS • Sep 6th: Downgrade to 3 small EC2
  8. 8. December photo: @niqui
  9. 9. • Dec 8: Calacanis + Scoble at the same time! • Upgrade to next size of RDS • (Sometimes scaling vertically does the job)
  10. 10. • Jan 26th: Solr powered dashboard • Replicated to 2, then 3 servers
  11. 11. Load balancer (nginx) HTTP cache (varnish) Database (MySQL RDS) app server app server app server (django/mod_wsgi) (django/mod_wsgi) (django/mod_wsgi) search master search slave search slave Redis (data structures + (solr) (solr) (solr) message queue) logging worker worker (MongoDB) (celery) (celery)
  12. 12. Solr + Haystack
  apache > lucene > solr
  More Like This
Faceting
Stored (non-indexed) fields
Highlighting
Spelling Suggestions
Boost

Haystack is BSD licensed , plays nicely with third-party app without needing to modify the source and supports Solr , Whoosh and Xapian .

1. Get the most recent source.
2. Add haystack to your INSTALLED_APPS.
3. Create files for your models.
4. Setup the main SearchIndex via autodiscover.
5. Include haystack.urls to your URLconf.
6. Search!
  15. 15. Model-oriented search • Define (like for your application • Hook up default haystack search views • Write a quick search.html template • Run ./ rebuild_index
  class BookIndex(indexes.SearchIndex):
    text = indexes.CharField(document=True, use_template=True)
    speakers = indexes.MultiValueField()
    topics = indexes.MultiValueField()

    def prepare_speakers(self, obj):
        return [a.user.t_id for a in obj.authors.exclude(
            user = None
        ).select_related('user')]

    def prepare_topics(self, obj):
        return list(obj.topics.values_list('pk', flat=True))
  17. 17. class BookIndex(indexes.SearchIndex): text = indexes.CharField(document=True, use_template=True) speakers = indexes.MultiValueField() topics = indexes.MultiValueField() def prepare_speakers(self, obj): return [a.user.t_id for a in obj.authors.exclude( user = None ).select_related('user')] def prepare_topics(self, obj): return list(obj.topics.values_list('pk', flat=True))
  18. 18. search/indexes/books/ book_text.txt {{ object.title }} {{ object.tagline }} {% for author in object.authors.all %} {{ author.display_name }} {{ author.user.t_screen_name }} {% endfor %} {% for topic in object.topics.all %} {{ topic.name_en }} {% endfor %}
  19. 19. Staying fresh • Search engines usually don’t like accepting writes too frequently • RealTimeSearchIndex for low traffic sites • ./ update_index --age=6 (hours) • Uses index.get_updated_field() • Roll your own (message queue or similar...)
  20. 20. Replication Solr Master Solr Slave Solr Slave Solr Slave
  21. 21. Smarter indexing class Article(models.Model): needs_indexing = models.BooleanField( default = True, db_index = True ) ... def save(self, *args, **kwargs): self.needs_indexing = True super(Article, self).save(*args, **kwargs)
  22. 22. index = site.get_index(model) updated_pks = [] objects = index.load_all_queryset().filter( needs_indexing=True )[:100] if not objects: return for object in objects: updated_pks.append( index.update_object(object) index.load_all_queryset().filter( pk__in = updated_pks ).update(needs_indexing = False)
  23. 23. nginx + Solr replication trick upstream solrmaster { server { server; listen 8983; } location /solr/update { upstream solrslaves { proxy_pass http://solrmaster; server; } server; location /solr/select { server; proxy_pass http://solrslaves; } } }
  Your contacts' calendar
yours 24 contacts 182

We've found 182 conferences your Twitter contacts are interested in.

Café Scientifique: Exploring the dark side of star formation with the Herschel Space Observatory
United Kingdom / Brighton
21st June 2011
Astronomy Science
4 contacts tracking

Usability Professionals' Association – International Conference
United States / Atlanta
21st–24th June 2011
Usability User Experience
1 contact speaking and 3 contacts tracking
  25. 25. # Original implementation twitter_ids = [11134, 223455, 33221, ...] # fetch from Twitter attendees = Attendee.objects.filter( user__t_id__in = twitter_ids ).filter( conference__start_date__gte = )
  26. 26. # Current implementation twitter_ids = [11134, 223455, 33221, ...] # fetch from Twitter sqs = SearchQuerySet() sqs = sqs.models(Conference) or_string = ' OR '.join(twitter_ids) sqs = sqs.narrow('attendees:(%s)' % or_string)
  27. 27. Redis
  Commands Clients Documentation Community Download Issues

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

strings hashes lists sets
  29. 29. simonw-follows:{144,21345,12328...} europython-attendees:{344,21345,787...} contact_ids = redis.sinter( 'simonw-follows', 'europython-attendees' )
  EuroPython 2011
The European Python Conference
19 –26 JUNE 2011
Florence in Italy

97 attending
80 tracking
119 speakers

Topics
Django
Plone
Pyramid
Python
Twisted
  31. 31. Celery
  Distributed Task Queue

Celery is an asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well.

The execution units, called tasks, are executed concurrently on a single or more worker servers using multiprocessing, Eventlet, or gevent. Tasks can execute asynchronously (in the background) or synchronously (wait until ready).

Celery is used in production systems to process millions of tasks a day.

Celery is written in Python, but the protocol can be implemented in any language. It can also operate with other languages using webhooks.

The recommended message broker is RabbitMQ, but limited support for Redis, Beanstalk, MongoDB, CouchDB, and databases (using SQLAlchemy or the Django ORM) is also available.

Celery is easy to integrate with Django, Pylons and Flask, using the django-celery, celery-pylons and Flask-Celery add-on packages.

Example

This is a simple task adding two numbers:
  33. 33. Tasks? • Anything that takes more than about 200ms • Updating a search index • Resizing images • Hitting external APIs • Generating reports
  34. 34. Trivial example • Fetch the content of a web page from celery.task import task @task def fetch_url(url): return urllib.urlopen(url).read() >>> result = fetch_url.delay(‘’) >>> html = result.wait()
  Add coverage

Python and MongoDB tutorial
Python mongo db-training-europython-2011
EuroPython 2011
Italy / Florence
19th–26th June 2011

Type of coverage
Link Audio Liveblog Write-up Sketch notes Photos Slides Transcript Notes Video Handout
  Coverage preview
From SlideShare:
  37. 37. The task itself... • Tries using to find a preview • Fetches the HTTP headers and first 2048 bytes • If HTML, attempts to extract the <title> • If other, gets the file type and size from headers
  38. 38. Behind the scenes... ar = enhance_link.delay(url) poll_url = '/working/%s/' % signed.dumps({ 'task_id': ar.task_id, 'on_done_url': on_done_url, }) if 'ajax' in request.POST: return render_json(request, { 'ok': True, 'poll_url': poll_url, }) else: return HttpResponseRedirect(poll_url)
  39. 39. And when it’s done... from celery.backends import default_backend ... task_id = request.REQUEST.get('id', '') result = default_backend.get_result(task_id)
  40. 40. Configuration # Carrot / Celery: queue uses Redis CARROT_BACKEND = "ghettoq.taproot.Redis" BROKER_HOST = "" # redis server BROKER_PORT = 6379 BROKER_VHOST = "6" # Task results stored in memcached, so they can # expire automatically CELERY_RESULT_BACKEND = "cache" CELERY_CACHE_BACKEND = "memcached://;..."
  41. 41. Tricks
  42. 42. Phantom load testing • Deploy a new architecture on a brand new EC2 cluster • Leave your existing site on the old cluster • Invisibly link to the new stack from an <img width=1 height=1> element on your live site (not for very long though) • (sensible alternative: find a way to replay log files)
  43. 43. cache_version
  Django conferences

Django events looking for participants
1 Django event is looking for participants

Django coverage
52 videos Most recent added 3 weeks ago
52 slide decks Most recent added 4 hours ago
3 audio clips Most recent added 1 week ago
27 write-ups Most recent added 1 week ago
11 handouts Most recent added 18 hours ago
3 notes Most recent added 10 hours ago

EuroPython 2011
Italy / Florence
19th–26th June 2011
Django Plone Pyramid Python Twisted

DjangoCon US 2011
United States / Portland
6th–8th September 2011
Django Open Source Python

PyCON FR 2011
France / Rennes
17th–18th September 2011
Django Python

PyCon DE 2011
  45. 45. class Conference(models.Model): ... cache_version = models.IntegerField(default = 0) def save(self, *args, **kwargs): self.cache_version += 1 super(Conference, self).save(*args, **kwargs) def touch(self): Conference.objects.filter(pk = cache_version = F('cache_version') + 1 )
  46. 46. {% cache 36000 conf-topics conference.cache_version %} <ul class="tags inline-tags meta"> {% for topic in conference.topics.all %} <li><a href="{{ topic.get_absolute_url }}">{{ topic }}</a></li> {% endfor %} </ul> {% endcache %}
  47. 47. Bulk invalidation from django.models import F topic.conferences.all().update( cache_version = F('cache_version') + 1 )
  48. 48. Signing
  49. 49. Pass data through an untrusted source with confidence that it hasn't been tampered with
  50. 50. Signing uses • "Unsubscribe" links in emails • ?redirect_to=URL protection Signed cookies "You are logged in as simonw" without hitting the database
  51. 51. Signing in Django 1.4 from django.core import signing signing.dumps({"foo": "bar"}) signing.loads(signed_string) response.set_signed_cookie(key, value...) response.get_signed_cookie(key)
  52. 52. Hashed static asset filenames in S3/CloudFront
  53. 53. global.js global.ed81d119.js
  54. 54. Benefits • Far futures expiry headers • Cache-Control: max-age=315360000 • Expires: Fri, 18 Jun 2021 06:45:00 -0000 GMT • Guaranteed updated CSS in IE • Deploy new assets in advance of application • Old versions stick around for rollbacks
  55. 55. ./ push_static • Minifies JavaScript and CSS • Renames files to include sha1(contents)[:6] • Pushes all assets to S3
  56. 56. Profiling and debugging production systems
  57. 57. UserBasedExceptionMiddleware from django.views.debug import technical_500_response import sys class UserBasedExceptionMiddleware(object): def process_exception(self, request, exception): if request.user.is_superuser: return technical_500_response(request, *sys.exc_info())
  58. 58. mysql-proxy • Very handy lua-customisable proxy for all of your MySQL traffic • Worst documented software ever • log.lua - logs out ALL queries •
  59. 59. django_instrumented • (Unreleased) code I wrote for Lanyrd • Collects various runtime stats about the current request, stashes a profile JSON in memcached • Writes out the profile UUID as part of the HTML • A bookmarklet to view the profile
  60. 60. mongodb logging • Super-fast inserts, log everything! • Capped collections • Structured queries • Ask me about it in a few months
  61. 61. For the future... • Much better profiling, monitoring and alerts • Varnish in front of everything • Replicated MySQL for analytics + upgrades
  62. 62. Questions?
  63. 63. Thank you!