Crowdsourcing with Django
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Crowdsourcing with Django

  • 6,214 views
Uploaded on

A talk presented at EuroPython on 30th June 2009.

A talk presented at EuroPython on 30th June 2009.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Congrats, Simon! Simply amazing! We need something just like that in Brazil.

    []'s, HB!
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
6,214
On Slideshare
5,912
From Embeds
302
Number of Embeds
7

Actions

Shares
Downloads
64
Comments
1
Likes
8

Embeds 302

http://simonwillison.net 288
http://denkyem.posterous.com 5
http://tumblr.iamdanw.com 2
http://www.slideshare.net 2
http://www.mefeedia.com 2
https://www.linkedin.com 2
http://translate.googleusercontent.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Crowdsourcing with Django EuroPython, 30th June 2009 Simon Willison · http://simonwillison.net/ · @simonw
  • 2. “Web development on journalism deadlines”
  • 3. The back story...
  • 4. November 2000 The Freedom of Information Act
  • 5. Heather Brooke • http://www.guardian.co.uk/politics/ 2009/may/08/mps-expenses-telegraph- checquebook-journalism • http://www.guardian.co.uk/politics/ 2009/may/15/mps-expenses-heather- brooke-foi
  • 6. 2004 The request
  • 7. January 2005 The FOI request
  • 8. July 2006 The FOI commissioner
  • 9. May 2007 The FOI (Amendment) Bill
  • 10. February 2008 The Information Tribunal
  • 11. “Transparency will damage democracy”
  • 12. May 2008 The high court
  • 13. January 2009 The exemption law
  • 14. March 2009 The mole
  • 15. “All of the receipts of 650-odd MPs, redacted and unredacted, are for sale at a price of £300,000, so I am told. The price is going up because of the interest in the subject.” Sir Stuart Bell, MP Newsnight, 30th March
  • 16. 8th May, 2009 The Daily Telegraph
  • 17. At the Guardian...
  • 18. April: “Expenses are due out in a couple of months, is there anything we can do?”
  • 19. June: “Expenses have been bumped forward, they’re out next week!”
  • 20. Thursday 11th June The proof-of-concept
  • 21. Monday 15th June The tentative go-ahead
  • 22. Tuesday 16th June Designer + client-side engineer
  • 23. Wednesday 17th June Operations engineer
  • 24. Thursday 18th June Launch day!
  • 25. How we built it
  • 26. $ convert Frank_Comm.pdf pages.png
  • 27. Models
  • 28. class Party(models.Model): name = models.CharField(max_length=100) class Constituency(models.Model): name = models.CharField(max_length=100) class MP(models.Model): name = models.CharField(max_length=100) party = models.ForeignKey(Party) constituency = models.ForeignKey(Constituency) guardian_url = models.CharField(max_length=255, blank=True) guardian_image_url = models.CharField(max_length=255, blank=True)
  • 29. class FinancialYear(models.Model): name = models.CharField(max_length=10) class Document(models.Model): title = models.CharField(max_length=100, blank=True) filename = models.CharField(max_length=100) mp = models.ForeignKey(MP) financial_year = models.ForeignKey(FinancialYear) class Page(models.Model): document = models.ForeignKey(Document) page_number = models.IntegerField()
  • 30. class User(models.Model): created = models.DateTimeField(auto_now_add = True) username = models.TextField(max_length = 100) password_hash = models.CharField(max_length = 128, blank=True) class LineItemCategory(models.Model): order = models.IntegerField(default = 0) name = models.CharField(max_length = 32) class LineItem(models.Model): user = models.ForeignKey(User) page = models.ForeignKey(Page) type = models.CharField(max_length = 16, choices = ( ('claim', 'claim'), ('proof', 'proof'), ), db_index = True) date = models.DateField(null = True, blank = True) amount = models.DecimalField(max_digits=20, decimal_places=2) description = models.CharField(max_length = 255, blank = True) created = models.DateTimeField(auto_now_add = True, db_index = True) categories = models.ManyToManyField(LineItemCategory, blank=True)
  • 31. class Vote(models.Model): user = models.ForeignKey(User, related_name = 'votes') page = models.ForeignKey(Page, related_name = 'votes') obsolete = models.BooleanField(default = False) vote_type = models.CharField(max_length = 32, blank = True) ip_address = models.CharField(max_length = 32) created = models.DateTimeField(auto_now_add = True) class TypeVote(Vote): type = models.CharField(max_length = 10, choices = ( ('claim', 'Claim'), ('proof', 'Proof'), ('blank', 'Blank'), ('other', 'Other') )) class InterestingVote(Vote): status = models.CharField(max_length = 10, choices = ( ('no', 'Not interesting'), ('yes', 'Interesting'), ('known', 'Interesting but known'), ('very', 'Investigate this!'), ))
  • 32. Frictionless registration
  • 33. Page filters
  • 34. page_filters = ( # Maps name of filter to dictionary of kwargs to doc.pages.filter() ('reviewed', { 'votes__isnull': False }), ('unreviewed', { 'votes__isnull': True }), ('with line items', { 'line_items__isnull': False }), ('interesting', { 'votes__interestingvote__status': 'yes' }), ('interesting but known', { 'votes__interestingvote__status': 'known' ... ) page_filters_lookup = dict(page_filters)
  • 35. pages = doc.pages.all() if page_filter: kwargs = page_filters_lookup.get(page_filter) if kwargs is None: raise Http404, 'Invalid page filter: %s' % page_filter pages = pages.filter(**kwargs).distinct() # Build the filters filters = [] for name, kwargs in page_filters: filters.append({ 'name': name, 'count': doc.pages.filter(**kwargs).distinct().count(), })
  • 36. Matching names
  • 37. http://github.com/simonw/datamatcher
  • 38. On the day
  • 39. def get_mp_pages(): "Returns list of (mp-name, mp-page-url) tuples" soup = Soup(urllib.urlopen(INDEX_URL)) mp_links = [] for link in soup.findAll('a'): if link.get('title', '').endswith("'s allowances"): mp_links.append( (link['title'].replace("'s allowances", ''), link['href']) ) return mp_links
  • 40. def get_pdfs(mp_url): "Returns list of (description, years, pdf-url, size) tuples" soup = Soup(urllib.urlopen(mp_url)) pdfs = [] trs = soup.findAll('tr')[1:] # Skip the first, it's the table header for tr in trs: name_td, year_td, pdf_td = tr.findAll('td') name = name_td.string year = year_td.string pdf_url = pdf_td.find('a')['href'] size = pdf_td.find('a').contents[-1].replace('(', '').replace(')', '') pdfs.append( (name, year, pdf_url, size) ) return pdfs
  • 41. “Drop Everything”
  • 42. Photoshop + AppleScript v.s. Java + IntelliJ
  • 43. Images on our docroot (S3 upload was taking too long)
  • 44. Blitz QA
  • 45. Launch! (on EC2)
  • 46. Crash #1: more Apache children than MySQL connections
  • 47. unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count()
  • 48. SELECT COUNT(DISTINCT `expenses_page`.`id`) FROM `expenses_page` LEFT OUTER JOIN `expenses_vote` ON ( `expenses_page`.`id` = `expenses_vote`.`page_id` ) WHERE `expenses_vote`.`id` IS NULL
  • 49. unreviewed_count = cache.get('homepage:unreviewed_count') if unreviewed_count is None: unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count() cache.set('homepage: unreviewed_count', unreviewed_count, 60)
  • 50. • With 70,000 pages and a LOT of votes... • DB takes up 135% of CPU • Cache the count in memcached... • DB drops to %35 of CPU
  • 51. unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count() reviewed_count = Page.objects.filter( votes__isnull = False ).distinct().count()
  • 52. unreviewed_count = Page.objects.filter( is_reviewed = False ).count()
  • 53. Migrating to InnoDB on a separate server
  • 54. ssh mps-live "mysqldump mp_expenses" | sed 's/ENGINE=MyISAM/ENGINE=InnoDB/g' | sed 's/CHARSET=latin1/CHARSET=utf8/g' | ssh mysql-big "mysql -u root mp_expenses"
  • 55. “next” button
  • 56. def next_global(request): # Next unreviewed page from the whole site all_unreviewed_pages = Page.objects.filter( is_reviewed = False ).order_by('?') if all_unreviewed_pages: return Redirect( all_unreviewed_pages[0].get_absolute_url() ) else: return HttpResponse( 'All pages have been reviewed!' )
  • 57. import random def next_global_from_cache(request): page_ids = cache.get('unreviewed_page_ids') if page_ids: return Redirect( '/page/%s/' % random.choice(page_ids) ) else: return next_global(request)
  • 58. from django.core.management.base import BaseCommand from mp_expenses.expenses.models import Page from django.core.cache import cache class Command(BaseCommand): help = """ populate unreviewed_page_ids in memcached """ requires_model_validation = True can_import_settings = True def handle(self, *args, **options): ids = list(Page.objects.exclude( is_reviewed = True ).values_list('pk', flat=True)[:1000]) cache.set('unreviewed_page_ids', ids)
  • 59. The numbers
  • 60. Final thoughts • High score tables help • MP photographs really help • Keeping up the interest is hard • Next step: start releasing the data