Your SlideShare is downloading. ×
0
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Crowdsourcing with Django
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Crowdsourcing with Django

3,658

Published on

A talk presented at EuroPython on 30th June 2009.

A talk presented at EuroPython on 30th June 2009.

Published in: Technology
1 Comment
8 Likes
Statistics
Notes
  • Congrats, Simon! Simply amazing! We need something just like that in Brazil.

    []'s, HB!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
3,658
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
65
Comments
1
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Crowdsourcing with Django EuroPython, 30th June 2009 Simon Willison · http://simonwillison.net/ · @simonw
  • 2. “Web development on journalism deadlines”
  • 3. The back story...
  • 4. November 2000 The Freedom of Information Act
  • 5. Heather Brooke • http://www.guardian.co.uk/politics/ 2009/may/08/mps-expenses-telegraph- checquebook-journalism • http://www.guardian.co.uk/politics/ 2009/may/15/mps-expenses-heather- brooke-foi
  • 6. 2004 The request
  • 7. January 2005 The FOI request
  • 8. July 2006 The FOI commissioner
  • 9. May 2007 The FOI (Amendment) Bill
  • 10. February 2008 The Information Tribunal
  • 11. “Transparency will damage democracy”
  • 12. May 2008 The high court
  • 13. January 2009 The exemption law
  • 14. March 2009 The mole
  • 15. “All of the receipts of 650-odd MPs, redacted and unredacted, are for sale at a price of £300,000, so I am told. The price is going up because of the interest in the subject.” Sir Stuart Bell, MP Newsnight, 30th March
  • 16. 8th May, 2009 The Daily Telegraph
  • 17. At the Guardian...
  • 18. April: “Expenses are due out in a couple of months, is there anything we can do?”
  • 19. June: “Expenses have been bumped forward, they’re out next week!”
  • 20. Thursday 11th June The proof-of-concept
  • 21. Monday 15th June The tentative go-ahead
  • 22. Tuesday 16th June Designer + client-side engineer
  • 23. Wednesday 17th June Operations engineer
  • 24. Thursday 18th June Launch day!
  • 25. How we built it
  • 26. $ convert Frank_Comm.pdf pages.png
  • 27. Models
  • 28. class Party(models.Model): name = models.CharField(max_length=100) class Constituency(models.Model): name = models.CharField(max_length=100) class MP(models.Model): name = models.CharField(max_length=100) party = models.ForeignKey(Party) constituency = models.ForeignKey(Constituency) guardian_url = models.CharField(max_length=255, blank=True) guardian_image_url = models.CharField(max_length=255, blank=True)
  • 29. class FinancialYear(models.Model): name = models.CharField(max_length=10) class Document(models.Model): title = models.CharField(max_length=100, blank=True) filename = models.CharField(max_length=100) mp = models.ForeignKey(MP) financial_year = models.ForeignKey(FinancialYear) class Page(models.Model): document = models.ForeignKey(Document) page_number = models.IntegerField()
  • 30. class User(models.Model): created = models.DateTimeField(auto_now_add = True) username = models.TextField(max_length = 100) password_hash = models.CharField(max_length = 128, blank=True) class LineItemCategory(models.Model): order = models.IntegerField(default = 0) name = models.CharField(max_length = 32) class LineItem(models.Model): user = models.ForeignKey(User) page = models.ForeignKey(Page) type = models.CharField(max_length = 16, choices = ( ('claim', 'claim'), ('proof', 'proof'), ), db_index = True) date = models.DateField(null = True, blank = True) amount = models.DecimalField(max_digits=20, decimal_places=2) description = models.CharField(max_length = 255, blank = True) created = models.DateTimeField(auto_now_add = True, db_index = True) categories = models.ManyToManyField(LineItemCategory, blank=True)
  • 31. class Vote(models.Model): user = models.ForeignKey(User, related_name = 'votes') page = models.ForeignKey(Page, related_name = 'votes') obsolete = models.BooleanField(default = False) vote_type = models.CharField(max_length = 32, blank = True) ip_address = models.CharField(max_length = 32) created = models.DateTimeField(auto_now_add = True) class TypeVote(Vote): type = models.CharField(max_length = 10, choices = ( ('claim', 'Claim'), ('proof', 'Proof'), ('blank', 'Blank'), ('other', 'Other') )) class InterestingVote(Vote): status = models.CharField(max_length = 10, choices = ( ('no', 'Not interesting'), ('yes', 'Interesting'), ('known', 'Interesting but known'), ('very', 'Investigate this!'), ))
  • 32. Frictionless registration
  • 33. Page filters
  • 34. page_filters = ( # Maps name of filter to dictionary of kwargs to doc.pages.filter() ('reviewed', { 'votes__isnull': False }), ('unreviewed', { 'votes__isnull': True }), ('with line items', { 'line_items__isnull': False }), ('interesting', { 'votes__interestingvote__status': 'yes' }), ('interesting but known', { 'votes__interestingvote__status': 'known' ... ) page_filters_lookup = dict(page_filters)
  • 35. pages = doc.pages.all() if page_filter: kwargs = page_filters_lookup.get(page_filter) if kwargs is None: raise Http404, 'Invalid page filter: %s' % page_filter pages = pages.filter(**kwargs).distinct() # Build the filters filters = [] for name, kwargs in page_filters: filters.append({ 'name': name, 'count': doc.pages.filter(**kwargs).distinct().count(), })
  • 36. Matching names
  • 37. http://github.com/simonw/datamatcher
  • 38. On the day
  • 39. def get_mp_pages(): "Returns list of (mp-name, mp-page-url) tuples" soup = Soup(urllib.urlopen(INDEX_URL)) mp_links = [] for link in soup.findAll('a'): if link.get('title', '').endswith("'s allowances"): mp_links.append( (link['title'].replace("'s allowances", ''), link['href']) ) return mp_links
  • 40. def get_pdfs(mp_url): "Returns list of (description, years, pdf-url, size) tuples" soup = Soup(urllib.urlopen(mp_url)) pdfs = [] trs = soup.findAll('tr')[1:] # Skip the first, it's the table header for tr in trs: name_td, year_td, pdf_td = tr.findAll('td') name = name_td.string year = year_td.string pdf_url = pdf_td.find('a')['href'] size = pdf_td.find('a').contents[-1].replace('(', '').replace(')', '') pdfs.append( (name, year, pdf_url, size) ) return pdfs
  • 41. “Drop Everything”
  • 42. Photoshop + AppleScript v.s. Java + IntelliJ
  • 43. Images on our docroot (S3 upload was taking too long)
  • 44. Blitz QA
  • 45. Launch! (on EC2)
  • 46. Crash #1: more Apache children than MySQL connections
  • 47. unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count()
  • 48. SELECT COUNT(DISTINCT `expenses_page`.`id`) FROM `expenses_page` LEFT OUTER JOIN `expenses_vote` ON ( `expenses_page`.`id` = `expenses_vote`.`page_id` ) WHERE `expenses_vote`.`id` IS NULL
  • 49. unreviewed_count = cache.get('homepage:unreviewed_count') if unreviewed_count is None: unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count() cache.set('homepage: unreviewed_count', unreviewed_count, 60)
  • 50. • With 70,000 pages and a LOT of votes... • DB takes up 135% of CPU • Cache the count in memcached... • DB drops to %35 of CPU
  • 51. unreviewed_count = Page.objects.filter( votes__isnull = True ).distinct().count() reviewed_count = Page.objects.filter( votes__isnull = False ).distinct().count()
  • 52. unreviewed_count = Page.objects.filter( is_reviewed = False ).count()
  • 53. Migrating to InnoDB on a separate server
  • 54. ssh mps-live "mysqldump mp_expenses" | sed 's/ENGINE=MyISAM/ENGINE=InnoDB/g' | sed 's/CHARSET=latin1/CHARSET=utf8/g' | ssh mysql-big "mysql -u root mp_expenses"
  • 55. “next” button
  • 56. def next_global(request): # Next unreviewed page from the whole site all_unreviewed_pages = Page.objects.filter( is_reviewed = False ).order_by('?') if all_unreviewed_pages: return Redirect( all_unreviewed_pages[0].get_absolute_url() ) else: return HttpResponse( 'All pages have been reviewed!' )
  • 57. import random def next_global_from_cache(request): page_ids = cache.get('unreviewed_page_ids') if page_ids: return Redirect( '/page/%s/' % random.choice(page_ids) ) else: return next_global(request)
  • 58. from django.core.management.base import BaseCommand from mp_expenses.expenses.models import Page from django.core.cache import cache class Command(BaseCommand): help = """ populate unreviewed_page_ids in memcached """ requires_model_validation = True can_import_settings = True def handle(self, *args, **options): ids = list(Page.objects.exclude( is_reviewed = True ).values_list('pk', flat=True)[:1000]) cache.set('unreviewed_page_ids', ids)
  • 59. The numbers
  • 60. Final thoughts • High score tables help • MP photographs really help • Keeping up the interest is hard • Next step: start releasing the data

×