What's up?


Published on

About an RSS reader that uses statistics to show the news most interesting to you. Based on Python and web.py.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

What's up?

  1. 1. What’s up?Bouvet BigOne, 2011-10-27Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga1
  2. 2. The problem with RSS readers• Many feeds (like newspapers’) are too busy, and so unread stories pile up• In most feeds you are only interested in a small subset of the posts• Staying on top of the flow of news and digging up the interesting stuff is hard work2
  3. 3. What’s up?• A newsreader that tries solve this for you• It uses statistics to figure out which news are the most interesting for you – the statistics are based on feedback from you• Everything is collected into a single list, sorted by relevance and freshness• Stories sink slowly as they age, so if you don’t read them they gradually fade away3
  4. 4. Like Not used Dislike Mark as read4
  5. 5. Very interesting post about beer, but the word “beer” doesn’t actually appear anywhere Probabilities combined with Bayes’s theorem.5
  6. 6. Utterly irrelevant post about sports6
  7. 7. Adding feedsProbably the usability Achillesheel of the system right now 7
  8. 8. Three implementations• In-memory single-user version – worked well for me for several years – wanted to try it out with more users• Google AppEngine version – easy to build and deploy – used way too much CPU• “Traditional” version – PostgreSQL backend, ordinary web hosting – seems to scale much better8
  9. 9. The goal• Make the site pay for its own hosting – currently solved by running it on my personal web server – expect system to outgrow that server soon• Move to cloud hosting – candidates: Amazon EC2, heroku, Google AppEngine w/ MySQL• Income from Google Ads – income per user likely to be very low – scaling challenge: support enough users to pay for computing resources9
  10. 10. Data structure Feed Post Subscr Rated User iption post• Good – fully normalized, no redundancy – simple and natural• Bad – showing main page requires many joins – limited possibilities for caching10
  11. 11. Queueing• The original version would respond to clicks in real-time – meant recomputing all stories on each up/down vote, before showing the page again – not really very pleasant user experience• Changed over to a queue approach – user clicks added to a queue – queue retrieves tasks, processes them, may add more – scheduled tasks injected into queue – admin command-line tool1) to inject tasks when needed – works beautifully11 1) http://code.google.com/p/whazzup/source/browse/send.py
  12. 12. Google AppEngine experience• Easy to build, painless to deploy – web.py and Python well supported – good queue and scheduled tasks APIs• Datastore and GQL too primitive – high latency registers as high CPU usage (costly) – very, very limited support for letting the database do the work, leads to poor performance• AppEngine apps require heavy caching to work – not really possible with this application• Would have hit limit of free usage at 4 users – not a realistic proposition12
  13. 13. Example problem• How to implement aging of posts? – that is, reducing score as the posts get older• Could compute score when loading story list – not possible in GQL (no expression language)• Could run scheduled tasks once an hour – in GQL this requires loading all RatedPost objects into main process – way too resource-intensive• Just didn’t scale at all13 (probability * 1000.0) / math.log(ageinsecs)
  14. 14. Current architecture 100% Python Based on web.py Apache w/ mod_python Single server so far Apache w/ mod_python Apache w/ mod_python Apache w/ mod_python cron IPC message queue 1) Queue worker Download Download Download PostgreSQL thread thread thread DBM files DBM files DBM files14 DBM files 1) http://semanchuk.com/philip/sysv_ipc/
  15. 15. Aging posts with Postgres• First attempt – load posts, compute in Python, save to DB – took 1.1 seconds per subscription – with ~50 subscriptions per user, that’s much too slow• Second attempt – do calculation in the SQL update statement1) – takes 0.5 seconds per user – more than 100 times faster – may still be too slow • with 7200 users it would take an hour15 1) http://code.google.com/p/whazzup/source/browse/dbqueue.py?r=4e
  16. 16. More performance tricks• Loading story pages is a bit expensive – because of SQL joins required – now handling votes with AJAX, so page doesn’t have to be reloaded for every vote – next step: caching feed titles and story titles?• Separate worker threads for feed downloading – because feeds may be slow to respond – threads save feed XML to disk, then queue task to process feed – ParseFeed task doesn’t have any network latency16
  17. 17. Statistics No perceptible server load Bottlenecks: • loading story list pages • parsing feeds, calculating points17
  18. 18. Future architecture Web frontend Web frontend Web frontend memcached? cron Message queue DB cluster (Gearman?) (PostgreSQL?) Queue worker Queue worker Queue worker DBM files?18
  19. 19. More information• Blog post – http://www.garshol.priv.no/blog/216.html• Source code – http://code.google.com/p/whazzup/• Pre-alpha trial – open to anyone; sign up if you’re interested – no guarantees about anything – http://whazzup.garshol.priv.no/ – currently limited to 100 users (87 accounts available)19