Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Upcoming SlideShare
Loading in...5
×
 

Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

on

  • 28,539 views

With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into ...

With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into usable information, so Zarkov was born. Zarkov is system that captures user events, logs them to a MongoDB collection, and aggregates them into useful data about user behavior and project statistics. This talk will discuss the components of Zarkov, including its use of Gevent asynchronous programming, ZeroMQ sockets, and the pymongo/bson driver.

Statistics

Views

Total Views
28,539
Views on SlideShare
17,337
Embed Views
11,202

Actions

Likes
45
Downloads
414
Comments
0

53 Embeds 11,202

http://www.10gen.com 5373
http://lanyrd.com 3135
http://www.mongodb.com 1188
http://simple-is-better.com 373
http://www.arborian.com 339
http://blog.pythonisito.com 262
http://join5works.com 184
http://feeds.feedburner.com 106
http://trunk.ly 48
https://www.mongodb.com 31
http://5works.co 23
https://twitter.com 14
http://drupal1.10gen.cc 10
http://twitter.com 9
http://ru.wiki.mongodb.org 9
http://us-w1.rockmelt.com 8
http://core.traackr.com 7
http://www.twylah.com 7
http://ww.mongodb.org 6
http://ayudamutuapadresenprocesodeduelo.blogspot.com 6
http://xnny.net 5
http://es.wiki.mongodb.org 5
http://w.mongodb.org 5
http://feed.feedsky.com 5
http://www.simple-is-better.com 4
http://archive.10gen.com 3
http://pythontip.sinaapp.com 3
http://tutorial.mongodb.org 2
http://hn.embed.ly 2
http://www.slideshare.net 2
http://sxr.mongodb.org 2
http://webcache.googleusercontent.com 2
http://fwww.10gen.com 2
http://www.newsblur.com 2
http://dennis.trunk.ly 2
http://translate.googleusercontent.com 1
http://newsrivr.com 1
http://presentations.10gen.com 1
http://wwww.10gen.com 1
http://10.237.125.82 1
http://fiberization26.katinia.com 1
http://education.mongodb.org 1
https://si0.twimg.com 1
http://wwww.mongodb.org 1
http://b.hatena.ne.jp 1
http://a0.twimg.com 1
http://ec2-75-101-156-249.compute-1.amazonaws.com 1
http://fastall.mongodb.org 1
http://admin.10gen.com 1
http://ww.alleycorp.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Can record many more than 4k events per second  345M events per day (single-thread, VM on a laptop) – we get a lot of traffic, but not that much  MR makes this much lower if calculated continuously, still hundreds of events even with MR locking

Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ Presentation Transcript

  • Realtime Analytics using MongoDB, Python, Gevent, and ZeroMQ Rick Copeland @rick446 [email_address]
  • SourceForge s MongoDB
    • Tried CouchDB – liked the dev model, not so much the performance
    • Migrated consumer-facing pages (summary, browse, download) to MongoDB and it worked great (on MongoDB 0.8 no less!)
    • Built an entirely new tool platform around MongoDB (Allura)
  • The Problem We’re Trying to Solve
    • We have lots of users (good)
    • We have lots of projects (good)
    • We don’t know what those users and projects are doing (not so good)
    • We have tons of code in PHP, Perl, and Python (not so good)
  • Introducing Zarkov 0.0.1
    • Asynchronous TCP server for event logging with gevent
    • Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client)
    • Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}
  • Zarkov Architecture MongoDB BSON over ZeroMQ Journal Greenlet Commit Greenlet Write-ahead log Write-ahead log Aggregation Greenlet
  • Technologies
    • MongoDB
      • Fast (10k+ inserts/s single-threaded)
    • ZeroMQ
      • Built-in buffering
      • PUSH/PULL sockets (push never blocks, easy to distribute work)
    • BSON
      • Fast Python/C implementation
      • More types than JSON
    • Gevent
      • “ green threads” for Python
  • “Wow, it’s really fast; can it replace…”
    • Download statistics?
    • Google Analytics?
    • Project realtime statistics?
    “Probably, but it’ll take some work….”
  • Moving towards production....
    • MongoDB MapReduce: convenient, but not so fast
      • Global JS Interpreter Lock per mongod
      • Lots of writing to temp collections (high lock %)
      • Javascript without libraries (ick!)
    • Hadoop? Painful to configure, high latency, non-seamless integration with MongoDB
  • Zarkov’s already doing a lot…
    • So we added a lightweight map/reduce framework
    • Write your map/reduce jobs in Python
    • Input/Output is MongoDB
    • Intermediate files are local .bson files
    • Use ZeroMQ for job distribution
  • Quick Map/reduce Refresher
    • def map_reduce (input_collection, query, output_collection,
    • map , reduce ):
    • objects = input_collection . find(query)
    • map_results = list ( map (objects))
    • map_results . sort(key = operator . itemgetter( 0 ))
    • for key, kv_pairs in itertools . groupby(
    • (map_results, operator . itemgetter( 0 )):
    • value = reduce (key, [ v for k,v in kv_pairs ])
    • output_collection . save(
    • { "_id" :key, "value" :value})
  • Quick Map/reduce Refresher
    • def map_reduce (input_collection, query, output_collection,
    • map , reduce ):
    • objects = input_collection . find(query)
    • map_results = list(map(objects))
    • map_results . sort(key = operator . itemgetter( 0 ))
    • for key, kv_pairs in itertools . groupby(
    • (map_results, operator . itemgetter( 0 )):
    • value = reduce (key, [ v for k,v in kv_pairs ])
    • output_collection . save(
    • { "_id" :key, "value" :value})
    Parallel
  • Zarkov Map/Reduce Architecture map_in_#.bson Query Map Sort Reduce Commit map_out_#.bson reduce_in.bson Job Mgr
  • Zarkov Map/Reduce
    • Phases managed by greenlets
    • Map and reduce jobs parceled out to remote workers via zmq PUSH/PULL
    • Adaptive timeout/retry to support dead workers
    • Sort phase is local (big mergesort) but still done in worker processes
  • Zarkov Web Service
    • We’ve got the data in, now how do we get it out?
    • Zarkov includes a tiny HTTP server
      • $ curl -d foo='{"c":"sfweb", "b":"date/2011-07-01/", "e":"date/2011-07-04"}' http://localhost:8081/q
      • {"foo": {"sflogo": [[1309579200000.0, 12774], [1309665600000.0, 13458], [1309752000000.0, 13967]], "hits": [[1309579200000.0, 69357], [1309665600000.0, 68514], [1309752000000.0, 68494]]}}
    • Values come out tweaked for use in flot
  • Zarkov Deployment at SF.net
  • Lessons learned at
  • MongoDB Tricks
    • Autoincrement integers are harder than in MySQL but not impossible
    • Unsafe writes, insert > update
    class IdGen ( object): @classmethod def get_ids(cls, inc = 1): obj = cls.query.find_and_modify( query={ '_id': 0}, update ={ '$inc': dict(inc =inc), }, upsert= True, new = True) return range(obj .inc - inc, obj.inc)
  • MongoDB Pitfalls
    • $addToSet is nice but nothing beats an integer range query
    • Avoid Javascript like the plague (mapreduce, group, $where)
    • Indexing is nice, but slows things down; use _id when you can
    • mongorestore is fast, but locks a lot
  • Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License Zarkov http://sf.net/p/zarkov/ Apache License
  • Future Work
    • Remove SPoF
    • Better way of expressing aggregates
      • Suggestions?
    • Better web integration
      • WebSockets/Socket.io
    • Maybe trigger aggs based on event activity?
  • Rick Copeland @rick446 [email_address]
  • Credits
    • http://www.flickr.com/photos/jprovost/5733297977/in/photostream/