With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into ...
With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into usable information, so Zarkov was born. Zarkov is system that captures user events, logs them to a MongoDB collection, and aggregates them into useful data about user behavior and project statistics. This talk will discuss the components of Zarkov, including its use of Gevent asynchronous programming, ZeroMQ sockets, and the pymongo/bson driver.
Can record many more than 4k events per second 345M events per day (single-thread, VM on a laptop) – we get a lot of traffic, but not that much MR makes this much lower if calculated continuously, still hundreds of events even with MR locking
$addToSet is nice but nothing beats an integer range query
Avoid Javascript like the plague (mapreduce, group, $where)
Indexing is nice, but slows things down; use _id when you can
mongorestore is fast, but locks a lot
Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License Zarkov http://sf.net/p/zarkov/ Apache License