How SourceForge is Using MongoDB Rick Copeland @rick446 [email_address]
SF.net “BlackOps”: FossFor.us User Editable! Web 2.0! (ish) Not Ugly!
Moving to NoSQL FossFor.us used CouchDB (NoSQL) “ Just adding new fields was trivial, and was happening all the time” – Mark Ramm Scaling up to the level of SF.net needs research CouchDB MongoDB Tokyo Cabinet/Tyrant Cassandra... and others
Rewriting “Consume” Most traffic on SF.net hits 3 types of pages: Project Summary File Browser Download Pages are read-mostly, with infrequent updates from the “Develop” side of sf.net Original goal is 1 MongoDB document per project  Later split release data because some projects have  lots  of releases Periodic updates via RSS and AMQP from “Develop”
Deployment Architecture Load Balancer / Proxy Gobble Server Develop Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave
Deployment Architecture (revised) Load Balancer / Proxy Gobble Server Develop Scalability is good Single-node  performance is good, too Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0
SF.net Downloads Allow non-sf.net projects to use SourceForge mirror network Stats calculated in Hadoop and stored/served from MongoDB Same deployment architecture as Consume (4 web, 1 db)
Allura  (SF.net “beta” devtools) Rewrite developer tools with new architecture Wiki, Tracker, Discussions, Git, Hg, SVN, with more to come Single MongoDB replica set manually sharded by project Release early & often
What We Liked Performance, performance, performance – Easily handle 90% of SF.net traffic from 1 DB server, 4 web servers Schemaless server allows fast schema evolution in development, making many migrations unnecessary Replication is  easy , making scalability and backups  easy Keep a “backup slave” running Kill backup slave, copy off database, bring back up the slave Automatic re-sync with master Query Language You mean I can have performance  without  map-reduce? GridFS
Pitfalls Too-large documents Store less per document Return only a few fields Ignoring indexing Watch your server log; bad queries show up there Ignoring your data’s schema Using many databases when one will do Using too many queries
Ming –  an “Object-Document Mapper?” Your data has a schema Your database can define and enforce it It can live in your application (as with MongoDB) Nice to have the schema defined in one place in the code Sometimes you  need  a “migration” Changing the structure/meaning of fields Adding indexes Sometimes lazy, sometimes eager Queuing up all your updates can be handy Python dicts are nice; objects are nicer
Ming Concepts Inspired by SQLAlchemy Group of classes to which you map your collections Each class defines its schema, including indexes Convenience methods for loading/saving objects and ensuring indexes are created Migrations Unit of Work –  great  for web applications MIM – “Mongo in Memory” nice for unit tests
Ming Example from   ming   import  schema from   ming.orm   import  MappedClass from   ming.orm   import  (FieldProperty, ForeignIdProperty,  RelationProperty) class   WikiPage (MappedClass): class   __mongometa__ : session  =  session name  =   'wiki_page'   _id  =  FieldProperty(schema . ObjectId) title  =  FieldProperty( str ) text  =  FieldProperty( str ) comments = RelationProperty( 'WikiComment' ) MappedClass . compile_all() # Lets ming know about the mapping
Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License
Future Work mongos New Allura Tools Migrating legacy SF.net projects to Allura Stats all in MongoDB rather than Hadoop? Better APIs to access your project data
Questions?
Rick Copeland @rick446 [email_address]

MongoATL: How Sourceforge is Using MongoDB

  • 1.
    How SourceForge isUsing MongoDB Rick Copeland @rick446 [email_address]
  • 2.
    SF.net “BlackOps”: FossFor.usUser Editable! Web 2.0! (ish) Not Ugly!
  • 3.
    Moving to NoSQLFossFor.us used CouchDB (NoSQL) “ Just adding new fields was trivial, and was happening all the time” – Mark Ramm Scaling up to the level of SF.net needs research CouchDB MongoDB Tokyo Cabinet/Tyrant Cassandra... and others
  • 4.
    Rewriting “Consume” Mosttraffic on SF.net hits 3 types of pages: Project Summary File Browser Download Pages are read-mostly, with infrequent updates from the “Develop” side of sf.net Original goal is 1 MongoDB document per project Later split release data because some projects have lots of releases Periodic updates via RSS and AMQP from “Develop”
  • 5.
    Deployment Architecture LoadBalancer / Proxy Gobble Server Develop Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave
  • 6.
    Deployment Architecture (revised)Load Balancer / Proxy Gobble Server Develop Scalability is good Single-node performance is good, too Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0
  • 7.
    SF.net Downloads Allownon-sf.net projects to use SourceForge mirror network Stats calculated in Hadoop and stored/served from MongoDB Same deployment architecture as Consume (4 web, 1 db)
  • 8.
    Allura (SF.net“beta” devtools) Rewrite developer tools with new architecture Wiki, Tracker, Discussions, Git, Hg, SVN, with more to come Single MongoDB replica set manually sharded by project Release early & often
  • 9.
    What We LikedPerformance, performance, performance – Easily handle 90% of SF.net traffic from 1 DB server, 4 web servers Schemaless server allows fast schema evolution in development, making many migrations unnecessary Replication is easy , making scalability and backups easy Keep a “backup slave” running Kill backup slave, copy off database, bring back up the slave Automatic re-sync with master Query Language You mean I can have performance without map-reduce? GridFS
  • 10.
    Pitfalls Too-large documentsStore less per document Return only a few fields Ignoring indexing Watch your server log; bad queries show up there Ignoring your data’s schema Using many databases when one will do Using too many queries
  • 11.
    Ming – an “Object-Document Mapper?” Your data has a schema Your database can define and enforce it It can live in your application (as with MongoDB) Nice to have the schema defined in one place in the code Sometimes you need a “migration” Changing the structure/meaning of fields Adding indexes Sometimes lazy, sometimes eager Queuing up all your updates can be handy Python dicts are nice; objects are nicer
  • 12.
    Ming Concepts Inspiredby SQLAlchemy Group of classes to which you map your collections Each class defines its schema, including indexes Convenience methods for loading/saving objects and ensuring indexes are created Migrations Unit of Work – great for web applications MIM – “Mongo in Memory” nice for unit tests
  • 13.
    Ming Example from ming import schema from ming.orm import MappedClass from ming.orm import (FieldProperty, ForeignIdProperty, RelationProperty) class WikiPage (MappedClass): class __mongometa__ : session = session name = 'wiki_page' _id = FieldProperty(schema . ObjectId) title = FieldProperty( str ) text = FieldProperty( str ) comments = RelationProperty( 'WikiComment' ) MappedClass . compile_all() # Lets ming know about the mapping
  • 14.
    Open Source Minghttp://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License
  • 15.
    Future Work mongosNew Allura Tools Migrating legacy SF.net projects to Allura Stats all in MongoDB rather than Hadoop? Better APIs to access your project data
  • 16.
  • 17.
    Rick Copeland @rick446[email_address]