Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Allura - an Open Source MongoDB Based Document Oriented SourceForge


Published on

MongoSF 2011 talk on Allura, the new platform for SourceForge that we released under an Apache license

Published in: Technology

Allura - an Open Source MongoDB Based Document Oriented SourceForge

  1. 1. Allura – an Open Source MongoDB Based Document Oriented SourceForge Rick Copeland @rick446 [email_address]
  2. 2. I am not Mark Ramm (sorry)
  3. 3. Allura ( “beta” devtools) <ul><li>Rewrite developer tools with new architecture </li></ul><ul><li>Wiki, Tracker, Discussions, Git, Hg, SVN, with more to come </li></ul><ul><li>Single MongoDB replica set </li></ul><ul><li>Release early & often </li></ul>
  4. 4. Allura Scaling <ul><li> currently handles ~4M pageviews per day </li></ul><ul><li>Allura will eventually handle 10% (with lots of writing) </li></ul><ul><li>“ Consume” currently handles 3M+ pageviews/day on one shard (read-mostly) </li></ul><ul><li>Allura can handle ~48k pageviews / day / shard </li></ul><ul><li>Add shards & optimize queries as we migrate projects to </li></ul><ul><li>Most data is project-specific; sharding by project is straightforward </li></ul>
  5. 5. System Architecture Web-facing App Server Task Daemon SMTP Server FUSE Filesystem (repository hosting)
  6. 6. Ming – an “Object-Document Mapper?” <ul><li>Your data has a schema </li></ul><ul><ul><li>Your database can define and enforce it </li></ul></ul><ul><ul><li>It can live in your application (as with MongoDB) </li></ul></ul><ul><ul><li>Nice to have the schema defined in one place in the code </li></ul></ul><ul><li>Sometimes you need a “migration” </li></ul><ul><ul><li>Changing the structure/meaning of fields </li></ul></ul><ul><ul><li>Adding indexes, particularly unique indexes </li></ul></ul><ul><ul><li>Sometimes lazy, sometimes eager </li></ul></ul><ul><li>“ Unit of work:” Queuing up all your updates can be handy </li></ul><ul><li>Python dicts are nice; objects are nicer </li></ul>
  7. 7. Ming Concepts <ul><li>Inspired by SQLAlchemy </li></ul><ul><li>Group of collection objects with schemas defined </li></ul><ul><li>Group of classes to which you map your collections </li></ul><ul><li>Use collection-level operations for performance </li></ul><ul><li>Use class-level operations for abstraction </li></ul><ul><li>Convenience methods for loading/saving objects and ensuring indexes are created </li></ul><ul><li>Migrations </li></ul><ul><li>Unit of Work – great for web applications </li></ul><ul><li>MIM – “Mongo in Memory” nice for unit tests </li></ul>
  8. 8. Ming Example from ming import schema, Field from ming.orm import (mapper, Mapper, RelationProperty, ForeignIdProperty) WikiDoc = collection(‘ wiki_page' , session, Field( '_id' , schema . ObjectId()), Field( 'title' , str , index = True ), Field( 'text' , str )) CommentDoc = collection(‘ comment' , session, Field( '_id' , schema . ObjectId()), Field( 'page_id' , schema . ObjectId(), index = True ), Field( 'text' , str )) class WikiPage ( object ): pass class Comment ( object ): pass ormsession . mapper(WikiPage, WikiDoc, properties = dict ( comments = RelationProperty( 'WikiComment' ))) ormsession . mapper(Comment, CommentDoc, properties = dict ( page_id = ForeignIdProperty( 'WikiPage' ), page = RelationProperty( 'WikiPage' ))) Mapper . compile_all()
  9. 9. Allura Artifacts <ul><li>Artifacts include tickets, wiki pages, discussions, comments, merge requests, etc. </li></ul><ul><li>On artifact change, a session extension: </li></ul><ul><li>Queues a Solr index operation (for full text search support) </li></ul><ul><li>Scans the artifact text for references to other artifacts </li></ul><ul><li>Updates statistics on objects created/modified/deleted </li></ul>Artifact VersionedArtifact Snapshot Message
  10. 10. Allura Threaded Discussions <ul><li>MessageDoc = collection( </li></ul><ul><li>'message' , project_doc_session, </li></ul><ul><li>Field( '_id' , str , if_missing = h . gen_message_id), </li></ul><ul><li>Field( 'slug' , str , if_missing = h . nonce), </li></ul><ul><li>Field( 'full_slug' , str ), </li></ul><ul><li>Field( 'parent_id' , str ),…) </li></ul>_id – use an email Message-ID compatible key slug – threaded path of random 4-digit hex numbers prefixed by parent (e.g. dead/beef/f00d  dead/beef  dead) full_slug – slug interspersed with ISO-formatted message datetime Easy queries for hierarchical data Find all descendants of a message – slug prefix search “dead/.*” Sort messages by thread, then by date – full_slug sort
  11. 11. MonQ: Async Queueing in MongoDB <ul><li>states = ( 'ready' , 'busy' , 'error' , 'complete' ) </li></ul><ul><li>result_types = ( 'keep' , 'forget' ) </li></ul><ul><li>MonQTaskDoc = collection( </li></ul><ul><li>'monq_task' , main_doc_session, </li></ul><ul><li>Field( '_id' , schema . ObjectId()), </li></ul><ul><li>Field( 'state' , schema . OneOf( * states)), </li></ul><ul><li>Field( 'result_type' , Schema . OneOf( * result_types)), </li></ul><ul><li>Field( 'time_queue' , datetime), </li></ul><ul><li>Field( 'time_start' , datetime), </li></ul><ul><li>Field( 'time_stop' , datetime), </li></ul><ul><li># dotted path to function </li></ul><ul><li>Field( 'task_name' , str ), </li></ul><ul><li>Field( 'process' , str ), # worker process name: “locks” the task </li></ul><ul><li>Field( 'context' , dict ( </li></ul><ul><li>project_id = schema . ObjectId(), </li></ul><ul><li>app_config_id = schema . ObjectId(), </li></ul><ul><li>user_id = schema . ObjectId())), </li></ul><ul><li>Field( 'args' , list ), </li></ul><ul><li>Field( 'kwargs' , { None : None }), </li></ul><ul><li>Field( 'result' , None , if_missing = None )) </li></ul>
  12. 12. Repository Cache Objects <ul><li>On commit to a repo (Hg, SVN, or Git) </li></ul><ul><li>Build commit graph in MongoDB for new commits </li></ul><ul><li>Build auxiliary structures </li></ul><ul><ul><li>tree structure, including all trees in a commit & last commit to modify </li></ul></ul><ul><ul><li>linear commit runs (useful for generating history) </li></ul></ul><ul><ul><li>commit difference summary (must be computed in Hg and Git) </li></ul></ul><ul><li>Note references to other artifacts and commits </li></ul><ul><li>Repo browser uses cached structure to serve pages </li></ul>Commit Tree Trees CommitRun LastCommit DiffInfo
  13. 13. Repository Cache Lessons Learned Using MongoDB to represent graph structures (commit graph, commit trees) requires careful query planning. Pointer-chasing is no fun! Sometimes Ming validation and ORM overhead can be prohibitively expensive – time to drop down a layer. Benchmarking and profiling are your friends, as are queries like {‘_id’: {‘$in’:[…]}} for returning multiple objects
  14. 14. Authorization: ProjectRole Objects <ul><li>ProjectRoleDoc = collection( </li></ul><ul><li>'project_role' , main_doc_session, </li></ul><ul><li>Field( '_id' , schema . ObjectId()), </li></ul><ul><li>Field( 'user_id' , schema . ObjectId(), index = True ), </li></ul><ul><li>Field( 'project_id' , schema . ObjectId(), index = True ), </li></ul><ul><li>Field( 'name' , str ), </li></ul><ul><li>Field( 'roles' , [schema . ObjectId()]), </li></ul><ul><li>Index( 'user_id' , 'project_id' , 'name' , unique = True ) </li></ul><ul><li>) </li></ul><ul><li>class ProjectRole ( object ): pass </li></ul><ul><li>main_orm_session . mapper(ProjectRole, ProjectRoleDoc, properties = dict ( </li></ul><ul><li>user_id = ForeignIdProperty( 'User' ), </li></ul><ul><li>project_id = ForeignIdProperty( 'Project' ), </li></ul><ul><li>user = RelationProperty( 'User' ), </li></ul><ul><li>project = RelationProperty( 'Project’ ))) </li></ul>
  15. 15. Authorization: ProjectRole Objects Roles can be named roles (“Groups”) or user proxies. Roles inherit all permissions of the roles they can “act as” User membership in a group is stored on the user proxy object (the list of roles for which the user has permission) Authorization checks all roles transitively for a user. If any role has the appropriate permission being required, then access is granted. Hierarchical role structures are supported, but not exposed in the UI.
  16. 16. Flyway Migrations <ul><li>Ming supports “lazy migrations” from one schema version to another automatically </li></ul><ul><li>Sometimes you want to explicitly version your DB </li></ul><ul><li>Flyway allows you to define various versions of your schema with pre- and post-conditions for running an “up” migration and a “down” migration </li></ul><ul><li>With multiple tools with interdependencies and a platform under it all, we thought we needed it </li></ul><ul><li>We didn’t, but it’s there and it works…. </li></ul>
  17. 17. What We Liked <ul><li>Performance, performance, performance – Easily handle 90% of traffic from 1 DB server, 4 web servers </li></ul><ul><li>Schemaless server allows fast schema evolution in development, making many migrations unnecessary </li></ul><ul><li>Replication is easy , making scalability and backups easy </li></ul><ul><ul><li>Keep a “backup slave” running </li></ul></ul><ul><ul><li>Kill backup slave, copy off database, bring back up the slave </li></ul></ul><ul><ul><li>Automatic re-sync with master </li></ul></ul><ul><li>Query Language </li></ul><ul><ul><li>You mean I can have performance without map-reduce? </li></ul></ul><ul><li>GridFS </li></ul>
  18. 18. Pitfalls <ul><li>Too-large documents </li></ul><ul><ul><li>Store less per document </li></ul></ul><ul><ul><li>Return only a few fields </li></ul></ul><ul><li>Ignoring indexing </li></ul><ul><ul><li>Watch your server log; bad queries show up there </li></ul></ul><ul><li>Too much denormalization </li></ul><ul><ul><li>Try to use an index if all you need is a backref </li></ul></ul><ul><li>Ignoring your data’s schema </li></ul><ul><li>Using many databases when one will do </li></ul><ul><li>Using too many queries </li></ul>
  19. 19. Open Source <ul><li>Ming </li></ul><ul><li> </li></ul><ul><li>MIT License </li></ul><ul><li>Allura </li></ul><ul><li> </li></ul><ul><li>Apache License </li></ul>
  20. 20. Future Work <ul><li>mongos </li></ul><ul><li>New Allura Tools </li></ul><ul><li>Migrating legacy projects to Allura </li></ul><ul><li>Stats all in MongoDB rather than Hadoop? </li></ul><ul><li>Better APIs to access your project data </li></ul>
  21. 21. Rick Copeland @rick446 [email_address]