• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Rapid, Scalable Web Development with MongoDB, Ming, and Python
 

Rapid, Scalable Web Development with MongoDB, Ming, and Python

on

  • 14,281 views

In 2009, SourceForge embarked on a quest to modernize our websites, converting a site written for a hodge-podge of relational databases in PHP to a MongoDB and Python-powered site, with a small ...

In 2009, SourceForge embarked on a quest to modernize our websites, converting a site written for a hodge-podge of relational databases in PHP to a MongoDB and Python-powered site, with a small development team and a tight deadline. We have now completely rewritten both the consumer and producer parts of the site with better usability, more functionality and better performance. This talk focuses on how we're using MongoDB, the pymongo driver, and Ming, an ORM-like library implemented at SourceForge, to continually improve and expand our offerings, with a special focus on how3 anyone can quickly become productive with Ming and pymongo without having to apologize for poor performance.

Statistics

Views

Total Views
14,281
Views on SlideShare
12,760
Embed Views
1,521

Actions

Likes
6
Downloads
131
Comments
0

22 Embeds 1,521

http://blog.pythonisito.com 433
http://www.10gen.com 424
http://www.yamdablam.com 356
http://feeds.feedburner.com 95
http://www.mongodb.com 70
http://simple-is-better.com 67
https://twitter.com 23
http://paper.li 18
http://www.slideshare.net 9
http://twitter.com 6
http://www.twylah.com 5
http://drupal1.10gen.cc 3
https://www.linkedin.com 2
http://presentations.10gen.com 2
http://www.365dailyjournal.com 1
http://cache.baidu.com 1
http://newsrivr.com 1
url_unknown 1
http://xianguo.com 1
http://www.mongodb.org 1
http://test-613f9co4dddc3f5cf000.ning.com 1
http://yamdablam.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Used to be PHP + MySQL + Postgres + ….. How did we get started down the NoSQL path?
  • New team hired, which had some new ideas Crazy short timeline
  • FossFor.us was a technical success, if a market failure Can we bring the same technical success to sf.net?
  • Content begins life in “Develop” world – old PHP/Relational codebase Gobble runs off existing APIs and new AMQP feeds to populate master mongodb server Webheads run off local MongoDB slaves
  • Questionable architectural decision, but it made sense at the time. All the while, our Data model evolved Factored out a lot of our custom mongodb code into…
  • Index=True (also has a unique=True arg) ForeignIdProperty like foreign keys – hints to the mapper how to actually infer RelationProperties Empty classes can have properties added by their mappers
  • Several interacting apps, with MongoDB and SOLR at the center Used to have RabbitMQ but we wanted more visibility into our queues Now I’ll show some of the model classes we use
  • Efficient, threaded messages Short “slugs” for linking to a particular message (and for URLs) Easy sorting Note the ‘if missing’ – basically default values that are created by Ming (in this case, python functions to generate the values)
  • Async Queuing Some perf instrumentation Can use findandmodify to grab and lock tasks for processing Uses both slow polling and amqp notification (but only to say POLL NOW)
  • UI-centered design Make sure it doesn’t take a lot of queries to build a page
  • Still alpha-quality (OSS because that’s the way we roll)
  • Can record many more than 4k events per second  345M events per day (single-thread, VM on a laptop) – we get a lot of traffic, but not that much  MR makes this much lower if calculated continuously, still hundreds of events even with MR locking Probably use a slave for MR processing, may end up using Hadoop or Disco if we need to

Rapid, Scalable Web Development with MongoDB, Ming, and Python Rapid, Scalable Web Development with MongoDB, Ming, and Python Presentation Transcript

  • Rapid, Scalable Web Development with MongoDB, Ming, and Python Rick Copeland @rick446 [email_address]
  • ?
    • NoSQL at SourceForge
    • Rewriting Consume
    • Introducing Ming
    • Allura – Open-Sourcing Open Source
    • Zarkov – MongoDB-based (near) real-time analytics
  • SF.net “BlackOps”: FossFor.us User Editable! Web 2.0! (ish) Not Ugly!
    • FossFor.us used CouchDB (NoSQL)
    • “ Just adding new fields was trivial, and was happening all the time” – Mark Ramm
    • Scaling up to the level of SF.net needs research
      • CouchDB
      • MongoDB
      • Tokyo Cabinet/Tyrant
      • Cassandra... and others
    Moving to NoSQL
  • What we were looking for
    • Performance – how does a single node perform?
    • Scalability – needs to support simple replication
    • Ability to handle complex data and queries
    • Ease of development
    • NoSQL at SourceForge
    • Rewriting Consume
    • Introducing Ming
    • Allura – Open-Sourcing Open Source
    • Zarkov – MongoDB-based (near) real-time analytics
  • Rewriting “Consume”
    • Most traffic on SF.net hits 3 types of pages:
      • Project Summary
      • File Browser
      • Download
    • Pages are read-mostly, with infrequent updates from the “Develop” side of sf.net
    • Original goal is 1 MongoDB document per project
      • Later split release data because some projects have lots of releases
    • Periodic updates via RSS and AMQP from “Develop”
  • Deployment Architecture Load Balancer / Proxy Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Apache mod_wsgi / TG 2.0 MongoDB Slave Gobble Server Develop Apache mod_wsgi / TG 2.0 MongoDB Slave
  • Deployment Architecture (revised) Load Balancer / Proxy Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 Gobble Server Develop Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Scalability is good Single-node performance is good, too
    • NoSQL at SourceForge
    • Rewriting Consume
    • Introducing Ming
    • Allura – Open-Sourcing Open Source
    • Zarkov – MongoDB-based (near) real-time analytics
  • Ming – an “Object-Document Mapper?”
    • Your data has a schema
      • Your database can define and enforce it
      • It can live in your application (as with MongoDB)
      • Nice to have the schema defined in one place in the code
    • Sometimes you need a “migration”
      • Changing the structure/meaning of fields
      • Adding indexes, particularly unique indexes
      • Sometimes lazy, sometimes eager
    • “ Unit of work:” Queuing up all your updates can be handy
    • Python dicts are nice; objects are nicer
  • Ming Concepts
    • Inspired by SQLAlchemy
    • Group of collection objects with schemas defined
    • Group of classes to which you map your collections
    • Use collection-level operations for performance
    • Use class-level operations for abstraction
    • Convenience methods for loading/saving objects and ensuring indexes are created
    • Migrations
    • Unit of Work – great for web applications
    • MIM – “Mongo in Memory” nice for unit tests
  • Ming Example from ming import schema, Field from ming.orm import (mapper, Mapper, RelationProperty, ForeignIdProperty) WikiDoc = collection(‘ wiki_page' , session, Field( '_id' , schema . ObjectId()), Field( 'title' , str , index = True ), Field( 'text' , str )) CommentDoc = collection(‘ comment' , session, Field( '_id' , schema . ObjectId()), Field( 'page_id' , schema . ObjectId(), index = True ), Field( 'text' , str )) class WikiPage ( object ): pass class Comment ( object ): pass ormsession . mapper(WikiPage, WikiDoc, properties = dict ( comments = RelationProperty( 'WikiComment' ))) ormsession . mapper(Comment, CommentDoc, properties = dict ( page_id = ForeignIdProperty( 'WikiPage' ), page = RelationProperty( 'WikiPage' ))) Mapper . compile_all()
    • NoSQL at SourceForge
    • Rewriting Consume
    • Introducing Ming
    • Allura – Open-Sourcing Open Source
    • Zarkov – MongoDB-based (near) real-time analytics
  • Python / MongoDB Taking Over…. Allura Load Balancer / Proxy Master DB Server MongoDB Master Apache mod_wsgi / TG 2.0 Gobble Server Develop Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Apache mod_wsgi / TG 2.0 Scalability is good Single-node performance is good, too
  • Allura Architecture Web-facing App Server Task Daemon SMTP Server FUSE Filesystem (repository hosting)
  • Allura Threaded Discussions
    • MessageDoc = collection(
    • 'message' , project_doc_session,
    • Field( '_id' , str , if_missing = h . gen_message_id),
    • Field( 'slug' , str , if_missing = h . nonce),
    • Field( 'full_slug' , str ),
    • Field( 'parent_id' , str ),…)
    _id – use an email Message-ID compatible key slug – threaded path of random 4-digit hex numbers prefixed by parent (e.g. dead/beef/f00d  dead/beef  dead) full_slug – slug interspersed with ISO-formatted message datetime (20110627…dead/20110627…beef….) Easy queries for hierarchical data Find all descendants of a message – slug prefix search “dead/.*” Sort messages by thread, then by date – full_slug sort
  • MonQ: Async Queueing in MongoDB
    • states = ( 'ready' , 'busy' , 'error' , 'complete' )
    • result_types = ( 'keep' , 'forget' )
    • MonQTaskDoc = collection(
    • 'monq_task' , main_doc_session,
    • Field( '_id' , schema . ObjectId()),
    • Field( 'state' , schema . OneOf( * states)),
    • Field( 'result_type' , Schema . OneOf( * result_types)),
    • Field( 'time_queue' , datetime),
    • Field( 'time_start' , datetime),
    • Field( 'time_stop' , datetime),
    • # dotted path to function
    • Field( 'task_name' , str ),
    • Field( 'process' , str ), # worker process name: “locks” the task
    • Field( 'context' , dict (
    • project_id = schema . ObjectId(),
    • app_config_id = schema . ObjectId(),
    • user_id = schema . ObjectId())),
    • Field( 'args' , list ),
    • Field( 'kwargs' , { None : None }),
    • Field( 'result' , None , if_missing = None ))
  • Repository Cache Objects
    • On commit to a repo (Hg, SVN, or Git)
    • Build commit graph in MongoDB for new commits
    • Build auxiliary structures
      • tree structure, including all trees in a commit & last commit to modify
      • linear commit runs (useful for generating history)
      • commit difference summary (must be computed in Hg and Git)
    • Note references to other artifacts and commits
    • Repo browser uses cached structure to serve pages
    DiffInfo Tree Trees CommitRun LastCommit Commit
  • Repository Cache Lessons Learned Using MongoDB to represent graph structures (commit graph, commit trees) requires careful query planning. Pointer-chasing is no fun! Sometimes Ming validation and ORM overhead can be prohibitively expensive – time to drop down a layer. Benchmarking and profiling are your friends, as are queries like {‘_id’: {‘$in’:[…]}} for returning multiple objects
    • NoSQL at SourceForge
    • Rewriting Consume
    • Introducing Ming
    • Allura – Open-Sourcing Open Source
    • Zarkov – MongoDB-based (near) real-time analytics
  • And now, for something completely different…
    • Business: we need more visibility into what users are doing
    • Low overhead
    • Near real-time
    • Unified view of lots of systems
      • Python
      • PHP
      • Perl
  • Introducing Zarkov
    • Asynchronous TCP server for event logging with gevent
    • Turn OFF “safe” writes, turn OFF Ming validation (or do it in the client)
    • Incrementally calculate aggregate stats based on event log using mapreduce with {‘out’:’reduce’}
  • Lessons learned at
  • What We Liked
    • Performance, performance, performance – Easily handle 90% of SF.net traffic from 1 DB server, 4 web servers
    • Dynamic schema allows fast schema evolution in development, making many migrations unnecessary
    • Replication is easy , making scalability and backups easy
    • Query Language
      • You mean I can have performance without map-reduce?
    • GridFS
  • Pitfalls
    • Too-large documents
      • Store less per document
      • Return only a few fields
    • Ignoring indexing
      • Watch your server log; bad queries show up there
    • Too much denormalization
      • Try to use an index if all you need is a backref
      • Stale data is a tricky problem
    • Using many databases when one will do
    • Using too many queries
  • Open Source Ming http://sf.net/projects/merciless/ MIT License Allura http://sf.net/p/allura/ Apache License Zarkov http://sf.net/p/zarkov/ Apache License
  • Future Work
    • mongos
    • New Allura Tools
    • Migrating legacy SF.net projects to Allura
    • Continue to optimize stats & analytics (Zarkov and others)
    • Better APIs to access your project data
  • Rick Copeland @rick446 [email_address]