Building a Directed Graph with MongoDB


Published on

Details of how Wordnik built a directed graph on top of MongoDB. This is the presentation given during MongoSF 2011 by Tony Tam.

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building a Directed Graph with MongoDB

  1. 1. Building A directed graph with mongodb<br />MongoSF 5/24/2011<br />By Tony Tam @fehguy<br />
  2. 2. Who is wordnik<br />Word + Meaning Discovery Engine<br />Clustered Application built with:<br />Scala/Java/Jetty<br />Only way in is via REST<br />19M API calls/day @ 7ms/query average<br />Physical servers<br />72GB RAM, 8 core<br />4.3TB DAS<br />We’re MongoDB users for ~1.5 yrs<br />Used in master/slave<br />14B documents in MongoDB<br />
  3. 3. Why a graph for words<br />Technique to model network relationships<br />Properties are dynamic<br />Links are “arbitrary”<br />Runtime performance<br />Answers in < 5ms/request<br />Routing functions based on goals<br />“find most likely word for X”<br />“find more common form of Y”<br />
  4. 4. Why a graph for words<br />Misspellings, abbreviations, texting, Twitter<br />
  5. 5. More about graphs<br />Different types of Graphs<br />Decisions have huge impact on design + implementation<br />Nodes (vertices)<br />String and numeric properties<br />Edges (links)<br />Finite set of labeled edge types (~30)<br />Multiple target nodes per edge<br />Each potentially different weight<br />Directed, non-symmetrical<br />
  6. 6. Why build on Mongodb?<br />Word Graph is core to Wordnik<br />Many ways to build a graph<br />Dedicated graph DBs<br />Relational DBs<br />MongoDB Document Storage<br />Uber-flexible<br />Successfully routes in < 5ms<br />Long runway for scale-out<br />Limit storage infrastructure components<br />Easy to implement<br />
  7. 7. Wordnik graph data model<br />Nodes<br />_id field holds name, object type<br />Index at no extra cost<br />Arbitrary number of properties<br />Only two datatypes for us, String, Double<br />Node type info in node ID (_id)<br />na_corpusCount => Double<br />sa_source => String<br />
  8. 8. Wordnik graph data model<br />Edges<br />Destination(s)<br />Weight<br />Link Properties<br />Stored in Mongo Arrays<br />Array size is app limited<br />Use $push, $pop<br />
  9. 9. Access to mongo<br />Mongo Access via DAO layer<br />Limit queries to ones that work“well”<br />ALL queries use index<br />Find Node “cat” of type “word”:<br />db.node.findOne({_id:"cat|word"})<br />Find Edge types for above:<br />db.edge.find({_id:/^cat|word|/},{_id:1})<br />Serialization/deserialization <br />Done “the old-fashioned way”<br />BasicDBObject, BasicDBList faster than mappers for our use case<br />
  10. 10. Query efficiency<br />Max execution time is f (ahops)<br />
  11. 11. Routing, traversals, functions<br />Typically find path from A to B<br />Routes have costs<br />Low cost or high probability<br />Our use case is atypical<br />LinkedIn vs. Maps<br />Not from A to B<br />More like “from A with 3 hops”<br />This matters!<br />
  12. 12. Performance + Scaling<br />
  13. 13. Performance + scaling<br />Query by index only<br />Use regex syntax in restricted fashion<br />Starts with only<br />No look behind<br />Case sensitive<br />Boring? Fast?<br />Sharding is a no-brainer<br />What about ObjectId()?<br />
  14. 14. Performance + scaling<br />Horizontal? Vertical? Both? And when?<br />Separate collections by edge type/object type<br />Increases storage needs<br />Collections all have padding, 30 collections => ~30x padding<br />Sharding<br />Use slick, built-in Mongo sharding<br />Roll your own based on your data<br />What does Wordnik do?<br />Neither! (yet)<br />30M Nodes, 50M Edges<br />One collection for nodes<br />One collection for edges<br />
  15. 15. Performance + scaling<br />Selecting a shard key<br />Done in application logic based on OUR data<br />Depends on what you need<br />
  16. 16. End result<br />Solves Wordnik Graph infrastructure needs<br />Store Word nodes with UGC, corpus, structured, analytical data<br />Batch fetch Edges @ > 50k/second<br />Find Edge + endpoints in 80mS <br />Powers our…<br />Word Selection<br />Canonicalization<br />Misspelling<br />“Did you mean” logic<br />Classification + Matching Engine<br />
  17. 17. Examples<br />Misspellings<br />Abbreviations<br />Lemmatization<br />
  18. 18. Examples<br />Term normalization<br />Find similar words<br />Meaning normalization<br />Find “more common” form<br />
  19. 19. examples<br />Applied Word Graph<br />Recall:<br />“Computers are stupid”<br />English is complex<br />Clustering + classification algorithms:<br />Stink without consistent data<br />“The” => “the” (duh)<br />“geese” => “goose” (ok)<br />Stink when they’re slow<br />Graph + Clustering/Classification<br />Just add data<br />
  20. 20. MongoDB makes a Great graph back-end<br />See more about Wordnik APIs:<br /><br />Further Reading<br />Migrating from MySQL to MongoDB<br /><br />Maintaining your MongoDB Installation<br /><br />Source Code<br />Mapping Benchmark<br /><br />Wordnik OSS Tools<br /><br />
  21. 21. MongoDB makes a Great graph back-end<br />Questions?<br />