Your SlideShare is downloading. ×
Building a Directed Graph with MongoDB
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Building a Directed Graph with MongoDB

27,466
views

Published on

Details of how Wordnik built a directed graph on top of MongoDB. This is the presentation given during MongoSF 2011 by Tony Tam.

Details of how Wordnik built a directed graph on top of MongoDB. This is the presentation given during MongoSF 2011 by Tony Tam.

Published in: Technology

2 Comments
41 Likes
Statistics
Notes
No Downloads
Views
Total Views
27,466
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
263
Comments
2
Likes
41
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Building A directed graph with mongodb
    MongoSF 5/24/2011
    By Tony Tam @fehguy
  • 2. Who is wordnik
    Word + Meaning Discovery Engine
    Clustered Application built with:
    Scala/Java/Jetty
    Only way in is via REST
    19M API calls/day @ 7ms/query average
    Physical servers
    72GB RAM, 8 core
    4.3TB DAS
    We’re MongoDB users for ~1.5 yrs
    Used in master/slave
    14B documents in MongoDB
  • 3. Why a graph for words
    Technique to model network relationships
    Properties are dynamic
    Links are “arbitrary”
    Runtime performance
    Answers in < 5ms/request
    Routing functions based on goals
    “find most likely word for X”
    “find more common form of Y”
  • 4. Why a graph for words
    Misspellings, abbreviations, texting, Twitter
  • 5. More about graphs
    Different types of Graphs
    Decisions have huge impact on design + implementation
    Nodes (vertices)
    String and numeric properties
    Edges (links)
    Finite set of labeled edge types (~30)
    Multiple target nodes per edge
    Each potentially different weight
    Directed, non-symmetrical
  • 6. Why build on Mongodb?
    Word Graph is core to Wordnik
    Many ways to build a graph
    Dedicated graph DBs
    Relational DBs
    MongoDB Document Storage
    Uber-flexible
    Successfully routes in < 5ms
    Long runway for scale-out
    Limit storage infrastructure components
    Easy to implement
  • 7. Wordnik graph data model
    Nodes
    _id field holds name, object type
    Index at no extra cost
    Arbitrary number of properties
    Only two datatypes for us, String, Double
    Node type info in node ID (_id)
    na_corpusCount => Double
    sa_source => String
  • 8. Wordnik graph data model
    Edges
    Destination(s)
    Weight
    Link Properties
    Stored in Mongo Arrays
    Array size is app limited
    Use $push, $pop
  • 9. Access to mongo
    Mongo Access via DAO layer
    Limit queries to ones that work“well”
    ALL queries use index
    Find Node “cat” of type “word”:
    db.node.findOne({_id:"cat|word"})
    Find Edge types for above:
    db.edge.find({_id:/^cat|word|/},{_id:1})
    Serialization/deserialization
    Done “the old-fashioned way”
    BasicDBObject, BasicDBList faster than mappers for our use case
  • 10. Query efficiency
    Max execution time is f (ahops)
  • 11. Routing, traversals, functions
    Typically find path from A to B
    Routes have costs
    Low cost or high probability
    Our use case is atypical
    LinkedIn vs. Maps
    Not from A to B
    More like “from A with 3 hops”
    This matters!
  • 12. Performance + Scaling
  • 13. Performance + scaling
    Query by index only
    Use regex syntax in restricted fashion
    Starts with only
    No look behind
    Case sensitive
    Boring? Fast?
    Sharding is a no-brainer
    What about ObjectId()?
  • 14. Performance + scaling
    Horizontal? Vertical? Both? And when?
    Separate collections by edge type/object type
    Increases storage needs
    Collections all have padding, 30 collections => ~30x padding
    Sharding
    Use slick, built-in Mongo sharding
    Roll your own based on your data
    What does Wordnik do?
    Neither! (yet)
    30M Nodes, 50M Edges
    One collection for nodes
    One collection for edges
  • 15. Performance + scaling
    Selecting a shard key
    Done in application logic based on OUR data
    Depends on what you need
  • 16. End result
    Solves Wordnik Graph infrastructure needs
    Store Word nodes with UGC, corpus, structured, analytical data
    Batch fetch Edges @ > 50k/second
    Find Edge + endpoints in 80mS
    Powers our…
    Word Selection
    Canonicalization
    Misspelling
    “Did you mean” logic
    Classification + Matching Engine
  • 17. Examples
    Misspellings
    Abbreviations
    Lemmatization
  • 18. Examples
    Term normalization
    Find similar words
    Meaning normalization
    Find “more common” form
  • 19. examples
    Applied Word Graph
    Recall:
    “Computers are stupid”
    English is complex
    Clustering + classification algorithms:
    Stink without consistent data
    “The” => “the” (duh)
    “geese” => “goose” (ok)
    Stink when they’re slow
    Graph + Clustering/Classification
    Just add data
  • 20. MongoDB makes a Great graph back-end
    See more about Wordnik APIs:
    http://developer.wordnik.com
    Further Reading
    Migrating from MySQL to MongoDB
    http://www.slideshare.net/fehguy/migrating-from-mysql-to-mongodb-at-wordnik
    Maintaining your MongoDB Installation
    http://www.slideshare.net/fehguy/mongo-sv-tony-tam
    Source Code
    Mapping Benchmark
    https://github.com/fehguy/mongodb-benchmark-tools
    Wordnik OSS Tools
    https://github.com/wordnik/wordnik-oss
  • 21. MongoDB makes a Great graph back-end
    Questions?