Neo4j - Tales from the Trenches

1,957
-1

Published on

Lessons learned from over a year with Neo4j on a social network / recommendation engine. Presented at Neo4j user group in London, UK in 2012.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,957
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • TODO Neo logo
  • Nicki:A complete online profile of your interests, tastes and opinions. Designed to be useful to you and to the rest of the world. http://labs.yougov.co.uk
  • Michal
  • Example: find all the companies that work on Opigram
  • Nicki
  • Don’t spend too much time on this:First 2: general and applicable to allNext 2: specific tips, there is a chance you’ll need themLast one: performance
  • Describe the problem and how it evolvedCan express:Users descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolProblems with the resulting schema:Change in a review => need to update votesNeed to make sure “described as” is deleted when votes = 0Finding “all people that described something as cool” is too complicatedNot future-proof, what if we now want to review 2 things together (like Nicki and I)
  • No need to keep track of votesCan still do all the traversals I needUsers descriptors for a given reviewTop 5 descriptors for a given thingAll things that are coolPLUS: All people that used a descriptorCan review multiple things now
  • As Michal explained, a node is - Neo Node ID is - Neo4j generated Unique id- long (generally auto incrementing – like Mysqlautoincrementing primary keys or Oracle sequences)- Easily accessible and exposed via Neo4J APIsMay hear that and think- great, I need a unique identifier, sounds like it does what I need, I shall just use that rather than manage it myselfENTER LESSON 1: Don’t Use Neo Node IDs as your primary keys
  • Benefits of this approach No code for you to worry about If you have multiple clients writing to the database (legacy system) this will be taken care for you under the coversgenerateUniqueID() needs to be unique across HA
  • Different versions handle differently1.4.2 Mostly recycling of old IDs1.5+ Possible changing of IDs between server restartsTODO: Don’t expose! + Index is your friend
  • ProblemTrying to pick a random number of nodes out of the graphNot Neo4J’s sweet spotEspecially hard when dealing with sub graphsExamplesPick some random nodes out of the graph to display to people to ask for recommendationsUse as part of statistical algorithms to make statements like …People who tend to like …. tend to also ….SolutionsIf size small enough and known traversal pathLoad into Collections and shuffleIf size largeCustom Relationship ExpanderIf the whole graph is in play ….ScattergunIndexer with Resevoir Sampling algorithmLessons Learned … - Random Access type work is not Neo4J’s sweet spot - Can get around it with indexes and random(ish) selection algorithms but may not be ideal
  • How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Hit and Miss”All nodes form part of “population”, not good when you want subsets of the graphGenerate random IDs, deal with cases of missesCustom Relationship Expander/EvaluatorRandomly discard relationships as you go alongIterables returned by traverser are generally not random, gives more precedence to nodes earlier onReservoir SamplingDesigned for use with IterablesRandomly build up and replace ultimate subset to returnUse an indexFront with a cache if need be
  • How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
  • How Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
  • Mac OS - 10.7 8GB RamLeftOver +-4.5GB JVM Heap max 1.5GB Neo4J Mapped Memory Settings 2.0GBneostore.nodestore.db.mapped_memory =256Mneostore.relationshipstore.db.mapped_memory =768Mneostore.propertystore.db.mapped_memory =512Mneostore.propertystore.db.strings.mapped_memory=256Mneostore.propertystore.db.arrays.mapped_memory =256M Post Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 395.0M neostore.propertystore.db 2.8M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db 7.6M neostore.propertystore.db.arrays Pre Upgrade from 1.4.2 -> 1.5 -> 1.6.2... 2.2M neostore.nodestore.db 1000.0M neostore.propertystore.db 54.0M neostore.propertystore.db.arrays 8000.0M neostore.propertystore.db.strings 581.0M neostore.relationshipstore.db
  • TODO: Mention disk accessHow Random does Random need to be?Load, Shuffle, PickIf hitting a known, small subset of the graphLoad all nodes thenCollections.shuffle(..)“Scattergun”All nodes form part of “population”Generate random IDs, deal with cases of missesCustom Relationship ExpanderRandomly discard relationships as you go alongReservoir SamplingGreat for use with IterablesUse an indexFront with a cache if need be
  • Neo4j - Tales from the Trenches

    1. 1. Neo4J – Tales from the Trenches A RECOMMENDATION ENGINE CASE STUDY Michal Bachman & Nicki Watt @bachmanm & @techiewatt
    2. 2. Who we are … role = “consultant” works on Nicki Watt works forOpigram colleague of OpenCredo works on works for Michal Bachman role = “consultant” uses Neo4J partner of
    3. 3. Opigram• http://labs.yougov.co.uk• Opinion Profile• Social Network (TBD)• Recommendation Engine• CMS
    4. 4. OpigramRecommendations/Interesting Insights Things generates aboutPeople who like …also tend to like … OpinionsPeople who like …tend to support … AboutPeople who like … (themselves) providesdescribe themselves as…. Panelists
    5. 5. Opigram• Started Feb 2011• Nov 2011 – OpenCredo – Many lessons learned• Stats• ~ 150k panelists (a.k.a. users)• ~ 100k “things” (movies, books,…)• ~ 8M relationships
    6. 6. Neo4J• Graph Database• Schema-less (NoSQL)• Vertices and Edges• a.k.a. Nodes and Relationships• Traversals• Version 1.7 just released!
    7. 7. Neo4J role = “consultant” works on Nicki Watt works forOpigram colleague of OpenCredo works on works for Michal Bachman role = “consultant” uses Neo4J partner of
    8. 8. Opigram + Neo4J• Taxonomy of “things”• Opinions on “things”• Recommendations• Offline “Crunching”
    9. 9. Opigram + MySQL• CMS Functionality• Crunching Results• Configuration / Metadata
    10. 10. Lessons Learned• Everyone loves Neo4J! Find praise online• “Trenches Talk” - Aiming to provide insight into some real problems encountered and approaches to solutions• We have 5 practical lessons for you – Tips – Tricks – Troubles
    11. 11. Lessons Learned• Lesson 1: Graph “Schema”• Lesson 2: Neo Node IDs• Lesson 3: Graph-wide Operations• Lesson 4: Extracting Randomised Data• Lesson 5: Multi-threading
    12. 12. Lesson 1Graph “Schema”
    13. 13. Schema-less ≠ Credit: Greencolander
    14. 14. Movie review typeMichal Pulp Fiction text = “…” descriptors = Cool, Funny described as described as votes = 1 votes = 1 Cool Boring type type Funny type Descriptor Romantic type
    15. 15. Movie typeMichal Pulp Fiction created review of described as Cool Review Boring text=“…” type type Funny type described as Descriptor Romantic type
    16. 16. Lesson 2Neo4J Node IDs
    17. 17. Neo Node IDs• What are they• Can I use them to represent my keys – No!• Why not – Not Stable – Ids are garbage collected over time, thus only guaranteed to be unique during a specific time span
    18. 18. Example“User Transformation”
    19. 19. USER_ID NEO_ ACTIVE 1 NODE_ID Michal type101 1 Y 2 type 4102 2 Y Nicki Panelist103 3 N Y 3 type Jim MySQL Jim is now Cool ! Cool 5 Boring type 7 Funny type type 6 8 Descriptor Romantic type
    20. 20. Alternate ID Strategies• Client provided IDs – Add as a standard property on the node – Add to index (or use auto indexer)• Natural vs. Synthetic IDs• Auto generate your own IDs – Hook into Neo4J Transaction Kernel – Use auto indexer
    21. 21. Auto generate your own IDs1) Implement TransactionEventHandler2) Register TransactionEventHandler with graphDatabaseService3) Turn auto indexing on for seamless generation
    22. 22. Lesson 2: ConclusionDon’t use Neo Node IDs as your keys!!!It’s a losing battle, ultimately the forcewill not be with you! credit: http://uk.xbox.gamespy.com
    23. 23. Lesson 3Graph-wide Operations
    24. 24. Motivations• Fixes – Bugs – Re-indexing• “Schema” Migrations• Data Export• Data Analysis• Count Caching
    25. 25. Lesson 3: Graph-wide Operations• Batch Updates• Delete relationships only from one side• GlobalGraphOperations since 1.6• No need for TX when reading
    26. 26. ExampleDeleting “soft-deleted” relationships
    27. 27. Lesson 3: Graph-wide Operations• Batch Updates• Delete only from 1 side• GlobalGraphOperations since 1.6• No need for TX when reading
    28. 28. Lesson 3: Graph-wide Operations• Batch Updates• Delete only from 1 side• GlobalGraphOperations since 1.6• No need for TX when reading
    29. 29. Lesson 3: Graph-wide Operations• Batch Updates• Delete only from 1 side• GlobalGraphOperations since 1.6• No need for TX when reading
    30. 30. Lesson 3: Graph-wide Operations• Batch Updates• Delete only from 1 side• GlobalGraphOperations since 1.6• No need for TX when reading
    31. 31. ExampleComputing statistics
    32. 32. Lesson 3: Graph-wide Operations• Batch Updates• Delete only from 1 side• GlobalGraphOperations since 1.6• No need for TX when reading
    33. 33. Lesson 4Extracting Randomised Data
    34. 34. Extracting Randomised Data• Use Cases – Provide Random Suggestions to users – Use for statistical analysis aka “Random Sampling”• Problem – No built in Neo4J support – Not Neo4J’s sweet spot – May result in very bad performance
    35. 35. Options• Randomisation Strategies – “Load, Shuffle, Pick” – “Hit and Miss” – Custom Relationship Expander/Evaluator – Reservoir Sampling• Performance Helpers – Indexes – Front with a cache if need be
    36. 36. Custom Relationship Evaluator
    37. 37. Reservoir Sampling Algorithm
    38. 38. Traversals vs. Index25 random nodes extracted from [Sample Size] using “Reservoir Sampling” algorithm X-Axis: Sample Size Y-Axis: Time (milliseconds)45000400003500030000 1.5 TRAVERSAL PASS 1 (COLD) 1.4.2 TRAVERSAL PASS 1 (COLD)25000 1.4.2 TRAVERSAL PASS 2 (WARMISH)20000 1.5 TRAVERSAL PASS 2 (WARMISH) 1.5 INDEX15000 1.6.2 TRAVERSAL PASS 1 (COLD)"10000 1.6.2 TRAVERSAL PASS 2 (WARMISH) 5000 Use of lucene indexes 0 can reduce time to +- 300 - 5000 10000 20000 40000 80000 160000 1000ms from cold
    39. 39. Conclusion• Most options are not “truly random” more “randomish”• Primarily has bad performance when hitting cold parts of graph• Caching helps – If an option, serve stale data until next random sample can be selected
    40. 40. Lesson 5Multi-threading
    41. 41. Lesson 5: Multi-threading• Shortcoming in Neo4J• Fixed in version 1.7• Avoid relationship properties in multi- threaded pre-1.7 apps
    42. 42. Questions?
    43. 43. Beer Time!• @bachmanm• michal.bachman@opencredo.com• @techiewatt• nicki.watt@opencredo.com

    ×