A Little Graph Theory for the Busy Developer - Jim Webber @ GraphConnect Chicago 2013

2,844 views

Published on

In this talk we'll explore powerful analytic techniques for graph data. Firstly we'll discover some of the innate properties of (social) graphs from fields like anthropology and sociology. By understanding the forces and tensions within the graph structure and applying some graph theory, we'll be able to predict how the graph will evolve over time. To test just how powerful and accurate graph theory is, we'll also be able to (retrospectively) predict World War 1 based on a social graph and a few simple mechanical rules.
Then we'll see how graph matching can be used to extract online business intelligence (for powerful retail recommendations). In turn we'll apply these powerful techniques to modelling domains in Neo4j (a graph database) and show how Neo4j can be used to drive business intelligence.
Don't worry, there won't be much maths :-)

Published in: Technology, Education
1 Comment
5 Likes
Statistics
Notes
No Downloads
Views
Total views
2,844
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
72
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide
  • Vanity slide - Work for Neo Technology, the commercial backer of the Neo4j open source graph database
  • We’ve all been there – trudging through 3 intermediate tables just to get the data you want.
  • We have key-value stores, typically very highly available and scalable for simple key-value data
  • Column stores naturally-indexed value storesContrary to common belief – Google’s big table isn’t the world’s most famous column storeBritish museum London is: it’s got columns and it’s where we stored all the stuff we nicked from the British Empire!
  • Aggregates:No more impedance mismatch, just decompose to documents/ k-v / columnsThe data model is less expressive than RDBMS, but that’s OK because:RDBMS constraints don’t really match application concernsThe expressiveness is in your domain logic, right?
  • People talk about Codd’s relational model being mature because it was proposed in 1969 – 42 years old.Euler’s graph theory was proposed in 1736 – 275 years old.
  • Graphs are the most natural way to model most domains. You already know this because you draw graphs on a whiteboard, but you’ve never had the opportunity to take that down into the database before.Nodes are a bit like documents, but they’re flat at present in Neo4j.You pour data into your nodes and then connect them – easy peasy.This enables high fidelity domain modeling because this is how your domains work.And you don’t have to do this stuff in your application code – it’s right there in the databaseLet’s prove it by exploring a fun domain…
  • Start off sketching the domain. That’s your model done – we see this when we revisit databases months after they’re been designed and put into productionNo decomposition, ER design, normalisation/denormalisation as you need with RDBMS.
  • Predictive analytics
  • Euler reduced the problem into an abstract formFrom geography to diagrammatic through to a graph.It’s the first documented case of representing the real world as a graph.
  • First we need to talk about some local properties
  • A triadic closure is a local property of (social) graphs whereby if two nodes are connected via a path involving a third node, there is an increased likelihood that the two nodes will become directly connected in future. This is a familiar enough situation for us in a social setting whereby if we happen to be friends with two people, ultimately there's an increased chance that those people will become direct friends too, since by being our friend in the first place, it's an indication of social similarity and suitability. It’s called triadic closure, because we try to close the triangle.
  • We see this all the time – it’s likely that if we have two friends, that they will also become at least acquaintances and potentially friends themselves!In general, if a node A has relationships to B & C then the relationship between B&C is likely to form – especially if the existing relationships are both strong.This is an incredibly strong assertion and will not be typically upheld by all subgraphs in a graph. Nonetheless it is sufficiently commonplace (particularly in social networks) to be trusted as a predictive aid.
  • Sentiment plays a role in how closures form too – there is a notion of balance.
  • From a triadic closure perspective this is OK, but intuitively it seems odd.Cartman’s friends shouldn’t be friends with his enemies. Nor should Cartman’s enemies be friends with his friends.
  • This makes sense – Cartman’s friend Craig is also an enemy of Cartman’s enemy TweekTwo negative sentiments and one positive sentiment is a balanced structure – and it makes sense too since we gang up with our friends on our poor beleaguered enemy
  • Another balanced – and more pleasant – arrangement is for three positive sentiments, in this case mutual friends.
  • A starting point for a network of friends and enemiesRed links indicate enemy of relationshipBlack links indicate friend of relationshipThe Three Emperor’s league
  • Italy forms the with Austria and Germany – a balanced +++ triadic closureIf Italy had made only a single alliance (or enemy) it would have been unstable and another relationship would be likely to form anyway!Triple Alliance
  • Russia becomes hostile to Austria and Germany – a balance --+ d triadic closure becomes agnostic towards France.German-Russian Lapse
  • The French and Russians ally, forming a balanced --+ triadic closure with the UKFrench-Russian Alliance
  • The UK and France enter into the famous Entente CordialeThis produces an unbalanced ++- triadic closure with Russia, and the graph doesn’t like it.
  • The British and Russians form an alliance, thereby changing their previously unbalanced triadic closure into a balanced one.Other local pressures on the graph make other closures form.Italy becomes hostile to Russia, forming a balanced --+ closure with the France, and another balanced --+ closure with the UK.Germany and the UK become hostile forming a balanced --+ closure with Austria and another balanced --+ closure with ItalyBritish-Russian Alliance
  • That WWI can be predicted without domain knowledge by iterating a graph and applying local structural constraints is nothing short of astonishing to me.Note how the network slides into a balanced labeling — and into World War I.
  • In this case the string triadic closure property still holds – though it is a weak link that characterises the relationship between Stan and Cartman.Given a starting graph, we can apply this simple local principal to see how it would evolve.
  • In this case the string triadic closure property still holds – though it is a weak link that characterises the relationship between Stan and Cartman.Given a starting graph, we can apply this simple local principal to see how it would evolve.
  • A local bridge acts as a link – perhaps the only realistic link - between two otherwise distant (or separate) subgraphs.Local bridges are semantically rich – they provide conduits for information flow between otherwise independent groups.In this case DATING is a local bridge – it must also be a weak relationship according to our definition of a local bridgeIntuitively this makes sense – your girl/boyfriend is rather less important at age 8 than your regular friends, IIRC.
  • How do we identify local bridges? Any weak link which would cause a component of the graph to become disconnected.Being able to identify local bridges is important – in this case it’s the only know conduit to allow the girls and boys to communicate.In real life local bridges are apparent in your organisation as experts (or managers); appear as nexus in fraud cases;
  • Zachary in the Journal of Anthropological Research 1977Intuitively we can see “clumps” in this graph.But how do we separate them out? It’s called minimum cut.
  • What’s interesting is that it’s mechanical – no domain knowledge is necessary. There’s only one failure with the method Zachary chose to partition the graph: node 9 should have gone to the instructor’s club but instead went with the original president of the club (node 34).Why? Because the student was three weeks away from completing a four-year quest to obtain a black belt, which he could only do with the instructor (node 1)Other minimum cut approaches might deliver slightly different results, but on the whole it’s amazing you get such insight from an algorithm!
  • But is there enough information in the graph itself to predict the schism?
  • But is there enough information in the graph itself to predict the schism?
  • We can use graph matching to look for behavioural patterns in the graph too!
  • The insight here is that we have a typical young father who buys beer, nappies and a game console simply by reducing subgraphWe have a pattern to search for
  • Now we look for young fathers – implied by beer and nappies purchases – who haven’t bought a game console.
  • START n=node(*), r=rel(*) DELETE r,nCREATE daddy1 = { name: 'Mickey Smith', dob: 19781006 }CREATE beer = { category: 'beer' } CREATE alcohol = { category: 'alcoholicdrinks' } CREATE peeweePilsner = { sku: '2555f258', product: 'PeeweePilsner' } CREATE badgersNadgers = { sku: '5e175641', product: 'BadgersNadgers Ale' }CREATE peeweePilsner-[:MEMBER_OF]->beerCREATE badgersNadgers-[:MEMBER_OF]->beerCREATE beer-[:MEMBER_OF]->alcoholCREATE daddy1-[:BOUGHT]->peeweePilsnerCREATE daddy1-[:BOUGHT]->badgersNadgersCREATE baby = { category: 'baby' }CREATE nappies = { category: 'nappies' }CREATE nappies-[:MEMBER_OF]->babyCREATE babyDryNights = { sku: '49d102bc', product: 'Baby DryNights'} CREATE babyDryNights-[:MEMBER_OF]-nappiesCREATE daddy1-[:BOUGHT]->babyDryNightsCREATE consumerElectronics = { category: 'consumerelectronics' }CREATE console = { category: 'console' }CREATE xbox = { sku: '49d102bc', product: 'XBox 360' }CREATE xbox-[:MEMBER_OF]->(console)-[:MEMBER_OF]->consumerElectronicsCREATE daddy1-[:BOUGHT]->xboxCREATE mummy11 = { name: 'Rose Tyler', dob: 19800317 }CREATE wine = { sku:'3a3f22bc', product: 'Shiraz' }CREATE mummy1-[:BOUGHT]->wineCREATE mummy1-[:BOUGHT]->babyDryNights CREATE daddy2 = { name: 'Rory Williams', dob: 19880121 }CREATE daddy2-[:BOUGHT]->peeweePilsnerCREATE daddy2-[:BOUGHT]->babyDryNightsSTART beer=node(2), nappies=node(7), xbox=node(11)MATCH (daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(beer), (daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(nappies), (daddy)-[b?:BOUGHT]->(xbox)WHERE b isnullRETURN distinctdaddy
  • Note that Max de Marzireimplemented a functionally better Graph Search with Neo4j and some Ruby gems for language processing in a weekend!
  • Image: real-timeBeing able to look for all young fathers who might be tempted to buy a new game console is helpful, but not dramatically different from what we have nowIt’s much faster to process in a graph, but still a latent business activity (e.g. mailshot)But you can take the same idea and run the query in real time foran individual customer at they go through the checkout!And you can do more too – you can add in more/less dimensions to the search.Does the shop have stock?Does the young father live in the right target area (r-tree, postcode)?We can vary the number of dimensions we include to tailor the search for performance/accuracy very easily in a graph – query latency is proportional to the amount of graph searched, not data set size.
  • A Little Graph Theory for the Busy Developer - Jim Webber @ GraphConnect Chicago 2013

    1. 1. A Little Graph Theory for theBusy DeveloperJim WebberChief Scientist, Neo Technology@jimwebber
    2. 2. Roadmap• Imprisoned data• Graph models• Graph theory– Local properties, global behaviors– Predictive techniques• Graph matching– Real-time analytics for fun and profit• Fin
    3. 3. http://www.flickr.com/photos/crazyneighborlady/355232758/
    4. 4. http://gallery.nen.gov.uk/image82582-.html
    5. 5. Aggregate-Oriented Datahttp://martinfowler.com/bliki/AggregateOrientedDatabase.html“There is a significant downside - the whole approach works really wellwhen data access is aligned with the aggregates, but what if you want tolook at the data in a different way? Order entry naturally stores orders asaggregates, but analyzing product sales cuts across the aggregate structure.The advantage of not using an aggregate structure in the database is that itallows you to slice and dice your data different ways for differentaudiences.This is why aggregate-oriented stores talk so much about map-reduce.”
    6. 6. complexity = f(size, connectedness, uniformity)
    7. 7. http://www.bbc.co.uk/london/travel/downloads/tube_map.html
    8. 8. Property graphs• Property graph model:– Nodes with properties– Named, directed relationships with properties– Relationships have exactly one start and end node• Which may be the same node
    9. 9. stolefromloveslovesenemyenemyA Good ManGoes to WarappearedinappearedinappearedinappearedinVictory ofthe Daleksappearedinappearedincompanioncompanionenemy
    10. 10. Property graphs are very whiteboard-friendly
    11. 11. http://blogs.adobe.com/digitalmarketing/analytics/predictive-analytics/predictive-analytics-and-the-digital-marketer/
    12. 12. http://en.wikipedia.org/wiki/File:Leonhard_Euler_2.jpgMeet Leonhard Euler• Swiss mathematician• Inventor of GraphTheory (1736)16
    13. 13. Königsberg (Prussia) - 1736
    14. 14. ABDC
    15. 15. ABDC1234765
    16. 16. http://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg20
    17. 17. Triadic Closurename: Kylename: Stan name: Kenny
    18. 18. Triadic Closurename: Kylename: Stan name: Kennyname: Kylename: Stan name: KennyFRIEND
    19. 19. Structural Balancename: Cartmanname: Craig name: Tweek
    20. 20. Structural Balancename: Cartmanname: Craig name: Tweekname: Cartmanname: Craig name: TweekFRIEND
    21. 21. Structural Balancename: Cartmanname: Craig name: Tweekname: Cartmanname: Craig name: TweekENEMY
    22. 22. Structural Balancename: Kylename: Stan name: Kennyname: Kylename: Stan name: KennyFRIEND
    23. 23. Structural Balance is a keypredictive techniqueAnd it’s domain-agnostic
    24. 24. Allies and EnemiesUKGermanyFranceRussia ItalyAustria
    25. 25. Allies and EnemiesUKGermanyFranceRussia ItalyAustria
    26. 26. Allies and EnemiesUKGermanyFranceRussia ItalyAustria
    27. 27. Allies and EnemiesUKGermanyFranceRussia ItalyAustria
    28. 28. Allies and EnemiesUKGermanyFranceRussia ItalyAustria
    29. 29. Allies and EnemiesUKGermanyFranceRussia ItalyAustria
    30. 30. Predicting WWI[Easley and Kleinberg]
    31. 31. Strong Triadic ClosureIt if a node has strong relationships to twoneighbours, then these neighbours must have atleast a weak relationship between them.[Wikipedia]
    32. 32. Triadic Closure(weak relationship)name: Kennyname: Stan name: Cartman
    33. 33. Triadic Closure(weak relationship)name: Kennyname: Stan name: Cartmanname: Kennyname: Stan name: CartmanFRIEND 50%
    34. 34. Weak relationships• Relationships can have “strength” as well asintent– Think: weighting on a relationship in a propertygraph• Weak links play another super-importantstructural role in graph theory– They bridge neighbourhoods
    35. 35. Local BridgesFRIENDname: Kennyname: Stanname: KyleFRIENDFRIENDname: Sallyname: Bebename: WendyFRIENDFRIEND 50%name: CartmanFRIENDENEMY
    36. 36. Local Bridge Property“If a node A in a network satisfies the StrongTriadic Closure Property and is involved in atleast two strong relationships, then any localbridge it is involved in must be a weakrelationship.”[Easley and Kleinberg]
    37. 37. University Karate Club
    38. 38. Graph Partitioning• (NP) Hard problem– Recursively remove the spanning links betweendense regions– Or recursively merge nodes into ever larger“subgraph” nodes– Choose your algorithm carefully – some are betterthan others for a given domain• Can use to (almost exactly) predict thebreak up of the karate club!
    39. 39. University Karate Clubs(predicted by Graph Theory)9
    40. 40. University Karate Clubs(what actually happened!)
    41. 41. Cypher• Declarative graph pattern matching language– “SQL for graphs”– Columnar results• Supports graph matching commands andqueries– Find me stuff like this…– Aggregation, ordering and limit, etc.
    42. 42. Firstname:MickeySurname: SmithDoB: 19781006SKU: 5e175641Product:BadgersNadgers AleSKU: 2555f258Product:Peewee PilsnerCategory: beerSKU: 49d102bcProduct: BabyDry NightsCategory:nappiesCategory: baby Category:alcoholicdrinksSKU: 49d102bcProduct: XBox360Category:consumerelectronicsCategory:consoleBOUGHTBOUGHTMEMBER_OFMEMBER_OFMEMBER_OFMEMBER_OFMEMBER_OF
    43. 43. Firstname: *Surname: *DoB: 1996 > x> 1972Category: beerCategory:nappiesBOUGHTCategory: gameconsole
    44. 44. Firstname: *Surname: *DoB: 1996 > x> 1972Category: beerCategory:nappies!BOUGHTCategory: gameconsole
    45. 45. (beer)(nappies)(console)(daddy)() ()()
    46. 46. Flatten the graph(daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(nappies)(daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(beer)(daddy)-[b:BOUGHT]->()-[:MEMBER_OF]->(console)
    47. 47. Wrap in a Cypher MATCH clauseMATCH (daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(nappies),(daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(beer),(daddy)-[b:BOUGHT]->()-[:MEMBER_OF]->(console)
    48. 48. Cypher WHERE clauseMATCH (daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(nappies),(daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(beer),(daddy)-[b:BOUGHT]->()-[:MEMBER_OF]->(console)WHERE b is null
    49. 49. Full Cypher querySTART beer=node:categories(category=‘beer’),nappies=node:categories(category=‘nappies’),xbox=node:products(product=‘xbox 360’)MATCH (daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(beer),(daddy)-[:BOUGHT]->()-[:MEMBER_OF]->(nappies),(daddy)-[b?:BOUGHT]->(xbox)WHERE b is nullRETURN distinct daddy
    50. 50. Results==> +---------------------------------------------+==> | daddy |==> +---------------------------------------------+==> | Node[15]{name:"Rory Williams",dob:19880121} |==> +---------------------------------------------+==> 1 row==> 0 ms==>neo4j-sh (0)$
    51. 51. Facebook Graph SearchWhich sushi restaurants inNYC do my friends like?
    52. 52. Graph Structure
    53. 53. Cypher querySTART me=node:person(name = Jim),location=node:location(location=New York),cuisine=node:cuisine(cuisine=Sushi)MATCH (me)-[:IS_FRIEND_OF]->(friend)-[:LIKES]->(restaurant)-[:LOCATED_IN]->(location),(restaurant)-[:SERVES]->(cuisine)RETURN restaurant
    54. 54. Search structure
    55. 55. What are graphs good for?• Recommendations• Pharmacology• Business intelligence• Social computing• Geospatial• MDM• Data center management• Web of things• Genealogy• Time series data• Product catalogue• Web analytics• Scientific computing• Indexing your slow RDBMS• And much more!
    56. 56. Free O’Reilly book foreveryone!http://graphdatabases.com
    57. 57. Thanks for listeningNeo4j: http://neo4j.orgMe: @jimwebber

    ×