Everything Goes Better With Bacon: Revisiting the Six Degrees Problem with a Graph Database

2,262 views
1,999 views

Published on

The six degrees problem is a classic party game and typical use-case for the power and efficiency of graph databases. But even with a powerful graph database, a complex ecosystem of data like IMDB (Internet Movie Database) can return a dizzying amount of data within six degrees of separation from the source. With this amount of data, how do you draw business value from large sets of highly connected data? In this session, we will discuss some powerful strategies for using a distributed graph database, to perform analysis to derive business value from highly connected, complex data sets using navigational queries and visualization.

Published in: Technology

Everything Goes Better With Bacon: Revisiting the Six Degrees Problem with a Graph Database

  1. 1. www.Objectivity.comNick QuinnLead Developer - InfiniteGraph
  2. 2. What are we talking about today?Not that Bacon This Bacon!• Intro to the Six Degrees Problem• What is a Graph Database?• Why Bacon in Graph Database?• How we solved the problemImages Courtesy of IMDB (www.imdb.com)
  3. 3. Six Degrees of Bacon“…any individual involved in the Hollywood, California film industrycan be linked through his or her film roles to actor Kevin Baconwithin six steps”[http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon]Gina MenzaImages Courtesy of IMDB (www.imdb.com)A Tale of Two Kevins
  4. 4. Why Six Degrees of Bacon?Actor Age # of ProjectsKevin Bacon 54 76Harrison Ford 70 70Tom Cruise 50 40Julia Roberts 45 50Tom Hanks 56 73Denzel Washington 58 53Michael Caine 80 157Kiefer Sutherland 46 82Kevin BaconImages Courtesy of IMDB (www.imdb.com)
  5. 5. Bacon Numbers in GoogleIn the summer of 2012, Google started to allow users to find thebacon number of any actor simply by following his or her namewith “bacon number”.MorganFreemanThe DarkNightRisesappeared inGaryOldmanappeared inKevinBaconCriminalLawappeared inappeared inwww.google.com Graphical Representation
  6. 6. What is a Graph Database ?
  7. 7. The Physical Data Model• Difference between relational & graph databasesMeetingsP1 Place TimeP2Alice Denver 5-27-10BobCallsFrom Time DurationToBob 13:20 25CarlosBob 17:10 15CharliePaymentsFrom Date AmountToCarlos 5-12-10 100000CharlieMet5-27-10AliceCalled13:20BobPaid100000CarlosCharlieCalled17:10Rows/Columns/Tables Relationship/Graph Optimized
  8. 8. Connecting DataPerson Building?Work Live RR Visit Eat Shop
  9. 9. Who is Gina Menza?• How do we get meaning from highlyconnected data?Gina MenzaJury Forewoman Miss JeffriesImages Courtesy of IMDB (www.imdb.com)
  10. 10. Strength of Connections Matter!• Why 6 degrees of separation and not 3.74?• We need analysis tools in order to– identify and filter out “unimportant” data and– infer what needs to be filtered as we investigate it.“When considering anotherperson in the world, a friendof your friend knows afriend of their friend”- facebook
  11. 11. Why Bacon in a Graph Database ?
  12. 12. Graph Analysis• Why use Graph Databases for graph analysis?– Dynamic on Live Data– Feedback/Inference– Optimized for concurrent user access– Handles big data problems– Native Graph Traversal API– Manage memory efficiently
  13. 13. Paths to BaconBacon Number(Degree of Separation / 2)# of People1 28232 3236773 10885604 2729055 225336 2300Using the IMDB (www.imdb.com) data set, we can study how many pathscan be found by degrees of separation from Kevin Bacon. Out of 5,067,124nodes and 11,505,797 edges, we get the following:0200000400000600000800000100000012000001 2 3 4 5 6# ofPeople
  14. 14. Big Data + Graph = Big Graph Data4 Degrees of Kevin Bacon(Breadth First up to 20K connections)Images generated using the IG Visualizer
  15. 15. Analyzing Bacon• To be able to perform meaningful analysis,these are things that you will need:– Ingest IMDB Dataset – About 50 Formattedcompressed files (Largest > 200 MB)– Custom algorithm support to perform meaningfulanalysis• Optimize queries to get results back in reasonable time– Visualization tool to test and view the results ofthe navigation (optional)
  16. 16. How IG Sizzles Your BaconIngestUpdateNavigateMassive graph data require efficient and intelligent toolsto analyze and understand it.
  17. 17. Super Simple Java APIActor bacon = new Actor(“Kevin Bacon”);imdbGraphDB.addVertex( bacon );Movie apollo= new Movie(“Apollo 13”, 1995);imdbGraphDB.addVertex( apollo );ActedIn bacon2apollo = new ActedIn(“Jack Swigert”);imdbGraphDB.addEdge(bacon2apollo, bacon, apollo,EdgeKind.BIDIRECTIONAL, 1 /**weight**/);Ingest
  18. 18. Scaling Writes• Big/Fast data demands write performance• Most NoSQL solutions allow you to scalewrites by…– Partitioning the data– Understanding your consistency requirements– Allowing you to defer conflictsIngest
  19. 19. App-2(Ingest V2)App-2(E23{ V2V3})Scaling Graph WritesACID TransactionsInfiniteGraphObjectivity/DB Persistence LayerApp-1(Ingest V1)App-3(Ingest V3)V1 V2 V3App-1(E1 2{ V1V2})App-3E12 E23Ingest
  20. 20. High Performance Edge IngestIG Core/APIC1C2C3E12E23TargetContainersPipelineContainersE(1->2)E(3->1)E(2->3)E(2->1)E(2->3)E(3->1)E(1->2)E(3->2)E(1->2)E(2->3)E(3->1)E(2->1)E(2->3)E(3->1)E(3->2)E(1->2)PipelineAgentIngest
  21. 21. Trade offs• Excellent for efficient use of page cache• Able to maintain full database consistency• Achieves highest ingest rate in distributedenvironments• Almost always has highest “perceived” rate• Trading Off :• Eventual consistency in graph (connections)• Updates are still atomic, isolated and durable but phased• External agent performs graph buildingIngest
  22. 22. Result…1 client2 clients4 clients8 clients050000100000150000200000250000300000350000400000450000500000124NodesandEdgespersecond1 client2 clients4 clients8 clientsIngest
  23. 23. Scaling Reads and QueryDistributed APIApplication(s)Partition 1 Partition 3Partition 2 Partition ...nProcessor Processor Processor ProcessorPartitioning and Read Replicas… easy right !
  24. 24. Why are Graphs Different ?Distributed APIApplication(s)Partition 1 Partition 3Partition 2 Partition ...nProcessor Processor Processor ProcessorNavigate
  25. 25. Distributed Navigation• Detect local hops and perform in memorytraversal• Send the partial path to the distributedprocessing to continue the navigation.• Intelligently cache remote data when accessedfrequently• Route tasks to other hosts when it is optimalNavigate
  26. 26. Distributed Navigation ServerProcessorDistributed APIPartition 1 Partition 2ProcessorApplicationAXYBCDEP(A,B,C,D)FGNavigate
  27. 27. GraphViewsLeveraging Schema in the GraphPatient PrescriptionDrugIngredientOutcomeComplaintVisitAllergyPhysicianNavigate
  28. 28. Schema Enables Views• GraphViews are extremely powerful• Allow Big Data to appear small !• Connection inference can lead to exponentialgains in query performance• Views are reusable between queries• Views can be persisted• Built into the native kernelNavigate
  29. 29. Problem of SupernodesIn Graph Theory, a “supernode” is a vertex with adisproportionally high number of connected edges.Supernodes make it difficult to do a navigational query inreal-time due to the amount of effort it may be to pursuepaths through it that may be unfruitful.NavigateImages generated using the IG Visualizer
  30. 30. Supernodes in BaconNavigateIn the IMDB data set, some examples of supernodes may be talkshows, awards shows, compilations or variety shows.Images generated using the IG Visualizer
  31. 31. How to avoid supernodes1. Setting policies on the navigation like theNoRevisitPolicy , MaximumResultCountPolicy andMaximumPathDepthPolicy can be used to customize theoverall behavior of the navigation.PolicyChain policies = new PolicyChain();// Only traverse the same vertex oncepolicies.addPolicy(new NoRevisitPolicy());// limits the number of paths that will be returned to 10Kpolicies.addPolicy(new MaximumResultCountPolicy(10000));// limits the path depth to 6policies.addPolicy(new MaximumPathDepthPolicy(6));Navigate
  32. 32. How to avoid supernodes2. Graph View to exclude or limit typesGraphView view = new GraphView();//Excludes all instances of TvShow from navigationview.excludeClass(myDb.getTypeId(TvShow.class.getName()));//Excludes all movies made for TV/Videoview.excludeClass(myDb.getTypeId(Movie.class.getName()),“details.madeForTv || details.madeForVideo”);//Include ActedIn w/ characterName not containing “Himself”view.excludeClass(myDb.getTypeId(WorkedOn.class.getName()));view.includeClass(myDb.getTypeId(ActedIn.class.getName()),“!CONTAINS(characterName, “Himself”)”);NavigateKevin BaconActorTheFollowingTV ShowBehind theScenesMovieApollo 13MovieHimselfRyan HardyJack Swigert
  33. 33. How to avoid supernodes3. Using these policies and graph view, we canfilter the size of the result set in our navigation:Navigator navigator = bacon.navigate(view,Guide.SIMPLE_BREADTH_FIRST, Qualifier.ANY,new VertexPredicate(Person.class, ""),policies, myResultHandler);navigator.start();Navigate
  34. 34. Filtered Views in BaconThe results of this navigation would look something like this…NavigateImages generated using the IG Visualizer
  35. 35. Why InfiniteGraph™?• Objectivity/DB is a proven foundation– Building distributed databases since 1993– A complete database management system• Concurrency, transactions, cache, schema, query, indexing• It’s a Graph Specialist !– Simple but powerful API tailored for data navigation.– Easy to configure distribution model
  36. 36. Advanced Configured Placement• Physically co-locate “closely related” data• Driven through a declarative placement model• Dramatically speeds “local” readsFacility Data Page(s)Patient Data Page(s)MrCitizenVisit VisitDrJonesSanJoseFacilityDrSmithPrimaryPhysicianHasHas WithAtLocated LocatedFacility Data Page(s)DrBlakeSunny-valeDrQuinnLocated LocatedWithAt
  37. 37. Fully Distributed Data ModelZone 2Zone 1HostAIG Core/APIDistributed Object and Relationship Persistence LayerCustomizable PlacementHostB HostC HostXAddVertex()
  38. 38. Polyglot NoSQL ArchitecturesDistributed DataProcessingPlatform DocumentGraphDatabaseRDBMSPartitioned Distributed DB (often Document / KV)UsersApplicationsExternal/LegacyDataTransformationMDMBusiness
  39. 39. What else!• Distributed update.Update… we are working on it.
  40. 40. ConclusionI hope that you enjoyed the bacon.My apologies to my kosher friends for any offense.Look out for new features coming soon!
  41. 41. QUESTIONS?

×