Upcoming SlideShare
×

# Revisiting the Six Degrees Problem with a Graph Database - Nick Quinn

223 views

Published on

The six degrees problem is a classic party game and typical use-case for the power and efficiency of graph databases. But even with a powerful graph database, a complex ecosystem of data like IMDB (Internet Movie Database) can return a dizzying amount of data within six degrees of separation from the source. With this amount of data, how do you draw business value from large sets of highly connected data? In this session, we will discuss some powerful strategies for using a distributed graph database, to perform analysis to derive business value from highly connected, complex data sets using navigational queries and visualization.

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
223
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide
• Kevin Norwood Bacon, 1958 in Pennsylvania
• The “six degrees of kevin bacon” game was created in early 1994 by three Albright College students after Kevin Bacon made a comment in Premiere magazine that he had worked with everyone in Hollywood while promoting “The River Wild”.Initially, he disliked the game because he thought he was being ridiculed, but since, he used it in commercials and has started his own social charity website sixdegrees.org due to its popularity.
• First Project: Animal House as “Chip Diller” 1978 According to IMDB, he has acted in 76 projects (Movies, TV shows). More than typical at his age because of his willingness to take small parts…
• Bacon Number: How many movies (degrees) separate any hollywood film actor from Kevin Bacon?
• A Graph Database is not just a graph style interface on top of a relational database. We store the data in such a way that optimizes around graph use cases. For any element, you can find the adjacent object without having to look it up in an index, instead the element has a direct pointer. Offer some kind of traversal query API and the traversal of relationships is optimized because there are no lookups. Explain the slide: As the number of people goes up, the number of meetings, calls and payments will go up and the size of these tables will grow. If you are interested in graph type questions like how many calls did Bob make or maybe how many times did Bob pay Charlie, it is much easier and much more suited to your needs to store the data in such a way that it is easier to answer those.
• So what is a Graph Database? It is a database that represents and stores data in its connected form, as a graph structure. So instead of rows and columns, you have vertex and edge objects. Data structure for representing complicated relationships between entities. Some obvious use cases lie in health care, bioinformatics, social networks, network management, crime detection, fraud analysis, etc.
• Gina Menza had a non-speaking part in both Sleepers and Outbreak. She was in “Wag the Dog”, but never really had any future parts. She is not a strong connector between the Kevins.In November of 2011, Facebook and the University of Milan released an annoucement that “there are on average 3.74 degrees of separation between any one Facebook user and another within the US”. Does this kill the six degrees problem? No, because we are not asking for any “Gina Menza” connection, we are looking for “Gary Oldman” connections. In other words, we are looking for meaningful connections between people. For example, the facebook pronouncement breaks down if you are not happy with just two people that liked “Pepsi”, but only with two people that are connected by people that they are tagged in photos with. The strength of the connections matters!
• Gina Menza had a non-speaking part in both Sleepers and Outbreak. She was in “Wag the Dog”, but never really had any future parts. She is not a strong connector between the Kevins.In November of 2011, Facebook and the University of Milan released an annoucement that “there are on average 3.74 degrees of separation between any one Facebook user and another within the US”. Does this kill the six degrees problem? No, because we are not asking for any “Gina Menza” connection, we are looking for “Gary Oldman” connections. In other words, we are looking for meaningful connections between people. For example, the facebook pronouncement breaks down if you are not happy with just two people that liked “Pepsi”, but only with two people that are connected by people that they are tagged in photos with. The strength of the connections matters!
• Alternate ways to handle graph analysisBatch Processing – MapReduce with Hadoop with some layer on top (Hama, GraphLab, Faunus)Used for global analytics, not real-timeAll in memory - Jung, NetworkX, iGraphRich ecosystem of visualization and algorithmic packagesThe data set size would be limited by the amount of memory that can be used. If ~ millions of edges, the amount of memory is usually too small.
• Kevin Bacon – 4 Degrees up to 20,000 connectionsEdge Line Color = Black (which is why it seems blacked out)Green Background = MoviesPink Background = TV ShowsPurple Background = Actors/ActressesWhite Background = Distribution Companies
• Some data is only actionable momentarilyIntelligenceIT SecuritySite/page visitFinancial / trading behaviorPresents a different type of challengeLatency of batch data processing becomes problematic
• 3 nodes, 2 edges in the middle. If both the edges come in at the same time, you will have to wait for the lock on the vertex in the middle. If this type of ingest is happening a much larger level, you will constantly be waiting for locks on shared vertex objects.
• This is our answer to how to scale the ingest. Updates to the vertex objects are staged in target containers and then sorted and moved to pipeline containers where they are picked up and processed by pipeline agents which apply the updates asynchronously.
• This an eventually consistent solution. We still support fully ACID transactions, but you have the choice. As you are doing the navigation, you may not necessarily see the update right away. You can set the consistency preference on a transaction basis, so some can be fully ACID and some eventually consistent.
• Ingest rates on arrays of machines with pretty modest hardware on racks
• In a lot of cases, especially KV stores and even RDBMS’s, it is relatively simple to scale the reads in especially if you have partitioned the data.
• In a graph, it looks like more of a mess. It is not as easy to do this where the data is connected and we are performing a navigational query.
• Pregel-like: wraps up a message and sends it over to the next hostDistributed Cache: Pieces of graph stored in memory, caches remote data stored frequentlyInfiniteGraph makes smart decisions about when to send a message vs doing it by bringing in remote cache in memory.
• InfiniteGraph supports schema, but does not restrict connection types between verticesWe take advantage of schema when applying filters to graph using graph viewsWorking on hybrid schema support
• 1: All movies and shows that are associated with Kevin Bacon2: All projects with the word “Show” are highlighted3: All projects with the word “Award” are highlighted4: Just shows the actors associated with the Jay Leno show which is clearly a supernode
• Also filtered on billing position on the edge types to only get the top 5 actors/actresses for each project.Path to Kevin Spacey through “Tremors”, Fred Ward and “Henry and June”Path to Morgan Freeman through “White Water Summer”, Sean Astin, and “The Long Way Home”
• Mixing multiple databases together and using the strengths of each oneUse caseWeb Scale Application on the Frontend with relevant database to store user profiles, read and writeHadoop/MapReduce – on backend to crunch user data to target relevant advertising to themStore the output of that data in such a way to get good query performanceUse the graph database when the connected data can be seen such a way that it makes sense to retrieve it using graph traversals.
• ### Revisiting the Six Degrees Problem with a Graph Database - Nick Quinn

1. 1. www.Objectivity.comNick QuinnLead Developer - InfiniteGraph
2. 2. What are we talking about today?Not that BaconThis Bacon!• Intro to the Six Degrees Problem• What is a Graph Database?• Why Bacon in Graph Database?• How we solved the problemImages Courtesy of IMDB (www.imdb.com)
3. 3. Six Degrees of Bacon“…any individual involved in the Hollywood, California film industrycan be linked through his or her film roles to actor Kevin Baconwithin six steps”[http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon]Gina MenzaImages Courtesy of IMDB (www.imdb.com)A Tale of Two Kevins
4. 4. Why Six Degrees of Bacon?Actor Age # of ProjectsKevin Bacon 54 76Harrison Ford 70 70Tom Cruise 50 40Julia Roberts 45 50Tom Hanks 56 73Denzel Washington 58 53Michael Caine 80 157Kiefer Sutherland 46 82Kevin BaconImages Courtesy of IMDB (www.imdb.com)
5. 5. Bacon Numbers in GoogleIn the summer of 2012, Google started to allow users to find thebacon number of any actor simply by following his or her namewith “bacon number”.MorganFreemanThe DarkNightRisesappeared inGaryOldmanappeared inKevinBaconCriminalLawappeared inappeared inwww.google.com Graphical Representation
6. 6. What is a Graph Database ?
7. 7. The Physical Data Model• Difference between relational & graph databasesMeetingsP1 Place TimeP2Alice Denver 5-27-10BobCallsFrom Time DurationToBob 13:20 25CarlosBob 17:10 15CharliePaymentsFrom Date AmountToCarlos 5-12-10 100000CharlieMet5-27-10AliceCalled13:20BobPaid100000CarlosCharlieCalled17:10Rows/Columns/Tables Relationship/Graph Optimized
8. 8. Connecting DataPerson Building?Work Live RR Visit Eat Shop
9. 9. Who is Gina Menza?• How do we get meaning from highlyconnected data?Gina MenzaJury Forewoman Miss JeffriesImages Courtesy of IMDB (www.imdb.com)
10. 10. Strength of Connections Matter!• Why 6 degrees of separation and not 3.74?• We need analysis tools in order to– identify and filter out “unimportant” data and– infer what needs to be filtered as we investigate it.“When considering anotherperson in the world, a friendof your friend knows afriend of their friend”- facebook
11. 11. Why Bacon in a Graph Database ?
12. 12. Graph Analysis• Why use Graph Databases for graph analysis?– Dynamic on Live Data– Feedback/Inference– Optimized for concurrent user access– Handles big data problems– Native Graph Traversal API– Manage memory efficiently
13. 13. Paths to BaconBacon Number(Degree of Separation / 2)# of People1 28232 3236773 10885604 2729055 225336 2300Using the IMDB (www.imdb.com) data set, we can study how many pathscan be found by degrees of separation from Kevin Bacon. Out of 5,067,124nodes and 11,505,797 edges, we get the following:0200000400000600000800000100000012000001 2 3 4 5 6# ofPeople
14. 14. Big Data + Graph = Big Graph Data4 Degrees of Kevin Bacon(Breadth First up to 20K connections)Images generated using the IG Visualizer
15. 15. Analyzing Bacon• To be able to perform meaningfulanalysis, these are things that you will need:– Ingest IMDB Dataset – About 50 Formattedcompressed files (Largest > 200 MB)– Custom algorithm support to perform meaningfulanalysis• Optimize queries to get results back in reasonable time– Visualization tool to test and view the results ofthe navigation (optional)
16. 16. How IG Sizzles Your BaconIngestUpdateNavigateMassive graph data require efficient and intelligent toolsto analyze and understand it.
17. 17. Super Simple Java APIActor bacon = new Actor(“Kevin Bacon”);imdbGraphDB.addVertex( bacon );Movie apollo= new Movie(“Apollo 13”, 1995);imdbGraphDB.addVertex( apollo );ActedIn bacon2apollo = new ActedIn(“Jack Swigert”);imdbGraphDB.addEdge(bacon2apollo, bacon, apollo, EdgeKind.BIDIRECTIONAL, 1 /**weight**/);Ingest
18. 18. Scaling Writes• Big/Fast data demands write performance• Most NoSQL solutions allow you to scalewrites by…– Partitioning the data– Understanding your consistency requirements– Allowing you to defer conflictsIngest
19. 19. App-2(Ingest V2)App-2(E23{ V2V3})Scaling Graph WritesACID TransactionsInfiniteGraphObjectivity/DB Persistence LayerApp-1(Ingest V1)App-3(Ingest V3)V1 V2 V3App-1(E1 2{ V1V2})App-3E12 E23Ingest
20. 20. High Performance Edge IngestIG Core/APIC1C2C3E12E23TargetContainersPipelineContainersE(1->2)E(3->1)E(2->3)E(2->1)E(2->3)E(3->1)E(1->2)E(3->2)E(1->2)E(2->3)E(3->1)E(2->1)E(2->3)E(3->1)E(3->2)E(1->2)PipelineAgentIngest
21. 21. Trade offs• Excellent for efficient use of page cache• Able to maintain full database consistency• Achieves highest ingest rate in distributedenvironments• Almost always has highest “perceived” rate• Trading Off :• Eventual consistency in graph (connections)• Updates are still atomic, isolated and durable but phased• External agent performs graph buildingIngest
22. 22. Result…1 client2 clients4 clients8 clients050000100000150000200000250000300000350000400000450000500000124NodesandEdgespersecond1 client2 clients4 clients8 clientsIngest
23. 23. Scaling Reads and QueryDistributed APIApplication(s)Partition 1 Partition 3Partition 2 Partition ...nProcessor Processor Processor ProcessorPartitioning and Read Replicas… easy right !
24. 24. Why are Graphs Different ?Distributed APIApplication(s)Partition 1 Partition 3Partition 2 Partition ...nProcessor Processor Processor ProcessorNavigate
25. 25. Distributed Navigation• Detect local hops and perform in memorytraversal• Send the partial path to the distributedprocessing to continue the navigation.• Intelligently cache remote data when accessedfrequently• Route tasks to other hosts when it is optimalNavigate
26. 26. Distributed Navigation ServerProcessorDistributed APIPartition 1 Partition 2ProcessorApplicationAXYBCDEP(A,B,C,D)FGNavigate
27. 27. GraphViewsLeveraging Schema in the GraphPatient PrescriptionDrugIngredientOutcomeComplaintVisitAllergyPhysicianNavigate
28. 28. Schema Enables Views• GraphViews are extremely powerful• Allow Big Data to appear small !• Connection inference can lead to exponentialgains in query performance• Views are reusable between queries• Views can be persisted• Built into the native kernelNavigate
29. 29. Problem of SupernodesIn Graph Theory, a “supernode” is a vertex with adisproportionally high number of connected edges.Supernodes make it difficult to do a navigational query inreal-time due to the amount of effort it may be to pursuepaths through it that may be unfruitful.NavigateImages generated using the IG Visualizer
30. 30. Supernodes in BaconNavigateIn the IMDB data set, some examples of supernodes may be talkshows, awards shows, compilations or variety shows.Images generated using the IG Visualizer
31. 31. How to avoid supernodes1. Setting policies on the navigation like theNoRevisitPolicy , MaximumResultCountPolicy andMaximumPathDepthPolicy can be used to customize theoverall behavior of the navigation.PolicyChain policies = new PolicyChain();// Only traverse the same vertex oncepolicies.addPolicy(new NoRevisitPolicy());// limits the number of paths that will be returned to 10Kpolicies.addPolicy(new MaximumResultCountPolicy(10000));// limits the path depth to 6policies.addPolicy(new MaximumPathDepthPolicy(6));Navigate
32. 32. How to avoid supernodes2. Graph View to exclude or limit typesGraphView view = new GraphView();//Excludes all instances of TvShow from navigationview.excludeClass(myDb.getTypeId(TvShow.class.getName()));//Excludes all movies made for TV/Videoview.excludeClass(myDb.getTypeId(Movie.class.getName()), “details.madeForTv || details.madeForVideo”);//Include ActedIn w/ characterName not containing “Himself”view.excludeClass(myDb.getTypeId(WorkedOn.class.getName()));view.includeClass(myDb.getTypeId(ActedIn.class.getName()),“!CONTAINS(characterName, “Himself”)”);NavigateKevin BaconActorTheFollowingTV ShowBehind theScenesMovieApollo 13MovieHimselfRyan HardyJack Swigert
33. 33. How to avoid supernodes3. Using these policies and graph view, we canfilter the size of the result set in our navigation:Navigator navigator =bacon.navigate(view, Guide.SIMPLE_BREADTH_FIRST, Qualifier.ANY, newVertexPredicate(Person.class, ""), policies,myResultHandler);navigator.start();Navigate
34. 34. Filtered Views in BaconThe results of this navigation would look something like this…NavigateImages generated using the IG Visualizer
35. 35. Why InfiniteGraph™?• Objectivity/DB is a proven foundation– Building distributed databases since 1993– A complete database management system• Concurrency, transactions, cache, schema, query, indexing• It’s a Graph Specialist !– Simple but powerful API tailored for data navigation.– Easy to configure distribution model
36. 36. Advanced Configured Placement• Physically co-locate “closely related” data• Driven through a declarative placement model• Dramatically speeds “local” readsFacility Data Page(s)Patient Data Page(s)MrCitizenVisit VisitDrJonesSanJoseFacilityDrSmithPrimaryPhysicianHasHas WithAtLocated LocatedFacility Data Page(s)DrBlakeSunny-valeDrQuinnLocated LocatedWithAt
37. 37. Fully Distributed Data ModelZone 2Zone 1HostAIG Core/APIDistributed Object and Relationship Persistence LayerCustomizable PlacementHostB HostC HostXAddVertex()
38. 38. Polyglot NoSQL ArchitecturesDistributed DataProcessingPlatform DocumentGraphDatabaseRDBMSPartitioned Distributed DB (often Document / KV)UsersApplicationsExternal/LegacyDataTransformationMDMBusiness
39. 39. What else!• Distributed update.Update… we are working on it.
40. 40. ConclusionI hope that you enjoyed the bacon.My apologies to my kosher friends for any offense.Look out for new features coming soon!
41. 41. QUESTIONS?