GraphTO Meetup: Intro to Graph Database and Pacer 1.0


Published on

Slides presented at covering an introduction to Graph Database concepts, ecosystem and use cases. Detailed talk on the basics of Pacer, a jRuby graph traversal and data processing library, through Darrick Wiebe's "Pacer in the REPL" talk.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • There are four trends underpinning the NoSQL and specifically the GraphDB movements:\n1)...the size of data that we are managing is more than doubling every two years, with around 2.4 Zettabytes expected by the end of this year (or 250mil years of the TV show “24”).\n\n2) Data is more highly-connected than ever before. FOAF on social networks; Configuration Management for a Datacenter\n\n3) Schema-less data persistence; Add a field to just one record, no problem. Sparkes on Toyota\n\n4) Application Architecture changed from flat-files and batch processing, to shared RDBMS, SOA + Web services\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • *This is a somewhat contrived example, as “person” & “friend” would normally be one table with a self join.\n
  • A borrowed slide from neo technology\n
  • A few options exist for graph query languages, some you may have hear of. \nSPARQL is a recursive acronym for “SPARQL Protocol and RDF Query Language” for Resource Description Framework. \nCypher and Gremlin are modern graph query languages with strong ties to the Neo4j community.\nPacer is a ruby gem that you can include in your projects and get jamming on embedded graph databases straight away. \n
  • \n
  • Chris compared Traffic-based and Content-based message ranking approaches to discover Ego Networks. We don’t need to worry about the details here though. Chris has left us with a nice property graph which identifies official reporting relationships by an edge labelled “Directly_Reported_To”.\n
  • Add organizational groups\nCluster messages together into X (new vertex for X)\n\nNaughty emails.\nNONE came *from* enron email addresses\n\n
  • Here we show how to:\n Start IRB with 3GB of heap space & require the enron examples lib\n Load the sample GraphML data into an in-memory TinkerGraph\n Use a helper method to get a high-level summary of the data\n
  • A few quick query types that are best suited to the graph are things like: \nCalculating the heaviest emailers with fast edge counting, and discovering communication paths through graph algorithms like Dijkstra’s shortest-path.\n\nAdding meta-data vertices to a dataset like this enables even more power-of-the-graph type of analysis. For example, coupling the ‘influencer’ type analysis with sentiment analysis on the email message bodies, would allow you to determine which groups of staff were being negatively (or positively) effected by a given influencer. \n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Go here, cool stuff.\n
  • \n
  • GraphTO Meetup: Intro to Graph Database and Pacer 1.0

    1. 1. GraphTO October 2012, Mozilla TorontoDavid Colebatch & Darrick Wiebe
    2. 2. Agenda• Intros and Welcome• Pacer 1.0 • pacer-examples-* Sponsored By: • Enron • Graph in the REPL• Questions• Discussion...
    3. 3. ¿por qué?• Data Set Size• Connectivity of Data• Semi-structure• Evolution of SOA and REST
    4. 4. The Zone of SQL Adequacy SQL database Requirement of applicationPerformance Data complexity
    5. 5. The Zone of SQL Adequacy SQL database Requirement of applicationPerformance Data complexity
    6. 6. The Zone of SQL Adequacy SQL database Requirement of applicationPerformance Salary List ERP CRM Data complexity
    7. 7. The Zone of SQL Adequacy SQL database Social Requirement of application GeoPerformance Salary List Network / Cloud Management ERP MDM CRM Data complexity
    8. 8. Graph Space New VendorsThe last 12 months have seen anumber of new graphtechnologies emerge• AffinityDB (VMW),YarcData uRiKA (Cray), Giraph (YHOO), Cassovary (Twitter), StigDB (Tagged), NuvolaBase, Pegasus, Titan (Aurelius), etc
    9. 9. How?• Nodes / Vertices• Relationships / Edges
    10. 10. Relational Model vs. Graph Each of these models expresses the same thingPerson* Person-Friend Friend* Em Joh il an knows knows Alli Tob Lar son i as knows s knows And And knows knows rea rés s knows knows knows Pet Miic Mc knows Ian er knows aa knows knows De Mic lia hae l
    11. 11. Graph db performance๏ a sample social graph • with ~1,000 persons๏ average 50 friends per person๏ pathExists(a,b) limited to depth 4๏ caches warmed up to eliminate disk I/O Database # persons query time MySQL 1,000 2,000 ms Neo4j 1,000 2 ms Neo4j 1,000,000 2 ms
    12. 12. Query Languages• SPARQL - if you grok RDF already• Cypher• Gremlin• Pacer - gem install pacer* * requires jRuby 1.7.0
    13. 13. Pacer::Examples::Enron• Sample data provided by Chris Diehl• A selection of the Enron email database, in GraphML format
    14. 14. $ git clone$ cd pacer-examples-enron$ bundle$ ./script/$ ./script/> enron = Pacer::Examples::Enron.load_data> enron.summary_of_data_types => { "Message"=>255636, "Email Address"=>87474, "Person"=>156 }
    15. 15. So What?• Heavy Emailers• Communication paths• ...add meta-data vertices to
    16. 16. Clustering Email by Topic• Analyze each email body• Detect 150 topics• Associate each email to its top-5 topics• Create meta-data vertices representing this data• Go nuts! • Trading emails with high BCC rates.
    17. 17. 001> Pacer.in_the_REPL REPL driven development is what I do!
    18. 18. Too hard to work with querylanguages and traversalalgorithms in an interactive way.- Simple things like discoveringlabels of edges for the currentvertex were not easy.
    19. 19. The same time I was feeling thatpain, discovered this interestingxpath inspired language calledGremlingremlin> outE/inV/inE/outV/back(3)/outV/etc ==> V[1] ==> V[2]
    20. 20. why is this its own language?- Solid underpinnings, great architecture- Ruby syntax in a couple of hours- Pacer was born by the end of the weekend- 2 years ago... Ecosystem has evolved a lot since then
    21. 21. Time for some Pacer!g = Pacer.neo4j db/enron.graph => #<PacerGraph neo4jgraph[db/enron.graph]>
    22. 22. Discovering your data >> g.v.frequencies :type => { "Message"=>255636, "Email Address"=>87474, "Person"=>156
    23. 23. Keeping Pacer friendly in the replreuse routes and extend them >> emails = g.v(Email, type: email) => #<V-Lucene(type:email) ~ 87474>
    24. 24. intelligent output>> emails.limit 5 #<V[191310]> #<V[252457]> #<V[210184]> #<V[237290]> #<V[252460]> => #<V-Lucene(type:email) ~ 87474 -> V -> V-Range(-1...5)>
    25. 25. we can do better g.vertex_name = proc do |v| if v[:type] == email ; v[:address] elsif v[:type] == Message ; v[:subject] ; end end>> emails.limit 5 #<V[191310]> #<V[252457]> #<V[210184]> #<V[237290]> #<V[252460]> Total: 5
    26. 26. Beyond the REPL>> Works much like Rubys built-in map method but has some extra options and, like all routes, does not evaluate immediately (see the :routes help topic). Example: ... ...>> :options ... ...
    27. 27. So how can we use it?In the Enron data, we can tell when peoplewere BCCd:>> g.e(RECEIVED_BY).frequencies :type=> { "to"=>1159970, "bcc"=>243627, "cc"=>243627 }
    28. 28. Apparently it was a commonoccurrence>> few_bccs = emails.lookahead(max:10) do |e| e.sent.received_by(type: bcc) end=> #<V-Lucene(type:email) ~ 87474 -> V -> V-Future(#<V -> out(:SENT) -> V ->outE(:RECEIVED_BY) -> E-Property(type=="bcc") -> inV -> V -> HasCount(<= 10) -> is(true)>)>>> rare_bccs = few_bccs.lookahead(min:1000) do |e| e.sent.received_by(type: to) end...=> #<V-Lucene(type:email) ~ 87474 -> V -> V-Future(#<V -> out(:SENT) -> V ->...>
    29. 29. Takes a couple of seconds.Lets speed it up>> interesting = rare_bccs.result=> #<Obj 32 ids -> lookup -> is_not(nil)>
    30. 30. Lets see if anyone hadany concerns>> interesting.sent. lookahead(&:bcc). filter { |m| m[:body] =~ /concerns/i }#<V[131323] Enron Mentions - 01/30/2001>#<V[194186] Western Storage Initiatives>Total: 2=> #<Obj 32 ids -> ...>
    31. 31. Resources
    32. 32. GraphTO October 2012, Mozilla TorontoDavid Colebatch & Darrick Wiebe