Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Finding Insights In Connected Data: Using Graph Databases In Journalism

1,484 views

Published on

When dealing with datasets, journalists have many options to choose from when moving beyond Excel. Usually the first step is using a relational (or SQL) database. While a relational database can be a good choice for some datasets, data analysts today turn to new tools to gain deeper insight. This talk will show how we can use a graph database to analyze highly connected data using examples from U.S. Congressional data and political email archives. Using the U.S. Congress data, we’ll show you how to explore the dataset using Cypher, the Neo4j query language, to discover legislator activity including bill sponsorship and voting activity. Building up our knowledge of Cypher as we progress, we’ll show how you can use principles from social network analysis to find influential legislators and discover what topics legislators have influence over. Finally, we will examine how to draw insights from the Hillary Clinton email dataset, released as part of a FOIA request earlier this year. We will explore this dataset as a graph of interactions among users, answering questions like: Who is communicating with Hillary the most? What are the topics of these emails? You’ll learn how to visualize these using the Neo4j browser to quickly make sense of the data as we are exploring.

The goal of this talk is to provide a demonstration of database tools that any journalist can use to explore datasets and draw insights from connected datasets.

Published in: Software

Finding Insights In Connected Data: Using Graph Databases In Journalism

  1. 1. Finding Insights in Connected Data Graph Databases in Journalism NICAR 2016 Denver William Lyon @lyonwj
  2. 2. About Software Developer @Neo4j will@neo4j.com @lyonwj lyonwj.com William Lyon
  3. 3. Agenda • What is a graph database? • Why graphs in journalism? • Demo1: Graphing US Congress • Demo2: Hillary email dataset
  4. 4. What is a graph?
  5. 5. Chart
  6. 6. Chart Graph
  7. 7. VIEWED VIEWED BOUGHT VIEWED BOUGHT BOUGHT BOUGHT BOUGHT
  8. 8. MANAGE MANAGE LEADS REGION M ANAG E MANAGE REGION LEADS LEADS COLLABORAT
  9. 9. ACCOUNT HOLDER 2 ACCOUNT ACCOUNT CREDIT CARD BANK ACCOUNT BANK ACCOUNT BANK ADDRESS PHONE PHONE NUMBER SSN 2 LOAN SSN 2 UNSECURE LOAN CREDIT CARD
  10. 10. Graph Databases in Journalism
  11. 11. Graph Databases Software that stores & queries data as a graph.
  12. 12. Graph Database • Property graph data model • Nodes and relationships • Native graph processing • Cypher query language neo4j.com
  13. 13. Why graph databases in journalism?
  14. 14. Why graph databases in journalism?
  15. 15. bills.csv committees.csv votes.csv https://www.govtrack.us/developers
  16. 16. bills.csv committees.csv votes.csv https://www.govtrack.us/developers
  17. 17. SELECT l.name, c.jurisdiction FROM legislators p LEFT JOIN committee c ON c.member_ID=l.thomasID WHERE c.thomasID = “HSAP”
  18. 18. SQLER Diagrams
  19. 19. Relational Versus Graph Models Relational Model Graph Model KNOWS KNOWS KNOWS ANDREAS TOBIAS MICA DELIA Person FriendPerson-Friend ANDREAS DELIA TOBIAS MICA
  20. 20. Graph Database Relational Database A way of representing data
  21. 21. Property Graph Model
  22. 22. The Whiteboard Model Is the Physical Model
  23. 23. Property Graph Model Components Nodes • The objects in the graph • Can have name-value properties • Can be labeled Relationships • Relate nodes by type and direction • Can have name-value properties CAR DRIVES name: “Dan” born: May 29, 1970 twitter: “@dan” name: “Ann” born: Dec 5, 1975 since: 
 Jan 10, 2011 brand: “Volvo” model: “V70” LOVES LOVES LIVES WITH OW NS PERSON PERSON
  24. 24. Cypher Query Language SQL for graphs
  25. 25. Cypher: Powerful and Expressive Query Language CREATE (:Person { name:“Dan”} ) -[:LOVES]-> (:Person { name:“Ann”} ) LOVES Dan Ann LABEL PROPERTY NODE NODE LABEL PROPERTY
  26. 26. MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report) WHERE boss.name = “John Doe” RETURN sub.name AS Subordinate, 
 count(report) AS Total Express Complex Queries Easily with Cypher Find all direct reports and how many people they manage, 
 up to 3 levels down Cypher Query SQL Query
  27. 27. Graphing US Congress Demo
  28. 28. https://github.com/legis-graph/legis-graph
  29. 29. https://github.com/legis-graph/legis-graph LOAD CSV WITH HEADERS FROM “file:///legislators.csv” AS line MERGE (l:Legislator (thomasID: line.thomasID}) SET l = line MERGE (s:State {code:line.state})<-[:REPRESENTS]-(l) … US Congress
  30. 30. https://github.com/legis-graph/legis-graph
  31. 31. http://legis-graph.github.io/legis-graph-spatial/
  32. 32. contributions committees candidates
  33. 33. https://gist.github.com/johnymontana/02ae47fc0a29719db045
  34. 34. +
  35. 35. https://gist.github.com/johnymontana/02ae47fc0a29719db045
  36. 36. Graph data models are easy to evolve! Takeaway
  37. 37. Hillary Clinton EmailsDemo
  38. 38. Clinton email graph model
  39. 39. Data munging http://graphics.wsj.com/hillary-clinton-email-documents/
  40. 40. Data munging https://github.com/OpenRefine/OpenRefine/wiki/Faceting
  41. 41. LOAD CSV - Cypher http://www.developeradvocate.com/2015/11/graphing-hillary-clinton-email/
  42. 42. Clinton email graph model bit.ly/1R1ybyu
  43. 43. Content mining “Networks give structure to the conversation while content mining gives meaning.” http://breakthroughanalysis.com/2015/10/08/ltapreriitsouda/ - Preriit Souda
  44. 44. Extracting topics from email text
  45. 45. Extracting topics from email text http://www.markhneedham.com/blog/2015/02/13/neo4j-building-a-topic-graph-with- prismatic-interest-graph-api/
  46. 46. Clinton email graph model
  47. 47. Clinton email graph model
  48. 48. http://bit.ly/1R1ybyu
  49. 49. Resources
  50. 50. Visualization https://linkurio.us/ http://visjs.org/ http://neo4j.com/developer/guide-data-visualization/
  51. 51. Data analysis with Neo4j Py2neo http://py2neo.org/2.0/ IPython Notebook https://github.com/versae/ipython-cypher R-lang http://neo4j.com/developer/r/
  52. 52. ICIJ Case Study Swiss Leaks https://youtu.be/4__ni4aC8gI http://neo4j.com/case-studies/icij/
  53. 53. graphdatabases.com

×