Successfully reported this slideshow.
Your SlideShare is downloading. ×

Frontiers of Computational Journalism week 8 - Visualization and Network Analysis


More Related Content


Frontiers of Computational Journalism week 8 - Visualization and Network Analysis

  1. 1. Frontiers of Computational Journalism Columbia Journalism School Week 8: Visualization and Network Analysis November 7, 2018
  2. 2. This class • Visualization as perception • Visualization design • Social network theory • Network analysis in journalism
  3. 3. Visualization as Perception
  4. 4. Topic links in Gödel, Escher, Bach
  5. 5. “Visualization allows people to offload cognition to the perceptual system, using carefully designed images as a form of external memory. The human visual system is a very high-bandwidth channel to the brain, with a significant amount of processing occurring in parallel and at the pre-conscious level.” - Tamara Munzner
  6. 6. Pop-Out Effects
  7. 7. Visual Comparisons length orientation size color number, shape, relative motion, and much more
  8. 8. Basic idea of visualization: Turn something you want to find into something you can see without thinking about it
  9. 9. correlations clusters extents outliers
  10. 10. Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
  11. 11. Visualization Design
  12. 12. Inward and Outward Grand Challenges for Visualization, Tamara Munzner
  13. 13. Sequential Narrative What’s Really Warming The World?, Bloomberg
  14. 14. Visualization isn’t “objective,” but that doesn’t mean you can’t mislead. (Is this graph misleading?)
  15. 15. Social Network Theory
  16. 16. Network A set of people and a set of connections between pairs of them
  17. 17. Types of connections Social network analysis: only one type of connection between individuals (e.g. "friend") Link analysis: multiple types of connections friend brother employer went to university with sold a car to owns 51% of Link analysis is much more relevant to journalism, because it allows representation of much more detail and context.
  18. 18. People Act in Groups Family and friendships: I am most closely connected to a small set of people, who are usually closely connected to each other. Business: I am much more likely to do business with people I already know. Influence: I listen to people I know more than I listen to strangers. Norms: what is right depends on what the people around me think. People tend to marry, do business with, spend time with, etc. people from similar backgrounds... and people who have social ties tend to be similar.
  19. 19. Two major analysis methods …after you have the network data, which may be a very manual process. • Look at a visualization • Apply algorithm In both cases, the results are not interpretable without context.
  20. 20. A “sociogram” of a fraternity from Moreno’s Who Shall Survive? (1934). Arrows show one way “attraction” and lines with a cross bar show “mutual attraction.”
  21. 21. Force-Directed Layout Each edge is a "spring" with a fixed preferred length. Plus global repulsive force that pushes all nodes apart.
  22. 22. The Effect of Graph Layout on Inference from Social Network Data, Blythe et al.
  23. 23. The Effect of Graph Layout on Inference from Social Network Data, Blythe et al. We asked respondents three questions about the same five focal nodes in each sociogram: 1) how many subgroups were in the sociogram 2) how “prominent” was each player in the sociogram 3) how important a “bridging” role did each player occupy in the sociogram
  24. 24. Centrality Often identified with "influence" or "power." Often important in journalism. We can visualize the graph and use our eyes, or we can compute centrality values algorithmically.
  25. 25. Degree centrality: number of edges Models: cases where the number of connections is important. Example: which celebrity can reach the most people at once?
  26. 26. Closeness centrality: average distance to all other nodes Models: cases where time taken to reach a node is important. Example: who finds out about gossip first?
  27. 27. Betweenness centrality: number of shortest paths that pass through node Models: cases where control over transmission is important. Example: who has the most power to make introductions?
  28. 28. Eigenvector centrality: how likely you are to end up at a node on a random walk (same idea as PageRank) Models: cases where importance of neighbors is important. Example: the private adviser to the president
  29. 29. Journalism centrality: how important is this person to this story?
  30. 30. Finding Communities No one definition of "community." Could mean a town, or a club, or an industry network. But for our purposes, a community is "a group of people with pre-existing patterns of association." In social network analysis, that translates into clusters in the graph.
  31. 31. Friends/followers
  32. 32. Co-consumption – Network of political book sales,
  33. 33. Communications network – Exploring Enron, Jeffery Heer
  34. 34. Web link structure – Map of Iranian Blogosphere, Berkman Center
  35. 35. Individual time/location trails – CitySense, Sense Networks
  36. 36. Mathematical definitions of "cluster" You've already seen several. If you can compute distance between any two items, you can cluster. But in social networks, not everyone is connected to everyone else...
  37. 37. Modularity Are there more intra-group edges than we would expect randomly?
  38. 38. Modularity n = number of vertices ki = degree of vertex i Aij = 1 if edge between i,j, 0 otherwise gij = 1 if i,j in same group, 0 otherwise There are total edges in the graph. If they go between random vertices then number of edges between i,j is m = 1 2 kiå kikj / 2m
  39. 39. Modularity n = number of vertices ki = degree of vertex i Aij = 1 if edge between i,j, 0 otherwise gij = 1 if i,j in same group, 0 otherwise Modularity If Q>0 then there are "excess" edges inside the groups (and fewer edges between them.) Q = Aij -kikj / 2m( ) ij å gij
  40. 40. Modularity algorithm • Look for a division of nodes into two groups that maximizes Q • Can find this through eigenvector technique • Possible that no division has Q>0, in which case the graph is a single community • If a division with Q>0 found, split • Recursively split sub-graphs
  41. 41. Network Analysis in Journalism
  42. 42. Case Study: Seattle Art World In Seattle Art World, Women Run the Show, Seattle Times Network obtained from dozens of in-person interviews. Interactive visualization in story.
  43. 43. Case Study: Hot Wheels Hot Wheels, Tampa Bay Times Network obtained from juvenile arrest records concerning stolen cars. Unpublished visualization and centrality measures used to direct reporting to most interesting people.
  44. 44. Coded 34 Stories for Sources and Uses Story visualization: published story contains a visualization Reporting visualization: used to guide reporters, unpublished. Scraping: network extracted from source documents Algorithm: centrality, community, etc. used Graph DB: network loaded into graph database
  45. 45. Results 0 5 10 15 20 25 30 35 40 Total Story Vis Scraping Reporting Vis Algorithm Graph DB
  46. 46. Why not algorithms? Heterogeneous networks. Multiple entity/relationship types. “Link analysis” like criminal investigations. Incomplete data. Building out the network is often an interactive process of data gathering. Contextual interpretation: What does it mean for someone to be “central”? Depends on the nature of the network and story.
  47. 47. Correlation of different types of info Suppose you have a record of phone numbers called, a database of political campaign donations, and a list of government appointees. Put them together, and you have this story: WASHINGTON—Time and again, Texas Gov. Rick Perry picked up his office phone in the months before he would announce his bid for the presidency. He dialed wealthy friends who were his big fundraisers and state officials who owed him for their jobs. Perry also met with a Texas executive who would later co-found an independent political committee that has promised to raise millions to support Perry but is prohibited from coordinating its activities with the governor. - Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011
  48. 48. The state of the art: Panama Papers
  49. 49. Graph Databases in Theory Load everything into the database, then analyze using a graph query language and interactive visualization. “Magic bullet” for large, complex, cross border investigations.
  50. 50. Panama Papers networks derived from structured data only
  51. 51. Entity recognition is not solved! Incredibly dirty source data. Current methods have low recall (~70%) Entities found out of 150
  52. 52. “Soft” record linkage Unlinked records
  53. 53. Graph Databases in Practice Incomplete data. Building a network often requires scraping from documents. Bulk data often unavailable or impractical, and some records need to be purchased one at a time. Instead, reporting involves interactive data enrichment. Record linkage: With N databases, there could be N copies of each entity. Graph queries are not that helpful. Cipher was available to PP investigators but no one outside the core team learned it. Moreover, it’s not clear how often reporting problems can be expressed as a graph query. Even “find path between” did not produce any (documented) leads on PP. Networks need to be narratives. The most useful networks are hand-built, for a particular line of reporting.
  54. 54. Maps, not data visualizations
  55. 55. Query results vs. hand-built graphs Search for node to addGraph query results
  56. 56. Proposed System