Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Frontiers of
Computational Journalism
Columbia Journalism School
Week 8: Visualization and Network Analysis
November 7, 20...
This class
• Visualization as perception
• Visualization design
• Social network theory
• Network analysis in journalism
Visualization as Perception
Topic links in Gödel, Escher, Bach
“Visualization allows people to offload cognition to the
perceptual system, using carefully designed images as a
form of ex...
Pop-Out Effects
Visual Comparisons
length
orientation
size color
...plus number, shape, relative motion, and much more
Basic idea of visualization:
Turn something you want to find
into something you can see
without thinking about it
correlations
clusters
extents
outliers
Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
Visualization Design
Inward and Outward Grand Challenges for Visualization, Tamara Munzner
Sequential Narrative
What’s Really Warming The World?, Bloomberg
Visualization isn’t “objective,” but that doesn’t mean you can’t mislead. (Is
this graph misleading?)
Social Network Theory
Network
A set of people
and a set of connections between pairs of them
Types of connections
Social network analysis: only one type of connection between
individuals (e.g. "friend")
Link analysi...
People Act in Groups
Family and friendships: I am most closely connected to a small set of people,
who are usually closely...
Two major analysis methods
…after you have the network data, which may be a very manual
process.
• Look at a visualization...
A “sociogram” of a fraternity from Moreno’s Who Shall Survive? (1934). Arrows show one way
“attraction” and lines with a c...
Force-Directed Layout
Each edge is a "spring" with a fixed preferred length.
Plus global repulsive force that pushes all n...
The Effect of Graph Layout on Inference from Social Network Data,
Blythe et al.
The Effect of Graph Layout on Inference from Social Network Data,
Blythe et al.
We asked respondents three questions about...
Centrality
Often identified with "influence" or "power." Often important in journalism.
We can visualize the graph and use...
Degree centrality: number of edges
Models: cases where the number of connections is important.
Example: which celebrity ca...
Closeness centrality: average distance to all other nodes
Models: cases where time taken to reach a node is important.
Exa...
Betweenness centrality:
number of shortest paths that pass through node
Models: cases where control over transmission is i...
Eigenvector centrality:
how likely you are to end up at a node on a random walk
(same idea as PageRank)
Models: cases wher...
Journalism centrality:
how important is this person to this story?
Finding Communities
No one definition of "community." Could mean a town, or a club, or an industry
network.
But for our pu...
Friends/followers
Co-consumption – Network of political book sales, Orgnet.com
Communications network – Exploring Enron, Jeffery Heer
Web link structure – Map of Iranian Blogosphere, Berkman Center
Individual time/location trails – CitySense, Sense Networks
Mathematical definitions of "cluster"
You've already seen several. If you can compute distance between any two
items, you ...
Modularity
Are there more intra-group edges than we would
expect randomly?
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same ...
Modularity
n = number of vertices
ki = degree of vertex i
Aij = 1 if edge between i,j, 0 otherwise
gij = 1 if i,j in same ...
Modularity algorithm
• Look for a division of nodes into two groups that maximizes Q
• Can find this through eigenvector t...
Network Analysis in Journalism
Case Study: Seattle Art World
In Seattle Art World, Women Run the Show, Seattle Times
Network obtained from
dozens of in-p...
Case Study: Hot Wheels
Hot Wheels, Tampa Bay Times
Network obtained from
juvenile arrest records
concerning stolen cars.
U...
Coded 34 Stories for Sources and Uses
Story visualization: published story contains a visualization
Reporting visualizatio...
Results
0
5
10
15
20
25
30
35
40
Total Story Vis Scraping Reporting Vis Algorithm Graph DB
Why not algorithms?
Heterogeneous networks. Multiple entity/relationship types. “Link
analysis” like criminal investigatio...
Correlation of different types of info
Suppose you have a record of phone numbers called, a database of political
campaign...
The state of the art: Panama Papers
Graph Databases in Theory
Load everything into the database, then analyze using a graph query
language and interactive vis...
Panama Papers networks derived from
structured data only
Entity recognition is not solved!
Incredibly dirty source data. Current methods have low recall (~70%)
Entities found
out ...
“Soft”
record
linkage
Unlinked
records
Graph Databases in Practice
Incomplete data. Building a network often requires scraping from documents. Bulk data often
un...
Maps, not data visualizations
Query results vs. hand-built graphs
Search for node to addGraph query results
Proposed System
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Frontiers of Computational Journalism week 8 - Visualization and Network Analysis
Upcoming SlideShare
Loading in …5
×

Frontiers of Computational Journalism week 8 - Visualization and Network Analysis

361 views

Published on

Taught at Columbia Journalism School, Fall 2018
Full syllabus and lecture videos at http://www.compjournalism.com/?p=218

Published in: Education
  • Be the first to comment

  • Be the first to like this

Frontiers of Computational Journalism week 8 - Visualization and Network Analysis

  1. 1. Frontiers of Computational Journalism Columbia Journalism School Week 8: Visualization and Network Analysis November 7, 2018
  2. 2. This class • Visualization as perception • Visualization design • Social network theory • Network analysis in journalism
  3. 3. Visualization as Perception
  4. 4. Topic links in Gödel, Escher, Bach
  5. 5. “Visualization allows people to offload cognition to the perceptual system, using carefully designed images as a form of external memory. The human visual system is a very high-bandwidth channel to the brain, with a significant amount of processing occurring in parallel and at the pre-conscious level.” - Tamara Munzner
  6. 6. Pop-Out Effects
  7. 7. Visual Comparisons length orientation size color ...plus number, shape, relative motion, and much more
  8. 8. Basic idea of visualization: Turn something you want to find into something you can see without thinking about it
  9. 9. correlations clusters extents outliers
  10. 10. Design Study Methodology: Reflections from the Trenches and the Stacks, Sedlmair et al, 2012
  11. 11. Visualization Design
  12. 12. Inward and Outward Grand Challenges for Visualization, Tamara Munzner
  13. 13. Sequential Narrative What’s Really Warming The World?, Bloomberg
  14. 14. Visualization isn’t “objective,” but that doesn’t mean you can’t mislead. (Is this graph misleading?)
  15. 15. Social Network Theory
  16. 16. Network A set of people and a set of connections between pairs of them
  17. 17. Types of connections Social network analysis: only one type of connection between individuals (e.g. "friend") Link analysis: multiple types of connections friend brother employer went to university with sold a car to owns 51% of Link analysis is much more relevant to journalism, because it allows representation of much more detail and context.
  18. 18. People Act in Groups Family and friendships: I am most closely connected to a small set of people, who are usually closely connected to each other. Business: I am much more likely to do business with people I already know. Influence: I listen to people I know more than I listen to strangers. Norms: what is right depends on what the people around me think. People tend to marry, do business with, spend time with, etc. people from similar backgrounds... and people who have social ties tend to be similar.
  19. 19. Two major analysis methods …after you have the network data, which may be a very manual process. • Look at a visualization • Apply algorithm In both cases, the results are not interpretable without context.
  20. 20. A “sociogram” of a fraternity from Moreno’s Who Shall Survive? (1934). Arrows show one way “attraction” and lines with a cross bar show “mutual attraction.”
  21. 21. Force-Directed Layout Each edge is a "spring" with a fixed preferred length. Plus global repulsive force that pushes all nodes apart.
  22. 22. The Effect of Graph Layout on Inference from Social Network Data, Blythe et al.
  23. 23. The Effect of Graph Layout on Inference from Social Network Data, Blythe et al. We asked respondents three questions about the same five focal nodes in each sociogram: 1) how many subgroups were in the sociogram 2) how “prominent” was each player in the sociogram 3) how important a “bridging” role did each player occupy in the sociogram
  24. 24. Centrality Often identified with "influence" or "power." Often important in journalism. We can visualize the graph and use our eyes, or we can compute centrality values algorithmically.
  25. 25. Degree centrality: number of edges Models: cases where the number of connections is important. Example: which celebrity can reach the most people at once?
  26. 26. Closeness centrality: average distance to all other nodes Models: cases where time taken to reach a node is important. Example: who finds out about gossip first?
  27. 27. Betweenness centrality: number of shortest paths that pass through node Models: cases where control over transmission is important. Example: who has the most power to make introductions?
  28. 28. Eigenvector centrality: how likely you are to end up at a node on a random walk (same idea as PageRank) Models: cases where importance of neighbors is important. Example: the private adviser to the president
  29. 29. Journalism centrality: how important is this person to this story?
  30. 30. Finding Communities No one definition of "community." Could mean a town, or a club, or an industry network. But for our purposes, a community is "a group of people with pre-existing patterns of association." In social network analysis, that translates into clusters in the graph.
  31. 31. Friends/followers
  32. 32. Co-consumption – Network of political book sales, Orgnet.com
  33. 33. Communications network – Exploring Enron, Jeffery Heer
  34. 34. Web link structure – Map of Iranian Blogosphere, Berkman Center
  35. 35. Individual time/location trails – CitySense, Sense Networks
  36. 36. Mathematical definitions of "cluster" You've already seen several. If you can compute distance between any two items, you can cluster. But in social networks, not everyone is connected to everyone else...
  37. 37. Modularity Are there more intra-group edges than we would expect randomly?
  38. 38. Modularity n = number of vertices ki = degree of vertex i Aij = 1 if edge between i,j, 0 otherwise gij = 1 if i,j in same group, 0 otherwise There are total edges in the graph. If they go between random vertices then number of edges between i,j is m = 1 2 kiå kikj / 2m
  39. 39. Modularity n = number of vertices ki = degree of vertex i Aij = 1 if edge between i,j, 0 otherwise gij = 1 if i,j in same group, 0 otherwise Modularity If Q>0 then there are "excess" edges inside the groups (and fewer edges between them.) Q = Aij -kikj / 2m( ) ij å gij
  40. 40. Modularity algorithm • Look for a division of nodes into two groups that maximizes Q • Can find this through eigenvector technique • Possible that no division has Q>0, in which case the graph is a single community • If a division with Q>0 found, split • Recursively split sub-graphs
  41. 41. Network Analysis in Journalism
  42. 42. Case Study: Seattle Art World In Seattle Art World, Women Run the Show, Seattle Times Network obtained from dozens of in-person interviews. Interactive visualization in story.
  43. 43. Case Study: Hot Wheels Hot Wheels, Tampa Bay Times Network obtained from juvenile arrest records concerning stolen cars. Unpublished visualization and centrality measures used to direct reporting to most interesting people.
  44. 44. Coded 34 Stories for Sources and Uses Story visualization: published story contains a visualization Reporting visualization: used to guide reporters, unpublished. Scraping: network extracted from source documents Algorithm: centrality, community, etc. used Graph DB: network loaded into graph database
  45. 45. Results 0 5 10 15 20 25 30 35 40 Total Story Vis Scraping Reporting Vis Algorithm Graph DB
  46. 46. Why not algorithms? Heterogeneous networks. Multiple entity/relationship types. “Link analysis” like criminal investigations. Incomplete data. Building out the network is often an interactive process of data gathering. Contextual interpretation: What does it mean for someone to be “central”? Depends on the nature of the network and story.
  47. 47. Correlation of different types of info Suppose you have a record of phone numbers called, a database of political campaign donations, and a list of government appointees. Put them together, and you have this story: WASHINGTON—Time and again, Texas Gov. Rick Perry picked up his office phone in the months before he would announce his bid for the presidency. He dialed wealthy friends who were his big fundraisers and state officials who owed him for their jobs. Perry also met with a Texas executive who would later co-found an independent political committee that has promised to raise millions to support Perry but is prohibited from coordinating its activities with the governor. - Jack Gillum, Perry called top donors from work phones, AP, 6 Dec 2011
  48. 48. The state of the art: Panama Papers
  49. 49. Graph Databases in Theory Load everything into the database, then analyze using a graph query language and interactive visualization. “Magic bullet” for large, complex, cross border investigations.
  50. 50. Panama Papers networks derived from structured data only
  51. 51. Entity recognition is not solved! Incredibly dirty source data. Current methods have low recall (~70%) Entities found out of 150
  52. 52. “Soft” record linkage Unlinked records
  53. 53. Graph Databases in Practice Incomplete data. Building a network often requires scraping from documents. Bulk data often unavailable or impractical, and some records need to be purchased one at a time. Instead, reporting involves interactive data enrichment. Record linkage: With N databases, there could be N copies of each entity. Graph queries are not that helpful. Cipher was available to PP investigators but no one outside the core team learned it. Moreover, it’s not clear how often reporting problems can be expressed as a graph query. Even “find path between” did not produce any (documented) leads on PP. Networks need to be narratives. The most useful networks are hand-built, for a particular line of reporting.
  54. 54. Maps, not data visualizations
  55. 55. Query results vs. hand-built graphs Search for node to addGraph query results
  56. 56. Proposed System

×