Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Frontiers of Computational Journalism week 1 - Introduction and High Dimensional Data

39 views

Published on

Taught at Columbia Journalism School, Fall 2018
Full syllabus and lecture videos at http://www.compjournalism.com/?p=218

Published in: Education
  • Be the first to comment

  • Be the first to like this

Frontiers of Computational Journalism week 1 - Introduction and High Dimensional Data

  1. 1. Frontiers of Computational Journalism Columbia Journalism School Week 1: High Dimensional Data September 12, 2018
  2. 2. “Stories will emerge from stacks of financial disclosure forms, court records, legislative hearings, officials' calendars or meeting notes, and regulators' email messages that no one today has time or money to mine. With a suite of reporting tools, a journalist will be able to scan, transcribe, analyze, and visualize the patterns in these documents.” - Cohen, Hamilton, Turner, Computational Journalism, 2011 Computational Journalism: Definitions
  3. 3. Surgeon Scorecard, ProPublica 2015
  4. 4. Computational Journalism: Definitions “Broadly defined, it can involve changing how stories are discovered, presented, aggregated, monetized, and archived. Computation can advance journalism by drawing on innovations in topic detection, video analysis, personalization, aggregation, visualization, and sensemaking.” - Cohen, Hamilton, Turner, Computational Journalism, 2011
  5. 5. Journalism & Technology: Big Data, Personalization & Automation Shailesh Prakash, The Washington Post
  6. 6. Kony 2012 early network, Gilad Lotan
  7. 7. We are now living in a world where algorithms, and the data that feed them, adjudicate a large array of decisions in our lives: not just search engines and personalized online news systems, but educational evaluations, the operation of markets and political campaigns, the design of urban public spaces, and even how social services like welfare and public safety are managed. … Journalists are beginning to adapt their traditional watchdogging and accountability functions to this new wellspring of power in society. They are investigating algorithms in order to characterize their power and delineate their mistakes and biases. - Nick Diakopoulos, Algorithmic Accountability, 2015 Computational Journalism: Definitions
  8. 8. Websites Vary Prices, Deals Based on Users' Information Valentino-Devries, Singer-Vine and Soltani, WSJ, 2012
  9. 9. Message Machine Jeff Larson, Al Shaw, ProPublica, 2012
  10. 10. “Computational Journalism” in this course Reporting on society, using computation and Reporting on computation in society
  11. 11. Natural Language Processing Visualization Sociology Artificial Intelligence Cognitive Science Statistics Graph Theory Text Analysis Filter Design Inference Algorithmic Accountability Network Analysis Disinformation Privacy and Security Information Retrieval Epistemology Course Topics
  12. 12. Administration Assignments Some assignments require programming, but your writing counts for more than your code! Final project Code, story, or research Course blog http://compjournalism.com Grading 40% assignements 40% final project 20% class participation
  13. 13. This class • Introduction • High dimensional data • Text analysis in journalism • The Document Vector Space model • The Overview document mining platform
  14. 14. High dimensional data
  15. 15. Vector representation of objects x1 x2 x3 xN é ë ê ê ê ê ê ê ê ù û ú ú ú ú ú ú ú Fundamental representation for (almost) all data mining, clustering, machine learning, visualization, NLP, etc. algorithms.
  16. 16. Interpreting High Dimensional Data UK House of Lords voting record, 2000-2012. N = 1043 lords by M = 1630 votes 2 = aye, 4 = nay, -9 = didn't vote
  17. 17. Dimensionality reduction Problem: vector space is high-dimensional. Up to thousands of dimensions. The screen is two-dimensional. We have to go from x ∈ RN to much lower dimensional points y ∈ RK<<N Probably K=2 or K=3.
  18. 18. Projection Projection from 3 to 2 dimensions
  19. 19. Which direction should we look from? Principal components analysis: find a linear projection that preserves greatest variance Take first K eigenvectors of covariance matrix corresponding to largest eigenvalues. This gives a K-dimensional sub-space for projection.
  20. 20. PCA on House of Lords data Classification is arguably one of the most central and generic of all our conceptual exercises. It is the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis in general. Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to Classification Techniques
  21. 21. UK House of Lords PCA notebook
  22. 22. Classification and Clustering Classification is arguably one of the most central and generic of all our conceptual exercises. It is the foundation not only for conceptualization, language, and speech, but also for mathematics, statistics, and data analysis in general. Kenneth D. Bailey, Typologies and Taxonomies: An Introduction to Classification Techniques
  23. 23. Distance metric d(x, y) ≥ 0 - distance is never negative d(x, x) = 0 - “reflexivity”: zero distance to self d(x, y) = d(y, x) - “symmetry”: x to y same as y to x d(x, z) ≤ d(x, y) + d(y, z) - “triangle inequality”: going direct is shorter
  24. 24. Distance matrix Data matrix for M objects of N dimensions Distance matrix X = x1 x2 xM é ë ê ê ê ê ù û ú ú ú ú = x1,1 x1,2 x1,N x2,1 x2,2 x1,M xM,N é ë ê ê ê ê ê ù û ú ú ú ú ú Dij = Dji = d(xi , xj ) = d1,1 d1,2 dM,M d2,1 d2,2 d1,M dM,M é ë ê ê ê ê ê ù û ú ú ú ú ú
  25. 25. Different clustering algorithms • Partitioning o keep adjusting clusters until convergence o e.g. K-means o Also LDA and many Bayesian models, from a certain perspective • Agglomerative hierarchical o start with leaves, repeatedly merge clusters o e.g. MIN and MAX approaches • Divisive hierarchical o start with root, repeatedly split clusters o e.g. binary split
  26. 26. K-means demo https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
  27. 27. UK House of Lords voting clusters Algorithm instructed to separate MPs into five clusters. Output: 1 2 2 1 3 2 2 2 1 4 1 1 1 1 1 1 5 2 1 1 2 2 1 2 3 2 2 4 2 1 2 3 2 1 3 1 1 2 1 2 1 5 2 1 4 2 2 1 2 1 1 4 1 1 4 1 2 2 1 5 1 1 1 2 3 3 2 2 2 5 2 3 1 2 1 4 1 1 4 4 1 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 3 2 1 1 2 2 1 2 3 4 2 2 2 . .
  28. 28. Voting clusters with parties LDem XB Lab LDem XB Lab XB Lab Con XB 1 2 2 1 3 2 2 2 1 4 Con Con LDem Con Con Con LDem Lab Con LDem 1 1 1 1 1 1 5 2 1 1 Lab Lab Con Lab XB XB Lab XB Lab Con 2 2 1 2 3 2 2 4 2 1 Lab XB Lab Con XB XB LDem Lab XB Lab 2 3 2 1 3 1 1 2 1 2 Con Con Lab Con XB Lab Lab Con XB XB 1 5 2 1 4 2 2 1 2 1 Con XB Con Con XB Con Lab XB LDem Con 1 4 1 1 4 1 2 2 1 5 Con Con Con Lab Bp XB Lab Lab Lab LDem 1 1 1 2 3 3 2 2 2 5 Lab XB Con Lab Con XB Con Con XB XB 2 3 1 2 1 4 1 1 4 4 Con Con Lab Con Con XB Lab Lab Lab Con 1 1 2 1 1 2 2 2 2 1 Lab LDem Lab Con Lab Lab Con XB Lab Con 2 1 2 1 2 2 1 3 2 1 Con Lab XB Con XB XB XB Lab Lab Lab 1 2 2 1 2 3 4 2 2 2
  29. 29. No unique “right” clustering Different distance metrics and clustering algorithms give different results. Should we sort incident reports by location, time, actor, event type, author, cost, casualties…? There is only context-specific categorization. And the computer doesn’t understand your context.
  30. 30. Different libraries, different categories
  31. 31. Clustering Algorithm Input: data points (feature vectors). Output: a set of clusters, each of which is a set of points. Visualization Input: data points (feature vectors). Output: a picture of the points.
  32. 32. Linear projections (like PCA) Projects in a straight line to closest point on "screen.” y = Px where P is a K by N matrix. Projection from 2 to 1 dimensions
  33. 33. Nonlinear projections Still going from high- dimensional x to low- dimensional y, but now y = f(x) for some function f(), not linear. So, may not preserve relative distances, angles, etc. Fish-eye projection from 3 to 2 dimensions
  34. 34. Multidimensional scaling Idea: try to preserve distances between points "as much as possible." If we have the distances between all points in a distance matrix, D = |xi – xj| for all i,j We can recover the original {xi} coordinates exactly (up to rigid transformations.) Like working out a country map if you know how far away each city is from every other. But notice that the original dimension is not encoded in the matrix… we can re-project to any number of dimensions.
  35. 35. Multidimensional scaling Torgerson's "classical MDS" algorithm (1952)
  36. 36. MDS Stress minimization The formula actually minimizes “stress” Think of “springs” between every pair of points. Spring between xi, xj has rest length dij Stress is zero if all high-dimensional distances matched exactly in low dimension. stress(x) = xi - xj - dij( ) 2 i, j å
  37. 37. Multi-dimensional Scaling Like "flattening" a stretchy structure into 2D, so that distances between points are preserved (as much as possible")
  38. 38. House of Lords MDS plot
  39. 39. Robustness of results Regarding these analyses of legislative voting, we could still ask: • Are we modeling the right thing? (What about other legislative work, e.g. in committee?) • Are our underlying assumptions correct? (do representatives really have “ideal points” in a preference space?) • What are we trying to argue? What will be the effect of pointing out this result?
  40. 40. Text Analysis in Journalism
  41. 41. Count incident types by date. For Level 14, ProPublica, 2015
  42. 42. The Child Exchange, Reuters, 2014
  43. 43. USA Today/Twitter Political Issues Index
  44. 44. Politico analysis of GOP primary, 2012
  45. 45. The Post obtained draft versions of 12 audits by the inspector general’s office, covering projects from the Caribbean to Pakistan to the Republic of Georgia between 2011 and 2013. The drafts are confidential and rarely become public. The Post compared the drafts with the final reports published by the inspector general’s office and interviewed former and current employees. E- mails and other internal records also were reviewed. The Post tracked changes in the language that auditors used to describe USAID and its mission offices. The analysis found that more than 400 negative references were removed from the audits between the draft and final versions. Sentiment analysis used by Washington Post, 2014
  46. 46. LAPD Underreported Serious Assaults, Skewing Crime Stats for 8 Years Los Angeles Times
  47. 47. The Times analyzed Los Angeles Police Department violent crime data from 2005 to 2012. Our analysis found that the Los Angeles Police Department misclassified an estimated 14,000 serious assaults as minor offenses, artificially lowering the city’s crime levels. To conduct the analysis, The Times used an algorithm that combined two machine learning classifiers. Each classifier read in a brief description of the crime, which it used to determine if it was a minor or serious assault. An example of a minor assault reads: "VICTS AND SUSPS BECAME INV IN VERBA ARGUMENT SUSP THEN BEGAN HITTING VICTS IN THE FACE.”
  48. 48. We used a machine-learning method known as latent Dirichlet allocation to identify the topics in all 14,400 petitions and to then categorize the briefs. This enabled us to identify which lawyers did which kind of work for which sorts of petitioners. For example, in cases where workers sue their employers, the lawyers most successful getting cases before the court were far more likely to represent the employers rather than the employees. The Echo Chamber, Reuters
  49. 49. Document Vector Space Model
  50. 50. Message Machine clusters emails Using TF-IDF document vectors
  51. 51. Document vectors in journalism - Text clustering for stories, e.g. Message Machine - Find “key words” or “most important words” - Topic analysis, e.g. ProPublica’s legislative tracker - Key component of filtering algorithms, e.g. Google News - Standard representation for document classification. - Basis of all text search engines. A text analysis building block.
  52. 52. What is this document "about"? Most commonly occurring words a pretty good indicator. 30 the 23 to 19 and 19 a 18 animal 17 cruelty 15 of 15 crimes 14 in 14 for 11 that 8 crime 7 we
  53. 53. Features = words works fine Encode each document as the list of words it contains. Dimensions = vocabulary of document set. Value on each dimension = # of times word appears in document
  54. 54. Example D1 = “I like databases” D2 = “I hate hate databases” Each row = document vector All rows = term-document matrix Individual entry = tf(t,d) = “term frequency”
  55. 55. Aka “Bag of words” model Throws out word order. e.g. “soldiers shot civilians” and “civilians shot soldiers” encoded identically.
  56. 56. Tokenization The documents come to us as long strings, not individual words. Tokenization is the process of converting the string into individual words, or "tokens." For this course, we will assume a very simple strategy: o convert all letters to lowercase o remove all punctuation characters o separate words based on spaces Note that this won't work at all for Chinese. It will fail in ,many ways even for English. How?
  57. 57. Distance metric for text Useful for: • clustering documents • finding docs similar to example • matching a search query Basic idea: look for overlapping terms
  58. 58. Cosine similarity Given document vectors a,b define If each word occurs exactly once in each document, equivalent to counting overlapping words. Note: not a distance function, as similarity increases when documents are… similar. (What part of the definition of a distance function is violated here?) similarity(a,b) º a·b
  59. 59. Problem: long documents always win Let a = “This car runs fast.” Let b = “My car is old. I want a new car, a shiny car” Let query = “fast car” this car runs fast my is old I want a new shiny a 1 1 1 1 0 0 0 0 0 0 0 0 b 0 3 0 0 1 1 1 1 1 1 1 1 q 0 1 0 1 0 0 0 0 0 0 0 0
  60. 60. similarity(a,q) = 1*1 [car] + 1*1 [fast] = 2 similarity(b,q) = 3*1 [car] + 0*1 [fast] = 3 Longer document more “similar”, by virtue of repeating words. Problem: long documents always win
  61. 61. Normalize document vectors similarity(a,b) º a·b a b = cos(Θ) returns result in [0,1]
  62. 62. Normalized query example this car runs fast my is old I want a new shiny a 1 1 1 1 0 0 0 0 0 0 0 0 b 0 3 0 0 1 1 1 1 1 1 1 1 q 0 1 0 1 0 0 0 0 0 0 0 0 similarity(a,q) = 2 4 2 = 1 2 » 0.707 similarity(b,q) = 3 17 2 » 0.514
  63. 63. Cosine similarity cosq = similarity(a,b) º a·b a b
  64. 64. Cosine distance (finally) dist(a,b) º1- a·b a b
  65. 65. Problem: common words We want to look at words that “discriminate” among documents. Stopwords: if all documents contain “the,” are all documents similar? Common words: if most documents contain “car” then car doesn’t tell us much about (contextual) similarity.
  66. 66. Context matters Car ReviewsGeneral News = contains “car” = does not contain “car”
  67. 67. Document Frequency Idea: de-weight common words Common = appears in many documents “document frequency” = fraction of docs containing term df (t,D)= d Î D:t Î d D
  68. 68. Inverse Document Frequency Invert (so more common = smaller weight) and take log idf (t,D) = log D d Î D:t Î d( )
  69. 69. TF-IDF Multiply term frequency by inverse document frequency n(t,d) = number of times term t in doc d n(t,D) = number docs in D containing t tfidf (t,d,D)= tf (t,d)×idf (d,D) = n(t,d)×log D n(t,D)( )
  70. 70. TF-IDF depends on entire corpus The TF-IDF vector for a document changes if we add another document to the corpus. TF-IDF is sensitive to context. The context is all other documents tfidf (t,d,D)= tf (t,d)×idf (d,D) if we add a document, D changes!
  71. 71. What is this document "about"? Each document is now a vector of TF-IDF scores for every word in the document. We can look at which words have the top scores. crimes 0.0675591652263963 cruelty 0.0585772393867342 crime 0.0257614113616027 reporting 0.0208838148975406 animals 0.0179258756717422 michael 0.0156575858658684 category 0.0154564813388897 commit 0.0137447439653709 criminal 0.0134312894429112 societal 0.0124164973052386 trends 0.0119505837811614 conviction 0.0115699047136248 patterns 0.011248045148093
  72. 72. Salton’s description of tf-idf - from Salton et al, A Vector Space Model for Automatic Indexing, 1975
  73. 73. TF nj-sentator-menendez corpus, Overview sample files color = human tags generated from TF-IDF clusters TF-IDF
  74. 74. Cluster Hypothesis “documents in the same cluster behave similarly with respect to relevance to information needs” - Manning, Raghavan, Schütze, Introduction to Information Retrieval Not really a precise statement – but the crucial link between human semantics and mathematical properties. Articulated as early as 1971, has been shown to hold at web scale, widely assumed.
  75. 75. Bag of words + TF-IDF widely used Practical win: good precision-recall metrics in tests with human-tagged document sets. Still the dominant text indexing scheme used today. (Lucene, FAST, Google…) Many variants and extensions. Some, but not much, theory to explain why this works. (E.g. why that particular IDF formula? why doesn’t indexing bigrams improve performance?) Collectively: the vector space document model

×