2. This class
• The Overview document mining platform
• Newsblaster
• Topic models
• Word Embeddings
• Word counting again
• Hard NLP problems in Journalism
9. Used Overview’s “topic tree” (TF-IDF clustering) to find a group of key
emails from a listserv. Summer 2013.
10. Technical troubles with a new system meant that almost 70,000 North
Carolina residents received their food stamps late this summer. That’s 8.5
percent of the number of clients the state currently serves every month. The
problem was eventually traced to web browser compatibility issues. WRAL
reporter Tyler Dukes obtained 4,500 pages of emails — on paper — from
various government departments and used DocumentCloud and Overview to
piece together this story.
https://blog.overviewdocs.com/completed-stories/
17. Scrape
Handcrafted list of source URLs (news front pages) and links followed
to depth 4
“For each page examined, if the amount of text in the largest cell of the
page (after stripping tags and links) is greater than some particular
constant (currently 512 characters), it is assumed to be a news article,
and this text is extracted.”
(At least it’s simple. This was 2002. How often does this work now?)
21. But news is an on-line problem...
Articles arrive one at a time, and must be clustered immediately.
Can’t look forward in time, can’t go back and reassign.
Greedy algorithm.
22. Single pass clustering
put first story in its own cluster
repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster
23. Now sort events into categories
Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.
Primitive operation: what topic is this story in?
24. TF-IDF, again
Each category has pre-assigned TF-IDF coordinate. Story category =
closest point.
“finance” category
“world”category
latest story
25. Cluster summarization
Problem: given a set of documents, write a sentence summarizing
them.
Active research area in AI. Recent progress with recurrent neural
network techniques. But initial algorithms go back to 1950s.
28. We used a machine-learning method
known as latent Dirichlet allocation to
identify the topics in all 14,400 petitions
and to then categorize the briefs. This
enabled us to identify which lawyers did
which kind of work for which sorts of
petitioners. For example, in cases where
workers sue their employers, the lawyers
most successful getting cases before the
court were far more likely to represent the
employers rather than the employees.
The Echo Chamber, Reuters
29. Problem Statement
Can the computer tell us the “topics” in a document set? Can the computer
organize the documents by “topic”?
Note: TF-IDF tells us the topics of a single document, but here we want topics
of an entire document set.
30. One simple technique
Sum TF-IDF scores for each word across entire document set, choose top
ranking words.
Cluster descriptions in Overview prototype generated this way
31. Topic Modeling Algorithms
Basic idea: reduce dimensionality of document vector space, so each
dimension is a topic.
Each document is then a vector of topic weights. We want to figure out what
dimensions and weights give a good approximation of the full set of words in
each document.
Many variants: LSI, PLSI, LDA, NMF
34. Non-negative Matrix Factorization
All elements of document coordinate matrix W and topic matrix H
must be >= 0
Simple iterative algorithm to compute.
Still have to choose number of topics r
36. Latent Dirichlet Allocation
Imagine that each document is written by someone going through the
following process:
1. For each doc d, choose mixture of topics p(z|d)
2. For each word w in d, choose a topic z from p(z|d)
3. Then choose word from p(w|z)
A document has a distribution of topics.
Each topic is a distribution of words.
LDA tries to find these two sets of distributions.
38. "Topics"
LDA models a topic as a distribution over all the words in the corpus. In
each topic, some words are more likely, some are less likely.
41. K topics
topic for word word in doc
topics in doc
topic
concentration
parameter
word
concentration
parameter
LDA Plate Notation
D docs
words in topics
N words
in doc
42. Computing LDA
Inputs:
word[d][i] document words
k # topics
a doc topic concentration
b topic word concentration
Also:
n # docs
len[d] # words in document
v vocabulary size
46. Update topics
// for each word in document, sample a new topic
for d=1..n
for i=1..len[d]
w = word[d][i]
for t=1..k
p[t] = doc_topics[d][j] * topic_words[j][w]
topics[d][i] = sample from p[t]
47. Dimensionality reduction
Output of NMF and LDA is a vector of much lower dimension for each
document. ("Document coordinates in topic space.")
Dimensions are “concepts” or “topics” instead of words.
Can measure cosine distance, cluster, etc. in this new space.
50. More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked, Jeff
Kao
51. Word vectors with semantics
Word embedding is a function f(w) which maps a string (word) to a vector
with ~200 dimensions.
We hope that words with similar meanings have similar vectors. Then we
can do things like:
- Compare the meaning of two words
- Compare the meaning of two sentences
- Classify words/sentences by topic
- Implement semantic search
…basically everything we use TF-IDF for.
52. Distributional Semantics
The distributional hypothesis in linguistics is derived from the semantic
theory of language usage, i.e. words that are used and occur in the same
contexts tend to have similar meanings. The underlying idea that "a word
is characterized by the company it keeps" was popularized by Firth.
- Wikipedia
54. Train network to predict word from context
Word2Vec Tutorial - The Skip-Gram Model, Chris McCormick
55. Captures word use semantics
king – man + woman = queen
paris – france + poland = warsaw
Capturing semantic meanings using deep learning, Lior Shkiller
56. Man is to Computer Programmer as Woman is to Homemaker? Debiasing
Word Embeddings, Bolukbasi1 et al.
57. More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked, Jeff
Kao
62. Comparing two document sets
Let me talk about Downton Abbey for a minute. The show's popularity has
led many nitpickers to draft up lists of mistakes. ... But all of these have
relied, so far as I can tell, on finding a phrase or two that sounds a bit off,
and checking the online sources for earliest use.
I lack such social graces. So I thought: why not just check every single line
in the show for historical accuracy? ... So I found some copies of the
Downton Abbey scripts online, and fed every single two-word phrase
through the Google Ngram database to see how characteristic of the
English Language, c. 1917, Downton Abbey really is.
- Ben Schmidt, Making Downton more traditional
63. Bigrams that do not appear in English books between 1912 and
1921.
64. Bigrams that are at least 100 times more common today than
they were in 1912-1921
67. What is this investigative journalist doing with documents?
68. A number of previous tools aim to help the user “explore” a
document collection (such as [6, 9, 10, 12]), though few of
these tools have been evaluated with users from a specific
target domain who bring their own data, making us suspect
that this imprecise term often masks a lack of understanding
of actual user tasks.
Overview: The Design, Adoption, and Analysis of a Visual Document
Mining Tool For Investigative Journalists, Brehmer et al, 2014
73. Data is insanely contextual
VICTS AND SUSPS BECAME INV IN VERBA ARGUMENT SUSP
THEN BEGAN HITTING VICTS IN THE FACE
Typical incident description processed by LA Times crime classifier
74. If your algorithm isn’t robust to input noise, don’t even bother
Data is insanely dirty – OCR and more
75. Most NER implementations have low recall (~70%)
Investigations would prefer higher recall, lower precision
Entities found
out of 150
Entity recognition is not solved!
77. Suffolk County public safety committee transcript,
Reference to a body left on the street due to union dispute
(via Adam Playford, Newsday, 2014)
Text search may not find the target
79. A document describing a modification to one of the loans used to finance the Trump Soho
hotel (New York City ACRIS document 2006083000784001)
80. Excerpt of the hand-built chronological list of New York City real estate public records
concerning the Trump Soho hotel. Color coding indicates documents on the same date
(Giannina Segnini / Columbia Journalism School)
Editor's Notes
To open:
Media cloud
https://mediacloud.org/
Google ngram
https://books.google.com/ngrams