SlideShare a Scribd company logo
1 of 94
An Introduction to Text Mining
‹#›
Bettina Berendt
Department of Computer Science
KU Leuven, Belgium
http://people.cs.kuleuven.be/~bettina.berendt/
Vienna Summer School on Digital Humanities
July 7th, 2015, Vienna, Austria
2
Starting questions for today
• What is a text?
• What questions can we ask of a text?
• What kind of answers "make us happy"?
3
Some answers that would make you happy, and
how (semi-)automatic text analysis could help
• Author
▫ Usually a metadatum that is extracted from the metadata set
▫ Can also be an inference: „“can you find out who is the author of this text?“
 This is a text-mining task that has been studied for example in online texts (one subtype of the de-anonymization problem)
• Genre
▫ A text-mining classification task (given a text, classify it into one from a list of genres)
• Style
▫ Same (stylometry classification)
• statement / summary
▫ Text mining task “summarization“ (e.g. of news texts)
• Content
▫ The most typical text mining task: identify topics, classify into a content class, ...
• function, intention
▫ (I‘m still not quite sure what this ... So this is for a future summer school ;-) )
• sequence of signs
1. Sequential analysis of texts is common (e.g. In co-occurrence and collocation analysis)
2. Signs (in the sense of arbitrary words standing for concepts): a key element of theories of semantics,
e.g. In the Semantic Web and Linked Open Data (e.g. DBPedia: a concept network version of Wikipedia)
– are used in text mining, for example for improving topic modelling and classification
July 9th 2015: I‘ll add references to these yet, but wanted to get the slides out to you already!
3
4
Motivation (1)
4
5
Motivation (2)
5
6
Goals and non-goals
• Goals
▫ Understand the basic ideas of data mining
▫ Understand how computer-scientist text miners approach texts
▫ Compare it with your own approaches
▫ Learn about some pitfalls and encourage a critical view
▫ Get your hands on some tools and real data
▫ Have an overview of other necessary steps (such as pre-processing) that take
too much time to be included in this course
▫ Have pointers for inquiring and going further
• Non-goals (selection)
▫ the statistical background of methods
▫ a comprehensive overview of the state-of-the-art of text mining methods
▫ a comprehensive overview of the state-of-the-art of text mining applications in
the digital humanities or social or behavioural sciences
▫ An introduction to big data computing or big DH infrastructuress
6
7
A more modest goal than revolutionising
knowledge as such?!
“As long as there have been books there
have been more books than you could
read. … Knowing how to "not-read" is
just as important as knowing how to
read”
(Mueller, 2007).
“data mining and machine learning are
best understood in terms of
“provocation”—the potential for outlier
results to surprise a reader into
attending to some aspect of a text not
previously deemed significant—as well as
“not-reading” or “distant reading,” the
automated search for patterns across a
much wider corpus than could be read
and assimilated via traditional
humanistic methods of “close reading.””
(Kirschenbaum, 2007)
7
‹#›
‹#›
and/or its
older brother:
Information
retrieval
10
Origins of text mining. Or: What is a
text for information retrieval?
Let‘s do some reverse engineering ...
10
11
Words, source relevance, and
personalization
11
12
Words and knowledge bases (1)
12
Metadata
as output
13
Knowledge-based text processing (2)
13
Metadata as
input?
Requires
different search
interfaces!
14
PS: the ranking includes network
analytics ( Thursday)
14
15
15
PS: the ranking also includes adaptation;
here: relevance feedback
16
Trending topics: a form of summarization
16
17
Finding “similar“ texts: Clustering
(example Google News)
17
18
Going further: What topics exist in a collection of
texts, and how do they evolve?
News texts, scientific
publications, …
Mei & Zhai (2005)
19
Guiding questions
• Information retrieval:
▫ Given the current user‘s information need, which are the most
relevant documents?
• Text mining:
▫ What do the documents tell us? What‘s in the texts? What can
we learn about the texts, their authors, ...
▫ Many different subquestions
▫ Summarization (of one text, of many texts) is just one of them
• Cf.
▫ “Distant reading“ (Moretti)
 understanding literature not by studying particular texts, but by
aggregating and analyzing massive amounts of data.
▫ “Machine reading“ (UCL Machine Reading Group)
 machines that can read and "understand" this textual
information, converting it into interpretable structured
knowledge to be leveraged by humans and other machines alike
19
‹#›
21
Speed-reading
(Woody Allen)
I took a course in speed reading and was able to
read War and Peace in twenty minutes.
It's about Russia.
... also quoted differently:
I took a speed reading course and read 'War and
Peace' in twenty minutes. It involves Russia.
21
22
A personal “experiment“
- deliberately a bit silly, more a gentle introduction to a great tool
and to some pitfalls of “distant reading“
(I haven‘t read War and Peace yet.)
22
23
Speed-reading with word clouds:
The Voyant tool (single-digit number of seconds)
23
24
24
Note about „said“:
Compare Joyce‘s
Dubliners
25
Word frequencies vs.
Woody Allen
25
26
Can we find out more about the 3?
26
27
Double-check in Wikipedia
(method: string search)
• Count Pyotr Kirillovich (Pierre) Bezukhov: The large-bodied, ungainly, and
socially awkward illegitimate son of an old Russian grandee. Pierre,
educated abroad, returns to Russia as a misfit. His unexpected inheritance
of a large fortune makes him socially desirable. Pierre is the central
character and often a voice for Tolstoy's own beliefs or struggles.
• Prince Andrey Nikolayevich Bolkonsky: A strong but skeptical, thoughtful
and philosophical aide-de-camp in the Napoleonic Wars.
▫ Some searching needed ... Andrew ... Andrei ... Andrey
• Countess Natalya Ilyinichna (Natasha) Rostova: A central character,
introduced as "not pretty but full of life" and a romantic young girl,
although impulsive and highly strung, she evolves through trials and
suffering and eventually finds happiness. She is an accomplished singer
and dancer.
• ...
• Prince Anatole Vasilyevich Kuragin: Hélène's brother and a very handsome
and amoral pleasure seeker who is secretly married yet tries to elope with
Natasha Rostova.
• Vasily Dmitrich Denisov: Nikolai Rostov's friend and brother officer, who
proposes to Natasha.
27
28
From Wikipedia‘s plot summary
(method: string search)
• ...
• Natasha is convinced that she loves Anatole and writes to Princess Maria, Andrei's
sister, breaking off her engagement [with Andrei]. At the last moment, Sonya
discovers her plans to elope and foils them. Pierre is initially horrified by Natasha's
behavior, but realizes he has fallen in love with her. During the time when the
Great Comet of 1811–2 streaks the sky, life appears to begin anew for Pierre.
• Prince Andrei coldly accepts Natasha's breaking of the engagement. He tells Pierre
that his pride will not allow him to renew his proposal. Ashamed, Natasha makes a
suicide attempt and is left seriously ill.
• ...
• Having lost all will to live, [Andrei] forgives Natasha in a last act before dying.
• Pierre's wife Hélène dies from an overdose of abortion medication (Tolstoy does not
state it explicitly but the euphemism he uses is unambiguous). Pierre is reunited
with Natasha, while the victorious Russians rebuild Moscow. Natasha speaks of
Prince Andrei's death and Pierre of Karataev's. Both are aware of a growing bond
between them in their bereavement. With the help of Princess Maria, Pierre finds
love at last and, revealing his love after being released by his former wife's death,
marries Natasha.
28
Total time: 29 mins since creation of word cloud,
17 mins since creation of Pierre-Natasha-Andrew chart
(includes making these slides for you)
29
Questions
• How much of this was “really automatic“?
• What existing knowledge (in my head and in
others‘) went into this analysis,
• and how?
• Can you think of another reason why this
(deliberately) turned out silly?
29
30
More interesting / serious examples (1)
(from the summer school participants)
• Analysis of ego-shooter missions (thanks to
Kathrin Trattner)
30
31
Comment B. Berendt – compare this with an earlier text-mining
analysis of reporting on the same events by CNN in comparison
with Al Jazeera
• See next slide
31
32
Unsupervised learning of bias
32
Nearest neighbour / best reciprocal hit
for document matching;
Kernel Canonical Correlation Analysis
and vector operations
for finding topics and characteristic keywords
[Fortuna, Galleguillos, & Cristianini, 2009]
What characterizes different news sources?
33
Additional information from Sentistrength analysis
33
34
More interesting / serious examples (2)
(from the summer school participants)
• Analysis of ideological documents:
• Charter of Hamas (thanks to Alexandra
Preitschopf):
• analyses word usage and – interestingly – also the
absence of specific words
▫ Note: this shows clearly why we need domain
knowledge to interpret frequencies!
• It also shows the difficulties of using sentiment
analysis when the real object of analysis is
opinion/bias.
• For more details, see her presentation (also linked
on my Summer School Web page)
34
35
More interesting / serious examples (3)
(from the summer school participants)
• Joseph Goebbels‘ sportpalast
speech (a famous propaganda
speech from 1943: “Do you
want the total war?“)
• frequencies of negatively
connotated words
(“bolshevism“, “judaism / the
Jews“) vs. positively
connotated words (“Germans“)
suggest:
▫ The speech starts with a threat
scenario and ends with a
positive vision of the future
• Remark B. Berendt: This is
borne out by reading the full
text, and it is also a classical
rhetorical structure.
35
Text from:
http://www.1000dokumente.de/inde
x.html?c=dokument_de&dokument
=0200_goe&object=translation&st=
&l=de
36
More interesting / serious examples (4)
(from others)
Examples of Voyant in Research:
http://docs.voyant-tools.org/about/examples-gallery/
36
37
Some formalism: the vector-space model of text (basic
model used in information retrieval and text mining)
▫ Basic idea:
 Keywords are extracted from texts.
 These keywords describe the (usually) topical content
of Web pages and other text contributions.
▫ Based on the vector space model of document
collections:
 Each unique word in a corpus of Web pages = one
dimension
 Each page(view) is a vector with non-zero weight for
each word in that page(view), zero weight for other
words
 Words become “features” (in a data-mining sense)
37
38
• Starting point is the raw term frequency as term weights
• Other weighting schemes can generally be obtained by applying various
transformations to the document vectors
nova galaxy heat actor film role diet
A 1.0 0.5 0.3
B 0.5 1.0
C 0.4 1.0 0.8 0.7
D 0.9 1.0 0.5
E 0.5 0.7 0.9
F 0.6 1.0 0.3 0.2 0.8
Document Ids
a document
vector
Features
Document Representation as Vectors
39
Other features (usually metadata of
different sorts) can be added
• Tags or other categories
• Special content (e.g. URLs, images, Twitter
mentions)
• Source
• Number of followers of source
• ...
39
40
41
41
https://aeshin.org/textmining/
42
‹#›
44
The idea of text mining ...
• ... is to go beyond frequency-counting
• ... is to go beyond the search-for-documents
framework
• ... is to find patterns (of meaning) within and
especially across documents
• (but boundaries are not fixed)
45
Data mining
(aka Knowledge Discovery)
The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data
(Fayyad, Platetsky-Shapiro, Smyth, 1996)
45
46
CRISP-DM: Cross-Industry Standard
Process for Data Mining
46
47
The steps of text mining
1. Application understanding
2. Corpus generation
3. Data understanding
4. Text preprocessing
5. Search for patterns / modelling
 Topical analysis
 Sentiment analysis / opinion mining
6. Evaluation
7. Deployment
48
Application understanding; Corpus
generation
▫ What is the question?
▫ What is the context?
▫ What could be interesting sources, and where
can they be found?
▫ Use an existing corpus
▫ Crawl
▫ Use a search engine and/or archive and/or API
▫ Get help!
49
Preprocessing (1)
• Data cleaning
▫ Goal: get clean ASCII text
▫ Remove HTML markup*, pictures,
advertisements, ...
▫ Automate this: wrapper induction
* Note: HTML markup may carry information too (e.g., <b> or <h1> marks
something important), which can be extracted! (Depends on the
application)
50
Preprocessing (2)
• Goal: get processable lexical / syntactical units
• Tokenize (find word boundaries)
• Lemmatize / stem
▫ ex. buyers, buyer  buyer / buyer, buying, ...  buy
• Remove stopwords
• Find Named Entities (people, places, companies, ...); filtering
• Resolve polysemy and homonymy: word sense disambiguation;
“synonym unification“
• Part-of-speech tagging; filtering of nouns, verbs, adjectives, ...
• ...
• Most steps are optional and application-dependent!
• Many steps are language-dependent; coverage of non-English
varies
• Free and/or open-source tools or Web APIs exist for most steps
Do you see a
problem
here for DH?
What implicit
assumptions
are made?
51
Preprocessing (3)
• Creation of text representation
▫ Goal: a representation that the modelling
algorithm can work on
▫ Most common forms: A text as
 a set or (more usually) bag of words / vector-space
representation: term-document matrix with
weights reflecting occurrence, importance, ...
 a sequence of words
 a tree (parse trees)
52
An important part of preprocessing:
Named-entity recognition (1)
This 2009 OpenCalais screenshot
visualizes nicely what today is mostly
markup. E.g. in the tool
http://www.alchemyapi.com/api/entity-
extraction
53
An important part of preprocessing:
Named-entity recognition (2)
• Technique: Lexica, heuristic rules, syntax
parsing
• Re-use lexica and/or develop your own
▫ configurable tools such as GATE
• An example challenge: multi-document named-
entity recognition
▫ Several solution proposals
• A more difficult problem: Anaphora resolution
54
Styles of statistics-based analysis
• Statistics: descriptive – inferential
• Data mining: descriptive – predictive (D – P)
• Machine learning, data mining: unsupervised –
supervised
• Typical tasks in text analysis:
▫ D: Frequency analysis, collocation analysis,
association rules
▫ D: Cluster analysis
▫ P: Classification
▫ Interactive knowledge discovery: combines various
forms and involves “the human in the loop“
54
“It involves
Russia.“
“It‘s about
Russia.“
55
55
https://aeshin.org/textmining/
56
Tools we will see (you‘ll have to choose,
based on your prior knowledge)
• Frequency analysis, collocation analysis
▫ Voyant
▫ (also offers many other forms, see
http://docs.voyant-tools.org/tools/)
• More visualization (based on clustering)
▫ DocumentAtlas
• Classification
▫ Weka (can also do lots of other data mining tasks, such as
association rules, and it is not made specifically for texts)
• Interactive knowledge discovery
▫ Ontogen: Ontology learning based on clustering and
manual post-processing; includes DocumentAtlas
56
57
Basic process of
classification/prediction
Given a set of documents and their classes, e.g.
▫ Spam, no-spam
▫ Topic categories in news: current affairs,
business, sports, entertainment, ...
▫ Any other classification
1. Learn which document features characterise
the classes = learn a classifier
2. Predict, from document features, the classes
▫ For old documents with known classes
▫ For new documents with unknown classes
57
58
What makes people happy?
59
Happiness in blogosphere
60
• Well kids, I had an awesome birthday
thanks to you. =D Just wanted to so
thank you for coming and thanks for the
gifts and junk. =) I have many pictures
and I will post them later. hearts
current
mood:
Home alone for too many hours, all
week long ... screaming child,
headache, tears that just won’t let
themselves loose.... and now I’ve
lost my wedding band. I hate this.
current
mood:
What are the
characteristic words
of these two moods?
[Mihalcea, R. & Liu, H. (2006).
In Proc. AAAI Spring Symposium CAAW.]
Slides based on Rada Mihalcea‘s presentation.
61
Data, data preparation and learning
• LiveJournal.com – optional mood annotation
• 10,000 blogs:
▫ 5,000 happy entries / 5,000 sad entries
▫ average size 175 words / entry
▫ pre-processing – remove SGML tags, tokenization,
part-of-speech tagging
Results: Corpus-derived happiness factors
yay 86.67
shopping 79.56
awesome 79.71
birthday 78.37
lovely 77.39
concert 74.85
cool 73.72
cute 73.20
lunch 73.02
books 73.02
goodbye 18.81
hurt 17.39
tears 14.35
cried 11.39
upset 11.12
sad 11.11
cry 10.56
died 10.07
lonely 9.50
crying 5.50
happiness factor of a word =
the number of occurrences in the happy blogposts / the total frequency in the corpus
‹#›
Weka – classification with Naive Bayes
‹#›
Using classifier learning for literature analysis –
here: a (Weka) decision tree (early example: MONK)
Sara Steger (2012).
Patterns of Sentimentality in
Victorian Novels.
Digital Studies 3(2).
65
Many other tasks (ex. news/blogs
mining)
Tasks in news / (micro-)blogs mining can be grouped by
different criteria:
• Basic task and type of result: description, classification and
prediction (supervised or unsupervised, includes for example
topic identification, tracking, and/or novelty detection;
spam detection); search (ad hoc or filtering);
recommendation (of blogs, blog posts, or (hash-)tags);
summarization
• Higher-order characterization to be extracted: especially
topic or event; opinion or sentiment
• Time dimension: nontemporal; temporal (stream mining);
multiple streams (e.g., in different languages, see cross-
lingual text mining)
• User adaptation: none (no explicit mention of user issues
and/or general audience); customizable; personalized
65
Berendt (Encyclopedia of Machine Learning and Data Mining, in press).
66
Real-world applications of news/blogs
mining
Real-world applications increasingly employ selections or, more often, combinations of
these tasks by their intended users and use cases, in particular:
• News aggregators allow laypeople and professional users (e.g. journalists) to see “what’s
in the news” and to compare different sources’ texts on one story. Reflecting the
presumption that news (especially mainstream news – sources for news aggregators are
usually whitelisted) are mostly objective/neutral, these aggregators focus on topics and
events. News aggregators are now provided by all major search engines.
• Social-media monitoring tools allow laypeople and professional users to track not only
topical mentions of a keyword or named entity (e.g. person, brand), but also aggregate
sentiment towards it. The focus on sentiment reflects the perceptions that even when
news-related, social media content tends to be subjective and that studying the
blogosphere is therefore an inexpensive way of doing market research or public-opinion
research. The whitelist here is usually the platforms (e.g. Twitter, Tumblr, LiveJournal,
Facebook) rather than the sources themselves, reflecting the huge size and dynamic
structure of the blogosphere / the Social Web. The landscape of commercial and free
social-media monitoring tools is wide and changes frequently; up-to-date overviews and
comparisons can easily be found on the Web.
• Emerging application types include text mining not of, but for journalistic texts, in
particular natural language generation in domains with highly schematized event
structures and reporting, such as sports and finance reporting (e.g. Allen et al., 2010;
narrativescience.com) and social-media monitoring tools for helping journalists find
sources (Diakopoulos et al., 2012).
66
Berendt (Encyclopedia of Machine Learning and Data Mining, in press).
‹#›
68
Evaluation of unsupervised learning: e.g.
clustering
• Do the clusters make sense?
• Are the instances within one cluster similar to
one another?
• Are the instances in different clusters
dissimilar to one another?
• (There are quantitative metrics of #2 and #3)
69
Quality of automatic “mood separation”
• naïve bayes text classifier
▫ five-fold cross validation
• Accuracy: 79.13% (>> 50% baseline)
70
70
https://aeshin.org/textmining/
71
71
https://aeshin.org/textmining/
72
72
https://aeshin.org/textmining/
73
73
https://aeshin.org/textmining/
74
74
https://aeshin.org/textmining/
75
75
https://aeshin.org/textmining/
78
Who defines which class a document
belongs to?
• The researcher?
• The author?
• The reader?
• Someone paid to do exactly this (e.g. a worker
on mTurk)?
• Several of them?
• Someone else?
78
79
The importance of consensus
Illustration: ESP game (“Games with a purpose“)
79
von Ahn (2005, 2006)
80
Measuring inter-rater reliability
• Popular measure of inter-rater agreement from content analysis
• Non-trivial formula (see references), but software exists.
80
81
How good is good: Magic numbers?
• (Kappa is a related measure; the boundaries are the same)
• Boundaries are disputed and tend to get higher
• Inter-rater agreement often systematically low, e.g. in text
summarization: slightly over 50% (Berendt et al., 2014)
• Recent approaches attempt to accept this ambiguity and work with
it: e.g. Poesio et al. (2013)
81
82
In what sense is this an alternative?
• “Given that there is no ground truth is a discipline like
literary criticism, it is difficult to know how influential
these results will prove.
• A scholar would have to write them up in traditional
article or monograph form, wait for the article or
monograph to move through the peer-review process
(this can take months or years) and then other scholars
in the field will have to read it, be influenced by its
arguments, and adjust their own interpretations of
Dickinson—in turn publishing these in their own articles
and monographs.
• Nonetheless, we believe that the Nora system has
suggested that classification and prediction can be
useful agents of provocation in humanistic study.”
(Kirschenbaum, 2007)
82
‹#›
84
#gamergate
“GamerGate is a grassroots
movement with the goal of
supporting ethics in game
journalism. Some feminists
have claimed it is a
hateful, misogynistic
movement, but they
haven't been able to meet
the burden of proof on
that.”
http://drunken-peasants-
podcast.wikia.com/wiki/GamerGate
84
85
(Only) one reason this is interesting for
text analysis
“Ethics aren't the only thing #Gamergate is concerned
with. As the movement made the shift from ad hominem
attacks to insisting that its only interest in Quinn was as
an example of nepotism and corruption in the gaming
industry, it also began co-opting the language of social
justice movements and of journalism to legitimize its
complaints.
Although their movement targets women specifically,
#Gamergaters insist they speak for a victimized
"demographic," and that anyone who opposes misogyny
while making generalizations about gamers must be a
hypocrite.”
http://gawker.com/what-is-gamergate-and-why-an-explainer-for-non-geeks-
1642909080
85
86
Gamergate tweets
• Based on the work of Budac, A., Chartier, R., Suomela, T., Gouglas,
S., & Rockwell, G. (see sources at the end of this slideset)
• I received the data for the purposes of this summer school (i.e. also
for you)
▫ Condition: we all respect the associated ethics code
▫ This is an interesting document in itself, and we will use it for part 3
• Data post-processed for you: “most retweeted tweets“ Oct‘14 –
Mar’15, in 4 versions (each version assembled into one ZIP file)
▫ 1 document per month, tweet texts ordered by count of retweets
(desc.)  Voyant
▫ 1 document per tweet, sorted into 1 folder per month 
DocumentAtlas/Ontogen
▫ 1 document overall ( Weka), with fields
 anonymized user ID
 Month
 Count in that month‘s dataset
 Tweet text
- The same, but with some post-processing that will make your analysis easier
86
87
The post-processing applied
(& user removed,
&1000 highest-ranking attr.s selected)
87
88
I suggest you run trees  J48
with settings such as these,
and Test Options: Use Training Set
88
89
89
90
Thank you!
I‘ll be more than happy to hear your
s
91
References
A good textbook on Text Mining:
• Feldman, R. & Sanger, J. (2007). The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data.
Cambridge University Press.
An introduction similar to this one, but also covering unsupervised learning in some detail, and with lots of pointers to books,
materials, etc.:
• Shaw, R. (2012). Text-mining as a Research Tool in the Humanities and Social Sciences. Presentation at the Duke
Libraries, September 20, 2012. https://aeshin.org/textmining/
An overview of news and (micro-)blogs mining:
• Berendt, B. (in press). Text mining for news and blogs analysis. To appear in C. Sammut & G.I. Webb (Eds.), Encyclopedia
of Machine Learning and Data Mining. Berlin etc.: Springer.
http://people.cs.kuleuven.be/~bettina.berendt/Papers/berendt_encyclopedia_2015_with_publication_info.pdf
See http://wiki.esi.ac.uk/Current_Approaches_to_Data_Mining_Blogs for more articles on the subject.
Individual sources cited on the slides
• Fortuna, B., Galleguillos, C., & Cristianini, N. (2009). Detecting the bias in media with statistical learning methods. In
Text Mining: Classification, Clustering, and Applications, Chapman & Hall/CRC, 2009.
• Qiaozhu Mei, ChengXiang Zhai: Discovering evolutionary theme patterns from text: an exploration of temporal text
mining. KDD 2005: 198-207
• Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proc. AAAI Spring Symposium on
Computational Approaches to Analyzing Weblogs. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.6759
• Kirschenbaum, M. "The Remaking of Reading: Data Mining and the Digital Humanities." In NGDM 07: National Science
Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation.
http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf
• Mueller, M. “Notes towards a user manual of MONK.”
https://apps.lis.uiuc.edu/wiki/display/MONK/Notes+towards+a+user+manual+of+Monk, 2007.
• Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo and Luca Ducceschi, 2013. Phrase Detectives: Utilizing
Collective Intelligence for Internet-Scale Language Resource Creation. ACM Transactions on Intelligent Interactive
Systems, 3(1). http://csee.essex.ac.uk/poesio/publications/poesio_et_al_ACM_TIIS_13.pdf
• Luis von Ahn (2005). Human Computation. PhD Dissertation. Computer Science Department, Carnegie Mellon University.
http://reports-archive.adm.cs.cmu.edu/anon/usr0/ftp/usr/ftp/2005/abstracts/05-193.html
• Luis von Ahn: Games with a Purpose. IEEE Computer 39(6): 92-94 (2006)
92
More DH-specific tools
Overviews of 71 tools for Digital Humanists
• Simpson, J., Rockwell, G., Chartier, R., Sinclair,
S., Brown, S., Dyrbye, A., & Uszkalo, K. (2013).
Text Mining Tools in the Humanities: An Analysis
Framework. Journal of Digital Humanities, 2
(3), http://journalofdigitalhumanities.org/2-
3/text-mining-tools-in-the-humanities-an-
analysis-framework/
• See also the link collection on the Voyant
documentation Web page
92
93
Tools (powerful, but require some computing
experience)
• Ling Pipe
▫ linguistic processing of text including entity extraction, clustering and classification, etc.
▫ http://alias-i.com/lingpipe/
• OpenNLP
▫ the most common NLP tasks, such as POS tagging, named entity extraction, chunking and
coreference resolution.
▫ http://opennlp.apache.org/
• Stanford Parser and Part-of-Speech (POS) Tagger
▫ http://nlp.stanford.edu/software/tagger.shtm/
• NTLK
▫ Toolkit for teaching and researching classification, clustering and parsing
▫ http://www.nltk.org/
• OpinionFinder
▫ subjective sentences , source (holder) of the subjectivity and words that are included in
phrases expressing positive or negative sentiments.
▫ http://code.google.com/p/opinionfinder/
• Basic sentiment tokenizer plus some tools, by Christopher Potts
▫ http://sentiment.christopherpotts.net
• Twitter NLP and Part-of-speech tagging
▫ http://www.ark.cs.cmu.edu/TweetNLP/
94
Further tools (thanks for your
suggestions!)
• Atlas TI: “Qualitative data analysis“
▫ http://atlasti.com/
▫ Commercial product, has free trial version
94
95
Gamergate sources
• Budac, A., Chartier, R., Suomela, T., Gouglas, S., &
Rockwell, G. (2015) #GamerGate: Distant Reading
Games Discourse. Paper presented at the CGSA
2015 conference at the HSSFC Congress at
University of Ottawa, Ottawa, Ontario, June 2015.
• Rockwell, G. (2015). Appendix 1: Ethics of Twitter
Gamergate Research.
• Rockwell, Geoffrey; Suomela, Todd, 2015,
"Gamergate Reactions",
http://dx.doi.org/10.7939/DVN/10253 V5
[Version].
95
96
More sources
• Please find the URLs of pictures and
screenshots in the Powerpoint “comment“ box
• Thanks to the Internet for them!
96

More Related Content

Similar to For Voyant_berendt_VSSDH15_lecture1.pptx

The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...
The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...
The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...NatGustafsonSundell
 
Nicholas Carr was born in 1959 and first gained widespread recogni.docx
Nicholas Carr was born in 1959 and first gained widespread recogni.docxNicholas Carr was born in 1959 and first gained widespread recogni.docx
Nicholas Carr was born in 1959 and first gained widespread recogni.docxhenrymartin15260
 
UVA MDST 3703 The Stack of Scholarship 2012-09-24
UVA MDST 3703 The Stack of Scholarship 2012-09-24UVA MDST 3703 The Stack of Scholarship 2012-09-24
UVA MDST 3703 The Stack of Scholarship 2012-09-24Rafael Alvarado
 
South asian history.
South asian history.South asian history.
South asian history.EHSAN KHAN
 
UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18Rafael Alvarado
 
Assignment InstructionsThis final project will require you to ga.docx
Assignment InstructionsThis final project will require you to ga.docxAssignment InstructionsThis final project will require you to ga.docx
Assignment InstructionsThis final project will require you to ga.docxhoward4little59962
 
History in Your Hands - Revision Class - February 2024.pptx
History in Your Hands - Revision Class - February 2024.pptxHistory in Your Hands - Revision Class - February 2024.pptx
History in Your Hands - Revision Class - February 2024.pptxEilsONeill
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagementSTIinnsbruck
 
Interactive visualization and exploration of network data with gephi
Interactive visualization and exploration of network data with gephiInteractive visualization and exploration of network data with gephi
Interactive visualization and exploration of network data with gephiBernhard Rieder
 
Writing good case studies
Writing good case studiesWriting good case studies
Writing good case studiesMiia Kosonen
 
LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docxwrite4
 
NHDStudentIntroNevadaPPT
NHDStudentIntroNevadaPPTNHDStudentIntroNevadaPPT
NHDStudentIntroNevadaPPTKarlye Mull
 
Background research
Background researchBackground research
Background researchSamiulhaq32
 
Stair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptx
Stair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptxStair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptx
Stair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptxEilsONeill
 
Introduction to Social Reading Technologies
Introduction to Social Reading TechnologiesIntroduction to Social Reading Technologies
Introduction to Social Reading TechnologiesFrederic Kaplan
 

Similar to For Voyant_berendt_VSSDH15_lecture1.pptx (20)

The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...
The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...
The New Past, and a Speculative Future, of Literature: A Brief Discussion of ...
 
Extreme Reading
Extreme ReadingExtreme Reading
Extreme Reading
 
Nicholas Carr was born in 1959 and first gained widespread recogni.docx
Nicholas Carr was born in 1959 and first gained widespread recogni.docxNicholas Carr was born in 1959 and first gained widespread recogni.docx
Nicholas Carr was born in 1959 and first gained widespread recogni.docx
 
UVA MDST 3703 The Stack of Scholarship 2012-09-24
UVA MDST 3703 The Stack of Scholarship 2012-09-24UVA MDST 3703 The Stack of Scholarship 2012-09-24
UVA MDST 3703 The Stack of Scholarship 2012-09-24
 
South Asian History
South Asian History South Asian History
South Asian History
 
South asian history.
South asian history.South asian history.
South asian history.
 
UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18UVA MDST 3703 Thematic Research Collections 2012-09-18
UVA MDST 3703 Thematic Research Collections 2012-09-18
 
Assignment InstructionsThis final project will require you to ga.docx
Assignment InstructionsThis final project will require you to ga.docxAssignment InstructionsThis final project will require you to ga.docx
Assignment InstructionsThis final project will require you to ga.docx
 
Ethnography and Historical Research
Ethnography and Historical ResearchEthnography and Historical Research
Ethnography and Historical Research
 
History in Your Hands - Revision Class - February 2024.pptx
History in Your Hands - Revision Class - February 2024.pptxHistory in Your Hands - Revision Class - February 2024.pptx
History in Your Hands - Revision Class - February 2024.pptx
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagement
 
Interactive visualization and exploration of network data with gephi
Interactive visualization and exploration of network data with gephiInteractive visualization and exploration of network data with gephi
Interactive visualization and exploration of network data with gephi
 
Writing good case studies
Writing good case studiesWriting good case studies
Writing good case studies
 
LCC CTS 2 Option.docx
LCC CTS 2 Option.docxLCC CTS 2 Option.docx
LCC CTS 2 Option.docx
 
NHDStudentIntroNevadaPPT
NHDStudentIntroNevadaPPTNHDStudentIntroNevadaPPT
NHDStudentIntroNevadaPPT
 
GCRD 6353: Seminar 2
GCRD 6353: Seminar 2GCRD 6353: Seminar 2
GCRD 6353: Seminar 2
 
Writing the Ethnography
Writing the EthnographyWriting the Ethnography
Writing the Ethnography
 
Background research
Background researchBackground research
Background research
 
Stair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptx
Stair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptxStair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptx
Stair i do Lámha - Rang athbhreithnithe Feabhra 2024.pptx
 
Introduction to Social Reading Technologies
Introduction to Social Reading TechnologiesIntroduction to Social Reading Technologies
Introduction to Social Reading Technologies
 

Recently uploaded

Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 

Recently uploaded (20)

Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 

For Voyant_berendt_VSSDH15_lecture1.pptx

  • 1. An Introduction to Text Mining ‹#› Bettina Berendt Department of Computer Science KU Leuven, Belgium http://people.cs.kuleuven.be/~bettina.berendt/ Vienna Summer School on Digital Humanities July 7th, 2015, Vienna, Austria
  • 2. 2 Starting questions for today • What is a text? • What questions can we ask of a text? • What kind of answers "make us happy"?
  • 3. 3 Some answers that would make you happy, and how (semi-)automatic text analysis could help • Author ▫ Usually a metadatum that is extracted from the metadata set ▫ Can also be an inference: „“can you find out who is the author of this text?“  This is a text-mining task that has been studied for example in online texts (one subtype of the de-anonymization problem) • Genre ▫ A text-mining classification task (given a text, classify it into one from a list of genres) • Style ▫ Same (stylometry classification) • statement / summary ▫ Text mining task “summarization“ (e.g. of news texts) • Content ▫ The most typical text mining task: identify topics, classify into a content class, ... • function, intention ▫ (I‘m still not quite sure what this ... So this is for a future summer school ;-) ) • sequence of signs 1. Sequential analysis of texts is common (e.g. In co-occurrence and collocation analysis) 2. Signs (in the sense of arbitrary words standing for concepts): a key element of theories of semantics, e.g. In the Semantic Web and Linked Open Data (e.g. DBPedia: a concept network version of Wikipedia) – are used in text mining, for example for improving topic modelling and classification July 9th 2015: I‘ll add references to these yet, but wanted to get the slides out to you already! 3
  • 6. 6 Goals and non-goals • Goals ▫ Understand the basic ideas of data mining ▫ Understand how computer-scientist text miners approach texts ▫ Compare it with your own approaches ▫ Learn about some pitfalls and encourage a critical view ▫ Get your hands on some tools and real data ▫ Have an overview of other necessary steps (such as pre-processing) that take too much time to be included in this course ▫ Have pointers for inquiring and going further • Non-goals (selection) ▫ the statistical background of methods ▫ a comprehensive overview of the state-of-the-art of text mining methods ▫ a comprehensive overview of the state-of-the-art of text mining applications in the digital humanities or social or behavioural sciences ▫ An introduction to big data computing or big DH infrastructuress 6
  • 7. 7 A more modest goal than revolutionising knowledge as such?! “As long as there have been books there have been more books than you could read. … Knowing how to "not-read" is just as important as knowing how to read” (Mueller, 2007). “data mining and machine learning are best understood in terms of “provocation”—the potential for outlier results to surprise a reader into attending to some aspect of a text not previously deemed significant—as well as “not-reading” or “distant reading,” the automated search for patterns across a much wider corpus than could be read and assimilated via traditional humanistic methods of “close reading.”” (Kirschenbaum, 2007) 7
  • 10. 10 Origins of text mining. Or: What is a text for information retrieval? Let‘s do some reverse engineering ... 10
  • 11. 11 Words, source relevance, and personalization 11
  • 12. 12 Words and knowledge bases (1) 12 Metadata as output
  • 13. 13 Knowledge-based text processing (2) 13 Metadata as input? Requires different search interfaces!
  • 14. 14 PS: the ranking includes network analytics ( Thursday) 14
  • 15. 15 15 PS: the ranking also includes adaptation; here: relevance feedback
  • 16. 16 Trending topics: a form of summarization 16
  • 17. 17 Finding “similar“ texts: Clustering (example Google News) 17
  • 18. 18 Going further: What topics exist in a collection of texts, and how do they evolve? News texts, scientific publications, … Mei & Zhai (2005)
  • 19. 19 Guiding questions • Information retrieval: ▫ Given the current user‘s information need, which are the most relevant documents? • Text mining: ▫ What do the documents tell us? What‘s in the texts? What can we learn about the texts, their authors, ... ▫ Many different subquestions ▫ Summarization (of one text, of many texts) is just one of them • Cf. ▫ “Distant reading“ (Moretti)  understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data. ▫ “Machine reading“ (UCL Machine Reading Group)  machines that can read and "understand" this textual information, converting it into interpretable structured knowledge to be leveraged by humans and other machines alike 19
  • 21. 21 Speed-reading (Woody Allen) I took a course in speed reading and was able to read War and Peace in twenty minutes. It's about Russia. ... also quoted differently: I took a speed reading course and read 'War and Peace' in twenty minutes. It involves Russia. 21
  • 22. 22 A personal “experiment“ - deliberately a bit silly, more a gentle introduction to a great tool and to some pitfalls of “distant reading“ (I haven‘t read War and Peace yet.) 22
  • 23. 23 Speed-reading with word clouds: The Voyant tool (single-digit number of seconds) 23
  • 26. 26 Can we find out more about the 3? 26
  • 27. 27 Double-check in Wikipedia (method: string search) • Count Pyotr Kirillovich (Pierre) Bezukhov: The large-bodied, ungainly, and socially awkward illegitimate son of an old Russian grandee. Pierre, educated abroad, returns to Russia as a misfit. His unexpected inheritance of a large fortune makes him socially desirable. Pierre is the central character and often a voice for Tolstoy's own beliefs or struggles. • Prince Andrey Nikolayevich Bolkonsky: A strong but skeptical, thoughtful and philosophical aide-de-camp in the Napoleonic Wars. ▫ Some searching needed ... Andrew ... Andrei ... Andrey • Countess Natalya Ilyinichna (Natasha) Rostova: A central character, introduced as "not pretty but full of life" and a romantic young girl, although impulsive and highly strung, she evolves through trials and suffering and eventually finds happiness. She is an accomplished singer and dancer. • ... • Prince Anatole Vasilyevich Kuragin: Hélène's brother and a very handsome and amoral pleasure seeker who is secretly married yet tries to elope with Natasha Rostova. • Vasily Dmitrich Denisov: Nikolai Rostov's friend and brother officer, who proposes to Natasha. 27
  • 28. 28 From Wikipedia‘s plot summary (method: string search) • ... • Natasha is convinced that she loves Anatole and writes to Princess Maria, Andrei's sister, breaking off her engagement [with Andrei]. At the last moment, Sonya discovers her plans to elope and foils them. Pierre is initially horrified by Natasha's behavior, but realizes he has fallen in love with her. During the time when the Great Comet of 1811–2 streaks the sky, life appears to begin anew for Pierre. • Prince Andrei coldly accepts Natasha's breaking of the engagement. He tells Pierre that his pride will not allow him to renew his proposal. Ashamed, Natasha makes a suicide attempt and is left seriously ill. • ... • Having lost all will to live, [Andrei] forgives Natasha in a last act before dying. • Pierre's wife Hélène dies from an overdose of abortion medication (Tolstoy does not state it explicitly but the euphemism he uses is unambiguous). Pierre is reunited with Natasha, while the victorious Russians rebuild Moscow. Natasha speaks of Prince Andrei's death and Pierre of Karataev's. Both are aware of a growing bond between them in their bereavement. With the help of Princess Maria, Pierre finds love at last and, revealing his love after being released by his former wife's death, marries Natasha. 28 Total time: 29 mins since creation of word cloud, 17 mins since creation of Pierre-Natasha-Andrew chart (includes making these slides for you)
  • 29. 29 Questions • How much of this was “really automatic“? • What existing knowledge (in my head and in others‘) went into this analysis, • and how? • Can you think of another reason why this (deliberately) turned out silly? 29
  • 30. 30 More interesting / serious examples (1) (from the summer school participants) • Analysis of ego-shooter missions (thanks to Kathrin Trattner) 30
  • 31. 31 Comment B. Berendt – compare this with an earlier text-mining analysis of reporting on the same events by CNN in comparison with Al Jazeera • See next slide 31
  • 32. 32 Unsupervised learning of bias 32 Nearest neighbour / best reciprocal hit for document matching; Kernel Canonical Correlation Analysis and vector operations for finding topics and characteristic keywords [Fortuna, Galleguillos, & Cristianini, 2009] What characterizes different news sources?
  • 33. 33 Additional information from Sentistrength analysis 33
  • 34. 34 More interesting / serious examples (2) (from the summer school participants) • Analysis of ideological documents: • Charter of Hamas (thanks to Alexandra Preitschopf): • analyses word usage and – interestingly – also the absence of specific words ▫ Note: this shows clearly why we need domain knowledge to interpret frequencies! • It also shows the difficulties of using sentiment analysis when the real object of analysis is opinion/bias. • For more details, see her presentation (also linked on my Summer School Web page) 34
  • 35. 35 More interesting / serious examples (3) (from the summer school participants) • Joseph Goebbels‘ sportpalast speech (a famous propaganda speech from 1943: “Do you want the total war?“) • frequencies of negatively connotated words (“bolshevism“, “judaism / the Jews“) vs. positively connotated words (“Germans“) suggest: ▫ The speech starts with a threat scenario and ends with a positive vision of the future • Remark B. Berendt: This is borne out by reading the full text, and it is also a classical rhetorical structure. 35 Text from: http://www.1000dokumente.de/inde x.html?c=dokument_de&dokument =0200_goe&object=translation&st= &l=de
  • 36. 36 More interesting / serious examples (4) (from others) Examples of Voyant in Research: http://docs.voyant-tools.org/about/examples-gallery/ 36
  • 37. 37 Some formalism: the vector-space model of text (basic model used in information retrieval and text mining) ▫ Basic idea:  Keywords are extracted from texts.  These keywords describe the (usually) topical content of Web pages and other text contributions. ▫ Based on the vector space model of document collections:  Each unique word in a corpus of Web pages = one dimension  Each page(view) is a vector with non-zero weight for each word in that page(view), zero weight for other words  Words become “features” (in a data-mining sense) 37
  • 38. 38 • Starting point is the raw term frequency as term weights • Other weighting schemes can generally be obtained by applying various transformations to the document vectors nova galaxy heat actor film role diet A 1.0 0.5 0.3 B 0.5 1.0 C 0.4 1.0 0.8 0.7 D 0.9 1.0 0.5 E 0.5 0.7 0.9 F 0.6 1.0 0.3 0.2 0.8 Document Ids a document vector Features Document Representation as Vectors
  • 39. 39 Other features (usually metadata of different sorts) can be added • Tags or other categories • Special content (e.g. URLs, images, Twitter mentions) • Source • Number of followers of source • ... 39
  • 40. 40
  • 42. 42
  • 44. 44 The idea of text mining ... • ... is to go beyond frequency-counting • ... is to go beyond the search-for-documents framework • ... is to find patterns (of meaning) within and especially across documents • (but boundaries are not fixed)
  • 45. 45 Data mining (aka Knowledge Discovery) The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad, Platetsky-Shapiro, Smyth, 1996) 45
  • 47. 47 The steps of text mining 1. Application understanding 2. Corpus generation 3. Data understanding 4. Text preprocessing 5. Search for patterns / modelling  Topical analysis  Sentiment analysis / opinion mining 6. Evaluation 7. Deployment
  • 48. 48 Application understanding; Corpus generation ▫ What is the question? ▫ What is the context? ▫ What could be interesting sources, and where can they be found? ▫ Use an existing corpus ▫ Crawl ▫ Use a search engine and/or archive and/or API ▫ Get help!
  • 49. 49 Preprocessing (1) • Data cleaning ▫ Goal: get clean ASCII text ▫ Remove HTML markup*, pictures, advertisements, ... ▫ Automate this: wrapper induction * Note: HTML markup may carry information too (e.g., <b> or <h1> marks something important), which can be extracted! (Depends on the application)
  • 50. 50 Preprocessing (2) • Goal: get processable lexical / syntactical units • Tokenize (find word boundaries) • Lemmatize / stem ▫ ex. buyers, buyer  buyer / buyer, buying, ...  buy • Remove stopwords • Find Named Entities (people, places, companies, ...); filtering • Resolve polysemy and homonymy: word sense disambiguation; “synonym unification“ • Part-of-speech tagging; filtering of nouns, verbs, adjectives, ... • ... • Most steps are optional and application-dependent! • Many steps are language-dependent; coverage of non-English varies • Free and/or open-source tools or Web APIs exist for most steps Do you see a problem here for DH? What implicit assumptions are made?
  • 51. 51 Preprocessing (3) • Creation of text representation ▫ Goal: a representation that the modelling algorithm can work on ▫ Most common forms: A text as  a set or (more usually) bag of words / vector-space representation: term-document matrix with weights reflecting occurrence, importance, ...  a sequence of words  a tree (parse trees)
  • 52. 52 An important part of preprocessing: Named-entity recognition (1) This 2009 OpenCalais screenshot visualizes nicely what today is mostly markup. E.g. in the tool http://www.alchemyapi.com/api/entity- extraction
  • 53. 53 An important part of preprocessing: Named-entity recognition (2) • Technique: Lexica, heuristic rules, syntax parsing • Re-use lexica and/or develop your own ▫ configurable tools such as GATE • An example challenge: multi-document named- entity recognition ▫ Several solution proposals • A more difficult problem: Anaphora resolution
  • 54. 54 Styles of statistics-based analysis • Statistics: descriptive – inferential • Data mining: descriptive – predictive (D – P) • Machine learning, data mining: unsupervised – supervised • Typical tasks in text analysis: ▫ D: Frequency analysis, collocation analysis, association rules ▫ D: Cluster analysis ▫ P: Classification ▫ Interactive knowledge discovery: combines various forms and involves “the human in the loop“ 54 “It involves Russia.“ “It‘s about Russia.“
  • 56. 56 Tools we will see (you‘ll have to choose, based on your prior knowledge) • Frequency analysis, collocation analysis ▫ Voyant ▫ (also offers many other forms, see http://docs.voyant-tools.org/tools/) • More visualization (based on clustering) ▫ DocumentAtlas • Classification ▫ Weka (can also do lots of other data mining tasks, such as association rules, and it is not made specifically for texts) • Interactive knowledge discovery ▫ Ontogen: Ontology learning based on clustering and manual post-processing; includes DocumentAtlas 56
  • 57. 57 Basic process of classification/prediction Given a set of documents and their classes, e.g. ▫ Spam, no-spam ▫ Topic categories in news: current affairs, business, sports, entertainment, ... ▫ Any other classification 1. Learn which document features characterise the classes = learn a classifier 2. Predict, from document features, the classes ▫ For old documents with known classes ▫ For new documents with unknown classes 57
  • 60. 60 • Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts current mood: Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this. current mood: What are the characteristic words of these two moods? [Mihalcea, R. & Liu, H. (2006). In Proc. AAAI Spring Symposium CAAW.] Slides based on Rada Mihalcea‘s presentation.
  • 61. 61 Data, data preparation and learning • LiveJournal.com – optional mood annotation • 10,000 blogs: ▫ 5,000 happy entries / 5,000 sad entries ▫ average size 175 words / entry ▫ pre-processing – remove SGML tags, tokenization, part-of-speech tagging
  • 62. Results: Corpus-derived happiness factors yay 86.67 shopping 79.56 awesome 79.71 birthday 78.37 lovely 77.39 concert 74.85 cool 73.72 cute 73.20 lunch 73.02 books 73.02 goodbye 18.81 hurt 17.39 tears 14.35 cried 11.39 upset 11.12 sad 11.11 cry 10.56 died 10.07 lonely 9.50 crying 5.50 happiness factor of a word = the number of occurrences in the happy blogposts / the total frequency in the corpus
  • 64. ‹#› Using classifier learning for literature analysis – here: a (Weka) decision tree (early example: MONK) Sara Steger (2012). Patterns of Sentimentality in Victorian Novels. Digital Studies 3(2).
  • 65. 65 Many other tasks (ex. news/blogs mining) Tasks in news / (micro-)blogs mining can be grouped by different criteria: • Basic task and type of result: description, classification and prediction (supervised or unsupervised, includes for example topic identification, tracking, and/or novelty detection; spam detection); search (ad hoc or filtering); recommendation (of blogs, blog posts, or (hash-)tags); summarization • Higher-order characterization to be extracted: especially topic or event; opinion or sentiment • Time dimension: nontemporal; temporal (stream mining); multiple streams (e.g., in different languages, see cross- lingual text mining) • User adaptation: none (no explicit mention of user issues and/or general audience); customizable; personalized 65 Berendt (Encyclopedia of Machine Learning and Data Mining, in press).
  • 66. 66 Real-world applications of news/blogs mining Real-world applications increasingly employ selections or, more often, combinations of these tasks by their intended users and use cases, in particular: • News aggregators allow laypeople and professional users (e.g. journalists) to see “what’s in the news” and to compare different sources’ texts on one story. Reflecting the presumption that news (especially mainstream news – sources for news aggregators are usually whitelisted) are mostly objective/neutral, these aggregators focus on topics and events. News aggregators are now provided by all major search engines. • Social-media monitoring tools allow laypeople and professional users to track not only topical mentions of a keyword or named entity (e.g. person, brand), but also aggregate sentiment towards it. The focus on sentiment reflects the perceptions that even when news-related, social media content tends to be subjective and that studying the blogosphere is therefore an inexpensive way of doing market research or public-opinion research. The whitelist here is usually the platforms (e.g. Twitter, Tumblr, LiveJournal, Facebook) rather than the sources themselves, reflecting the huge size and dynamic structure of the blogosphere / the Social Web. The landscape of commercial and free social-media monitoring tools is wide and changes frequently; up-to-date overviews and comparisons can easily be found on the Web. • Emerging application types include text mining not of, but for journalistic texts, in particular natural language generation in domains with highly schematized event structures and reporting, such as sports and finance reporting (e.g. Allen et al., 2010; narrativescience.com) and social-media monitoring tools for helping journalists find sources (Diakopoulos et al., 2012). 66 Berendt (Encyclopedia of Machine Learning and Data Mining, in press).
  • 68. 68 Evaluation of unsupervised learning: e.g. clustering • Do the clusters make sense? • Are the instances within one cluster similar to one another? • Are the instances in different clusters dissimilar to one another? • (There are quantitative metrics of #2 and #3)
  • 69. 69 Quality of automatic “mood separation” • naïve bayes text classifier ▫ five-fold cross validation • Accuracy: 79.13% (>> 50% baseline)
  • 76. 78 Who defines which class a document belongs to? • The researcher? • The author? • The reader? • Someone paid to do exactly this (e.g. a worker on mTurk)? • Several of them? • Someone else? 78
  • 77. 79 The importance of consensus Illustration: ESP game (“Games with a purpose“) 79 von Ahn (2005, 2006)
  • 78. 80 Measuring inter-rater reliability • Popular measure of inter-rater agreement from content analysis • Non-trivial formula (see references), but software exists. 80
  • 79. 81 How good is good: Magic numbers? • (Kappa is a related measure; the boundaries are the same) • Boundaries are disputed and tend to get higher • Inter-rater agreement often systematically low, e.g. in text summarization: slightly over 50% (Berendt et al., 2014) • Recent approaches attempt to accept this ambiguity and work with it: e.g. Poesio et al. (2013) 81
  • 80. 82 In what sense is this an alternative? • “Given that there is no ground truth is a discipline like literary criticism, it is difficult to know how influential these results will prove. • A scholar would have to write them up in traditional article or monograph form, wait for the article or monograph to move through the peer-review process (this can take months or years) and then other scholars in the field will have to read it, be influenced by its arguments, and adjust their own interpretations of Dickinson—in turn publishing these in their own articles and monographs. • Nonetheless, we believe that the Nora system has suggested that classification and prediction can be useful agents of provocation in humanistic study.” (Kirschenbaum, 2007) 82
  • 82. 84 #gamergate “GamerGate is a grassroots movement with the goal of supporting ethics in game journalism. Some feminists have claimed it is a hateful, misogynistic movement, but they haven't been able to meet the burden of proof on that.” http://drunken-peasants- podcast.wikia.com/wiki/GamerGate 84
  • 83. 85 (Only) one reason this is interesting for text analysis “Ethics aren't the only thing #Gamergate is concerned with. As the movement made the shift from ad hominem attacks to insisting that its only interest in Quinn was as an example of nepotism and corruption in the gaming industry, it also began co-opting the language of social justice movements and of journalism to legitimize its complaints. Although their movement targets women specifically, #Gamergaters insist they speak for a victimized "demographic," and that anyone who opposes misogyny while making generalizations about gamers must be a hypocrite.” http://gawker.com/what-is-gamergate-and-why-an-explainer-for-non-geeks- 1642909080 85
  • 84. 86 Gamergate tweets • Based on the work of Budac, A., Chartier, R., Suomela, T., Gouglas, S., & Rockwell, G. (see sources at the end of this slideset) • I received the data for the purposes of this summer school (i.e. also for you) ▫ Condition: we all respect the associated ethics code ▫ This is an interesting document in itself, and we will use it for part 3 • Data post-processed for you: “most retweeted tweets“ Oct‘14 – Mar’15, in 4 versions (each version assembled into one ZIP file) ▫ 1 document per month, tweet texts ordered by count of retweets (desc.)  Voyant ▫ 1 document per tweet, sorted into 1 folder per month  DocumentAtlas/Ontogen ▫ 1 document overall ( Weka), with fields  anonymized user ID  Month  Count in that month‘s dataset  Tweet text - The same, but with some post-processing that will make your analysis easier 86
  • 85. 87 The post-processing applied (& user removed, &1000 highest-ranking attr.s selected) 87
  • 86. 88 I suggest you run trees  J48 with settings such as these, and Test Options: Use Training Set 88
  • 87. 89 89
  • 88. 90 Thank you! I‘ll be more than happy to hear your s
  • 89. 91 References A good textbook on Text Mining: • Feldman, R. & Sanger, J. (2007). The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press. An introduction similar to this one, but also covering unsupervised learning in some detail, and with lots of pointers to books, materials, etc.: • Shaw, R. (2012). Text-mining as a Research Tool in the Humanities and Social Sciences. Presentation at the Duke Libraries, September 20, 2012. https://aeshin.org/textmining/ An overview of news and (micro-)blogs mining: • Berendt, B. (in press). Text mining for news and blogs analysis. To appear in C. Sammut & G.I. Webb (Eds.), Encyclopedia of Machine Learning and Data Mining. Berlin etc.: Springer. http://people.cs.kuleuven.be/~bettina.berendt/Papers/berendt_encyclopedia_2015_with_publication_info.pdf See http://wiki.esi.ac.uk/Current_Approaches_to_Data_Mining_Blogs for more articles on the subject. Individual sources cited on the slides • Fortuna, B., Galleguillos, C., & Cristianini, N. (2009). Detecting the bias in media with statistical learning methods. In Text Mining: Classification, Clustering, and Applications, Chapman & Hall/CRC, 2009. • Qiaozhu Mei, ChengXiang Zhai: Discovering evolutionary theme patterns from text: an exploration of temporal text mining. KDD 2005: 198-207 • Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proc. AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.6759 • Kirschenbaum, M. "The Remaking of Reading: Data Mining and the Digital Humanities." In NGDM 07: National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation. http://www.cs.umbc.edu/~hillol/NGDM07/abstracts/talks/MKirschenbaum.pdf • Mueller, M. “Notes towards a user manual of MONK.” https://apps.lis.uiuc.edu/wiki/display/MONK/Notes+towards+a+user+manual+of+Monk, 2007. • Massimo Poesio, Jon Chamberlain, Udo Kruschwitz, Livio Robaldo and Luca Ducceschi, 2013. Phrase Detectives: Utilizing Collective Intelligence for Internet-Scale Language Resource Creation. ACM Transactions on Intelligent Interactive Systems, 3(1). http://csee.essex.ac.uk/poesio/publications/poesio_et_al_ACM_TIIS_13.pdf • Luis von Ahn (2005). Human Computation. PhD Dissertation. Computer Science Department, Carnegie Mellon University. http://reports-archive.adm.cs.cmu.edu/anon/usr0/ftp/usr/ftp/2005/abstracts/05-193.html • Luis von Ahn: Games with a Purpose. IEEE Computer 39(6): 92-94 (2006)
  • 90. 92 More DH-specific tools Overviews of 71 tools for Digital Humanists • Simpson, J., Rockwell, G., Chartier, R., Sinclair, S., Brown, S., Dyrbye, A., & Uszkalo, K. (2013). Text Mining Tools in the Humanities: An Analysis Framework. Journal of Digital Humanities, 2 (3), http://journalofdigitalhumanities.org/2- 3/text-mining-tools-in-the-humanities-an- analysis-framework/ • See also the link collection on the Voyant documentation Web page 92
  • 91. 93 Tools (powerful, but require some computing experience) • Ling Pipe ▫ linguistic processing of text including entity extraction, clustering and classification, etc. ▫ http://alias-i.com/lingpipe/ • OpenNLP ▫ the most common NLP tasks, such as POS tagging, named entity extraction, chunking and coreference resolution. ▫ http://opennlp.apache.org/ • Stanford Parser and Part-of-Speech (POS) Tagger ▫ http://nlp.stanford.edu/software/tagger.shtm/ • NTLK ▫ Toolkit for teaching and researching classification, clustering and parsing ▫ http://www.nltk.org/ • OpinionFinder ▫ subjective sentences , source (holder) of the subjectivity and words that are included in phrases expressing positive or negative sentiments. ▫ http://code.google.com/p/opinionfinder/ • Basic sentiment tokenizer plus some tools, by Christopher Potts ▫ http://sentiment.christopherpotts.net • Twitter NLP and Part-of-speech tagging ▫ http://www.ark.cs.cmu.edu/TweetNLP/
  • 92. 94 Further tools (thanks for your suggestions!) • Atlas TI: “Qualitative data analysis“ ▫ http://atlasti.com/ ▫ Commercial product, has free trial version 94
  • 93. 95 Gamergate sources • Budac, A., Chartier, R., Suomela, T., Gouglas, S., & Rockwell, G. (2015) #GamerGate: Distant Reading Games Discourse. Paper presented at the CGSA 2015 conference at the HSSFC Congress at University of Ottawa, Ottawa, Ontario, June 2015. • Rockwell, G. (2015). Appendix 1: Ethics of Twitter Gamergate Research. • Rockwell, Geoffrey; Suomela, Todd, 2015, "Gamergate Reactions", http://dx.doi.org/10.7939/DVN/10253 V5 [Version]. 95
  • 94. 96 More sources • Please find the URLs of pictures and screenshots in the Powerpoint “comment“ box • Thanks to the Internet for them! 96