This document describes a study that uses text mining techniques like topic modeling to analyze trends in scientific literature related to work and organizational (W&O) psychology. Abstracts from four journals between 1975-2014 were analyzed. Topic modeling identified major topics in each journal and how they changed over time, revealing emerging and declining areas of research. The study demonstrated how text mining can provide insights into trends in a field and support systematic literature reviews. Future work could analyze additional parts of documents and develop a topic hierarchy.
2. Contents
• Background – Why this study? Hasn’t this been done before?
• Objectives – What are we really trying to do here?
• Materials – The ingredients
• Methods – The tools
• Results – Show me the outcome!
• Conclusion and Future Work – What has been achieved? How
to proceed
Kobayashi, Mol, & Kismihók - University of Amsterdam 2
4. Background
• The psychological literature is huge (PsychINFO abstracts 3.7 million
documents and PubPsych has 900,000 searchable records)
• Text Mining applications
• Mining biomedical literature
• Web textual data
• Opinion and Sentiment mining from product reviews, microblogging, users’ posts and
comments.
• Text Mining opportunities for gaining insight into trends in the scientific
literature
• Key term extraction to support efficient document search and retrieval
• Identifying topics to group document with similar themes
• So far little text mining effort has been made in the W&O psychology
Literature
Kobayashi, Mol, & Kismihók - University of Amsterdam 4
6. Objectives
• Apply text mining, specifically, topic modeling techniques to the
W&O literature
• Pair topics and publication dates to reveal topical trends in this
field
Contributions
• Efficient search and retrieval of W&O psychology literature
• Supporting systematic literature review and automatic
knowledge discovery
• Identifying topics (or themes) and topic trends
Kobayashi, Mol, & Kismihók - University of Amsterdam 6
8. Terminology
• Document – a file that contains sequence of characters or text
• Corpus – collection of documents
• Term – smallest unit in a document (e.g. word, phrase,
sentence, or even a single character)
• Vocabulary or lexicon – set of all unique terms
Kobayashi, Mol, & Kismihók - University of Amsterdam 8
10. SOURCE
• Abstracts from 4 journals
1975-2014
1096 abstracts
2008-2014
89 abstracts
1977-2014
1115 abstracts
1991-2014
602 abstracts
Total number of abstracts: 2902
Kobayashi, Mol, & Kismihók - University of Amsterdam 10
11. For this study…
• DOCUMENT
• A single abstract
• CORPUS
• Collection of abstracts
• TERMS
• Words
• VOCABULARY
• Set of all unique words (after preprocessing) in the corpus
Kobayashi, Mol, & Kismihók - University of Amsterdam 11
12. Why Abstracts only?
• The abstract contains the gist of the whole article
• Commonly, articles are indexed based on titles, keywords and
abstracts.
Kobayashi, Mol, & Kismihók - University of Amsterdam 12
14. Techniques
• String Processing
• Natural Language Processing
• Topic Modeling
• Latent Dirichlet Allocation Model
• Assumes that each document is a mixture of topics
• Each word is generated from a specific topic
• An algorithm for topic discovery
• Topical Trend Analysis
Kobayashi, Mol, & Kismihók - University of Amsterdam 14
15. Analysis done separately for each journal
Kobayashi, Mol, & Kismihók - University of Amsterdam 15
16. Original abstract
Preprocessed abstract
Lower case transformation
Stopwords removal
Delete punctuations
Stemming
Kobayashi, Mol, & Kismihók - University of Amsterdam 16
17. Abstracts
Vocabulary The document-by-term
matrix
a a
11 1
N
a a
V 1
VN
Documents
The entries (the a’s) are the tf-idf
weight of the terms in each
document
Kobayashi, Mol, & Kismihók - University of Amsterdam 17
18. tf-idf
• There are many ways to assign weights to terms in the
documents
• The most popular is the tf-idf, computed by
, , tf-idf tf idf t d t d t
frequency of term t in document d inverse document frequency of term t
idf log
N
t
number of documents in the corpus where t
occurs
Kobayashi, Mol, & Kismihók - University of Amsterdam 18
19. a a
11 1
N
a a
V 1
VN
Documents
Vocabulary
Apply Latent Dirichlet
Allocation Model
1. List of Topics
2. Topic classification of
documents
Apply separately for each journal
Kobayashi, Mol, & Kismihók - University of Amsterdam 19
20. Topical Trends
• Topic for each document
• Publication dates of documents
• Create a chart depicting the evolution of topics from the
publication dates and topics of the documents
Kobayashi, Mol, & Kismihók - University of Amsterdam 20
21. Document Topic Publication Date
Document 1 Topic 3 1990
Document 2 Topic 5 1993
… … …
Document N Topic 12 1998
Publication Date Topic 1 Topic T
1975 Number of
publications
… Number of
publications
1976 Number of
publications
… Number of
publications
… … … …
2014 Number of
publications
… Number of
publications
Kobayashi, Mol, & Kismihók - University of Amsterdam 21
27. Conclusion
Demonstrated the use of text mining to this type of application
Idea of what is keeping the researchers of W&O psychology
busy
Offers a view of how W&O Psychology topics evolve and gain
attention (which might reflect the development and maturation
of the field)
Can be alternative to traditional content analysis
Facilitate peer review process by suggesting to researchers the
outlet that will most likely accept their work.
Kobayashi, Mol, & Kismihók - University of Amsterdam 27
28. Future Work
• Aside from extracting topics one can also extract concepts,
techniques, and key issues
• Create a hierarchy of topics
• Consider other parts of the document and not just the abstract.
Kobayashi, Mol, & Kismihók - University of Amsterdam 28
29. MAIN REFERENCES
• Learning Topic Models by Arora, Ge, and Moitra (2012)
• Text Mining Infrastructure in R by Feinerer, Hornik, and Meyer
(2008)
• Understanding Evolution of Research Themes by Wang, Zhai,
and Roth (2013)
Kobayashi, Mol, & Kismihók - University of Amsterdam 29
30. ACKNOWLEDGEMENT
• We would like to thank our colleague Ms Sofija Pajic for helping
us out in interpreting the topics.
Kobayashi, Mol, & Kismihók - University of Amsterdam 30
Editor's Notes
Please (and always) write out all first names in full
Although these are the usual suspects for a contents section, it may be good to replace these with titles that provide some more detail as to the exact content within each section (I used to have content sections like these, and presenting them is somewhat boring)
I would probably not dedicate a full slide to this.
Replace “Immense number of psychological literature ” with “The psychological literature is huge”
Replace “Text Mining opportunities for scientific literature” with “Text Mining opportunities for gaining insight into trends in the scientific literature”
Replace “Identify” with “Identifying”
Replace “has been done to” with “has been made in the”
I would probably not dedicate a full slide to this
Replace “on the” with “to the”
Replace “indentify” with “identifying”
I would probably not dedicate a full slide to this
Replace “tekst” with “text”
I would probably not dedicate a full slide to this
Replace “abstract” with “abstracts”
I would probably not dedicate a full slide to this
I would probably not dedicate a full slide to this
So after Lower case transformation, Stopwords removal, Delete punctuations, and Stemming, how are these topics regenerated? Here I see punctuation for instance.
To empower has little to do with team effectiveness
Maybe we should quickly discuss this slide (the topic interpretation part). In my understanding it would be akin to the naming of factors in an exploratory factor analysis?
Maybe use journal logo’s here instead of the yellow acronym?
Organizational behavior is at a different level of abstraction than bullying and harassment (in fact you may say that the former contains the latter)
Insert journal logo? It is not directly clear to me what impact assessment is.
I would probably not dedicate a full slide to this
I would drop the word successfully. Leave this to the audience to judge.
The numbers imply an order that is not really there. I would suggest dropping these.
For the hierarchy of topics you may use the organizational behavior example