Analysis of Metadata and Topic Modeling for

Analysis of Metadata and Topic Modeling for
Academic Articles - MIS Quarterly Journal
Under the Supervision: Dr. Arun RaiBy Jigar Mehta
May 12th, 2016
GRA Work Report Submission
Spring 2016

Results and Insights – MIS Quarterly Journal– Descriptive Stats
• #Articles published per year has increased two fold over 20 years
• Avg. #Keywords per article has doubled over 20 years
• 82% of Articles belong to two dominant categories : Research Article/Note (60%) and Special Issue (20%)
• Avg. length of articles (number of pages) per year has witnessed a three fold increase over last 20 years
• Avg. Abstract length per article was higher in ’05-’10 but has been consistent since then (~1500 characters)
• No significant trend in Avg Title length (~100 characters) per article except for small variations by year
• For last 5 years: Avg #Tables per article ranges from 7 to 8; whereas Avg #Figures per article is around 4
• For last 5 years: Avg #References per article per year has seen a small increase (Avg ~ 85 references)
• On an average there are two authors per article

Results and Insights – MIS Quarterly Journal– Content Analysis
• Based on Topic Modeling on only Abstracts for last 20 years, these 8 topics are widely discussed by authors:
User/ customer centric – approach and attributes
Product/service attributes
Ethics and legal issues
Project outsourcing, teams and offshoring
Scientific studies, analysis methods and models
Firms investments, working and capabilities
Decision support systems and framework
Organizational process development and framework
Ethics and legal issues
Product/service attributes
Project outsorcing, teams and offshoring
Scientific studies, analysis methods and…
Decision support systems and framework
Firms investments, working and capabilities
User - centric
Organizational process development
6%
11%
11%
12%
13%
14%
16%
18%
TOPICS AND THEIR WEIGHTAGE
Increasing Trend of Topics :
1. product/service attributes,
2. user-centric focused approach,
3. firms investment & capability alignment
Decreasing Trend of Topics :
1. Ethics and legal issues
2. Project outsourcing
Consistent Trend of Topic:
1. Organizational processes dev
2. Scientific studies and models
3. Decision support systems

Project
Objective and
Framework
Discussion
MISQ Journal
- Data Fetch
Python Script
to create
Metadata and
Other tables
Python Script
for Base Table
Preparation
for Analysis
R code for
Word Clouds
and Keywords
Trend
Analysis
Academic
Papers
Descriptive
Analysis -
Code and
Results
Topic
Modeling -
R code, results
and
Presentation
Topic
Modeling -
Trend
Analysis and
Presentation
Topic
Modeling -
Multiple
iterations &
Tableau
Final Results
Visualizing Work Progress
Jan 12th
Jan 19th
Feb 2th
Feb 16th
Mar 1st
Mar 8th
Mar 22th
Apr 5th
Apr 19th
May 4th

Keywords Trend Analysis –
Comparison of Word Clouds across different Time horizons

Shrinking
cognitive
agility
support
management
electronic
costs
risk
empirical
longitudinal
manufacturing
auctions
Growing
security
privacy
Online
social
software
design
data
web
product
statistics
mobile
user-behavior
trust
quality
Persistent
business
knowledge
development
outsourcing
innovation
performance
model
science
network
process
analysis
value
Top Keywords by trend behavior

Documents and words can be directly observed, topics are latent
Textual Analysis – Topic Modeling on Abstracts of Papers

Assumptions
Documents
• A Document is a mix of topics
• Single document can consist of many topics, but to different proportions
• A Topic is a mix of word
• Two documents with the same topics will have overlap in words
• Use statistics to find latent topics represented by groups of words
Topics
• To find topics that are as much distinct as each other
• To highlight the most heavily discussed topic(s) in each paper
• Keeping α low will lead to sparse topic distribution
• Keeping β low will lead to topics having less common words

Topic Modeling – Understanding LDA and latent parameters

Understanding Alpha and Beta parameters
α
• A high alpha-value means that
each document is likely to contain
a mixture of most of the topics,
and not any single topic
specifically
• A low alpha value puts less such
constraints on documents and
means that it is more likely that a
document may contain mixture of
just a few, or even only one, of
the topics.
β
• A high beta-value means that each
topic is likely to contain a mixture
of most of the words, and not any
word specifically, while
• A low value means that a topic
may contain a mixture of just a
few of the words.
Impact on Content
• In practice, A high alpha-value
will lead to documents being
more similar in terms of what
topics they contain.
• A high beta-value will similarly
lead to topics being more similar
in terms of what words they
contain.

N- iterations N- iterations α β 5 8 12 16 20
700 1500 0.02 0.02
1000 1500 0.1 0.08
2000 1500 0.3 0.1
5000 1500 0.6 0.4
8000 1500 0.8 0.6
10000 1500 1 0.8
K
Multiple Iterations – Tuning α, β, K and N – 60 Topic Models
Insights
• As α increases, topics are more evenly distributed in terms of proportion of documents they hold. Low values causes Sparse topic
distribution, High value causes topics to have common themes and hence, overlap.
• As β increases, topics are more similar in terms of the words they are made up and end up being more similar topics. Low values causes
unique topics, High values causes topic to be similar and overlap.
• As K increases, more topics are discovered. Low values causes significant topics to be missed and and higher value can cause overlapping and
similar topics.
• As N increases, topics discovery becomes stable and guarantees convergence. Low values indicated unstable and unreliable topics discovery.

Topic Model Result 1
(Topics= 8, Iterations = 1800, alpha = 0.61, beta = 0.4)

Topic Trend over years and Top words for each Topic
User –
centric
behavior
Product/
Service
attributes
Epistemological
perspectives in
IS
IS
development
/ Project
management
(outsourcing/
offshoring)
Research
Design and
Methods
IT
Strategy/
Business
Value
Changing
nature of
computing
Organizati
onal
processes
user product work project studies firm decision development
influence service theories task field firms support innovation
adoption quality managers time analysis strategic making organizations
users trust professionals communication modeling strategy virtual practice
perceived privacy quandaries projects researchers risk effectiveness technologies
usage price deception groups interpretive alignment complexity analysis
factors consumer ethical group constructs resource problem develop
intention electronic term media methods capability usersã context
security markets increase team models resources tools work
behaviors perceived stakeholder teams evaluation investments effects change
behavior products normative members case capabilities search
understandin
g
training impact challenges differences science level user action
individual content managerial control measurement significant approach practices
acceptance Market explored client construct investment world theoretical
relationship effects resolve tasks approach outsourcing develop framework
affect uncertainty law development validity benefits explanations case
support consumers conflict cultural statistical industry present processes
efficacy internet turnover offshore principles findings framework concept
implementation sales reported offshoring structural network existing developing
computer find ethics learning issues governance important role
beliefs feedback violating support techniques agility interface mechanisms
20%
7%
20%
13%
26%
12%
21%
10%
18%
13%
17%
10%
23%
27%
7%
10%
13%
11%
19%
5%
13%
7%
4%
2%
10%
6%
9%
7%
15%
8%
12%
16%
10%
15%
13% 13%
24%
13%
10%
8%
12%
13%
7%
4%
3%
7%
4%
5%
6%
3% 3%
7% 6%
4%
5%
3% 3%
8%
13%
18%
12%
22%
7%
13%
11%
8% 9%
10%
20%
10%
6%
7%
6%
7%
13%
7%
10% 10%
10%
12%
10%
13%
12%
7%
9%
8%
10%
15%
20%
12%
19%
21%
10%
9% 9%
12%
22%
6%
16%
5%
14%
18%
16%
24%
19%
6%
12%
9%
11%
15%
21%
16%
13%
20%
16%
22%
14%
18%
7%
12%
8%
9% 9%
20%
17%
14%
7%
11%
17%
11%
10% 10% 11%
19%
5%
13%
13%
21%
28%
20%
31%
21%
12%
26%
14%
12%
17%
14% 13%
25%
19% 19%
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Topic Trend over the years
user - centric product/service attributes
ethics and legal issues project outsorcing, teams and offshoring
scientific studies, analysis methods and models firms investments, working and capabilities
decision support systems and framework organizational process development

Pearson Correlation (Linear) amongst the topics
Topics
User –
centric
behavior
Product/
Service
attributes
Epistemol
ogical
perspecti
ves in IS
IS
development
/ Project
management
(outsourcing/
offshoring)
Research
Design
and
Methods
IT
Strategy/
Business
Value
Changing
nature of
computing
Organizati
onal
processes
User – centric behavior 1.00 -0.45 0.08 0.13 -0.12 -0.47 -0.49 0.12
Product/ Service attributes -0.45 1.00 -0.54 -0.27 0.22 0.21 0.04 -0.23
Epistemological perspectives in IS 0.08 -0.54 1.00 0.20 -0.24 -0.27 0.47 -0.20
IS development/ Project
management
(outsourcing/offshoring)
0.13 -0.27 0.20 1.00 -0.17 -0.48 -0.06 -0.17
Research Design and Methods -0.12 0.22 -0.24 -0.17 1.00 -0.04 -0.10 -0.38
IT Strategy/ Business Value -0.47 0.21 -0.27 -0.48 -0.04 1.00 0.15 -0.17
Changing nature of computing -0.49 0.04 0.47 -0.06 -0.10 0.15 1.00 -0.49
Organizational processes 0.12 -0.23 -0.20 -0.17 -0.38 -0.17 -0.49 1.00

Topic Model Result 2
(Topics = 8, Iterations =1500, alpha = 0.02, beta = 0.02)

Hierarchical Topic Distribution of Major Topic
Social
Media
Topic
Top
Keywords

T1 T2 T3 T4 T5 T6 T7 T8
privacy virtual firm field acceptance quandaries Product support
security task service interpretive constructs theories Price analysis
user media firms science measurement ethical Decision work
efficacy groups project principles adoption managers Trust understanding
users communication strategic case usage stakeholder Products develop
influence time projects studies models fraud markets theoretical
behaviors teams risk evaluation cultural ethics content quarterly
resistance tasks alignment evaluating culture resolve perceived suggest
work training capability published modeling explored consumer innovation
computer group investments journals construct normative consumers introduction
affect team firmã articles perceived violating market practice
compliance members outsourcing methods differences detection uncertainty level
intention minute strategy journal formative deception commerce important
professionals partitioned client discipline structural managerial electronic approach
usersã support resource critical behavioral related user potential
employees presented cost researchers validity increase quality framework
individuals worlds resources academic behavior real auctions change
attitudes periods enterprise methodology intention world sales associate
personal ideas investment rationale ease actions search action
cognitive differences level issues usefulness balancing feedback community
11%
27%
4%
13%
1%
12% 12% 11%
20%
16%
5%
7% 7%
9%
14%
19%
13% 14%
21%
12%
3%
7%
11%
18%
8%
15%
8%
13%
11%
14%
8%
14% 13%
9% 9% 9% 9%
12%
7%
4%
8%
3%
5% 4%
2% 1% 2% 3% 3% 4%
12% 12%
9% 8%
5% 4%
8%
2%
0%
14%
1%
0%
1% 1% 0% 0%
1%
0% 0% 0%
2%
4%
1% 1% 0%
1%
61%
49%
38%
59%
56%
64%
54%
66%
53%
44%
62%
52%
45%
52%
46%
50%
59%
49% 46%
2%
5%
13%
2%
1%
6%
3% 4% 4%
13%
10%
8%
12%
7%
10%
7%
9%
15%
6%
3%
11% 12%
5%
13%
3%
7%
4%
1%
4% 4%
5%
3%
2%
6%
1% 2%
6%
2%
1% 1%
4%
7% 6%
2%
7% 6% 7% 8%
3%
15%
7%
4%
2%
6%
3% 4% 4%
Topic Distribution and Trend over years
T1 T2 T3 T4 T5 T6 T7 T8
9% 9%
7%
14%
18%
22%
32%
29%
21%
12%
16%
20%
35%
28%
18%
29%
15%
10%
16%
27%
16% 16%
24%
5%
15%
8%
16%
21%
12%
9% 8%
5%
22%
12%
13% 14%
7% 8%7%
21%
18%
16%
18%
10%
13% 12% 14%
25% 26%
17%
12% 12%
8%
5% 5% 6%
3%2% 4%
14%
5%
9%
11% 12%
8%
19%
6%
18%
10% 8% 8%
11%
9%
19%
29%
14%
29%
22%
25%
20%
32%
26%
20%
15% 16%
20%
16%
18%
19%
13%
25%
21% 22%
19%
20%
12%
2%
7%
3%
6%
3%
10%
7%
5%
3% 4%
8%
11%
9%
12% 12%
16%
12%
27%
13%
26%
13%
19%
12%
14%
5%
12%
4%
21%
11%
20%
10%
7%
14%
11%
9%
17%
12%
Topic Distribution and Trend over years
T1 T2 T3 T4 T5 T6 T7
T1 T2 T3 T4 T5 T6 T7
field business decision social model innovation design
action strategy work media theoretical digital software
issues creation served community communication critical process
theories strategic task online empirical human learning
practice firms implementation change analysis service world
researchers performance group technologies behaviors context support
address term associate communities future construct science
discipline competitive virtual power review innovations problem
studies industry explanations complex culture conceptual expertise
methods internet team practice framework modeling types
academic strategies reviewers networks case technological specific
tools advantage teams network data base project
argue alignment context analysis level artifacts experience
conducted case findings features researchers infrastructure traditional
health framework computer identity implications path approach
problems investment support transfer privacy phenomena control
principles product users people contexts dominant dimensions
present sustainability professionals time impact developing effectiveness
core firm conditions practices studies lead elements
published level making mechanisms develop realism search

StopWords used in Topic Modeling
'knowledge','information','system','research','paper','study','based','literature','article','tion
',
'gss','cid','146','pls','font','misq','text','open','pp','vol','1px','post','number',
'quaterly','www','http',
'website','org','appendencies','border','systems','senior','accepting','2',
'theory',
'editor','keywords','?','1','mis','oss','technology','organization','management','knowledge',
'organization','organizations','organizational','development'

Semantic Relatedness and TF-IDF
Semantic
Analysis
TF-IDF
Dimen-
sionality
Reduction
• Reduce high-dimensional term vector space to low-dimensional
'latent' topic space
• Two words co-occurring in a text
• signal that they are related
• document frequency determines strength of signal
• co-occurrence index
• TF: Term Frequency
• terms more frequently in document are more important
• IDF: Inverted Document Frequency
• terms in fewer documents are more specific
• TF * IDF indicates importance of term relative to the document

Topic Modeling Process – LDA Implementation Steps (Part 1)
• Cleaned the abstracts from as much noise as possible and lowercase all the abstract
• Replace all special characters and do n-gram tokenizing
• Lemmatizing - reducing words to their root form, e.g., “reviews” and “reviewing to “review”
• Removing numbers (e.g., “2014”) and removing HTML tags and symbols,
• Create Dictionaries, Corpus of Bag-of-Words
• Pass through LDA Algorithm and Evaluate
Vector Space Model
Bag of-
words Dictionaries
Tokeniz
ation
Lemmati
zation
Stopwords
Removal
LDA
Preprocessing
Topics and their Words
Tuning
Parameter
s
Dictionaries
Bag-of-
Words

Step 1:
Select β
• The term distribution β is determined for each topic by
β ∼ Dirichlet (δ).
Step 2:
Select α
• The proportions θ of the topic distribution for the document w
are determined by: θ ∼ Dirichlet (α).
Step 3:
Iterate
• For each of the N words wi
• (a) Choose a topic zi ∼ Multinomial(θ).
• (b) Choose a word wi from a multinomial probability distribution
conditioned on the topic
• zi : p(wi|zi, β).
Topic Modeling Generative Process
LDA Implementation Steps (Part 2)
For LDA the generative model consists of the following three steps :
* β is the term distribution of topics and contains the probability of a word occurring in a given topic.
* The process is purely based on frequency and co-occurrence of words

Number of Articles Published by the Year of Publication (1977 – 2015)
Total Papers = 1081

Number of Articles Published by the Category of Paper (2000-2015)
0
50
100
150
200
250
300
RESEARCH
ARTICLE
SPECIAL ISSUE RESEARCH NOTE ISSUES AND
OPINIONS
RESEARCH ESSAY THEORY AND
REVIEW
MISQ REVIEW SIM PAPER
COMPETITION
[CELLRANGE] (281)
[CELLRANGE] (111)
[CELLRANGE] (69)
[CELLRANGE] (41)
[CELLRANGE] (25) [CELLRANGE] (21)
[CELLRANGE] (7) [CELLRANGE] (3)
# Articles by Category
Total Papers = 551

Trend of Average # Keywords Per Article by Year (1996 – 2015)
Avg. #Keywords per article have doubled over 20 years
Total Papers = 584

Trend of Average Abstract length per Article by Year (1996 – 2015)
1462 1407
1329
1387
1535
1181
1374
1502
1948
2102
1926
2079 2082
1555
1497 1451 1499 1477 1467
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Avg Abs length
Avg Abs length
Total Papers = 584

Trend of Average Title length per Article by Year (2000 – 2015)
92
95
85
94
109
101
89
83
94
107
95
86 87
98
103
100
50
60
70
80
90
100
110
120
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Average Title Length Linear (Average Title Length)
Total Papers = 551

Article Size – # Pages Per Article v/s Avg File Size (KBs)
Total Papers = 1,081
0
5
10
15
20
25
30
35
0
500
1000
1500
2000
2500
3000
3500 1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
AVG#PAGESPERARTICLE
AVGFILESIZE(KB)
Avg Filze Size (KB) Avg Number of Pages Per Aricles

Analysis of Metadata and Topic Modeling for

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (19)

Similar to Analysis of Metadata and Topic Modeling for

Similar to Analysis of Metadata and Topic Modeling for (20)

Analysis of Metadata and Topic Modeling for