Capitalizing on Machine Reading to Engage Bigger Data

Capitalizing on
Machine Reading to
Engage “Big(ger) Data”
FOUR USE CASES FROM HIGHER ED
Summer Institute on Distance Learning and Instructional Technology (SIDLIT 2016)
August 4 - 5, 2016

Overview
What are some ways to select, say, 200 research articles to “close read” from a set of 2,000 PDF
articles* gleaned from library databases and Google Scholar? How can a researcher make sense
of a trending issue in the flood of Tweets and RT based on a particular hashtag (#) or keyword
search or an especially lively Tweetstream based on a particular social media account? People
are dealing with ever more prodigious amounts of information—from a number of sources.
Those who are savvy to the uses of computers to aid their reading (through “distant reading” or
“not-reading”) may find that they are able to cover much more ground. This presentation
introduces the use of NVivo 11 Plus (matrix queries, word frequency counts, text searches and
dendrograms, cluster analyses, topic modeling, and others) for multiple cases of distant reading
to aid in academic and research work.
*Note: In NVivo 11 Plus, it seems easier to process .docx and .txt textual data than .PDF, so
some file transcoding may be necessary for smoother processing.
2

Presentation Topic Headings
What is Big(ger) Data?
What is Machine Reading?
Some Machine Reading Sequences
Four “Use Cases” from Higher Education
◦ Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and document sets)
◦ Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data
◦ Use Case #3: Exploring Manual and Machine Coding
◦ Use Case #4: Machine Reading for Sentiment Analysis
3

What is Big(ger) Data?
STRUCTURED DATA
Thousands of individual records extracted
from a microblogging or social networking
platform
…and others
UNSTRUCTURED DATA
Hundreds to thousands of scraped images
from a content-sharing platform
Hundreds of videos
Complex mixed media (heterogeneous) sets of
data
…and others
5

Where Does the “Bigger Data” Come
From?
Social media platforms (crowd-sourced encyclopedias, social networking sites, microblogging
sites, blogging sites, image sharing sites, video sharing sites, email sites, collaborative work sites,
and others)
Databases of published contents
World Wide Web (WWW) and Internet
Synthetic data from various software tools or sites or other processes,
…and others
6

What is Machine Reading?
Use of computers to decode text (which generally was created from natural language processes)
◦ Also sometimes called “distant reading” (Moretti, 2000) or “not-reading”
Capturing of quantitative measures of a text
◦ Reliable and consistent, reproducible / repeatable
◦ Patterned
◦ May be generalizable and transferable to other contexts
8

What is Machine Reading? (cont.)
Some common types of machine-based text analytics include the following:
◦ Linguistic analysis (based on word counts of function words, punctuation, and other parts of written
and spoken language)
◦ Style / stylometry
◦ Sentiment analysis
◦ Emotion analysis
◦ Cognitive analysis
◦ Social standing analysis
◦ Deception analysis, and others
9

What is Machine Reading? (cont.)
◦ Word frequency counts (and resulting word clouds)
◦ Word searches (and resulting word trees)
◦ Word network analysis (and resulting network graphs)
◦ Word similarity clustering (and resulting 2D and 3D word clusters)
◦ Word proximity clustering (and resulting 2D and 3D word clusters)
◦ Topic modeling or theme/subtheme extraction through unsupervised machine learning
10

A Brief History of Machine Reading
Technologies originated in 1960s
◦ Assistive technology track (Haskins Laboratories, 1970s; Kurzweil Computer Products, 1975; and others)
◦ Natural language processing track (Bobrow, 1964; Weizenbaum, 1965; Schank, 1969; Woods, 1970;
Winograd, 1971; Hendrix, 1982; and others)
Recently
◦ applied to Web and Internet scale texts
◦ popularizing to individual academic researcher-level applications
11

Supervised and Unsupervised Machine
Learning from Text / Text Corpora
SUPERVISED OR SEMI-SUPERVISED (WITH
DIRECT HUMAN INPUTS)
Coding by existing pattern (computer emulates
human coding over part of a text set and
codes the rest of the text set, based on human
coding examples)
XML coding and analysis of coded segments of
text
◦ Often manual coding
◦ Sometimes machine-enhanced XML coding
UNSUPERVISED (WHOLLY BASED ON
COMPUTER ALGORITHM)
◦ Sentiment, emotion, cognitive, and other types
of analysis of text data
◦ Word frequency counts with stopwords lists
(and resulting word clouds, treemaps,)
◦ Word searches (and resulting word trees)
◦ Word network analysis (and resulting network
graphs)
12

Learning from Text / Text Corpora (cont.)
SUPERVISED OR SEMI-SUPERVISED (WITH
DIRECT HUMAN INPUTS)
Uses of human-labeled data
UNSUPERVISED (WHOLLY BASED ON
COMPUTER ALGORITHM)
◦ Word similarity clustering (and resulting 2D and
3D word clusters) (and resulting dendrogram
visualizations, 2D and 3D cluster diagrams, ring
lattice graphs)
◦ Word proximity clustering (and resulting
dendrogram visualizations, 2D and 3D word
clusters diagrams, ring lattice graphs)
◦ Clustering from factor analysis
◦ Topic modeling or theme/subtheme extraction
through unsupervised machine learning (with
intensity matrices, bar charts, hierarchy
diagrams like treemaps and sunbursts)
13

Learning from Text / Text Corpora (cont.)
All with extracted data tables (from which data visualizations are created)
All with extracted word lists (from which data visualizations are created)
14

Features and Affordances of Machine
Reading
FEATURES
Still some level of human interpretation
needed
Extracted clusters, but the human has to apply
the interpretation of what those clusters
represent (factor analysis, principal
components analysis, and others)
AFFORDANCES
Speed
Scale
Reproducibility
15

Under the Hood…
Bag of words (parsing: removal of punctuation, breaking writing apart into words) vs. structure
and context-preserving methods
Sliding “windows” for co-occurrence and proximity captures
Languages are fairly predictably structured, so statistical methods and counting can be applied
based on those patterns
Matrices are a common tool to capture relationships between words and parts of a text and
documents; all relational matrices may be re-visualized as network graphs and other types of
data visualizations
Pre-coded word sets or dictionaries are often used for sentiment analysis, emotion analysis, and
the extraction of psychometric features
16

Under the Hood… (cont.)
Often a sequence of algorithmic procedures (and often black box except for the few software
makers who are highly focused on documenting accurately)
17

Some Software and Capabilities
Linguistic Inquiry and Word Count (LIWC2015): application of psychometrics and other
constructs across over 100 dimensions; two-plus decades of empirical, lab, and other research;
dictionaries available in multiple languages; customized dictionaries may be applied
AutoMap and NetScenes: extraction of content networks, and others
(http://www.casos.cs.cmu.edu/projects/automap/)
Coh-Metrix (http://cohmetrix.com/)
DICTION (http://www.dictionsoftware.com/)
Latent Semantic Analysis @ CU Boulder (http://lsa.colorado.edu/)
(The presenter has only tried the top two. MS Word has a brief lexical element which enables
the extraction of readability statistics based on the Flesch Reading Ease test and the Flesch-
Kincaid Grade Level test.)
18

NVivo 11 Plus
Primary focus: Qualitative research data analysis suite
◦ Digital data curation
◦ Manual coding
◦ Data queries
◦ Auto coding
◦ Data visualizations, and others
Some built-in data analytics tools that enable machine reading of texts
This presentation will only focus on this particular tool for the use cases.
19

Some Machine Reading
Sequences
20

A Conceptualization of a Recursive and
Semi-Linear Sequence
Review of the Literature*
Research Design / Exploration and Discovery*
Multimedia Collection / Text Collection*
Close Human Reading
Data Cleaning
Text and Multimedia Data Curation
Data Runs with Machine Reading Software
Extractions of Data Tables and Text Sets
Data Visualizations
Write-up Presentation
*all potential start points
21

Power in Combinations and in Sequences
Many points-of-entry to machine reading
Ranges of tactics and strategies and capabilities based on available texts, various software tools,
and researcher capabilities
Text processing is sensitive to sequential time, so it is important to be very clear about what is
happening at each processing phase (and how the data changes)
◦ Need clear documentation of what changes happened in each step
23

Why Text?
May be human-collected data sets, sometimes computer program-scraped data sets, or a mix
Text of various types based on conventions
Text as a lowest common denominator in terms of multimedia objects (images, audio, video, and
others)
24

Text Curation and Data Cleaning
Selection of texts to include and those to exclude
Consistent and informative file naming protocols
Rendering of multimedia files to searchable text ones
Ensuring texts are machine readable
◦ Rendering of texts to searchability
De-duplication of files
Normalizing textual data, and others
25

26
dendrogram hierarchy showing similarity clustering
Note: Start with the leaves at the far right and read “up”
the tree to branches and then the trunk,
in order to understand clustering relationships…
from the most granular to the most broad or coarse.

27
treemap diagram showing theme and subtheme extraction

28
treemap diagram showing theme and subtheme extraction without color overlay

29
word tree based on a text search for “data”

30
2D cluster visualization based on source ties based on word similarity

Setup of Machine Reading Processes
There are a half-dozen widely-available software programs that may be used for machine
reading of texts. Those who use high-level computer languages also have packages that enable
various types of text analysis.
It helps to know what the capabilities are for the various software tools.
It helps to know what to set the parameters at for the various processes.
It helps to know how the respective data visualizations are to be interpreted and to mitigate for
the fact that data visualizations are summary data. It is important to mitigate negative learning.
31

Assertability
It helps to know what may be asserted from the respective machine reading processes, so the
findings may be accurately represented.
Also, many researchers use multiple processes and outside-understandings in order to
contextualize the data from machine reading. Those insights should be properly couched as
well.
32

Four “Use Cases” from
Higher Education
• APPLICATION OF MACHINE READING FOR:
• Learning
• Awareness
• Research
33

Overview of Use Cases
Use Case #1: Topic Modeling and Article Histograms (latent structure in documents and
document sets)
Use Case #2: Engaging with Social Media (Microblogging and Social Network) Data
Use Case #3: Exploring Manual and Machine Coding
Use Case #4: Machine Reading for Sentiment Analysis
34

Use Case #1: Topic Modeling and Article
Histograms(latent structure in documents and document / text sets)
Using a computer to read a large amount of articles or contents to extract topics
◦ Can be a “knowledge poor” approach in which no prior information about the domain is applied to the
topic extraction (as in the case of NVivo 11 Plus)
Topic modeling may be used to identify which works should be read using human “close
reading”
◦ Article histograms through theme / subtheme extraction (topic modeling)
◦ Understanding of topics as a finite feature set
◦ Classification of articles by their main named contents
Using a computer to extract themes and sub-themes to understand the general gist of an article
or a text set
Can auto-code at three levels of granularity: sentence, paragraph, and cell (depending on the
structure of the data)
35

Use Case #1 Work Sequence
Collecting the documents as separate items in a text corpus
Cleaning the text sets
Running the theme and sub-theme extraction against the set (NVivo 11 Plus)
Selecting the appropriate themes and subthemes
Auto-coded at the more granular sentence level (vs. paragraph level)
Exporting the text-document intensity matrix
Mapping the entire set as a line graph (Excel)
Mapping separate articles as article histograms
36

Assumptions of the Topic Modeling
Approach
Word counts in respective documents are usually normalized to account for varying document
lengths (so as not to overweight the words appearing in longer documents)
Each article is represented as a list of identified “important” words
◦ Important words are likely semantic meaning-bearing terms
◦ Important words are sometimes “rare” ones
◦ Important words are not functional words (articles, pronouns, etc.) usually on a “stopwords” list
◦ If the TF-IDF (term frequency-inverse document frequency) calculation is used,
◦ words are upweighted if they occur a fair amount in a local document but occur rarely globally in the set, and
◦ words are downweighted if they occur a lot across documents (globally) … in order to lower the influence of frequently appearing
words over rarer-occurring ones
◦ Words directly linked to the identified themes are placed as sub-themes (in bigrams, three-grams, and
so on)
37

Visualizations
A dataset of articles about “machine reading” from Springer, IEEEXplore Digital Library, ACM
and Google Scholar
Kept at document size and stored inone folder
38

39
autocoded nodes based on extracted themes and subthemes

40
an intensity matrix of extracted themes from an article set

41
a combined bar chart of articles and extracted top-level themes

42
treemap diagram showing theme and subtheme extraction

43
sunburst hierarchy diagram showing theme and subtheme extraction

44
an article histogram created in Excel

45
interactive word tree based on “machine reading”

An Integrated File of All Articles
1751 pp., > a million words from articles based on “machine reading” from the academic
literature
Binder1 file
63 MB
Saved out as a .txt file from .pdf
Saved out as an .rtf file from .pdf
46
combining of article set of “machine reading” articles
for corpus-based summaries

47
synthesized set theme and subtheme extraction from text corpus

48
frequency word cloud from synthesized article set

49
treemap diagram from synthesized article set

50
interactive word tree from “machine” seeded text search from synthesized article set

51
treemap diagram showing theme and subtheme extraction from synthesized article set

52
sunburst diagram showing theme and subtheme extraction from synthesized article set

53
sub-sunburst reflecting “system” theme and linked subthemes
from synthesized article set

54
treemap diagram showing auto-extracted sentiments from
synthesized article set

55
bar chart showing autocoded sentiment in four classifications
from synthesized article set
intensity matrix

Use Case #2: Engaging with Social Media
(Microblogging and Social Network) Data
Using NCapture web browser add-on (on MS’s IE or Google Chrome) to extract a Twitter
Tweetstream
Using NCapture web browser add-on to extract a Facebook wall of postings
Ability to capture text messages, URLs, geographical coordinates, thumbnail images, and other
data
Ability to run sentiment analysis over the collected data
Ability to run theme and sub-theme extraction over the collected data
Ability to map social networks (sociograms) based on interaction data
56

Capture a microblogging text set (Tweetstream, #hashtag conversation, keyword conversations,
or other data extractions) from Twitter’s API
Run a text frequency count to capture a gist of the focus of the target text set (NVivo 11 Plus)
Run a word or phrase or name search to capture a sense of the gist of the targeted word use
(NVivo 11 Plus)
Process the data and extract social networks to identify main communicators in the #hashtag
network or the keyword network
57

Use Case #2 Work Sequence (cont.)
Map the microblogging data to a geographical map
Capture the URLs from the data set for more analysis
Capture related imagery from the data set for more analysis
Look at the postings over time for more analysis
Conduct a theme and sub-theme analysis of the text set to capture meaning
Map the text set for sentiment to capture the general sentiment of the text set
◦ Examine the extracted textual data for deeper insights
58

59
extracted Tweetstream dataset from Twitter using NCapture of
NVivo 11 Plus

60
dendrogram showing user names clustered by word similarity
(suggestive of intercommunications and topical interests)

61
3D cluster chart of Twitter users clustered by word similarity (zoomable,
pannable, interactive)

62
ring lattice graph / circle graph showing Twitter user accounts and clustering by word
similarity from exchanged microblogged messaging

63
mapped locations of accounts of Twitter communicators
based on shared geolocational data

64
interactive word tree from a text search for “asia”

65
sociogram / sociogram depicting target user account on Twitter’s out-degree
ego neighborhood (1 deg.)

66
data table showing descending word count from a word frequency count

67
frequency word cloud from a Tweetstream with hashtag networks and (semantic)
keywords

68
treemap from frequency word count of the Tweetstream

69
treemap from autocoded theme and subtheme extraction from the Tweetstream

70
sunburst hierarchy chart showing extracted themes and subthemes from the
Tweetstream

71
sunburst hierarchy chart showing the “educator” theme and its related
subthemes from Twitter Tweetstream

72
bar chart of extracted themes (in alphabetical order) from the Twitter account
Tweetstream; subthemes not depicted here

73
least frequent word count to capture the “long tail” from the Tweetstream
often alphanumeric garble,
misspellings, and a few nuggets
of insight in the long tail
power law curve frequency distribution

In the area marked by ellipses …
74
treemap from Excel 2016

75
0
5
10
15
20
25
30
treatment
welcome
tested
mason
a88da9idef
including
therapy
#saam
7pppg0epoe
advances
busters
coordination
driving
five
highest
killed
movement
per
receive
significant
three
weekend
#forewishes
#smoking
2008
5oulv66ucg
9ocvgfzr26
@nychealthy
af3qhwdkqy
backup
bxsqo6wlee
cmv
dcqd9t8ssu
durtxyyrlj
existence
front
hbdzzwenxu
identifies
iz0i7cq4lx
kqgr4oezzc
m9bip8i
mrry2ntcvq
ny
pages
prepped
rbtvgk
rynmfouobu
sponsored
techniques
truth
vacuna
wildland
yq4tl0noas
FrequencyCount
Words and String Data in the Long Tail
An Example of the Long Tail (from @CDCGov Tweetstream)
frequency bar chart from Excel 2016

In the Long Tail…
(with a somewhat arbitrary raw count to indicate the start of the long tail)
NOT USEFUL
Misspelled words
Alphanumeric garble
POSSIBLY USEFUL
URLs (uniform resource locators)or web
addresses
Rare topics
Concepts
Unusual terms
Names, proper nouns, named entities
76

77
data table starting with frequency counts of one

78
treemap of auto-extracted sentiment: neutral, positive, negative, and mixed

79
bar chart of auto-extracted sentiments in four categories or a binary split:
• very negative, moderately negative, moderately positive, and very positive (and “neutral”);
• also as negative-positive polarity (and “neutral”)

80
text set from one of the four sentiment categories

Use Case #3: Exploring Manual and
Machine Coding
Exploring human-coded (and / or auto-coded or machine-coded) nodes for pattern
identification: interrelationships, similarity clustering, and others
Data queries to enable coding exploration: word frequency count, text search, and others
Matrix coding query to explore interrelationships between codes (nodes)
81

Raw source data ingested
Manual (and / or automated) coding of that data
Data queries of the coding to observe patterns in coding of that data
◦ Ability to set up text frequency counts based on various parameters: exact matches, stemmed words,
synonyms, specializations, and generalizations
Text search with special character capabilities
Proximity searches for words occurring near a certain term
82

Special Features for Text Searches
◦ Wildcard searches (?) Any one character
◦ Wildcard searches (*) any characters
◦ AND
◦ OR
◦ NOT
◦ Required
◦ Prohibit
◦ Fuzzy
◦ Near… (proximity searches) / an extension of memory
83

Grouping
Enabling text searches, word frequency
counts, and other queries from most specific
to increasing gradations of generality (based
on placement of the slider)
Exact matches
With stemmed words
With synonyms
With specializations
With generalizations
84

85
interactive word tree based on “player” seeding term from coded data
(whether manual-coded or auto/machine-coded or combined)

86
color-coded intensity matrix of auto-coded sentiment analysis of the selected code

87
treemap diagram of theme and sub-theme extraction from text code

88
sunburst diagram of auto-extracted theme and subtheme extraction of coding
references

89
scatterplot of auto-extracted sentiment in coding as expressed in Excel 2016

Use Case #4: Machine Reading for
Sentiment Analysis
Ability to conduct a fast extraction of sentiment
◦ either as a polarity (positive-negative), with uncoded neutral text, or as
◦ a four-category set of sentiment (very negative, moderately negatively, moderately positive, and very
positive) and one of neutrality
Ability to explore the sentiment-coded text sets for textual contents
◦ Ability to query the coded text sets for additional word relationships and patterns
Ability to uncode or re-code autocoded text for sentiment to increase accuracy (through human
oversight)
90

Use Case #4: Machine Reading for
Sentiment Analysis (cont.)
Can auto-code at three levels of granularity: sentence, paragraph, and cell (depending on the
structure of the data)
◦ Sentences (granular) and paragraphs (coarser) are common in documents
◦ Cells are common in data tables, and many contain structured data but also phrases, URLs, and
thumbnail imagery (from social media); data tables from online surveys may contain whole sentences
and paragraphs
91

Run sentiment analysis on the text set
Auto-coded at paragraph level (coarser than at the granular sentence level)
Review the data visualizations (intensity matrix, bar chart)
Explore the text in each sentiment sub-category (very negative, moderately negative,
moderately positive, very positive)
◦ Run word frequency count
◦ Run selected text searches
◦ Run theme and sub-theme extraction
◦ Create additional related data visualizations
92

93
intensity matrix of auto-extracted sentiment from raw article sources

94
bar chart of sentiment from comparative articles in a text set

95
auto-coded nodes based on sentiment dictionary in either a four-category mode (top)
or a binary two-category mode (bottom)

96
access to exportable coded text sets in each of the sentiment categories

97
access to exportable coded text sets in each of the sentiment categories, with
ability to re-code and un-code

98
treemap of comparative sentiment coding per document

99
sunburst graph showing auto-coded sentiment extraction across comparative
documents

Value-added Aspects to Machine
Reading
Machine reading augments human capabilities for knowing. It decodes (“reads”) beyond a
surface level of understanding of words in various contexts.
◦ It enables efficient access to latent insights.
◦ There are ways to chain processes and compare / contrast information in a way that illuminates new
insights.
NVivo 11 Plus enables ways to save “macros” of the various data queries and autocoding
sequences to enhance re-runs of the macro sequences. This may be helpful with continuing
data collection, in order to update query results after the acquisition of new data.
There is value in comparing the outcomes from one software program to another…and to see
what happens with different settings for the different machine reading approaches.
◦ Researchers may export data tables for additional analytics and data visualizations outside of the
software.
100

Conclusion and Contact
Dr. Shalin Hai-Jew
◦ iTAC, Kansas State University
◦ 212 Hale / Farrell Library
◦ shalin@k-state.edu
◦ 785-532-5262
No ties: The presenter has no formal tie to QSR International, the maker of NVivo 11 Plus, nor to
Microsoft, the maker of Excel.
About the data visualizations: Some of the data extractions and processes used only a few dozen
source items, in part because of the need for coherent data visualizations and also to enable
processing of the data on a local Windows machine.
◦ However, in a server-hosted context, text sets closer to big(ger) data may be run using NVivo 11 Plus.
A sampling: The four “use cases” presented here are by no means comprehensive. These offer a
taste of some of the possibilities.
103

Capitalizing on Machine Reading to Engage Bigger Data

More Related Content

What's hot

Viewers also liked

Similar to Capitalizing on Machine Reading to Engage Bigger Data

More from Shalin Hai-Jew

Recently uploaded

Capitalizing on Machine Reading to Engage Bigger Data