Distant reading refers to the uses of computers to “read” texts by counting words, identifying themes and subthemes (through topic modeling), extracting sentiment, applying psychological analysis to the author(s), and otherwise finding latent or hidden insights. This work is based on research on “mass surveillance” based on five text sets: academic, mainstream journalism, microblogging, Wikipedia articles, and leaked government data. The purpose was to capture some insights about the collective social discussions occurring around this issue in an indirect way. This presentation uses a variety of data visualizations (article network graphs, word trees, dendrograms, treemaps, cluster diagrams, line graphs, bar charts, pie charts, and others) to show how machines read and the types of summary data they enable (at computational speeds, at machine scale, and in a reproducible way). Also, some computational linguistic analysis tools enable the creation of custom dictionaries for unique types of applied research. The tools used in this presentation include NVivo 11 Plus and LIWC2015.
Using “Distant Reading” to Explore Discussion Threads in Online CoursesShalin Hai-Jew
In this age of mass data, “distant reading” has come to the fore as a way to deal with large amounts of text data—including from student discussion threads in online courses. Kansas State University has a site license for NVivo 11 Plus, a software that enables multimedia data curation and qualitative and mixed methods data analysis. Two new features in NVivo—sentiment analysis and theme extraction (topic modeling)—enable users to “distant read” large amounts of text to extract some early insights.
What are the expressed sentiments of learners when discussing a particular issue? Do these trend positive or negative?
What topics or themes or concepts are brought up by students given a certain discussion thread prompt?
What do the sentiment and topic insights suggest about where students are at with a particular issue? Are there latent (hidden) insights?
These new features, in combination with text frequency counts (with related text clustering), text searches, and other text data query capabilities (and related data visualization capabilities) in NVivo enable distant reading for use in online courses. This digital slideshow will introduce NVivo 11 Plus (a local software tool with both Windows and Mac platform versions) and walk-through how it may be applied to textual data extracted from an online course.
Understanding what students are thinking is a critical part of transformational teaching and learning. Using computational means to listen and to hear is important to this end.
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
What are some ways to select, say, 200 research articles to “close read” from a set of 2,000 PDF articles gleaned from library databases and Google Scholar? How can a researcher make sense of a trending issue in the flood of Tweets and RT based on a particular hashtag (#) or keyword search or an especially lively Tweetstream based on a particular social media account? People are dealing with ever more prodigious amounts of information—from a number of sources. Those who are savvy to the uses of computers to aid their reading (through “distant reading” or “not-reading”) may find that they are able to cover much more ground. This presentation introduces the use of NVivo 11 Plus (matrix queries, word frequency counts, text searches and dendrograms, cluster analyses, topic modeling, and others) for multiple cases of distant reading to aid in academic and research work.
Writing and Publishing about Applied Technologies in Tech Journals and BooksShalin Hai-Jew
This slideshow provides insights on how to write and publish about applied technologies in tech journals and books, including the following:
Getting started in tech publishing
Cost-benefit calculations
Parts to an article; parts to a chapter
Writing process
Collaborating
Publishing process
Acquiring readers (and citations)
Post-publishing
Next works
Building a Digital Learning Object w/ Articulate Storyline 2Shalin Hai-Jew
The digital learning object (DLO) is still a common staple in online learning. One of the more sophisticated authoring tools to build DLOs is Articulate Storyline 2, which enables the integration of multimedia (including screen captures with Articulate Replay); the building of animations; branching, and other features. Its packaging allows a full range of SCORM and Tin Can API outputs and versioning in HTML 5. This presentation will introduce the software tool and some of its capabilities to provide a sense of where digital learning objects may be headed.
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
This is the introductory talk for the TREC Dynamic Domain Track. The Track ran from 2015 to 2017, aiming to evaluate and advance research in dynamic search and domain-specific search. This talk was prepared to introduce the ideas and setups in the upcoming Track to the research community.
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Using “Distant Reading” to Explore Discussion Threads in Online CoursesShalin Hai-Jew
In this age of mass data, “distant reading” has come to the fore as a way to deal with large amounts of text data—including from student discussion threads in online courses. Kansas State University has a site license for NVivo 11 Plus, a software that enables multimedia data curation and qualitative and mixed methods data analysis. Two new features in NVivo—sentiment analysis and theme extraction (topic modeling)—enable users to “distant read” large amounts of text to extract some early insights.
What are the expressed sentiments of learners when discussing a particular issue? Do these trend positive or negative?
What topics or themes or concepts are brought up by students given a certain discussion thread prompt?
What do the sentiment and topic insights suggest about where students are at with a particular issue? Are there latent (hidden) insights?
These new features, in combination with text frequency counts (with related text clustering), text searches, and other text data query capabilities (and related data visualization capabilities) in NVivo enable distant reading for use in online courses. This digital slideshow will introduce NVivo 11 Plus (a local software tool with both Windows and Mac platform versions) and walk-through how it may be applied to textual data extracted from an online course.
Understanding what students are thinking is a critical part of transformational teaching and learning. Using computational means to listen and to hear is important to this end.
Capitalizing on Machine Reading to Engage Bigger DataShalin Hai-Jew
What are some ways to select, say, 200 research articles to “close read” from a set of 2,000 PDF articles gleaned from library databases and Google Scholar? How can a researcher make sense of a trending issue in the flood of Tweets and RT based on a particular hashtag (#) or keyword search or an especially lively Tweetstream based on a particular social media account? People are dealing with ever more prodigious amounts of information—from a number of sources. Those who are savvy to the uses of computers to aid their reading (through “distant reading” or “not-reading”) may find that they are able to cover much more ground. This presentation introduces the use of NVivo 11 Plus (matrix queries, word frequency counts, text searches and dendrograms, cluster analyses, topic modeling, and others) for multiple cases of distant reading to aid in academic and research work.
Writing and Publishing about Applied Technologies in Tech Journals and BooksShalin Hai-Jew
This slideshow provides insights on how to write and publish about applied technologies in tech journals and books, including the following:
Getting started in tech publishing
Cost-benefit calculations
Parts to an article; parts to a chapter
Writing process
Collaborating
Publishing process
Acquiring readers (and citations)
Post-publishing
Next works
Building a Digital Learning Object w/ Articulate Storyline 2Shalin Hai-Jew
The digital learning object (DLO) is still a common staple in online learning. One of the more sophisticated authoring tools to build DLOs is Articulate Storyline 2, which enables the integration of multimedia (including screen captures with Articulate Replay); the building of animations; branching, and other features. Its packaging allows a full range of SCORM and Tin Can API outputs and versioning in HTML 5. This presentation will introduce the software tool and some of its capabilities to provide a sense of where digital learning objects may be headed.
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
This is the introductory talk for the TREC Dynamic Domain Track. The Track ran from 2015 to 2017, aiming to evaluate and advance research in dynamic search and domain-specific search. This talk was prepared to introduce the ideas and setups in the upcoming Track to the research community.
Influence of Timeline and Named-entity Components on User Engagement Roi Blanco
Nowadays, successful applications are those which contain features that captivate and engage users. Using an interactive news retrieval system as a use case, in this paper we study the effect of timeline and named-entity components on user engagement. This is in contrast with previous studies where the importance of these components were studied from a retrieval effectiveness point of view. Our experimental results show significant improvements in user engagement when named-entity and timeline components were installed. Further, we investigate if we can predict user-centred metrics through user's interaction with the system. Results show that we can successfully learn a model that predicts all dimensions of user engagement and whether users will like the system or not. These findings might steer systems that apply a more personalised user experience, tailored to the user's preferences.
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
An Introduction to Information Retrieval and Applicationssathish sak
An Introduction to Information Retrieval and Applications The score you get depends on the functions, difficulty and quality of your project
For system development:
System functions and correctness
For academic paper presentation
Quality and your presentation of the paper
Major methods/experimental results *must* be presented
Papers from top conferences are strongly suggested
E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, …
Proposals are *required* for each team, and will be counted in the score
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
MLIS Course Code 5501-Information Retrieval and Dissemination- Workshop AIOU 2013, Information Management, Information Retrieval and Dissemination, Information Retrieval, Information Dissemination, Workshop, AIOU, Computer Science, Information science, Information technology, Hardware, Software, Computer basics,
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
See Ya! Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...Shalin Hai-Jew
American Renunciation of Citizenship (by the numbers)
LIWC2015 and Custom Dictionaries
Tapping Twitter, Facebook, Flickr, Wikipedia, and Reddit
The “See Ya!” Dictionary
Lessons about Custom Spatial-Based Dictionary-Making
Space, Place, and the Renunciation of U.S. Citizenship (from social media datasets)
Some Future Research Directions
Formations & Deformations of Social Network GraphsShalin Hai-Jew
Social network graphs are node-link (vertex-edge; entity-relationship) diagrams that show relationships between people and groups. Open-source tools like NodeXL Basic (available on Microsoft’s CodePlex) enable the capture of network data from select social media platforms through third-party add-ons and social media APIs. From social groups, relational clusters are extracted with clustering algorithms which identify intensities of connections. Visually, structural relational data is conveyed with layout algorithms in two-dimensional space. Using these various layout options and built-in visual design features, it is possible to aesthetically “deform” the network graph data for visual effects. This presentation introduces novel datasets and novel data visualizations.
Broad introduction to information retrieval and web search, used to teaching at the Yahoo Bangalore Summer School 2013. Slides are a mash-up from my own and other people's presentations.
An Introduction to Information Retrieval and Applicationssathish sak
An Introduction to Information Retrieval and Applications The score you get depends on the functions, difficulty and quality of your project
For system development:
System functions and correctness
For academic paper presentation
Quality and your presentation of the paper
Major methods/experimental results *must* be presented
Papers from top conferences are strongly suggested
E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, …
Proposals are *required* for each team, and will be counted in the score
This 2-hour lecture was held at Amsterdam University of Applied Sciences (HvA) on October 16th, 2013. It represents a basic overview over core technologies used by ICT companies such as Google, Twitter or Facebook. The lecture does not require a strong technical background and stays at conceptual level.
MLIS Course Code 5501-Information Retrieval and Dissemination- Workshop AIOU 2013, Information Management, Information Retrieval and Dissemination, Information Retrieval, Information Dissemination, Workshop, AIOU, Computer Science, Information science, Information technology, Hardware, Software, Computer basics,
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalMauro Dragoni
The presentation provides an overview of what an ontology is and how it can be used for representing information and for retrieving data with a particular focus on the linguistic resources available for supporting this kind of task. Overview of semantic-based retrieval approaches by highlighting the pro and cons of using semantic approaches with respect to classic ones. Use cases are presented and discussed
See Ya! Creating a Custom Spatial-Based Linguistic Analysis Dictionary from ...Shalin Hai-Jew
American Renunciation of Citizenship (by the numbers)
LIWC2015 and Custom Dictionaries
Tapping Twitter, Facebook, Flickr, Wikipedia, and Reddit
The “See Ya!” Dictionary
Lessons about Custom Spatial-Based Dictionary-Making
Space, Place, and the Renunciation of U.S. Citizenship (from social media datasets)
Some Future Research Directions
Formations & Deformations of Social Network GraphsShalin Hai-Jew
Social network graphs are node-link (vertex-edge; entity-relationship) diagrams that show relationships between people and groups. Open-source tools like NodeXL Basic (available on Microsoft’s CodePlex) enable the capture of network data from select social media platforms through third-party add-ons and social media APIs. From social groups, relational clusters are extracted with clustering algorithms which identify intensities of connections. Visually, structural relational data is conveyed with layout algorithms in two-dimensional space. Using these various layout options and built-in visual design features, it is possible to aesthetically “deform” the network graph data for visual effects. This presentation introduces novel datasets and novel data visualizations.
Designing Online Learning to Actual Human CapabilitiesShalin Hai-Jew
In instructional design work, instructional designers (IDs) often focus on the changing technological capabilities (of authoring tools, of learning management systems, and so on)—namely, on enablements / affordances and constraints. What is less often discussed are human capabilities, their affordances and constraints. Human enablements may be broadly conceptualized as the following: (1) perception (five senses and proprioception), (2) cognition, (3) learning, (4) memory, (5) decision-making, and (6) action-taking. This presentation summarizes some of the latest research on these areas of human capabilities and some design mitigations to design for these particular aspects of people.
LIWC-ing at Texts for Insights from Linguistic PatternsShalin Hai-Jew
Since the mid-1990s, researchers have been using the Linguistic Inquiry and Word Count (LIWC pronounced “luke”) software tool to explore various text corpora for hidden insights from linguistic patterns. The LIWC tool has evolved over the years. Simultaneously, research using computational text analysis has evolved and shed light on areas of deception, threat assessment, personality, predictive analytics, and other areas. This presentation will highlight some of the applications of LIWC in the research literature and showcase the tool on some original text sets.
What is NodeXL (Network Overview, Discovery and Exploration for Excel)?
Graph aesthetics in NodeXL
Visual pleasure
Cognitive pleasure
Bridging to NodeXL for research and analysis
Researchers have long known that the words of a text have always contained more information than on the surface. As such, texts have been studied for subtexts and other latent or hidden information. One approach has involved the machine-enabled analysis of human sentiment, usually mapped out on a positive-negative polarity. NVivo 11 Plus (a qualitative research tool released in late 2015) enables the automated sentiment analysis of texts (coded research, formal articles, text corpora, Tweetstream datasets, Facebook wall posts, websites, and other sources) based on four categories: very positive, moderately positive, moderately negative, and very negative. The tool feature compares the target text set against a sentiment dictionary and enables coding at different units of analysis: sentence, paragraph, or cell. Further, the sentiment capability extracts the coded text into respective text sets which may be further analyzed using text frequency counts, text searches, automated theme and sub-theme extractions (topic modeling), and data visualizations.
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
Research analysts go to Twitter to capture the general trends of public conversations, identify and profile influential accounts, and extract subgroups within larger collectives and larger discourses; they also go to eavesdrop on individual self-talk and individual-to-individual conversations. So what is technically in your tweets, asked Dave Rosenberg famously in a CNET article (2010). The answer: a whole lot more than 140 characters. How are the most influential social media accounts identified through #hashtag graphs? How are themes extracted? How are sentiments understood? How can users be profiled through their Tweetstreams? How can locations be mapped in terms of the Twitter conversations occurring in particular physical areas? How can live and trending issues be identified and categorized in terms of sentiment (positive, negative, and neutral)? This presentation will summarize some of the free and open-source tools as well as commercial and proprietary ones that enable increased knowability.
Fully Exploiting Qualitative and Mixed Methods Data from Online SurveysShalin Hai-Jew
A wide range of contemporary research uses online surveys. This presentation provides an overview of ways to exploit survey-captured data for analysis. There will be a summary of basic survey and item analysis that may be achieved with survey data results. There will also be a range of tips for extracting, cleaning, structuring, and presenting both quantitative and qualitative data for data-consumer sense-making. The platform that will be used as an exemplar will be the Qualtrics survey platform, and two supporting tools used for analysis are Excel 2013 and NVivo 10. Real-world projects are used to demo these approaches—with principal investigator (PI) permission.
Letting the Machine Code Qualitative and Mixed Methods Data in NVivo 10Shalin Hai-Jew
An experimental feature in NVivo 10 (circa 2013), Autocoding by Existing Pattern, enables the application of semi-supervised machine learning to ingested research data. This results in the extraction of themes and other relevant insights from data—at machine speeds, based on the classification algorithm. This presentation will introduce this feature in NVivo 10 (on both Windows and Mac platforms). This will show how the machine can achieve high inter-rater reliability (a Cohen’s Kappa of one in many cases) on the one hand but still not achieve full human sensibility from “close reading” coding on the other. This presentation will suggest a complementary balance between machine- and human- coding of qualitative and mixed methods data for the most efficient application of researcher time and expertise.
Understanding Public Sentiment: Conducting a Related-Tags Content Network Ext...Shalin Hai-Jew
This presentation focuses on how to understand public sentiment through a related-tags content network analysis of public Flickr photos and videos. NodeXL is used to conduct data extractions and visualizations of user-tagged Flickr contents and the resulting “noisy” folksonomies. What mental connections may be made about particular issues based on analysis of text-annotated graphs?
Using Qualtrics to Create Automated Online TrainingsShalin Hai-Jew
When thinking about “transformational teaching and learning,” training would not be the first thing to come to mind.
The Qualtrics® research suite offers a number of design tools and features that enable the building of automated online trainings. There are the baseline features such as the ability to integrate multimedia, apply various question designs, enable accessibility features (like alt-texting), deliver a mobile experience, reach learners across distances, and provide basic security and data integrity features.
Other features actually make this tool phenomenally powerful. One is the ability to richly customize learning sequences—by learner profile, by performance (behavior), by selection, or a mix of factors. There is a feature that enables the scoring of learner responses and the ability to set a threshold for passing. This tool has a rich data analytics capability (including a light item analysis), including online analytics and even cross-tabulation analysis. A Qualtrics® API enables the recording of online assessment scores and learner behaviors, in an automated way to faculty / staff / student information systems.
Trainings are critical for effective workplace functioning and professional development. The same features in Qualtrics® that enable the effective building of automated trainings also enable the effective building of pre-learning modules or sequences for learners who need to refresh their skills for a new course. This digital slideshow introduces the use of Qualtrics® as a customizable training and pre-learning module tool.
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...Shalin Hai-Jew
This introduces methods for extracting and analyzing social network data from Twitter for hashtag conversations (and emergent events), event graphs, search networks, and user ego neighborhoods (using NodeXL). There will be direct demonstrations and discussions of how to analyze social network graphs. This information may be extended with human- and / or machine-based sentiment analysis.
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Taha Kass-Hout, MD, MS
A hybrid (event-based and indicator-based) platform designed to streamline the collaboration between domain experts and machine learning algorithms for detection, prediction and response to health-related events (such as disease outbreaks or pandemics). The platform helps synthesize health-related event indicators from a wide variety of information sources (structured and unstructured) into a consolidated picture for analysis, maintenance of “community-wide coherence”, and collaboration processes. The platform offers features to detect anomalies, visualize clusters of potential events, predict the rate and spread of a disease outbreak and provide decision makers with tools, methodologies and processes to investigate the event.
Smart Wireless Surveillance Monitoring using RASPBERRY PIKrishna Kumar
This is a slide about the smart surveillance monitoring system using raspberry pi.
It includes the full details of the procedure , component description and the screenshots
How to Build Your Inbound Marketing Game Plan
Learn how to generate leads and build loyalty by outthinking, not outspending, the competition. Inbound marketing is powered by content and community. In order to grow smarter and faster, organizations must maintain powerful Websites, participate in social media and continually publish great content. The Inbound Marketing GamePlan follows a proven eight-step system that concentrates on shifting resources to more effective and measurable inbound marketing strategies.
* Paul Roetzer, President, PR 20/20
Native Emigration from the U.S. and Renunciation of U.S. Citizenship Shalin Hai-Jew
This presentation summarizes some initial research on the phenomena of the renunciation of U.S. citizenship and green card status. This presentation highlights some of the basic literature and then uses some social media to tap an indirect sense of public attitudes towards this and peripherally related issues.
1. Research Histories of news big data analytics
- distribution of news semantic network
- multilevel semantic network analysis of news
- news big data analysis system <newssource> and <bigkinds>
- webzine <news>
http://story.visualdive.co.kr/2016/04/newspaper26y/ (in Korean)
- 14 articles and 3 books
2. Research Plans
- expert systems using multi-modal data
- opinion dynamics, Bayesian statistics, machine learning
Research Interests : Their Dynamics, Structures and Applications in Personali...Yi Zeng
About how user interests (more specifically research interests of scientists) can be quantitatively analized and used in personalized Web search (Invited talk at Microsoft Research Asia NLC Group).
Data Science & Analytics (light overview) Shalin Hai-Jew
This draft slideshow is a print version of an Adobe Spark presentation planned for the TILTed Event at Fort Hays State University (FHSU) in March 2019. The URL to the Spark presentation is https://spark.adobe.com/page/jaOglkNI9Jjp1/.
Slides from my lecture for the Information Retrieval and Data Mining course at University College London
The slides cover introductory concepts on topic models, vector semantics and basic end applications
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.
Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
International Journal of Computer Science, Engineering and Applications (IJCSEA)IJCSEA Journal
International Journal of Computer Science, Engineering and Applications (IJCSEA) is an open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer science, Engineering and Applications. The journal is devoted to the publication of high quality papers on theoretical and practical aspects of computer science, Engineering and Applications.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
Talk at Bonn University on general AI and NLP challenges in the context of online discourse analysis. Specific focus on challenges arising from the widespread adoption of neural large language models.
Social media sites (by some referred to as the web 2.0) allow their users to interact with each other, for example in collecting and sharing so-called user-generated content - these can be just bookmarks, but also blogs, images, and videos. Social media support co-creation: processes where customers (or users, if you prefer) do not just consume but play an active role in defining and shaping the end product. Famous examples include Six Degrees, LiveJournal, Digg, Epinions, Myspace, Flickr, YouTube, Linked-in, and Pinterest. Of course, today's internet giants Facebook and Twitter are key new developments. Finally, Wikipedia should not be overlooked - a major resource in many language technologies including information retrieval!
The second part of the lecture looks into the opportunities for information retrieval research. Social media platforms tend to provide access to user profiles, connections between users, the content these users publish or share, and how they react to each other's content through commenting and rating. Also, the large majority of social media platforms allow their users to categorize content by means of tags (or, in direct communication, through hash-tags), resulting in collaborative ways of information organization known as folksonomies. However, these social media also form a challenge for information retrieval research: the many platforms vary in functionalities, and we have only very little understanding of clearly desirable features like combining tag usage and ratings in content recommendation! A unifying approach based on random walks will be discussed to illustrate how we can answer some of these questions [1], but clearly the area has ample opportunity to leave your own marks.
In the final part of the lecture I will briefly touch upon an even wider range of opportunities, where data derived from social media form a key component to enable new research and insights. I will review a few important results from research centered on Wikipedia, facebook and twitter data, as well as a diverse range of new information sources including the geo- and temporal information derived from images and tweets, product reviews and comments on youtube videos, and how url shorteners may give a view on what is popular on the web.
[1] Maarten Clements, Arjen P. De Vries, and Marcel J. T. Reinders. 2010. The task-dependent effect of tags and ratings on social media access. ACM Trans. Inf. Syst. 28, 4, Article 21 (November 2010), 42 pages. http://doi.acm.org/10.1145/1852102.1852107
Similar to "Mass Surveillance" through Distant Reading (20)
Long nonfiction chapters are not in-style and may never have been. Where average chapter lengths of nonfiction book chapters are about 4,000 – 7,000 words in length, some may be several times that max range number. The explanation is that there is some irreducible complexity that that chapter addresses that cannot be addressed in shorter form. This slideshow explores some methods for writing longer chapters while still maintaining coherence, focus, and reader interest…and while using some technological tools to write and edit more efficiently.
Overcoming Reluctance to Pursuing Grant Funds in AcademiaShalin Hai-Jew
Starting as an organization’s new grant writer can be a challenge, especially in a case where there has been a time lapse since the last one left. People get out of the habit of pursuing grant funds. This slideshow addresses some of the reasons for such reluctance and proposes some ways to mitigate these.
Writing grants is one common way that those in institutions of higher education may acquire some funds—small and big, one-off and continuing—to conduct research, hire faculty and researchers and learners and others, update equipment, update or build up new buildings, and achieve other work. This slideshow explores some aspects of the work of grant writing in the present moment in higher education.
Contrasting My Beginner Folk Art vs. Machine Co-Created Folk Art with an Art-...Shalin Hai-Jew
The SARS-CoV-2 pandemic inspired several years of experimentation with common or folk art, involving mixed media, alcohol ink painting, and other explorations. Then, with the emergence of art-making generative AIs, there were further experiments, particularly with one that enables generation of visuals from scanned art and photos, text prompts, style overlays, and text-based visual modifiers. While both types of artmaking are emotionally satisfying and helpful for stress management, there are some contrasting differences. This exploratory slideshow explores some of these differences in order to partially shed light on the informal usage of an art-making generative AI (artificial intelligence).
Creating Seeding Visuals to Prompt Art-Making Generative AIsShalin Hai-Jew
Art-making generative AIs have come to the fore. A basic work pipeline typically involves starting with text prompts -> generated images. That image may be used to seed further iterations. Deep Dream Generator (DDG) enables the application of “modifiers” of various types (artist styles, visual adjectives, others) to be applied in addition to the text prompt.
Another approach involves beginning with a “seeding image,” a born-digital or digitized (born-analog) visual on which AI-generated art may be based for a multi-channel and multi-modal prompt. This slideshow provides some observations of how to think about seeding images, particularly in terms of how the DDG handles them, with its “algorithmic pareidolia” (“Deep Dream,” Wikipedia, July 3, 2023).
Human art-making is often about throwing mass-scale conversations. Artists are thought to help bridge humanity into the future. Whether generative AI art enables this or not is still not clear.
Common Neophyte Academic Book Manuscript Reviewer MistakesShalin Hai-Jew
The work of academic book reviewing, as a volunteer (most often), is a common academic practice. The presenter has served as a neophyte one for some years before settling into this invited volunteer work for several decades. There have been lessons learned over time about avoidable mistakes…from both experience and observation.
Fashioning Text (and Image) Prompts for the CrAIyon Art-Making Generative AIShalin Hai-Jew
CrAIyon (formerly DALL-E after Salvador “Dali”) is a web-facing art-making generative AI tool online (https://www.craiyon.com/) that enables the uses of text (and image) prompts for the creation of watermarked, lightweight visuals. Counterintuitively, the rough visuals are much more usable for recombinations and remixes and recreations into usable digital visuals for various digital learning objects. The textual prompts are not particularly intuitive because of how the generative AI program was trained on mass-scale visuals). There is an art and occasional indirection to working prompts after each try, with the resulting nine-image proof sheets that CrAIyon outputs. The tool can be used iteratively for different outputs.
The tool sometimes turns out serendipitous surprises, including an occasional work so refined that it can be used / shared almost unedited. One challenge in using CrAIyon comes from their request for credit (for all non-subscribers to their service). Another comes from the visual watermarking (orange crayon at the bottom right of the image). However, this tool is quite useful for practical applications if one is willing to engage deep digital image editing (Adobe Photoshop, Adobe Illustrator).
Augmented Reality in Multi-Dimensionality: Design for Space, Motion, Multiple...Shalin Hai-Jew
Augmented reality (AR)—the use of digital overlays over physical space—manifests in a wide range of spaces (indoor, outdoor; virtual) and ways (in real space (with unaided human vision); in head gear; in smart glasses; on mobile devices, and others). There are various authoring technologies that enable the making of AR experiences for various users. This work uses a particular tool (Adobe Aero®) to explore ways to build AR for multiple dimensions, including the fourth dimension (motion, changes over time).
Based on the respective purposes of the AR experience, some basic heuristics are captured for
space design (1),
motion design (2),
multiple perception design (sight, smell, taste, sound, touch) (3),
and virtual- and tangible- interactivity (4).
Some Ways to Conduct SoTL Research in Augmented Reality (AR) for Teaching and...Shalin Hai-Jew
One of the extant questions about augmented reality (AR) is how (in)effective it is for the teaching and learning in various formal, nonformal, and informal contexts. The research literature shows mixed findings, which are often highly context-based (and not generalizable). There are some non-trivial costs to the design/development/deployment of AR for teaching and learning. For the users, there is cognitive load on the working memory [(1) extraneous/poor design, (2) intrinsic/inherent difficulty in topic, and (3) germane/forming schemas]. For teachers, there are additional knowledge, skills, and abilities / attitudes (KSAs) that need to be brought to bear.
Exploring the Deep Dream Generator (an Art-Making Generative AI) Shalin Hai-Jew
The Deep Dream Generator was created by Google engineer Alexander Mordvintsev in 2014. It has a public facing instance at https://deepdreamgenerator.com/, which enables people to use text prompts and image prompts (individually or in combination) to inspire the art-generating generative AI to output images. This work highlights some process-based walk-throughs of the tool, some practical uses, some lightweight art learning, some aspects of the online social community on this platform, and other insights. Some works by the AI prompted by the presenter may be seen here: https://deepdreamgenerator.com/u/sjjalinn.
(This is the first draft of a slideshow that will be used in a conference later in the year.)
Augmented Reality for Learning and AccessibilityShalin Hai-Jew
Recently, the presenter conducted a systematic review of the academic literature and an environmental scan to learn how to set up an augmented reality (AR) shop at an institution of higher education. The ambition was to not only set up AR in an accessible and legal way but also be able to test for potential +/- effects of AR on teaching and learning. The research did not go past the review stage, because of a lack of funding, but some insights about accessibility in AR were acquired.
(The visuals are from Deep Dream Generator and CrAIyon.)
Engaging Pixabay as an open-source contributor to hone digital image editing,...Shalin Hai-Jew
This slideshow describes the author's early experiences with creating two accounts on Pixabay in order to advance digital editing skills in multimedia. The two accounts are located at https://pixabay.com/users/sjjalinn-28605710/ and https://pixabay.com/users/wavegenerics-29440244/ ...
This work explores four main spaces where researchers publish about educational technology: academic-commercial, open-access, open-source, and self-publishing.
Human-Machine Collaboration: Using art-making AI (CrAIyon) as cited work, o...Shalin Hai-Jew
It is early days for generative art AIs. What are some ways to use these to complement one's work while staying legal (legal-ish)?
Correction: .webp is a raster format
Getting Started with Augmented Reality (AR) in Online Teaching and Learning i...Shalin Hai-Jew
University creative shops are exploring whether they can get into the game of producing AR-enhanced experiences: campus tours, interactive gaming, virtual laboratories, exploratory art spaces, simulations, design labs, online / offline / blended teaching and learning modules, and other AR applications.
This work offers a basic environmental scan of the AR space for online teaching and learning, and it includes pedagogical design leads from the current research, technological knowhow, hands-on design / development / deployment of learning objects, and online teaching and learning methods.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
"Mass Surveillance" through Distant Reading
1. “MASS SURVEILLANCE”
THROUGH DISTANT READING
Shalin Hai-Jew
• Aesthesia
• March 2, 2017
• Marianna Kistler Beach
Museum of Art
• Kansas State University
2. OVERVIEW
Distant reading refers to the uses of computers to “read” texts by counting words,
identifying themes and subthemes (through topic modeling), extracting sentiment,
applying psychological analysis to the author(s), and otherwise finding latent or
hidden insights. This work is based on research on “mass surveillance” based on five
text sets: academic, mainstream journalism, microblogging, Wikipedia articles, and
leaked government data. The purpose was to capture some insights about the
collective social discussions occurring around this issue in an indirect way. This
presentation uses a variety of data visualizations (article network graphs, word trees,
dendrograms, treemaps, cluster diagrams, line graphs, bar charts, pie charts, and
others) to show how machines read and the types of summary data they enable (at
computational speeds, at machine scale, and in a reproducible way). Also, some
computational linguistic analysis tools enable the creation of custom dictionaries for
unique types of applied research. The tools used in this presentation include NVivo
11 Plus and LIWC2015.
2
3. SOME COMMON TYPES OF “DISTANT READING”
AND APPLICATIONS
Linguistic analysis
Topic modeling
Theme and subtheme extraction
Sentiment analysis
• Positive and negative
Text networks
Word relationships
Authorship analysis (based on latent
features)
Stylometry “fingerprinting”
Author gender identification
Psychological analysis
Cultural analysis, culturomics
History-based applications
Literary analysis
Dialogue analysis
Geographical referencing and patterning
Character analysis
Predictive analytics
Classification
Trend
3
4. STUDIED PHENOMENA IN THE COMPUTATIONAL
LINGUISTIC ANALYSIS RESEARCH LITERATURE
Political science, leader speech analysis
(for profiling)
State-of-a-field research
Authorship identification
Plagiarism detection
Suicidality
Movie popularity, song popularity
Language studies
Law enforcement
Fraud detection
Threat detection, and others
4
5. WHY DISTANT READING?
Textual interpretation
At computational speeds
At computational scale
Reproducible, repeatable
Measures various analytical constructs in quantized ways
Surfacing latent (hidden) ideas and data patterns not seeable otherwise (such as by
human “close reading”)
Results comparable against large textual datasets of particular types of text (such as
comparing a Tweetstream against other social media texts or even microblogging
texts)
Complementary to and augmentary of human “close reading”
5
6. COMMON ANALYTICAL TRAJECTORIES
Curation of text sets (corpora) -> distant reading data summaries -> zoomed-in
analysis (of concepts, names, dates, locations, symbols, and numbers, etc.) -> human
close reading
General-to-specific trajectory
Baseline text set statistics based on curated text collections and text corpora
Comparisons across text sets
Relative data
6
8. WHY “MASS SURVEILLANCE”?
A timely construct
A point-of-global discussion
A mixed group of competing stakeholders re: the issue
Wide public availability of five (somewhat) disparate text sets:
Academic
Mainstream journalism
Microblogging
Wikipedia articles
Leaked government data
8
30. 30
Gunning Fog Index Coleman Liau Index Flesch Kincaid
Grade Level
ARI (Automated
Readability Index)
SMOG Readability
Formula
Flesch
Reading
Ease ( /100)
Set 1: Academic article text
set (partial)
13.20 11.71 10.71 9.29 12.80 43.26
Set 2: Mainstream
journalistic text set
14.28 13.88 12.12 12.40 13.75 39.25
Set 3: Twitter
microblogging hashtag
discourse text set
28.88 32.36 24.40 29.73 21.75 -38.46 (on a
100 point
scale)
Set 4: Wikipedia article
network text set (partial)
11.09 12.25 9.46 8.31 11.07 44.39
Set 5: Leaked U.S.
government text set (partial)
14.65 12.45 12.29 10.89 13.97 36.44
data table
31. 31
Final Full Set Academic Themes and Subthemes Treemap
treemap diagram
32. 32
Final Full Set Mainstream Journalist Themes and Subthemes Treemap
treemap diagram
33. 33
Final Full Set #surveillance Microblogging Themes and Subthemes Treemap
treemap diagram
61. 61
0
1
2
3
4
5
6
7
8
A : content B : dissemination C : front door D : hidden service E : information F : jflftflvjff
dissemination
G : node H : onion I : r dissemination
NumberofMentions
Auto-extracted Top-Level Themes from a Government Document
An Article Histogram of a Leaked Government Document
article
histogram
w/ main
theme
extractions
62. 62
0 0.5 1 1.5 2 2.5 3 3.5
A : event
B : facebook
C : msn
D : notification
E : sources
F : target
Counts of Mentions of Top-Level Themes
Auto-extractedTop-LevelThemes
A Theme Histogram from a Government Document
article
histogram
w/ main
theme
extractions
64. CONTRIBUTIONS TO THE “MASS SURVEILLANCE”
TOPIC
Academic writing: legal, philosophical, technological, and practical implications
Mainstream journalistic articles: domestic and foreign government engagement
with the issue (executive, legislative, judicial, and others)
Microblogging messages: global surveillance challenges, changing technologies
(drones)
Wikipedia (open-source and crowdsourced encyclopedia): summary details,
highlighted events, personages, URLs, and timely observations
Government documents: bureaucratese, technical capabilities
64
65. ABOUT THE RELATED TEXT SETS…FROM DISTANT
READING
Different genres of writing, based on a particular topic, manifest differently on
different textual dimensions.
Some textual features seem to co-vary and may be because these are features of prose writing, or
other factors.
Analysis of different features of the text sets may be helpful in identifying source types that may be
most useful for certain types of research or questions.
Social media “netspeak” has not yet fully been captured in the two commercial tools used for this
analysis.
Average word counts per unit differed: academic (7,624 – 8,073 words per unit),
mainstream journalistic articles (1,460 – 1488 words per unit), microblogging
hashtag discourse (44 – 61 per user account), Wikipedia articles (6,710 – 7,216
words per article), and leaked government documents (1,711 – 1,800 words).
Variance in word counts were based on the uses of differing software programs to do the counts…and
natural ambiguity in word identification.
65
66. ABOUT THE RELATED TEXT SETS…FROM DISTANT
READING (CONT.)
Computational analysis of the five text sets showed a spike in terms of human drives
across all sets…in terms of “power.” Because this applied across all five text sets, it
may be that “power” is a driving issue of concern regarding “mass surveillance.”
Sentiment was most present in the following (in descending order): Wikipedia articles,
academic articles, leaked government documents, mainstream journalism, and hashtag
discourse, according to analysis in NVivo 11 Plus but a different order was found
using LIWC2015 (in descending order): mainstream journalism, Wikipedia articles,
academic articles, leaked government documents, and hashtag discourse.
The only rank position of agreement was having hashtag discourse in last place with the least
sentiment, which can partially be explained by the brevity of Tweets and the expression of emotion in
emoticons and punctuation marks.
66
67. ABOUT THE RELATED TEXT SETS…BASED IN PART
ON SELECTED CLOSE READING
All five text sets—academic, mainstream journalistic, microblogging messages,
Wikipedia articles, and the government documents—were informed by the source
government documents.
The journalistic articles, with a rights narrative of deep intrusions into privacy, seem to
have captured the readership’s attention, while academic and government documents
were not consumed as broadly.
Journalistic articles ranked high in sociality measures—and that may indicate why people see it as
connecting with their lives.
Twitter was used to advertise writings from academia and mainstream journalism.
Some academic publications cited mainstream journalistic pieces, but fewer
journalistic pieces cited academic works.
67
68. ABOUT THE RELATED TEXT SETS…BASED IN PART
ON SELECTED CLOSE READING(CONT.)
Academia did not have a lot of pieces on this issue in the subscription databases and
other sources that were checked.
It may be that more time has to pass for researchers to study the issues.
The technological complexity of the government documents required technology and
legal and policy experts to interpret.
These documents were generally handled in a non-consumptive way for computational linguistic
analysis. Non-consumptiveness refers to the extraction of statistical features of a text set without direct
access to the underlying texts. For this analysis, the focus was on computational reading of the related
documents, not a human interpretation of the text set or the related capabilities.
68
69. ABOUT USING COMPUTATIONAL LINGUISTIC
ANALYSIS TO “READ” UP ON AN ISSUE
Selected text sets should be as comprehensive as possible in order to represent the
topic. The text sets should be cleaned, so irrelevant elements may be eliminated.
There should be clear documentation about how data was collected and processed
and handled.
How the text sets are handled affect the results.
The bundling of particular text sets will affect results as well.
Because social media only attracts some to participate, there can be some large gaps
in informational coverage.
Social media platform APIs are often rate- and data-limited, so it’s important to review the terms of
access to such data.
Using multiple software tools to conduct analysis makes sense because there are
differences between tool designs which will affect what is observed or not. The
“validity” and “reliability” of software tools vary…
69
70. ABOUT USING COMPUTATIONAL LINGUISTIC
ANALYSIS TO “READ” UP ON AN ISSUE (CONT.)
How the researcher asks questions and wields the technology will affect what is
seeable and seen. There is not an “objective” reading machine… Subjectivity and
judgment play a role.
External validation may be an important piece of research using computational reading.
The data visualizations here are mostly interactive, and it is possible to link to original
underlying data. All the data visualizations are informed by underlying data, and
these should be accessed for deeper understandings.
These interactive features and underlying data should be engaged to fully benefit from the
computational analyses. (Data visualizations are not used independent of the underlying data.)
“Non-consumptive” text analysis can sometimes be helpful even without the benefit of close reading
and examination of the underlying text corpora used for the computational analysis.
70
71. ABOUT USING COMPUTATIONAL LINGUISTIC
ANALYSIS TO “READ” UP ON AN ISSUE (CONT.)
Close reading always a part of the work, even though distant reading is brought to
bear. Both enhance the other, and there are many rich processing sequences to read.
What a human reader “sees” vs. what a computer does differs.
71
72. SOME POSSIBLE EFFECTS OF THE RESEARCH
Different genres of texts may reach different parts of a population. Those who limit
themselves to particular genres will only capture some aspects of information about a
topic.
Those engaged in strategic communications would benefit from gaining a sense of which
communications modes to engage in order to reach their target audience.
It helps to know what issues are trending at any particular time…and the collective
emotions which are being expressed.
It helps to strategically target limited human close reading attention based on
observations from distant reading.
72
73. WHY “MASS SURVEILLANCE” AND “DISTANT
READING”?
There is an elision of mass surveillance and distant reading…in this slideshow…in
part because technological enablements enable “mass surveillance” and
dataveillance (data + surveillance, in a portmanteau term).
Practically speaking, human close reading would be wholly insufficient to interact with mass data.
There are not enough human years to plough through the masses of structured and unstructured data
being created today.
For complex data, human close reading requires close and slow attention (200 wpm / words per
minute).
Human close reading is not known for great objective accuracy. Rather, human reading is informed by
a trained and subjective lens. Human reading is known for a unique perspective and voice.
73
74. WHY “MASS SURVEILLANCE” AND “DISTANT
READING”?(CONT.)
Together, “distant” and “close” reading expand human power to read, interpret, and
learn. Sometimes, these complementary efforts help solve very human challenges.
Computational distant reading does not “displace” people or what they can bring to research and
analysis. Oftentimes, the findings from each diverge, resulting in different insights attained in different
ways.
74
76. ABOUT NVIVO 11 PLUS
Enables the building of unstructured, semi-structured, and structured data (using SQL
as the understructure on Windows)
Enables analysis of any data represented by UTF-8 (Unicode character set) but
requires a main base language
Enables exact matches, stemmed words, synonyms, specializations, and generalizations
Enables the application of special characters and Boolean terms
Enables the building of an exportable code dictionary
Enables topic modeling, sentiment analysis, and “coding by existing pattern”
Enables “distant reading” and interactive data visualizations including word trees,
dendrograms, treemaps, cluster diagrams, and others
76
78. ABOUT LIWC2015 PLUS
Has a built-in linguistic analysis dictionary which has been built up over decades of
refinement and empirical research
Summarizes datasets on four scores: Analytic, Clout, Authentic, and Tone
Includes psychological and socio-psychological elements
Includes sentiment and emotional analysis features
Includes gender reference counts
Includes human drives counts
Includes generic linguistic analysis counts (including for function words)
78
79. ABOUT LIWC2015 PLUS(CONT.)
Is back-stopped by decades of solid research
Is a very well and smartly documented tool
Is set up as a processor and a dictionary
Enables the building of custom dictionaries to run against textual datasets to surface
more unique insights
79
80. ABOUT LIWC2015 PLUS(CONT.)
Requires some in-depth reading of the related documentation
The Development and Psychometric Properties of LIWC2015
Linguistic Inquiry and Word Count: LIWC2015
Requires reading of years of research for the smoothest research applications
Requires experience in Excel since data dump out into .xl or .xlsx
There is no proprietary file to save an analysis using LIWC2015
80
81. CONTACT AND CONCLUSION
Dr. Shalin Hai-Jew
Instructional Designer
Kansas State University
785-532-5262
shalin@k-state.edu
“Distant reading” is a term originated by Franco Moretti (founder of the Stanford
Literary Lab) in 2011.
This slideshow is based on a research-based chapter forthcoming in 2017.
81