Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Visual Analytics for Linguistics - Day 4
Olga Scrivner
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
What You Will Learn
DAY 1 Introduction to Visual Analytics
DAY 2 Visualization Methods, Design, and Tools
DAY 3 Working with Unstructured Data
DAY 4 Working with Structured Data
DAY 5 Advanced Analytics
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Our Materials - Web Site
http:
//obscrivn.wixsite.com/visualization
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
What We Need
Interactive Text Mining Suite
Language Variation Suite
Download R code
Download files from Day 3 and Day 4
R and Rstudio
R libraries: tm, party, partykit
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
What We Need
Interactive Text Mining Suite
Language Variation Suite
Download R code
Download files from Day 3 and Day 4
R and Rstudio
R libraries: tm, party, partykit
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Text Mining Review
Intro to ITMS - slides Day 3
R coding
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Text Mining Review
1. Open RStudio
2. Open R file textmining.R:
3. Set up working directory
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Text Mining Package - tm
Main structure - corpus
Corpus is constructed via DirSource, VectorSource,
DataframeSource
mycorpus <- Corpus(VectorSource(file.txt))
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Preprocessing with tm
lower case
mycorpus <- tm_map(mycorpus, tolower)
remove punctuation
mycorpus <- tm_map(mycorpus,
removePunctuation)
remove numbers
mycorpus <- tm_map(mycorpus, removeNumbers)
remove stopwords
mycorpus <- tm_map(mycorpus, removeWords,
stopwords(’english’))
mycorpus <- tm_map(mycorpus, stripWhitespace)
mycorpus <- tm_map(mycorpus,
PlainTextDocument)
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
What is LVS?
Language Variation Suite
It is a Shiny web application originally designed for data
analysis in sociolinguistic research.
It can be used for:
Processing spreadsheet data
Visualizing data
Analyzing means, regression, conditional trees ...
(and much more)
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Background
LVS is built in R using Shiny package:
1. R - a free programming language for statistical
computing and graphics
2. Shiny App - a web application framework for R
Computational power of R + Web interactivity
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Background
http://littleactuary.github.io/blog/Web-application-framework-with-Shiny/
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Workspace
Browser
Chrome, Firefox, Safari - recommendable
Explorer may cause instability issues
Accessibility
PC, Mac, Linux
Data files will be uploaded from any location on your
computer
Smart Phone
Data files must be on a cloud platform connected to
your phone account (e.g. dropbox)
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Server
Since LVS is hosted on a server, Shiny idle time-out settings
may stop application when it is left inactive (it will grey out).
Solution: Click reload and re-upload your csv file
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Data Preparation
Important things to consider before data entry:
File format:
Comma separated value (CSV) - faster processing
Excel format will slow processing
Column names should not contain spaces
Permitted: non-accented characters, numbers,
underscore, hyphen, and period
One column must contain your dependent variable
The rest of the columns contain independent variables
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Terminology Review
a. Categorical - non-numerical data with two values
yes - no; male - female
b. Continuous - numerical data
duration, age, year
c. Multinomial - non-numerical data with three or more
values
regions, nationalities
d. Ordinal - scale: currently not supported
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Terminology Review
a. Categorical - non-numerical data with two values
yes - no; male - female
b. Continuous - numerical data
duration, age, year
c. Multinomial - non-numerical data with three or more
values
regions, nationalities
d. Ordinal - scale: currently not supported
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Workshop Files
1. http://obscrivn.wixsite.com/visualization
Download file Day 4: movie_metadata.csv
Simplified set from https://www.kaggle.com/
deepmatrix/imdb-5000-movie-dataset
2. LVS web site: http://languagevariationsuite.com/
http://cl.indiana.edu/~obscrivn/docs/movie_metadata.csv
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Movie Data
Budget
Director
Actor 1
Director facebook likes
Actor 1 facebook likes
Genre
Year
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Language Variation Suite - Structure
1. Data
Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
Plotting, cluster classification
3. Inferential Statistics
Modeling, regression, conditional trees, random forest
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Language Variation Suite - Structure
1. Data
Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
Plotting, cluster classification
3. Inferential Statistics
Modeling, regression, conditional trees, random forest
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Upload File
Upload movie_metadata.csv
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Uploaded Dataset
The data content is imported as a table and allows for
sorting columns.
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Summary
Summary provides a quantitative summary for each variable,
e.g. frequency count, mean, median.
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Data Structure
1. Total number of observations (rows)
2. Number of variables (columns)
3. Variable types
Factor - categorical values
Num - numeric values (0.95, 1.05)
Int - integer values (1, 2, 3)
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Cross Tabulation
Cross-tabulation examines the relationship between variables.
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Cross Tabulation Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Cross Tabulation Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Adjusting Browser - Layout
Shiny pages are fluid and reactive.
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Adjusting Browser - Layout
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Language Variation Suite - Structure
1. Data
Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
Plotting, cluster classification
3. Inferential statistics
Modeling, regression, varbrul analysis, conditional trees,
random forest
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
One Variable Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
One Variable Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Customizing Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Customizing Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Saving Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Cluster Plot
Classification of data into sub-groups is based on
pairwise similarities
Groups are clustered in the form of a tree-like
dendrogram
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Cluster Plot
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Cluster Plot
Group 1 Animation, Biography and Group 2 Action, Drama,
Comedy
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Inferential Statistics
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Language Variation Suite - Structure
1. Data
Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
Plotting, cluster classification
3. Inferential statistics
Modeling, regression, varbrul analysis, conditional trees,
random forest
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
How to Create a Regression Model
Step 1 Modeling - create a model with dependent
and independent variables
Step 2 Regression - specify the type of regression
(fixed, mixed) and type of dependent variable
(binary, continuous, multinomial)
Step 3 Stepwise Regression - compare models
(Log-likelihood, AIC, BIC)
Step 4 Conditional Trees - apply non-parametric
tests to the model
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Modeling
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Regression Types
Model
a.) Fixed effect
b.) Mixed effect - individual speaker/token variation (within
group)
Type of Dependent Variable
a.) Binary/categorical (only two values)
b.) Continuous (numeric)
c.) Multinomial - categorical with more than two values
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Regression
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Model Output
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Model Output
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Interpretation: Budget and Genre
Genre Action is the reference value
Positive coefficient - positive effect
Negative coefficient - negative effect
http://www.free-online-calculator-use.com/scientific-notation-converter.html
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Interpretation: Budget and Genre
Genre Action is the reference value
Positive coefficient - positive effect
Negative coefficient - negative effect
http://www.free-online-calculator-use.com/scientific-notation-converter.html
exponential notation:
1.46e-7
0.000000146
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Conditional Tree
Conditional tree: a simple non-parametric regression analysis,
commonly used in social and psychological studies
Linear regression: all information is combined linearly
Conditional tree regression: visual splitting to capture
interaction between variables
Recursive splitting (tree branches)
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Conditional Tree
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Conditional Tree
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Conditional Tree
1. Genre is the significant factor for budget
2. Budget distribution is split in two groups:
Action and Animation
Biography, Comedy and Drama
3. Budget is significantly higher for Animation and Action
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Random Forest
1. Variable importance for predictors
2. Robust technique with small n large p data
3. All predictors considered jointly (allows for inclusion of
correlated factors)
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Random Forest
Let’s add more factors!
Return to Modeling
Add independent factors: director facebook likes,
actor 1 facebook likes, title year
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Random Forest
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Random Forest
Genre is the most important predictor for this model.
Close to zero or red-dotted line (cut off values) -
irrelevant for this model
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Assignment
Select LVS or ITMS
Upload your own file (csv for LVS or txt/pdf for ITMS)
Explore your data
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
References I
Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics.
Cambridge: Cambridge University Press
Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas
Biber Randi Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics.
Cambridge: Cambridge University Press
Schnapp, Jeffrey, and Peter Presner. 2009. Digital Humanities Manifesto 2.0.
http://gifsanimados.espaciolatino.com/x_bob_esponja_8.gif
https://daniellestolt.files.wordpress.com/2013/01/are-you-ready1.gif
http://www.martijnwieling.nl/R/sheets.pdf
Visual Analytics
for Linguistics -
Day 4
Olga Scrivner
Course Info
tm Package
Statistical
Visualization
Data Preparation
LVS
Working with
Data
Visual Analytics
Inferential
Analysis
Resources
http://www.rdatamining.com/examples/text-mining
https:
//en.wikibooks.org/wiki/R_Programming/Text_Processing
http://data.library.virginia.edu/
reading-pdf-files-into-r-for-text-mining/
http://www.katrinerk.com/courses/
words-in-a-haystack-an-introductory-statistics-course/
schedule-words-in-a-haystack/
r-code-the-text-mining-package
tm package

Visual Analytics for Linguistics - Day 4 ESSLLI - structured data

  • 1.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Visual Analytics for Linguistics - Day 4 Olga Scrivner
  • 2.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis What You Will Learn DAY 1 Introduction to Visual Analytics DAY 2 Visualization Methods, Design, and Tools DAY 3 Working with Unstructured Data DAY 4 Working with Structured Data DAY 5 Advanced Analytics
  • 3.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Our Materials - Web Site http: //obscrivn.wixsite.com/visualization
  • 4.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis What We Need Interactive Text Mining Suite Language Variation Suite Download R code Download files from Day 3 and Day 4 R and Rstudio R libraries: tm, party, partykit
  • 5.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis What We Need Interactive Text Mining Suite Language Variation Suite Download R code Download files from Day 3 and Day 4 R and Rstudio R libraries: tm, party, partykit
  • 6.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Text Mining Review Intro to ITMS - slides Day 3 R coding
  • 7.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Text Mining Review 1. Open RStudio 2. Open R file textmining.R: 3. Set up working directory
  • 8.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Text Mining Package - tm Main structure - corpus Corpus is constructed via DirSource, VectorSource, DataframeSource mycorpus <- Corpus(VectorSource(file.txt))
  • 9.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Preprocessing with tm lower case mycorpus <- tm_map(mycorpus, tolower) remove punctuation mycorpus <- tm_map(mycorpus, removePunctuation) remove numbers mycorpus <- tm_map(mycorpus, removeNumbers) remove stopwords mycorpus <- tm_map(mycorpus, removeWords, stopwords(’english’)) mycorpus <- tm_map(mycorpus, stripWhitespace) mycorpus <- tm_map(mycorpus, PlainTextDocument)
  • 10.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis What is LVS? Language Variation Suite It is a Shiny web application originally designed for data analysis in sociolinguistic research. It can be used for: Processing spreadsheet data Visualizing data Analyzing means, regression, conditional trees ... (and much more)
  • 11.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Background LVS is built in R using Shiny package: 1. R - a free programming language for statistical computing and graphics 2. Shiny App - a web application framework for R Computational power of R + Web interactivity
  • 12.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Background http://littleactuary.github.io/blog/Web-application-framework-with-Shiny/
  • 13.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Workspace Browser Chrome, Firefox, Safari - recommendable Explorer may cause instability issues Accessibility PC, Mac, Linux Data files will be uploaded from any location on your computer Smart Phone Data files must be on a cloud platform connected to your phone account (e.g. dropbox)
  • 14.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Server Since LVS is hosted on a server, Shiny idle time-out settings may stop application when it is left inactive (it will grey out). Solution: Click reload and re-upload your csv file
  • 15.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Data Preparation Important things to consider before data entry: File format: Comma separated value (CSV) - faster processing Excel format will slow processing Column names should not contain spaces Permitted: non-accented characters, numbers, underscore, hyphen, and period One column must contain your dependent variable The rest of the columns contain independent variables
  • 16.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Terminology Review a. Categorical - non-numerical data with two values yes - no; male - female b. Continuous - numerical data duration, age, year c. Multinomial - non-numerical data with three or more values regions, nationalities d. Ordinal - scale: currently not supported
  • 17.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Terminology Review a. Categorical - non-numerical data with two values yes - no; male - female b. Continuous - numerical data duration, age, year c. Multinomial - non-numerical data with three or more values regions, nationalities d. Ordinal - scale: currently not supported
  • 18.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Workshop Files 1. http://obscrivn.wixsite.com/visualization Download file Day 4: movie_metadata.csv Simplified set from https://www.kaggle.com/ deepmatrix/imdb-5000-movie-dataset 2. LVS web site: http://languagevariationsuite.com/ http://cl.indiana.edu/~obscrivn/docs/movie_metadata.csv
  • 19.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Movie Data Budget Director Actor 1 Director facebook likes Actor 1 facebook likes Genre Year
  • 20.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Language Variation Suite - Structure 1. Data Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis Plotting, cluster classification 3. Inferential Statistics Modeling, regression, conditional trees, random forest
  • 21.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Language Variation Suite - Structure 1. Data Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis Plotting, cluster classification 3. Inferential Statistics Modeling, regression, conditional trees, random forest
  • 22.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Upload File Upload movie_metadata.csv
  • 23.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Uploaded Dataset The data content is imported as a table and allows for sorting columns.
  • 24.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Summary Summary provides a quantitative summary for each variable, e.g. frequency count, mean, median.
  • 25.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Data Structure 1. Total number of observations (rows) 2. Number of variables (columns) 3. Variable types Factor - categorical values Num - numeric values (0.95, 1.05) Int - integer values (1, 2, 3)
  • 26.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Cross Tabulation Cross-tabulation examines the relationship between variables.
  • 27.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Cross Tabulation Plot
  • 28.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Cross Tabulation Plot
  • 29.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Adjusting Browser - Layout Shiny pages are fluid and reactive.
  • 30.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Adjusting Browser - Layout
  • 31.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Language Variation Suite - Structure 1. Data Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis Plotting, cluster classification 3. Inferential statistics Modeling, regression, varbrul analysis, conditional trees, random forest
  • 32.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis One Variable Plot
  • 33.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis One Variable Plot
  • 34.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Customizing Plot
  • 35.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Customizing Plot
  • 36.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Saving Plot
  • 37.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Cluster Plot Classification of data into sub-groups is based on pairwise similarities Groups are clustered in the form of a tree-like dendrogram
  • 38.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Cluster Plot
  • 39.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Cluster Plot Group 1 Animation, Biography and Group 2 Action, Drama, Comedy
  • 40.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Inferential Statistics
  • 41.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Language Variation Suite - Structure 1. Data Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis Plotting, cluster classification 3. Inferential statistics Modeling, regression, varbrul analysis, conditional trees, random forest
  • 42.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis How to Create a Regression Model Step 1 Modeling - create a model with dependent and independent variables Step 2 Regression - specify the type of regression (fixed, mixed) and type of dependent variable (binary, continuous, multinomial) Step 3 Stepwise Regression - compare models (Log-likelihood, AIC, BIC) Step 4 Conditional Trees - apply non-parametric tests to the model
  • 43.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Modeling
  • 44.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Regression Types Model a.) Fixed effect b.) Mixed effect - individual speaker/token variation (within group) Type of Dependent Variable a.) Binary/categorical (only two values) b.) Continuous (numeric) c.) Multinomial - categorical with more than two values
  • 45.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Regression
  • 46.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Model Output
  • 47.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Model Output
  • 48.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Interpretation: Budget and Genre Genre Action is the reference value Positive coefficient - positive effect Negative coefficient - negative effect http://www.free-online-calculator-use.com/scientific-notation-converter.html
  • 49.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Interpretation: Budget and Genre Genre Action is the reference value Positive coefficient - positive effect Negative coefficient - negative effect http://www.free-online-calculator-use.com/scientific-notation-converter.html exponential notation: 1.46e-7 0.000000146
  • 50.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Conditional Tree Conditional tree: a simple non-parametric regression analysis, commonly used in social and psychological studies Linear regression: all information is combined linearly Conditional tree regression: visual splitting to capture interaction between variables Recursive splitting (tree branches)
  • 51.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Conditional Tree
  • 52.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Conditional Tree
  • 53.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Conditional Tree 1. Genre is the significant factor for budget 2. Budget distribution is split in two groups: Action and Animation Biography, Comedy and Drama 3. Budget is significantly higher for Animation and Action
  • 54.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Random Forest 1. Variable importance for predictors 2. Robust technique with small n large p data 3. All predictors considered jointly (allows for inclusion of correlated factors)
  • 55.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Random Forest Let’s add more factors! Return to Modeling Add independent factors: director facebook likes, actor 1 facebook likes, title year
  • 56.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Random Forest
  • 57.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Random Forest Genre is the most important predictor for this model. Close to zero or red-dotted line (cut off values) - irrelevant for this model
  • 58.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Assignment Select LVS or ITMS Upload your own file (csv for LVS or txt/pdf for ITMS) Explore your data
  • 59.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis References I Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge: Cambridge University Press Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber Randi Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: Cambridge University Press Schnapp, Jeffrey, and Peter Presner. 2009. Digital Humanities Manifesto 2.0. http://gifsanimados.espaciolatino.com/x_bob_esponja_8.gif https://daniellestolt.files.wordpress.com/2013/01/are-you-ready1.gif http://www.martijnwieling.nl/R/sheets.pdf
  • 60.
    Visual Analytics for Linguistics- Day 4 Olga Scrivner Course Info tm Package Statistical Visualization Data Preparation LVS Working with Data Visual Analytics Inferential Analysis Resources http://www.rdatamining.com/examples/text-mining https: //en.wikibooks.org/wiki/R_Programming/Text_Processing http://data.library.virginia.edu/ reading-pdf-files-into-r-for-text-mining/ http://www.katrinerk.com/courses/ words-in-a-haystack-an-introductory-statistics-course/ schedule-words-in-a-haystack/ r-code-the-text-mining-package tm package