Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis

Language Variation Suite Toolkit
Olga Scrivner, Rafael Orozco, Manuel Díaz-Campos
NWAV47, October 18, 2018
Indiana University and Louisiana State University

What You Will Learn
Part 1 Cross-disciplinary Methods in Sociolinguistics
Part 2 Web Interface - a novel approach for interactive
socio-analysis
Part 3 Working with Structured Data: LVS
Part 4 Working with Unstructured Data: ITMS
2

Our Materials - Web Site
https://languagevariationsuite.
wordpress.com/
3

What We Need
- Computer
- WiFi
- Smartphone (optional)
- R and Rstudio (not really)
4

Our Objectives
Goal: Optimize sociolinguistic analysis by using a variety of
cross-disciplinary methods
1. Introduce a novel (socio-)linguistic toolkit for research and
as a teaching tool
2. Develop practical skills via interactive and visual interface
3. Understand and interpret advanced statistical and data
mining models
6

Cross-disciplinary Research
7
Corpus
Linguistics
Computational
Linguistics
Sociolinguistics
Data
Science
Statistics

Cross-disciplinary Research: Methods
8
Corpus
Linguistics
Computational
Linguistics
Sociolinguistics
Data
Science
Statistics
Cleaning
Pre-processing
Creating
Corpora
Parametric
Non-parametric
Models
Clustering
Topic Modeling

Cross-disciplinary Methods: A Closer Look
Random Forest - a more stable than stepwise variable selection
(Tagliamonte and Baayen,2012, 25)
Conditional Tree - a hierarchical tree of signiﬁcant factors and
their relationship with other predictors
9

Cross-disciplinary Methods: A Closer Look
Clustering - a grouping of variables or individuals or
documents (Gries and Hilpert, 2008, 69)
Topic Modeling - a discovery of language patterns and word usage
over time and across genres (Blei, 2012,
78)
10

How To Unify All Methods: Data Science Workﬂow
1. Import data into R: read_csv(), read_line() [even praat
ﬁles!]
2. Tidy data - pre-processing
3. Transform with dplyr [select, subset, recode...]
4. Visualize with ggplot and plotly
5. Model with glm, lda, randomForest...
11
Tidyverse - a system of R packages for data manipulation,
exploration, and visualization that share a common design
philosophy (Wickham and Grolemund, 2017)

Sociolinguistic Data - Data Analytics
“Analytics is the critical technology
needed to bring value out of data”
(Anonymous)
12

Sociolinguistic Data - Visual Analytics
“The science of analytical reasoning
facilitated by visual interactive interfaces”
(Thomas and Cook, 2005)
“Visual analytics integrates new computational and
theory-based tools with innovative interactive techniques
and visual representations to enable human-information
discourse” (Thomas and Cook, 2005)
PositionSentence
p < 0.001
1
ind, pre post
Heaviness
p = 0.003
2
≤ 1 > 1
Period
p < 0.001
3
≤ 1 > 1
Node 4 (n = 81)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 5 (n = 119)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 6 (n = 181)
VOOV
0
0.2
0.4
0.6
0.8
1
Period
p < 0.001
7
≤ 2 > 2
Node 8 (n = 221)
VOOV
0
0.2
0.4
0.6
0.8
1
Focus
p < 0.001
9
cf nf
Node 10 (n = 66)
VOOV
0
0.2
0.4
0.6
0.8
1
Main_Verb_Structure
p < 0.001
11
ACIOther, Restructuring
Node 12 (n = 43)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 13 (n = 265)
VOOV
0
0.2
0.4
0.6
0.8
1
13

What is LVS?
Language Variation Suite
It is a Shiny web application designed for data analysis in
sociolinguistic research.
It can be used for:
Processing spreadsheet data
Reporting in tables and graphs
Analyzing means, regression, conditional trees ...
(and much more)
15

Background
LVS is built in R using Shiny package:
1. R - a free programming language for statistical computing
and graphics
2. Shiny App - a web application framework for R
Computational power of R + Web interactivity
16

Background
http://littleactuary.github.io/blog/Web-application-framework-with-Shiny/
17

Workspace
Browser
Chrome, Firefox, Safari - recommendable
Explorer may cause instability issues
Accessibility
PC, Mac, Linux
◦ Data ﬁles will be uploaded from any location on your
computer
Smart Phone
◦ Data ﬁles must be on a cloud platform connected to your
phone account (e.g. dropbox)
18

Server
Since LVS is hosted on a server, Shiny idle time-out settings may
stop application when it is left inactive (it will grey out).
Solution: Click reload and re-upload your csv ﬁle
19

Data Preparation
Important things to consider before data entry:
File format:
◦ Comma separated value (CSV) - faster processing
◦ Excel format will slow processing
Column names should not contain spaces
◦ Permitted: non-accented characters, numbers, underscore,
hyphen, and period
One column must contain your dependent variable
The rest of the columns contain independent variables
21

Terminology Review
a. Categorical/Nominal - non-numerical data with two
values
◦ yes - no; deletion - retention; perfective - imperfective
b. Continuous - numerical data
◦ duration, age, chronological period
c. Multinomial - non-numerical data with three or more
values
◦ deletion - aspiration - retention
d. Ordinal - scale
22

Test Types
Dependent Independent Test
Continuous Continous t-test
Categorical Categorical Chi-square
Continuous Multiple Categorical/ Linear Regression
Continuous
Categorical Multiple Categorical/ Logistic Regression
Continuous Binary or Multinomial
(Dickinson, 2014)
23

Workshop Files
https://languagevariationsuite.wordpress.com/
1. categoricaldata.csv: categorical dependent - Labov New
York 1966 study
2. continuousdata.csv: continuous dependent - Intervocalic
/d/ in Caracas corpus (Díaz-Campos and Gradoville,
2011; Scrivner and Díaz-Campos, 2016)
3. LVS web site:
www.languagevariationsuite.com
25

Language Variation Suite - Structure
1. Data
◦ Upload ﬁle, data summary, adjust data, cross tabulation
2. Visual Analysis
◦ Plotting, cluster classiﬁcation
3. Inferential Statistics
◦ Modeling, regression, conditional trees, random forest
27

Upload File
Upload movie_metadata.csv
28

Excel Format
1. Slow processing
2. Requires the name of your excel sheet
29

TIPS: Save Your Excel as CSV Format
To optimize speed - Save as CSV prior upload
30

Upload File
Upload categoricaldata.csv
31

Uploaded Dataset
The data content is imported as a table and allows for sorting
columns.
32

Summary
Summary provides a quantitative summary for each variable,
e.g. frequency count, mean, median.
33

Data Structure
1. Total number of observations (rows)
2. Number of variables (columns)
3. Variable types
◦ Factor - categorical values
◦ Num - numeric values (0.95, 1.05)
◦ Int - integer values (1, 2, 3)
34

Cross Tabulation
Cross-tabulation examines the relationship between variables.
35

Cross-Tabulation Output
Raw frequency / Proportion by column / Proportion across
row
37

1. Data
2. Visual Analysis
3. Inferential statistics
◦ Modeling, regression, varbrul analysis, conditional trees,
random forest
40

Adjusting Browser - Layout
Shiny pages are ﬂuid and reactive.
41

Cluster Plot
Classiﬁcation of data into sub-groups is based on pairwise
similarities
Groups are clustered in the form of a tree-like dendrogram
46

Cluster Plot
Saks (upper middle-class store), Macy’s (middle-class store), Kleins
(working-class)
48

1. Data
2. Visual Analysis
3. Inferential statistics
◦ Modeling, regression, varbrul analysis, conditional trees,
random forest
52

How to Create a Regression Model
Step 1 Modeling - create a model with dependent and
independent variables
Step 2 Regression - specify the type of regression (ﬁxed,
mixed) and type of dependent variable (binary,
continuous, multinomial)
Step 3 Stepwise Regression - compare models
(Log-likelihood, AIC, BIC)
Step 4 Conditional Trees - apply non-parametric tests to
the model
53

Modeling
54
We are interested in RETENTION
= Application

Regression Types
Model
a.) Fixed eﬀect
b.) Mixed eﬀect - individual speaker/token variation (within
group)
Type of Dependent Variable
a.) Binary/categorical (only two values)
b.) Continuous (numeric)
c.) Multinomial - categorical with more than two values
55

Interpretation
Deletion is the reference value
Positive coefficient - positive effect
Negative coefficient - negative effect
58

Interpretation - RETENTION
Lexical item Fourth has a negative effect on retention compared
to Floor and is significant
Normal style has a slightly negative effect on retention but its
coefficient is not significant
Macy’s and Saks have a positive and significant effect on
retention. Saks (upper middle class store) is more significant
than Macy’s (middle class store)
http://www.free-online-calculator-use.com/scientific-notation-converter.html
59

Interpretation - RETENTION
Lexical item Fourth has a negative effect on retention compared
to Floor and is significant
Normal style has a slightly negative effect on retention but its
coefficient is not significant
Macy’s and Saks have a positive and significant effect on
retention. Saks (upper middle class store) is more significant
than Macy’s (middle class store)
http://www.free-online-calculator-use.com/scientific-notation-converter.html
59
exponential notation:
1.48e-8
0.0000000148

Conditional Tree
Conditional tree: a simple non-parametric regression analysis,
commonly used in social and psychological studies
Linear regression: all information is combined linearly
Conditional tree regression: visual splitting to capture
interaction between variables
Recursive splitting (tree branches)
61

Conditional Tree - Tagliamonte and Baayen (2012)
1. The distribution of was/were is split in two groups by
individuals.
2. The variant were occurs signiﬁcantly more frequently with the
ﬁrst group.
62

1. Polarity is relevant to the second group of individuals.
2. The variant were occurs signiﬁcantly more often with negative
polarity
63

1. Aﬃrmative Polarity is conditioned by Age.
2. The variant was is produced signiﬁcantly more often by
Individuals of 46 and younger.
64

Conditional Tree
1. Store is the most signiﬁcant factor for R-use
◦ Kleins (working class store) - more R-deletion
2. R-use in Macy’s and Saks is conditioned by lexical item:
◦ Floor shows more R-retention than Fourth
3. Style is not signiﬁcant
66

Random Forest
1. Variable importance for predictors
2. Robust technique with small n large p data
3. All predictors considered jointly (allows for inclusion of
correlated factors)
68

Random Forest
1. Store is the most important predictor
2. Lexical Item is the second predictor
3. Style is irrelevant: close to zero and red dotted line (cut-oﬀ
value).
70

Fixed and Mixed Models
Fixed Eﬀects Model : All predictors are treated independent.
Underlying assumption - no group-internal
variation between speakers or tokens
Mixed Eﬀects Model : Allows for evaluation of individual- and
group-level variation
72

Fixed and Mixed Models
Fixed Regression Model - ignoring individual variations
(speakers or words) may lead to Type I Error:
“a chance effect is mistaken for a real difference
between the populations”
Mixed Regression Model - prone to Type II Error:
“if speaker variation is at a high level, we cannot
discern small population effects without a large
number of speakers” (Johnson 2009, 2015)
73

Mixed Effect Regression
Mixed Model = fixed effects + random effects
Fixed-effect factor - “repeatable and a small number of levels”
Random-effect factor - “a non-repeatable random sample from
a larger population” (Wieling 2012)
walk, sleep, study, finish, eat, etc
event verb, stative verb
speaker1, speaker3, speaker3, etc
male, female
74

Preparing for Mixed Model
1. Download continuousdata.csv
2. Upload this ﬁle on LVS
75

Mixed Eﬀect Modeling
77
NULL when the dependent variable is continuous
Fixed Eﬀects - independent variables

Mixed Eﬀect Modeling
78
Mixed Eﬀects - group-internal variation

Interpretation - Random Eﬀects
1. Standard Deviation: a measure of the variability for each
random eﬀect (speakers and tokens)
2. Residual: random variation that is not due to speakers or
tokens (residual error)
80

Interpretation - Fixed Effects
1. Estimate/coefficient: reported in log-odds (negative or
positive)
2. P-value: tells you if the level is significant
81

Publications
- Scrivner, Olga., and Díaz-Campos, M. 2016. Language
Variation Suite: A theoretical and methodological
contribution for linguistic data analysis. Proceedings of the
Linguistic Society of America, 1, 29–1
- Estrada, Monica. 2016. El tú no es de nosotros, es de otros
países: Usos del voseo y actitudes hacia él en el castellano
hondureño. Master Thesis. Louisiana State University
- Orozco, Rafael. 2018. El castellano del Caribe colombiano
en la ciudad de Nueva York: El uso variable d sujetos
pronominales. In Studies in Hispanic and Lusophone
Linguistics 11(1): 89–129
86

Students Perspective: Feedback
“This program makes analysis so much more accessible to
someone like myself who is new to statistical analysis”
“I realize it is actually a tool of analysis every linguist will
appreciate a lot for quantitative analysis”
87

References I
Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge: Cambridge University
Press
Bentivoglio, Paola and Mercedes Sedano. 1993. Investigación sociolingüística: sus métodos aplicados a una
experiencia venezolana. Boletín de Lingüística 8. 3-35
Díaz-Campos, Manuel and Stephanie Dickinson. In press. Using Statistics as a Tool in the Analysis of Sociolinguistic
Variation: A comparison of current and traditional methods. In Fernando Tejedo-Herrero (Ed.), Lusophone,
Galician, and Hispanic Linguistics: Bridging Frames and Traditions. Amsterdam: John Benjamins
Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber Randi Reppen (eds.), The
Cambridge Handbook of English Corpus Linguistics. Cambridge: Cambridge University Press
Labov, W. 1966. The Social Stratiﬁcation of English in New York City. Washington: Center for Applied Linguistics
Scrivner, Olga., and Díaz-Campos, M. 2016. Language Variation Suite: A theoretical and methodological
contribution for linguistic data analysis. Proceedings of the Linguistic Society of America, 1, 29–1
Strobl, Carolin, James Malley, and Gerhard Tutz. An introduction to recursive partitioning: rationale, application,
and characteristics of classiﬁcation and regression trees, bagging, and random forests. 2009. Psychological
methods 14: 323–348.
Tagliamonte, Sali, and R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a
case study for statistical practice. Language Variation and Change 24: 135–178.
Tagliamonte, Sali. 2016. Quantitative analysis in language variation and change. In Fernando Tejedo-Herrero and
Sandro Sessarego (Eds.), Spanish Language and Sociolinguistic Analysis. Amsterdam: John Benjamins
89

Appendix 1: Density
92
Number of bins can have a
disproportionate eﬀect on visualization

Histogram
Density: a non-parametric model of the distribution of points based
on a smooth density estimate
http://scikit-learn.org/stable/modules/density.html
93

Appenix 2 - Data Modiﬁcation
94

Adjust Data
Retain: Select data subset
Exclude: Exclude variables from a factor group
Recode: Combine and rename variables
Change class: Numeric → factor; factor → numeric
Transform: Apply log transformation to a speciﬁc column
ADJUSTED DATASET:
◦ Run - to apply all above changes
◦ Reset - to reset to the original dataset
95

Adjusting Dataset
To revert to the original data, select RESET:
98

Appendix 3 - Model Comparison
99

Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis

Similar to Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis (20)

More from Olga Scrivner

More from Olga Scrivner (20)

Recently uploaded

Recently uploaded (20)

Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis