Language Variation Suite Toolkit
Olga Scrivner, Rafael Orozco, Manuel Díaz-Campos
NWAV47, October 18, 2018
Indiana University and Louisiana State University
Workshop Information
What You Will Learn
Part 1 Cross-disciplinary Methods in Sociolinguistics
Part 2 Web Interface - a novel approach for interactive
socio-analysis
Part 3 Working with Structured Data: LVS
Part 4 Working with Unstructured Data: ITMS
2
Our Materials - Web Site
https://languagevariationsuite.
wordpress.com/
3
What We Need
- Computer
- WiFi
- Smartphone (optional)
- R and Rstudio (not really)
4
What We Need
- Computer
- WiFi
- Smartphone (optional)
- R and Rstudio (not really)
4
Cross-disciplinary Methods
Our Objectives
Goal: Optimize sociolinguistic analysis by using a variety of
cross-disciplinary methods
1. Introduce a novel (socio-)linguistic toolkit for research and
as a teaching tool
2. Develop practical skills via interactive and visual interface
3. Understand and interpret advanced statistical and data
mining models
6
Cross-disciplinary Research
7
Corpus
Linguistics
Computational
Linguistics
Sociolinguistics
Data
Science
Statistics
Cross-disciplinary Research: Methods
8
Corpus
Linguistics
Computational
Linguistics
Sociolinguistics
Data
Science
Statistics
Cleaning
Pre-processing
Creating
Corpora
Parametric
Non-parametric
Models
Clustering
Topic Modeling
Cross-disciplinary Methods: A Closer Look
Random Forest - a more stable than stepwise variable selection
(Tagliamonte and Baayen,2012, 25)
Conditional Tree - a hierarchical tree of significant factors and
their relationship with other predictors
9
Cross-disciplinary Methods: A Closer Look
Clustering - a grouping of variables or individuals or
documents (Gries and Hilpert, 2008, 69)
Topic Modeling - a discovery of language patterns and word usage
over time and across genres (Blei, 2012,
78)
10
How To Unify All Methods: Data Science Workflow
1. Import data into R: read_csv(), read_line() [even praat
files!]
2. Tidy data - pre-processing
3. Transform with dplyr [select, subset, recode...]
4. Visualize with ggplot and plotly
5. Model with glm, lda, randomForest...
11
Tidyverse - a system of R packages for data manipulation,
exploration, and visualization that share a common design
philosophy (Wickham and Grolemund, 2017)
Sociolinguistic Data - Data Analytics
“Analytics is the critical technology
needed to bring value out of data”
(Anonymous)
12
Sociolinguistic Data - Visual Analytics
“The science of analytical reasoning
facilitated by visual interactive interfaces”
(Thomas and Cook, 2005)
“Visual analytics integrates new computational and
theory-based tools with innovative interactive techniques
and visual representations to enable human-information
discourse” (Thomas and Cook, 2005)
PositionSentence
p < 0.001
1
ind, pre post
Heaviness
p = 0.003
2
≤ 1 > 1
Period
p < 0.001
3
≤ 1 > 1
Node 4 (n = 81)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 5 (n = 119)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 6 (n = 181)
VOOV
0
0.2
0.4
0.6
0.8
1
Period
p < 0.001
7
≤ 2 > 2
Node 8 (n = 221)
VOOV
0
0.2
0.4
0.6
0.8
1
Focus
p < 0.001
9
cf nf
Node 10 (n = 66)
VOOV
0
0.2
0.4
0.6
0.8
1
Main_Verb_Structure
p < 0.001
11
ACIOther, Restructuring
Node 12 (n = 43)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 13 (n = 265)
VOOV
0
0.2
0.4
0.6
0.8
1
13
Introduction to LVS
What is LVS?
Language Variation Suite
It is a Shiny web application designed for data analysis in
sociolinguistic research.
It can be used for:
Processing spreadsheet data
Reporting in tables and graphs
Analyzing means, regression, conditional trees ...
(and much more)
15
Background
LVS is built in R using Shiny package:
1. R - a free programming language for statistical computing
and graphics
2. Shiny App - a web application framework for R
Computational power of R + Web interactivity
16
Background
http://littleactuary.github.io/blog/Web-application-framework-with-Shiny/
17
Workspace
Browser
Chrome, Firefox, Safari - recommendable
Explorer may cause instability issues
Accessibility
PC, Mac, Linux
◦ Data files will be uploaded from any location on your
computer
Smart Phone
◦ Data files must be on a cloud platform connected to your
phone account (e.g. dropbox)
18
Server
Since LVS is hosted on a server, Shiny idle time-out settings may
stop application when it is left inactive (it will grey out).
Solution: Click reload and re-upload your csv file
19
Data Preparation
Data Preparation
Important things to consider before data entry:
File format:
◦ Comma separated value (CSV) - faster processing
◦ Excel format will slow processing
Column names should not contain spaces
◦ Permitted: non-accented characters, numbers, underscore,
hyphen, and period
One column must contain your dependent variable
The rest of the columns contain independent variables
21
Terminology Review
a. Categorical/Nominal - non-numerical data with two
values
◦ yes - no; deletion - retention; perfective - imperfective
b. Continuous - numerical data
◦ duration, age, chronological period
c. Multinomial - non-numerical data with three or more
values
◦ deletion - aspiration - retention
d. Ordinal - scale
22
Test Types
Dependent Independent Test
Continuous Continous t-test
Categorical Categorical Chi-square
Continuous Multiple Categorical/ Linear Regression
Continuous
Categorical Multiple Categorical/ Logistic Regression
Continuous Binary or Multinomial
(Dickinson, 2014)
23
24
Questions?
Comments?
24
Questions?
Comments?
Workshop Files
https://languagevariationsuite.wordpress.com/
1. categoricaldata.csv: categorical dependent - Labov New
York 1966 study
2. continuousdata.csv: continuous dependent - Intervocalic
/d/ in Caracas corpus (Díaz-Campos and Gradoville,
2011; Scrivner and Díaz-Campos, 2016)
3. LVS web site:
www.languagevariationsuite.com
25
LVS Structure
Language Variation Suite - Structure
1. Data
◦ Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
◦ Plotting, cluster classification
3. Inferential Statistics
◦ Modeling, regression, conditional trees, random forest
27
Language Variation Suite - Structure
1. Data
◦ Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
◦ Plotting, cluster classification
3. Inferential Statistics
◦ Modeling, regression, conditional trees, random forest
27
Upload File
Upload movie_metadata.csv
28
Excel Format
1. Slow processing
2. Requires the name of your excel sheet
29
TIPS: Save Your Excel as CSV Format
To optimize speed - Save as CSV prior upload
30
Upload File
Upload categoricaldata.csv
31
Uploaded Dataset
The data content is imported as a table and allows for sorting
columns.
32
Summary
Summary provides a quantitative summary for each variable,
e.g. frequency count, mean, median.
33
Data Structure
1. Total number of observations (rows)
2. Number of variables (columns)
3. Variable types
◦ Factor - categorical values
◦ Num - numeric values (0.95, 1.05)
◦ Int - integer values (1, 2, 3)
34
Cross Tabulation
Cross-tabulation examines the relationship between variables.
35
Variables
36
Cross-Tabulation Output
Raw frequency / Proportion by column / Proportion across
row
37
38
Questions?
Comments?
Visual Analysis
Language Variation Suite - Structure
1. Data
◦ Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
◦ Plotting, cluster classification
3. Inferential statistics
◦ Modeling, regression, varbrul analysis, conditional trees,
random forest
40
Adjusting Browser - Layout
Shiny pages are fluid and reactive.
41
One Variable Plot
42
Two Variables Plot
43
Saving Plot
44
Three Variables Plot
45
Cluster Plot
Classification of data into sub-groups is based on pairwise
similarities
Groups are clustered in the form of a tree-like dendrogram
46
Cluster Plot
47
Cluster Plot
Saks (upper middle-class store), Macy’s (middle-class store), Kleins
(working-class)
48
49
Questions?
Comments?
Inferential Analysis
Inferential Statistics
51
Language Variation Suite - Structure
1. Data
◦ Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
◦ Plotting, cluster classification
3. Inferential statistics
◦ Modeling, regression, varbrul analysis, conditional trees,
random forest
52
How to Create a Regression Model
Step 1 Modeling - create a model with dependent and
independent variables
Step 2 Regression - specify the type of regression (fixed,
mixed) and type of dependent variable (binary,
continuous, multinomial)
Step 3 Stepwise Regression - compare models
(Log-likelihood, AIC, BIC)
Step 4 Conditional Trees - apply non-parametric tests to
the model
53
Modeling
54
Modeling
54
We are interested in RETENTION
= Application
Regression Types
Model
a.) Fixed effect
b.) Mixed effect - individual speaker/token variation (within
group)
Type of Dependent Variable
a.) Binary/categorical (only two values)
b.) Continuous (numeric)
c.) Multinomial - categorical with more than two values
55
Regression
56
Model Output
57
Model Output
57
Interpretation
Deletion is the reference value
Positive coefficient - positive effect
Negative coefficient - negative effect
58
Interpretation - RETENTION
Lexical item Fourth has a negative effect on retention compared
to Floor and is significant
Normal style has a slightly negative effect on retention but its
coefficient is not significant
Macy’s and Saks have a positive and significant effect on
retention. Saks (upper middle class store) is more significant
than Macy’s (middle class store)
http://www.free-online-calculator-use.com/scientific-notation-converter.html
59
Interpretation - RETENTION
Lexical item Fourth has a negative effect on retention compared
to Floor and is significant
Normal style has a slightly negative effect on retention but its
coefficient is not significant
Macy’s and Saks have a positive and significant effect on
retention. Saks (upper middle class store) is more significant
than Macy’s (middle class store)
http://www.free-online-calculator-use.com/scientific-notation-converter.html
59
exponential notation:
1.48e-8
0.0000000148
60
Questions?
Comments?
Conditional Tree
Conditional tree: a simple non-parametric regression analysis,
commonly used in social and psychological studies
Linear regression: all information is combined linearly
Conditional tree regression: visual splitting to capture
interaction between variables
Recursive splitting (tree branches)
61
Conditional Tree - Tagliamonte and Baayen (2012)
1. The distribution of was/were is split in two groups by
individuals.
2. The variant were occurs significantly more frequently with the
first group.
62
Conditional Tree - Tagliamonte and Baayen (2012)
1. Polarity is relevant to the second group of individuals.
2. The variant were occurs significantly more often with negative
polarity
63
Conditional Tree - Tagliamonte and Baayen (2012)
1. Affirmative Polarity is conditioned by Age.
2. The variant was is produced significantly more often by
Individuals of 46 and younger.
64
Conditional Tree
65
Conditional Tree
1. Store is the most significant factor for R-use
◦ Kleins (working class store) - more R-deletion
2. R-use in Macy’s and Saks is conditioned by lexical item:
◦ Floor shows more R-retention than Fourth
3. Style is not significant
66
Model Output
67
Random Forest
1. Variable importance for predictors
2. Robust technique with small n large p data
3. All predictors considered jointly (allows for inclusion of
correlated factors)
68
Random Forest
69
Random Forest
1. Store is the most important predictor
2. Lexical Item is the second predictor
3. Style is irrelevant: close to zero and red dotted line (cut-off
value).
70
Mixed Effects
Fixed and Mixed Models
Fixed Effects Model : All predictors are treated independent.
Underlying assumption - no group-internal
variation between speakers or tokens
Mixed Effects Model : Allows for evaluation of individual- and
group-level variation
72
Fixed and Mixed Models
Fixed Regression Model - ignoring individual variations
(speakers or words) may lead to Type I Error:
“a chance effect is mistaken for a real difference
between the populations”
Mixed Regression Model - prone to Type II Error:
“if speaker variation is at a high level, we cannot
discern small population effects without a large
number of speakers” (Johnson 2009, 2015)
73
Mixed Effect Regression
Mixed Model = fixed effects + random effects
Fixed-effect factor - “repeatable and a small number of levels”
Random-effect factor - “a non-repeatable random sample from
a larger population” (Wieling 2012)
walk, sleep, study, finish, eat, etc
event verb, stative verb
speaker1, speaker3, speaker3, etc
male, female
74
Mixed Effect Regression
Mixed Model = fixed effects + random effects
Fixed-effect factor - “repeatable and a small number of levels”
Random-effect factor - “a non-repeatable random sample from
a larger population” (Wieling 2012)
walk, sleep, study, finish, eat, etc
event verb, stative verb
speaker1, speaker3, speaker3, etc
male, female
74
Preparing for Mixed Model
1. Download continuousdata.csv
2. Upload this file on LVS
75
Data - Uploaded Dataset
76
Mixed Effect Modeling
77
NULL when the dependent variable is continuous
Fixed Effects - independent variables
Mixed Effect Modeling
78
Mixed Effects - group-internal variation
Regression Results
79
Regression Results
79
Interpretation - Random Effects
1. Standard Deviation: a measure of the variability for each
random effect (speakers and tokens)
2. Residual: random variation that is not due to speakers or
tokens (residual error)
80
Interpretation - Fixed Effects
1. Estimate/coefficient: reported in log-odds (negative or
positive)
2. P-value: tells you if the level is significant
81
Bonus - Word Clouds
82
Frequency Plot
83
Frequency Plot
84
Applied Projects with LVS
Publications
- Scrivner, Olga., and Díaz-Campos, M. 2016. Language
Variation Suite: A theoretical and methodological
contribution for linguistic data analysis. Proceedings of the
Linguistic Society of America, 1, 29–1
- Estrada, Monica. 2016. El tú no es de nosotros, es de otros
países: Usos del voseo y actitudes hacia él en el castellano
hondureño. Master Thesis. Louisiana State University
- Orozco, Rafael. 2018. El castellano del Caribe colombiano
en la ciudad de Nueva York: El uso variable d sujetos
pronominales. In Studies in Hispanic and Lusophone
Linguistics 11(1): 89–129
86
Students Perspective: Feedback
“This program makes analysis so much more accessible to
someone like myself who is new to statistical analysis”
“I realize it is actually a tool of analysis every linguist will
appreciate a lot for quantitative analysis”
87
Teachers Perspective
88
References I
Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge: Cambridge University
Press
Bentivoglio, Paola and Mercedes Sedano. 1993. Investigación sociolingüística: sus métodos aplicados a una
experiencia venezolana. Boletín de Lingüística 8. 3-35
Díaz-Campos, Manuel and Stephanie Dickinson. In press. Using Statistics as a Tool in the Analysis of Sociolinguistic
Variation: A comparison of current and traditional methods. In Fernando Tejedo-Herrero (Ed.), Lusophone,
Galician, and Hispanic Linguistics: Bridging Frames and Traditions. Amsterdam: John Benjamins
Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber Randi Reppen (eds.), The
Cambridge Handbook of English Corpus Linguistics. Cambridge: Cambridge University Press
Labov, W. 1966. The Social Stratification of English in New York City. Washington: Center for Applied Linguistics
Scrivner, Olga., and Díaz-Campos, M. 2016. Language Variation Suite: A theoretical and methodological
contribution for linguistic data analysis. Proceedings of the Linguistic Society of America, 1, 29–1
Strobl, Carolin, James Malley, and Gerhard Tutz. An introduction to recursive partitioning: rationale, application,
and characteristics of classification and regression trees, bagging, and random forests. 2009. Psychological
methods 14: 323–348.
Tagliamonte, Sali, and R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a
case study for statistical practice. Language Variation and Change 24: 135–178.
Tagliamonte, Sali. 2016. Quantitative analysis in language variation and change. In Fernando Tejedo-Herrero and
Sandro Sessarego (Eds.), Spanish Language and Sociolinguistic Analysis. Amsterdam: John Benjamins
89
Thank you!
90
Appendix
Appendix 1: Density
92
Number of bins can have a
disproportionate effect on visualization
Histogram
Density: a non-parametric model of the distribution of points based
on a smooth density estimate
http://scikit-learn.org/stable/modules/density.html
93
Appenix 2 - Data Modification
94
Adjust Data
Retain: Select data subset
Exclude: Exclude variables from a factor group
Recode: Combine and rename variables
Change class: Numeric → factor; factor → numeric
Transform: Apply log transformation to a specific column
ADJUSTED DATASET:
◦ Run - to apply all above changes
◦ Reset - to reset to the original dataset
95
Exclude: Emphatic Style
96
Adjusted Dataset
97
Adjusting Dataset
To revert to the original data, select RESET:
98
Appendix 3 - Model Comparison
99

Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis

  • 1.
    Language Variation SuiteToolkit Olga Scrivner, Rafael Orozco, Manuel Díaz-Campos NWAV47, October 18, 2018 Indiana University and Louisiana State University
  • 2.
  • 3.
    What You WillLearn Part 1 Cross-disciplinary Methods in Sociolinguistics Part 2 Web Interface - a novel approach for interactive socio-analysis Part 3 Working with Structured Data: LVS Part 4 Working with Unstructured Data: ITMS 2
  • 4.
    Our Materials -Web Site https://languagevariationsuite. wordpress.com/ 3
  • 5.
    What We Need -Computer - WiFi - Smartphone (optional) - R and Rstudio (not really) 4
  • 6.
    What We Need -Computer - WiFi - Smartphone (optional) - R and Rstudio (not really) 4
  • 7.
  • 8.
    Our Objectives Goal: Optimizesociolinguistic analysis by using a variety of cross-disciplinary methods 1. Introduce a novel (socio-)linguistic toolkit for research and as a teaching tool 2. Develop practical skills via interactive and visual interface 3. Understand and interpret advanced statistical and data mining models 6
  • 9.
  • 10.
  • 11.
    Cross-disciplinary Methods: ACloser Look Random Forest - a more stable than stepwise variable selection (Tagliamonte and Baayen,2012, 25) Conditional Tree - a hierarchical tree of significant factors and their relationship with other predictors 9
  • 12.
    Cross-disciplinary Methods: ACloser Look Clustering - a grouping of variables or individuals or documents (Gries and Hilpert, 2008, 69) Topic Modeling - a discovery of language patterns and word usage over time and across genres (Blei, 2012, 78) 10
  • 13.
    How To UnifyAll Methods: Data Science Workflow 1. Import data into R: read_csv(), read_line() [even praat files!] 2. Tidy data - pre-processing 3. Transform with dplyr [select, subset, recode...] 4. Visualize with ggplot and plotly 5. Model with glm, lda, randomForest... 11 Tidyverse - a system of R packages for data manipulation, exploration, and visualization that share a common design philosophy (Wickham and Grolemund, 2017)
  • 14.
    Sociolinguistic Data -Data Analytics “Analytics is the critical technology needed to bring value out of data” (Anonymous) 12
  • 15.
    Sociolinguistic Data -Visual Analytics “The science of analytical reasoning facilitated by visual interactive interfaces” (Thomas and Cook, 2005) “Visual analytics integrates new computational and theory-based tools with innovative interactive techniques and visual representations to enable human-information discourse” (Thomas and Cook, 2005) PositionSentence p < 0.001 1 ind, pre post Heaviness p = 0.003 2 ≤ 1 > 1 Period p < 0.001 3 ≤ 1 > 1 Node 4 (n = 81) VOOV 0 0.2 0.4 0.6 0.8 1 Node 5 (n = 119) VOOV 0 0.2 0.4 0.6 0.8 1 Node 6 (n = 181) VOOV 0 0.2 0.4 0.6 0.8 1 Period p < 0.001 7 ≤ 2 > 2 Node 8 (n = 221) VOOV 0 0.2 0.4 0.6 0.8 1 Focus p < 0.001 9 cf nf Node 10 (n = 66) VOOV 0 0.2 0.4 0.6 0.8 1 Main_Verb_Structure p < 0.001 11 ACIOther, Restructuring Node 12 (n = 43) VOOV 0 0.2 0.4 0.6 0.8 1 Node 13 (n = 265) VOOV 0 0.2 0.4 0.6 0.8 1 13
  • 16.
  • 17.
    What is LVS? LanguageVariation Suite It is a Shiny web application designed for data analysis in sociolinguistic research. It can be used for: Processing spreadsheet data Reporting in tables and graphs Analyzing means, regression, conditional trees ... (and much more) 15
  • 18.
    Background LVS is builtin R using Shiny package: 1. R - a free programming language for statistical computing and graphics 2. Shiny App - a web application framework for R Computational power of R + Web interactivity 16
  • 19.
  • 20.
    Workspace Browser Chrome, Firefox, Safari- recommendable Explorer may cause instability issues Accessibility PC, Mac, Linux ◦ Data files will be uploaded from any location on your computer Smart Phone ◦ Data files must be on a cloud platform connected to your phone account (e.g. dropbox) 18
  • 21.
    Server Since LVS ishosted on a server, Shiny idle time-out settings may stop application when it is left inactive (it will grey out). Solution: Click reload and re-upload your csv file 19
  • 22.
  • 23.
    Data Preparation Important thingsto consider before data entry: File format: ◦ Comma separated value (CSV) - faster processing ◦ Excel format will slow processing Column names should not contain spaces ◦ Permitted: non-accented characters, numbers, underscore, hyphen, and period One column must contain your dependent variable The rest of the columns contain independent variables 21
  • 24.
    Terminology Review a. Categorical/Nominal- non-numerical data with two values ◦ yes - no; deletion - retention; perfective - imperfective b. Continuous - numerical data ◦ duration, age, chronological period c. Multinomial - non-numerical data with three or more values ◦ deletion - aspiration - retention d. Ordinal - scale 22
  • 25.
    Test Types Dependent IndependentTest Continuous Continous t-test Categorical Categorical Chi-square Continuous Multiple Categorical/ Linear Regression Continuous Categorical Multiple Categorical/ Logistic Regression Continuous Binary or Multinomial (Dickinson, 2014) 23
  • 26.
  • 27.
  • 28.
    Workshop Files https://languagevariationsuite.wordpress.com/ 1. categoricaldata.csv:categorical dependent - Labov New York 1966 study 2. continuousdata.csv: continuous dependent - Intervocalic /d/ in Caracas corpus (Díaz-Campos and Gradoville, 2011; Scrivner and Díaz-Campos, 2016) 3. LVS web site: www.languagevariationsuite.com 25
  • 29.
  • 30.
    Language Variation Suite- Structure 1. Data ◦ Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis ◦ Plotting, cluster classification 3. Inferential Statistics ◦ Modeling, regression, conditional trees, random forest 27
  • 31.
    Language Variation Suite- Structure 1. Data ◦ Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis ◦ Plotting, cluster classification 3. Inferential Statistics ◦ Modeling, regression, conditional trees, random forest 27
  • 32.
  • 33.
    Excel Format 1. Slowprocessing 2. Requires the name of your excel sheet 29
  • 34.
    TIPS: Save YourExcel as CSV Format To optimize speed - Save as CSV prior upload 30
  • 35.
  • 36.
    Uploaded Dataset The datacontent is imported as a table and allows for sorting columns. 32
  • 37.
    Summary Summary provides aquantitative summary for each variable, e.g. frequency count, mean, median. 33
  • 38.
    Data Structure 1. Totalnumber of observations (rows) 2. Number of variables (columns) 3. Variable types ◦ Factor - categorical values ◦ Num - numeric values (0.95, 1.05) ◦ Int - integer values (1, 2, 3) 34
  • 39.
    Cross Tabulation Cross-tabulation examinesthe relationship between variables. 35
  • 40.
  • 41.
    Cross-Tabulation Output Raw frequency/ Proportion by column / Proportion across row 37
  • 42.
  • 43.
  • 44.
    Language Variation Suite- Structure 1. Data ◦ Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis ◦ Plotting, cluster classification 3. Inferential statistics ◦ Modeling, regression, varbrul analysis, conditional trees, random forest 40
  • 45.
    Adjusting Browser -Layout Shiny pages are fluid and reactive. 41
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
    Cluster Plot Classification ofdata into sub-groups is based on pairwise similarities Groups are clustered in the form of a tree-like dendrogram 46
  • 51.
  • 52.
    Cluster Plot Saks (uppermiddle-class store), Macy’s (middle-class store), Kleins (working-class) 48
  • 53.
  • 54.
  • 55.
  • 56.
    Language Variation Suite- Structure 1. Data ◦ Upload file, data summary, adjust data, cross tabulation 2. Visual Analysis ◦ Plotting, cluster classification 3. Inferential statistics ◦ Modeling, regression, varbrul analysis, conditional trees, random forest 52
  • 57.
    How to Createa Regression Model Step 1 Modeling - create a model with dependent and independent variables Step 2 Regression - specify the type of regression (fixed, mixed) and type of dependent variable (binary, continuous, multinomial) Step 3 Stepwise Regression - compare models (Log-likelihood, AIC, BIC) Step 4 Conditional Trees - apply non-parametric tests to the model 53
  • 58.
  • 59.
    Modeling 54 We are interestedin RETENTION = Application
  • 60.
    Regression Types Model a.) Fixedeffect b.) Mixed effect - individual speaker/token variation (within group) Type of Dependent Variable a.) Binary/categorical (only two values) b.) Continuous (numeric) c.) Multinomial - categorical with more than two values 55
  • 61.
  • 62.
  • 63.
  • 64.
    Interpretation Deletion is thereference value Positive coefficient - positive effect Negative coefficient - negative effect 58
  • 65.
    Interpretation - RETENTION Lexicalitem Fourth has a negative effect on retention compared to Floor and is significant Normal style has a slightly negative effect on retention but its coefficient is not significant Macy’s and Saks have a positive and significant effect on retention. Saks (upper middle class store) is more significant than Macy’s (middle class store) http://www.free-online-calculator-use.com/scientific-notation-converter.html 59
  • 66.
    Interpretation - RETENTION Lexicalitem Fourth has a negative effect on retention compared to Floor and is significant Normal style has a slightly negative effect on retention but its coefficient is not significant Macy’s and Saks have a positive and significant effect on retention. Saks (upper middle class store) is more significant than Macy’s (middle class store) http://www.free-online-calculator-use.com/scientific-notation-converter.html 59 exponential notation: 1.48e-8 0.0000000148
  • 67.
  • 68.
    Conditional Tree Conditional tree:a simple non-parametric regression analysis, commonly used in social and psychological studies Linear regression: all information is combined linearly Conditional tree regression: visual splitting to capture interaction between variables Recursive splitting (tree branches) 61
  • 69.
    Conditional Tree -Tagliamonte and Baayen (2012) 1. The distribution of was/were is split in two groups by individuals. 2. The variant were occurs significantly more frequently with the first group. 62
  • 70.
    Conditional Tree -Tagliamonte and Baayen (2012) 1. Polarity is relevant to the second group of individuals. 2. The variant were occurs significantly more often with negative polarity 63
  • 71.
    Conditional Tree -Tagliamonte and Baayen (2012) 1. Affirmative Polarity is conditioned by Age. 2. The variant was is produced significantly more often by Individuals of 46 and younger. 64
  • 72.
  • 73.
    Conditional Tree 1. Storeis the most significant factor for R-use ◦ Kleins (working class store) - more R-deletion 2. R-use in Macy’s and Saks is conditioned by lexical item: ◦ Floor shows more R-retention than Fourth 3. Style is not significant 66
  • 74.
  • 75.
    Random Forest 1. Variableimportance for predictors 2. Robust technique with small n large p data 3. All predictors considered jointly (allows for inclusion of correlated factors) 68
  • 76.
  • 77.
    Random Forest 1. Storeis the most important predictor 2. Lexical Item is the second predictor 3. Style is irrelevant: close to zero and red dotted line (cut-off value). 70
  • 78.
  • 79.
    Fixed and MixedModels Fixed Effects Model : All predictors are treated independent. Underlying assumption - no group-internal variation between speakers or tokens Mixed Effects Model : Allows for evaluation of individual- and group-level variation 72
  • 80.
    Fixed and MixedModels Fixed Regression Model - ignoring individual variations (speakers or words) may lead to Type I Error: “a chance effect is mistaken for a real difference between the populations” Mixed Regression Model - prone to Type II Error: “if speaker variation is at a high level, we cannot discern small population effects without a large number of speakers” (Johnson 2009, 2015) 73
  • 81.
    Mixed Effect Regression MixedModel = fixed effects + random effects Fixed-effect factor - “repeatable and a small number of levels” Random-effect factor - “a non-repeatable random sample from a larger population” (Wieling 2012) walk, sleep, study, finish, eat, etc event verb, stative verb speaker1, speaker3, speaker3, etc male, female 74
  • 82.
    Mixed Effect Regression MixedModel = fixed effects + random effects Fixed-effect factor - “repeatable and a small number of levels” Random-effect factor - “a non-repeatable random sample from a larger population” (Wieling 2012) walk, sleep, study, finish, eat, etc event verb, stative verb speaker1, speaker3, speaker3, etc male, female 74
  • 83.
    Preparing for MixedModel 1. Download continuousdata.csv 2. Upload this file on LVS 75
  • 84.
    Data - UploadedDataset 76
  • 85.
    Mixed Effect Modeling 77 NULLwhen the dependent variable is continuous Fixed Effects - independent variables
  • 86.
    Mixed Effect Modeling 78 MixedEffects - group-internal variation
  • 87.
  • 88.
  • 89.
    Interpretation - RandomEffects 1. Standard Deviation: a measure of the variability for each random effect (speakers and tokens) 2. Residual: random variation that is not due to speakers or tokens (residual error) 80
  • 90.
    Interpretation - FixedEffects 1. Estimate/coefficient: reported in log-odds (negative or positive) 2. P-value: tells you if the level is significant 81
  • 91.
    Bonus - WordClouds 82
  • 92.
  • 93.
  • 94.
  • 95.
    Publications - Scrivner, Olga.,and Díaz-Campos, M. 2016. Language Variation Suite: A theoretical and methodological contribution for linguistic data analysis. Proceedings of the Linguistic Society of America, 1, 29–1 - Estrada, Monica. 2016. El tú no es de nosotros, es de otros países: Usos del voseo y actitudes hacia él en el castellano hondureño. Master Thesis. Louisiana State University - Orozco, Rafael. 2018. El castellano del Caribe colombiano en la ciudad de Nueva York: El uso variable d sujetos pronominales. In Studies in Hispanic and Lusophone Linguistics 11(1): 89–129 86
  • 96.
    Students Perspective: Feedback “Thisprogram makes analysis so much more accessible to someone like myself who is new to statistical analysis” “I realize it is actually a tool of analysis every linguist will appreciate a lot for quantitative analysis” 87
  • 97.
  • 98.
    References I Baayen, Harald.2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge: Cambridge University Press Bentivoglio, Paola and Mercedes Sedano. 1993. Investigación sociolingüística: sus métodos aplicados a una experiencia venezolana. Boletín de Lingüística 8. 3-35 Díaz-Campos, Manuel and Stephanie Dickinson. In press. Using Statistics as a Tool in the Analysis of Sociolinguistic Variation: A comparison of current and traditional methods. In Fernando Tejedo-Herrero (Ed.), Lusophone, Galician, and Hispanic Linguistics: Bridging Frames and Traditions. Amsterdam: John Benjamins Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber Randi Reppen (eds.), The Cambridge Handbook of English Corpus Linguistics. Cambridge: Cambridge University Press Labov, W. 1966. The Social Stratification of English in New York City. Washington: Center for Applied Linguistics Scrivner, Olga., and Díaz-Campos, M. 2016. Language Variation Suite: A theoretical and methodological contribution for linguistic data analysis. Proceedings of the Linguistic Society of America, 1, 29–1 Strobl, Carolin, James Malley, and Gerhard Tutz. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. 2009. Psychological methods 14: 323–348. Tagliamonte, Sali, and R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change 24: 135–178. Tagliamonte, Sali. 2016. Quantitative analysis in language variation and change. In Fernando Tejedo-Herrero and Sandro Sessarego (Eds.), Spanish Language and Sociolinguistic Analysis. Amsterdam: John Benjamins 89
  • 99.
  • 100.
  • 101.
    Appendix 1: Density 92 Numberof bins can have a disproportionate effect on visualization
  • 102.
    Histogram Density: a non-parametricmodel of the distribution of points based on a smooth density estimate http://scikit-learn.org/stable/modules/density.html 93
  • 103.
    Appenix 2 -Data Modification 94
  • 104.
    Adjust Data Retain: Selectdata subset Exclude: Exclude variables from a factor group Recode: Combine and rename variables Change class: Numeric → factor; factor → numeric Transform: Apply log transformation to a specific column ADJUSTED DATASET: ◦ Run - to apply all above changes ◦ Reset - to reset to the original dataset 95
  • 105.
  • 106.
  • 107.
    Adjusting Dataset To revertto the original data, select RESET: 98
  • 108.
    Appendix 3 -Model Comparison 99