This document provides an overview of the Language Variation Suite (LVS) toolkit. The LVS is a web application designed for sociolinguistic data analysis. It allows users to upload spreadsheet data, perform data cleaning and preprocessing, generate summary statistics and cross tabulations, create data visualizations, and conduct various statistical analyses including regression modeling, clustering, and random forests. The workshop will cover the structure and functionality of the LVS through practical examples and exercises using sample sociolinguistic datasets.
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB mathsjournal
The assignment – selection problem used to find one-to- one match of given “Users” to “Laptops”, the main objective is to minimize the cost as per user requirement. This paper presents satisfactory solution for real assignment – Laptop selection problem using MATLAB coding.
A pre conference workshop on Machine Learning was organized as a part of #doppa17, DevOps++ Global Summit 2017. The workshop was conducted by Dr. Vivek Vijay and Dr. Sandeep Yadav. All the copyrights are reserved with the author.
Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.
These patterns are then utilized to predict the values of the target attribute in future data instances.
Unsupervised learning: The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
In data streams using classification and clustering different techniques to f...eSAT Journals
Abstract Data stream mining is a process of extracting knowledge from continuous data. Data Stream classification is major challenges than classifying static data because of several unique properties of data streams. Data stream is ordered sequence of instances that arrive at a rate does not store permanently in memory. The problem making more challenging when concept drift occurs when data changes over time Major problems of data stream mining is : infinite length, concept drift, concept evolution. Novel class detection in data stream classification is a interesting research topic for concept drift problem here we compare different techniques for same. Index Terms— Ensemble Method, Decision Tree, Novel Class, Option Tree, Recurring class
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB mathsjournal
The assignment – selection problem used to find one-to- one match of given “Users” to “Laptops”, the main objective is to minimize the cost as per user requirement. This paper presents satisfactory solution for real assignment – Laptop selection problem using MATLAB coding.
A pre conference workshop on Machine Learning was organized as a part of #doppa17, DevOps++ Global Summit 2017. The workshop was conducted by Dr. Vivek Vijay and Dr. Sandeep Yadav. All the copyrights are reserved with the author.
Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.
These patterns are then utilized to predict the values of the target attribute in future data instances.
Unsupervised learning: The data have no target attribute.
We want to explore the data to find some intrinsic structures in them.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
In data streams using classification and clustering different techniques to f...eSAT Journals
Abstract Data stream mining is a process of extracting knowledge from continuous data. Data Stream classification is major challenges than classifying static data because of several unique properties of data streams. Data stream is ordered sequence of instances that arrive at a rate does not store permanently in memory. The problem making more challenging when concept drift occurs when data changes over time Major problems of data stream mining is : infinite length, concept drift, concept evolution. Novel class detection in data stream classification is a interesting research topic for concept drift problem here we compare different techniques for same. Index Terms— Ensemble Method, Decision Tree, Novel Class, Option Tree, Recurring class
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Jerrin George
A brief introduction about Machine Learning, Supervised and Unsupervised Learning, and Support Vector Machines.
Application of a Supervised Algorithm to identify relevant sections of webpages obtained in search results using an SVM.
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
Generative Adversarial Networks : Basic architecture and variantsananth
In this presentation we review the fundamentals behind GANs and look at different variants. We quickly review the theory such as the cost functions, training procedure, challenges and go on to look at variants such as CycleGAN, SAGAN etc.
On the High Dimentional Information Processing in Quaternionic Domain and its...IJAAS Team
There are various high dimensional engineering and scientific applications in communication, control, robotics, computer vision, biometrics, etc.; where researchers are facing problem to design an intelligent and robust neural system which can process higher dimensional information efficiently. The conventional real-valued neural networks are tried to solve the problem associated with high dimensional parameters, but the required network structure possesses high complexity and are very time consuming and weak to noise. These networks are also not able to learn magnitude and phase values simultaneously in space. The quaternion is the number, which possesses the magnitude in all four directions and phase information is embedded within it. This paper presents a well generalized learning machine with a quaternionic domain neural network that can finely process magnitude and phase information of high dimension data without any hassle. The learning and generalization capability of the proposed learning machine is presented through a wide spectrum of simulations which demonstrate the significance of the work.
Identification of Relevant Sections in Web Pages Using a Machine Learning App...Jerrin George
A brief introduction about Machine Learning, Supervised and Unsupervised Learning, and Support Vector Machines.
Application of a Supervised Algorithm to identify relevant sections of webpages obtained in search results using an SVM.
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
In this paper, different classification algorithms for data mining are discussed. Data Mining is about
explaining the past & predicting the future by means of data analysis. Classification is a task of data mining,
which categories data based on numerical or categorical variables. To classify the data many algorithms are
proposed, out of them five algorithms are comparatively studied for data mining through classification. There are
four different classification approaches namely Frequency Table, Covariance Matrix, Similarity Functions &
Others. As work for research on classification methods, algorithms like Naive Bayesian, K Nearest Neighbors,
Decision Tree, Artificial Neural Network & Support Vector Machine are studied & examined using benchmark
datasets like Iris & Lung Cancer.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
Generative Adversarial Networks : Basic architecture and variantsananth
In this presentation we review the fundamentals behind GANs and look at different variants. We quickly review the theory such as the cost functions, training procedure, challenges and go on to look at variants such as CycleGAN, SAGAN etc.
On the High Dimentional Information Processing in Quaternionic Domain and its...IJAAS Team
There are various high dimensional engineering and scientific applications in communication, control, robotics, computer vision, biometrics, etc.; where researchers are facing problem to design an intelligent and robust neural system which can process higher dimensional information efficiently. The conventional real-valued neural networks are tried to solve the problem associated with high dimensional parameters, but the required network structure possesses high complexity and are very time consuming and weak to noise. These networks are also not able to learn magnitude and phase values simultaneously in space. The quaternion is the number, which possesses the magnitude in all four directions and phase information is embedded within it. This paper presents a well generalized learning machine with a quaternionic domain neural network that can finely process magnitude and phase information of high dimension data without any hassle. The learning and generalization capability of the proposed learning machine is presented through a wide spectrum of simulations which demonstrate the significance of the work.
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
Semi-structured, schema-free data formats are used in many applications because their flexibility enables simple data exchange. Especially graph data formats like RDF have become well established in the Web of Data. For the Web of Data, it is known that data instances are not only added, changed, and removed regularly, but that their schemas are also subject to enormous changes over time. Unfortunately, the collection, indexing, and analysis of the evolution of data schemas on the web is still in its infancy. To enable a detailed analysis of the evolution of Linked Open Data, we lay the foundation for the implementation of incremental schema-level indices for the Web of Data. Unlike existing schema-level indices, incremental schema-level indices have an efficient update mechanism to avoid costly recomputations of the entire index. This enables us to monitor changes to data instances at schema-level, trace changes, and ultimately provide an always up-to-date schema-level index for the Web of Data. In this paper, we analyze in detail the challenges of updating arbitrary schema-level indices for the Web of Data. To this end, we extend our previously developed meta model FLuID. In addition, we outline an algorithm for performing the updates.
This talk was presented in Startup Master Class 2017 - http://aaiitkblr.org/smc/ 2017 @ Christ College Bangalore. Hosted by IIT Kanpur Alumni Association and co-presented by IIT KGP Alumni Association, IITACB, PanIIT, IIMA and IIMB alumni.
My co-presenter was Biswa Gourav Singh. And contributor was Navin Manaswi.
http://dataconomy.com/2017/04/history-neural-networks/ - timeline for neural networks
Architectural decisions in designing data and computation intensive systems can have a major impact on the ability of these systems to perform statistical and other complex calculations efficiently. The storage, processing, tools, and associated databases coupled with the networking and compute infrastructure make some kinds of computations easier, and other harder. This talk will provide an introduction to software and data systems components that are important for understanding how these choices impact data analysis uncertainties and costs, and thus for developing system and software designs best suited to statistical analyses.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
Introduction to Web Scraping with PythonOlga Scrivner
In this workshop, you will learn how to extract web data with Beautiful Soup, a Python library for extracting data out of HTML- and XML-structured documents. You will also learn the basics of scraping and parsing data. In this hands-on workshop, we will also be using the DataCamp platform and participants are requested to have a free account with DataCamp prior the workshop.
Call for paper Collaboration Systems and TechnologyOlga Scrivner
Our minitrack encourages research contributions that deal with learning theories, cognition, tools and their development, enabling platforms, communication media, distance learning, supporting infrastructures, user experiences, research methods, social impacts, learning analytics, and measurable outcomes as they relate to the area of technology and its support of improving teaching and learning. In particular, the significant increase of online and distributed classroom environments brings new technological challenges.
Hands-on workshop on Jupiter Notebook and Machine Learning.
The link to material - https://languagevariationsuite.wordpress.com/2020/02/22/machine-learning-101-with-jupyter-notebook/amp/
CEWIT Hand-on workshop.
Link to materials - https://languagevariationsuite.wordpress.com/2020/01/31/faculty-accelerator-crash-course-rmarkdown-with-r-introduction/amp/
Introduction to Interactive Shiny Web ApplicationOlga Scrivner
2 hour hands-on workshop on how to create, deploy and use Shiny in research and teaching. The materials for the workshop are https://languagevariationsuite.wordpress.com/2018/11/27/introduction-to-interactive-shiny-web-applications
Video of Workshop - https://media.dlib.indiana.edu/media_objects/rj430941s
This is workshop offered via Social Science Research Center to students and faculty to become familiar with an online collaborative writing using Latex and Overleaf.
Gender Disparity in Employment and EducationOlga Scrivner
Data analysis is presented at IndyBigData Visualization Challenge 2018. Data is provided by MPH - see https://www.indybigdata.com/visualization-challenge/
CrashCourse: Python with DataCamp and Jupyter for BeginnersOlga Scrivner
Crash course for beginners is based on Python Introduction by Philip Schowenaars from DataCamp and Jupyter Introduction adapted from Adapted from Pryke, B. (2018). Jupyter Notebook for Beginners: A Tutorial. DataQuest. https://www.dataquest.io/blog/jupyter-notebook-tutorial/
Optimizing Data Analysis: Web application with ShinyOlga Scrivner
In the format of hands-on session, this workshop will introduce participants to the Language Variation Suite (LVS), a user-friendly interactive web application built in R. LVS provides access to advanced statistical methods and visualization techniques, such as mixed-effects modeling, conditional and random tree analyses, cluster analysis. These advanced methods enable researchers to handle imbalanced data, measure individual and group variation, estimate significance, and rank variables according to their significance.
Workshop files:
Categorical data csv – Use of R in New York (Labov 1966) - http://cl.indiana.edu/~obscrivn/docs/categoricaldata.csv
Continuous data csv – Intervocalic /d/ (Díaz-Campos et al. 2016) - http://cl.indiana.edu/~obscrivn/docs/continuousdata.csv
Language Variation Suite - https://languagevariationsuite.shinyapps.io/Pages/
Data Analysis and Visualization: R WorkflowOlga Scrivner
The lecture introduces to R project set-up, planning and deploying as well as to the concept of tidy data (Wickham and Grolemund, 2017).
Visual Insights Talks 2018 at
http://ivmooc.cns.iu.edu/
http://cns.iu.edu/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
1. Language Variation Suite Toolkit
Olga Scrivner, Rafael Orozco, Manuel Díaz-Campos
NWAV47, October 18, 2018
Indiana University and Louisiana State University
3. What You Will Learn
Part 1 Cross-disciplinary Methods in Sociolinguistics
Part 2 Web Interface - a novel approach for interactive
socio-analysis
Part 3 Working with Structured Data: LVS
Part 4 Working with Unstructured Data: ITMS
2
4. Our Materials - Web Site
https://languagevariationsuite.
wordpress.com/
3
5. What We Need
- Computer
- WiFi
- Smartphone (optional)
- R and Rstudio (not really)
4
6. What We Need
- Computer
- WiFi
- Smartphone (optional)
- R and Rstudio (not really)
4
8. Our Objectives
Goal: Optimize sociolinguistic analysis by using a variety of
cross-disciplinary methods
1. Introduce a novel (socio-)linguistic toolkit for research and
as a teaching tool
2. Develop practical skills via interactive and visual interface
3. Understand and interpret advanced statistical and data
mining models
6
11. Cross-disciplinary Methods: A Closer Look
Random Forest - a more stable than stepwise variable selection
(Tagliamonte and Baayen,2012, 25)
Conditional Tree - a hierarchical tree of significant factors and
their relationship with other predictors
9
12. Cross-disciplinary Methods: A Closer Look
Clustering - a grouping of variables or individuals or
documents (Gries and Hilpert, 2008, 69)
Topic Modeling - a discovery of language patterns and word usage
over time and across genres (Blei, 2012,
78)
10
13. How To Unify All Methods: Data Science Workflow
1. Import data into R: read_csv(), read_line() [even praat
files!]
2. Tidy data - pre-processing
3. Transform with dplyr [select, subset, recode...]
4. Visualize with ggplot and plotly
5. Model with glm, lda, randomForest...
11
Tidyverse - a system of R packages for data manipulation,
exploration, and visualization that share a common design
philosophy (Wickham and Grolemund, 2017)
14. Sociolinguistic Data - Data Analytics
“Analytics is the critical technology
needed to bring value out of data”
(Anonymous)
12
15. Sociolinguistic Data - Visual Analytics
“The science of analytical reasoning
facilitated by visual interactive interfaces”
(Thomas and Cook, 2005)
“Visual analytics integrates new computational and
theory-based tools with innovative interactive techniques
and visual representations to enable human-information
discourse” (Thomas and Cook, 2005)
PositionSentence
p < 0.001
1
ind, pre post
Heaviness
p = 0.003
2
≤ 1 > 1
Period
p < 0.001
3
≤ 1 > 1
Node 4 (n = 81)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 5 (n = 119)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 6 (n = 181)
VOOV
0
0.2
0.4
0.6
0.8
1
Period
p < 0.001
7
≤ 2 > 2
Node 8 (n = 221)
VOOV
0
0.2
0.4
0.6
0.8
1
Focus
p < 0.001
9
cf nf
Node 10 (n = 66)
VOOV
0
0.2
0.4
0.6
0.8
1
Main_Verb_Structure
p < 0.001
11
ACIOther, Restructuring
Node 12 (n = 43)
VOOV
0
0.2
0.4
0.6
0.8
1
Node 13 (n = 265)
VOOV
0
0.2
0.4
0.6
0.8
1
13
17. What is LVS?
Language Variation Suite
It is a Shiny web application designed for data analysis in
sociolinguistic research.
It can be used for:
Processing spreadsheet data
Reporting in tables and graphs
Analyzing means, regression, conditional trees ...
(and much more)
15
18. Background
LVS is built in R using Shiny package:
1. R - a free programming language for statistical computing
and graphics
2. Shiny App - a web application framework for R
Computational power of R + Web interactivity
16
20. Workspace
Browser
Chrome, Firefox, Safari - recommendable
Explorer may cause instability issues
Accessibility
PC, Mac, Linux
◦ Data files will be uploaded from any location on your
computer
Smart Phone
◦ Data files must be on a cloud platform connected to your
phone account (e.g. dropbox)
18
21. Server
Since LVS is hosted on a server, Shiny idle time-out settings may
stop application when it is left inactive (it will grey out).
Solution: Click reload and re-upload your csv file
19
23. Data Preparation
Important things to consider before data entry:
File format:
◦ Comma separated value (CSV) - faster processing
◦ Excel format will slow processing
Column names should not contain spaces
◦ Permitted: non-accented characters, numbers, underscore,
hyphen, and period
One column must contain your dependent variable
The rest of the columns contain independent variables
21
24. Terminology Review
a. Categorical/Nominal - non-numerical data with two
values
◦ yes - no; deletion - retention; perfective - imperfective
b. Continuous - numerical data
◦ duration, age, chronological period
c. Multinomial - non-numerical data with three or more
values
◦ deletion - aspiration - retention
d. Ordinal - scale
22
25. Test Types
Dependent Independent Test
Continuous Continous t-test
Categorical Categorical Chi-square
Continuous Multiple Categorical/ Linear Regression
Continuous
Categorical Multiple Categorical/ Logistic Regression
Continuous Binary or Multinomial
(Dickinson, 2014)
23
37. Summary
Summary provides a quantitative summary for each variable,
e.g. frequency count, mean, median.
33
38. Data Structure
1. Total number of observations (rows)
2. Number of variables (columns)
3. Variable types
◦ Factor - categorical values
◦ Num - numeric values (0.95, 1.05)
◦ Int - integer values (1, 2, 3)
34
56. Language Variation Suite - Structure
1. Data
◦ Upload file, data summary, adjust data, cross tabulation
2. Visual Analysis
◦ Plotting, cluster classification
3. Inferential statistics
◦ Modeling, regression, varbrul analysis, conditional trees,
random forest
52
57. How to Create a Regression Model
Step 1 Modeling - create a model with dependent and
independent variables
Step 2 Regression - specify the type of regression (fixed,
mixed) and type of dependent variable (binary,
continuous, multinomial)
Step 3 Stepwise Regression - compare models
(Log-likelihood, AIC, BIC)
Step 4 Conditional Trees - apply non-parametric tests to
the model
53
60. Regression Types
Model
a.) Fixed effect
b.) Mixed effect - individual speaker/token variation (within
group)
Type of Dependent Variable
a.) Binary/categorical (only two values)
b.) Continuous (numeric)
c.) Multinomial - categorical with more than two values
55
64. Interpretation
Deletion is the reference value
Positive coefficient - positive effect
Negative coefficient - negative effect
58
65. Interpretation - RETENTION
Lexical item Fourth has a negative effect on retention compared
to Floor and is significant
Normal style has a slightly negative effect on retention but its
coefficient is not significant
Macy’s and Saks have a positive and significant effect on
retention. Saks (upper middle class store) is more significant
than Macy’s (middle class store)
http://www.free-online-calculator-use.com/scientific-notation-converter.html
59
66. Interpretation - RETENTION
Lexical item Fourth has a negative effect on retention compared
to Floor and is significant
Normal style has a slightly negative effect on retention but its
coefficient is not significant
Macy’s and Saks have a positive and significant effect on
retention. Saks (upper middle class store) is more significant
than Macy’s (middle class store)
http://www.free-online-calculator-use.com/scientific-notation-converter.html
59
exponential notation:
1.48e-8
0.0000000148
68. Conditional Tree
Conditional tree: a simple non-parametric regression analysis,
commonly used in social and psychological studies
Linear regression: all information is combined linearly
Conditional tree regression: visual splitting to capture
interaction between variables
Recursive splitting (tree branches)
61
69. Conditional Tree - Tagliamonte and Baayen (2012)
1. The distribution of was/were is split in two groups by
individuals.
2. The variant were occurs significantly more frequently with the
first group.
62
70. Conditional Tree - Tagliamonte and Baayen (2012)
1. Polarity is relevant to the second group of individuals.
2. The variant were occurs significantly more often with negative
polarity
63
71. Conditional Tree - Tagliamonte and Baayen (2012)
1. Affirmative Polarity is conditioned by Age.
2. The variant was is produced significantly more often by
Individuals of 46 and younger.
64
73. Conditional Tree
1. Store is the most significant factor for R-use
◦ Kleins (working class store) - more R-deletion
2. R-use in Macy’s and Saks is conditioned by lexical item:
◦ Floor shows more R-retention than Fourth
3. Style is not significant
66
75. Random Forest
1. Variable importance for predictors
2. Robust technique with small n large p data
3. All predictors considered jointly (allows for inclusion of
correlated factors)
68
77. Random Forest
1. Store is the most important predictor
2. Lexical Item is the second predictor
3. Style is irrelevant: close to zero and red dotted line (cut-off
value).
70
79. Fixed and Mixed Models
Fixed Effects Model : All predictors are treated independent.
Underlying assumption - no group-internal
variation between speakers or tokens
Mixed Effects Model : Allows for evaluation of individual- and
group-level variation
72
80. Fixed and Mixed Models
Fixed Regression Model - ignoring individual variations
(speakers or words) may lead to Type I Error:
“a chance effect is mistaken for a real difference
between the populations”
Mixed Regression Model - prone to Type II Error:
“if speaker variation is at a high level, we cannot
discern small population effects without a large
number of speakers” (Johnson 2009, 2015)
73
81. Mixed Effect Regression
Mixed Model = fixed effects + random effects
Fixed-effect factor - “repeatable and a small number of levels”
Random-effect factor - “a non-repeatable random sample from
a larger population” (Wieling 2012)
walk, sleep, study, finish, eat, etc
event verb, stative verb
speaker1, speaker3, speaker3, etc
male, female
74
82. Mixed Effect Regression
Mixed Model = fixed effects + random effects
Fixed-effect factor - “repeatable and a small number of levels”
Random-effect factor - “a non-repeatable random sample from
a larger population” (Wieling 2012)
walk, sleep, study, finish, eat, etc
event verb, stative verb
speaker1, speaker3, speaker3, etc
male, female
74
83. Preparing for Mixed Model
1. Download continuousdata.csv
2. Upload this file on LVS
75
89. Interpretation - Random Effects
1. Standard Deviation: a measure of the variability for each
random effect (speakers and tokens)
2. Residual: random variation that is not due to speakers or
tokens (residual error)
80
90. Interpretation - Fixed Effects
1. Estimate/coefficient: reported in log-odds (negative or
positive)
2. P-value: tells you if the level is significant
81
95. Publications
- Scrivner, Olga., and Díaz-Campos, M. 2016. Language
Variation Suite: A theoretical and methodological
contribution for linguistic data analysis. Proceedings of the
Linguistic Society of America, 1, 29–1
- Estrada, Monica. 2016. El tú no es de nosotros, es de otros
países: Usos del voseo y actitudes hacia él en el castellano
hondureño. Master Thesis. Louisiana State University
- Orozco, Rafael. 2018. El castellano del Caribe colombiano
en la ciudad de Nueva York: El uso variable d sujetos
pronominales. In Studies in Hispanic and Lusophone
Linguistics 11(1): 89–129
86
96. Students Perspective: Feedback
“This program makes analysis so much more accessible to
someone like myself who is new to statistical analysis”
“I realize it is actually a tool of analysis every linguist will
appreciate a lot for quantitative analysis”
87
98. References I
Baayen, Harald. 2008. Analyzing linguistic data: A practical introduction to statistics. Cambridge: Cambridge University
Press
Bentivoglio, Paola and Mercedes Sedano. 1993. Investigación sociolingüística: sus métodos aplicados a una
experiencia venezolana. Boletín de Lingüística 8. 3-35
Díaz-Campos, Manuel and Stephanie Dickinson. In press. Using Statistics as a Tool in the Analysis of Sociolinguistic
Variation: A comparison of current and traditional methods. In Fernando Tejedo-Herrero (Ed.), Lusophone,
Galician, and Hispanic Linguistics: Bridging Frames and Traditions. Amsterdam: John Benjamins
Gries, Stefan Th. 2015. Quantitative designs and statistical techniques. In Douglas Biber Randi Reppen (eds.), The
Cambridge Handbook of English Corpus Linguistics. Cambridge: Cambridge University Press
Labov, W. 1966. The Social Stratification of English in New York City. Washington: Center for Applied Linguistics
Scrivner, Olga., and Díaz-Campos, M. 2016. Language Variation Suite: A theoretical and methodological
contribution for linguistic data analysis. Proceedings of the Linguistic Society of America, 1, 29–1
Strobl, Carolin, James Malley, and Gerhard Tutz. An introduction to recursive partitioning: rationale, application,
and characteristics of classification and regression trees, bagging, and random forests. 2009. Psychological
methods 14: 323–348.
Tagliamonte, Sali, and R. Harald Baayen. 2012. Models, forests and trees of York English: Was/were variation as a
case study for statistical practice. Language Variation and Change 24: 135–178.
Tagliamonte, Sali. 2016. Quantitative analysis in language variation and change. In Fernando Tejedo-Herrero and
Sandro Sessarego (Eds.), Spanish Language and Sociolinguistic Analysis. Amsterdam: John Benjamins
89
102. Histogram
Density: a non-parametric model of the distribution of points based
on a smooth density estimate
http://scikit-learn.org/stable/modules/density.html
93
104. Adjust Data
Retain: Select data subset
Exclude: Exclude variables from a factor group
Recode: Combine and rename variables
Change class: Numeric → factor; factor → numeric
Transform: Apply log transformation to a specific column
ADJUSTED DATASET:
◦ Run - to apply all above changes
◦ Reset - to reset to the original dataset
95