2. Idiolect, sociolect, intertextuality
What?
- Idiolect: individual’s distinctive and unique use of language
- Sociolect : variety of language associated with a social group (socioeconomic,
ethnic, age)
- Intertextuality: the shaping of a text’s meaning by another text
3. Forensic Linguistics
"Forensic linguistics, legal linguistics, or language and the law, is the application of
linguistic knowledge, methods and insights to the forensic context of law,
language, crime investigation, trial, and judicial procedure. It is a branch of applied
linguistics.” [Wikipedia]
- Authorship Attribution
- Authorship Identification
- Gender/Age classification etc
4. Dataset
- 8m tweets between 18/06/2015 - 06/08/2015
- 92m words (white space tokenized)
- 190K users
- Key events during this period
- Referendum Announcement
- Capital Controls
- Referendum voting
6. Basic Data Exploration - Counting
Check for trends:
- Lowercase vs Uppercase ratios
- Relative frequencies of important (propaganda) words
- Average text length (per day)
- Average word length (per day)
11. Similarities & interactions graph [Gephi]
Gephi : Modularity analysis, 9 communities detected
Communities:
- “Yes”, black
- “No”, magenta
- media, red
- celebrities, dark green
- “Romantic twitter”, orange
- ....
12. - Choose top N most frequent words [1]
- Build frequency vectors for all users
- Compare user signatures [eg Cosine Similarity]
- Identified double-account user among 180K candidates (so much for anonymity)
[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694980/
2. Idiolect : Style signatures
14. - Apply clustering on signature vectors
- KMeans on signatures
- KMeans on word2vec vectors:
- Transform words to vectors, sum and average
- Also works very well for metaphor detection
Sociolect: Clustering
15. - User generates texts by sampling a number of topics
- “Similar” users will tend to have similar topic distributions
- Given a subset of similar users, identify the most influential, eg the user who enforces writing style. [But that’s another presentation :)]
Challenges
Noise
“Random events”
Opinion shifting: People change their opinions and their writing styles accordingly. Social media tends to amplify this behaviour [one
more presentation :) ]
Intertextuality: LDA + signatures
16. - User - Topic Classification
- Gender classification
- Age
- Personality, stress, anxiety etc
- Try Deep Learning approaches
Next steps