SlideShare a Scribd company logo
1 of 17
Download to read offline
Forensic Linguistics with
Apache Spark
Kostas Perifanos
@k_perifanos
Idiolect, sociolect, intertextuality
What?
- Idiolect: individual’s distinctive and unique use of language
- Sociolect : variety of language associated with a social group (socioeconomic,
ethnic, age)
- Intertextuality: the shaping of a text’s meaning by another text
Forensic Linguistics
"Forensic linguistics, legal linguistics, or language and the law, is the application of
linguistic knowledge, methods and insights to the forensic context of law,
language, crime investigation, trial, and judicial procedure. It is a branch of applied
linguistics.” [Wikipedia]
- Authorship Attribution
- Authorship Identification
- Gender/Age classification etc
Dataset
- 8m tweets between 18/06/2015 - 06/08/2015
- 92m words (white space tokenized)
- 190K users
- Key events during this period
- Referendum Announcement
- Capital Controls
- Referendum voting
Toolset
- Apache Spark 1.6.1
- RDD
- DataFrames / Spark SQL
- Word2vec, KMeans
- Apache Zeppelin
- Gephi
Basic Data Exploration - Counting
Check for trends:
- Lowercase vs Uppercase ratios
- Relative frequencies of important (propaganda) words
- Average text length (per day)
- Average word length (per day)
Counting - lowercase / uppercase ratio
Counting - Propaganda
- Build a word2vec model, treat @mentions as vocabulary words
- Find top-N “synonyms” using seed accounts, keep all starting with “@”
- @handle1: @handle2, @handle3, ...
- @handle32: @handle5, @handle3, ...
- Visualize the graph
Similarities & user interactions
Similarities & interactions graph [Gephi]
Similarities & interactions graph [Gephi]
Gephi : Modularity analysis, 9 communities detected
Communities:
- “Yes”, black
- “No”, magenta
- media, red
- celebrities, dark green
- “Romantic twitter”, orange
- ....
- Choose top N most frequent words [1]
- Build frequency vectors for all users
- Compare user signatures [eg Cosine Similarity]
- Identified double-account user among 180K candidates (so much for anonymity)
[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694980/
2. Idiolect : Style signatures
2. Idiolect : Style signatures
- Apply clustering on signature vectors
- KMeans on signatures
- KMeans on word2vec vectors:
- Transform words to vectors, sum and average
- Also works very well for metaphor detection
Sociolect: Clustering
- User generates texts by sampling a number of topics
- “Similar” users will tend to have similar topic distributions
- Given a subset of similar users, identify the most influential, eg the user who enforces writing style. [But that’s another presentation :)]
Challenges
Noise
“Random events”
Opinion shifting: People change their opinions and their writing styles accordingly. Social media tends to amplify this behaviour [one
more presentation :) ]
Intertextuality: LDA + signatures
- User - Topic Classification
- Gender classification
- Age
- Personality, stress, anxiety etc
- Try Deep Learning approaches
Next steps
Thank you!
Questions?
@k_perifanos - http://github.com/kperi

More Related Content

Similar to Forensic linguistics with Apache Spark

SocioViz : Social Network Analysis made easy
SocioViz : Social Network Analysis made easySocioViz : Social Network Analysis made easy
SocioViz : Social Network Analysis made easyAlessandro Zonin
 
Using cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationUsing cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationDigital Reasoning
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applicationsVasileios Lampos
 
Vuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachersVuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachersRiina Vuorikari
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Dataopenminted_eu
 
Digital Reasoning at AirSummit 2014
Digital Reasoning at AirSummit 2014Digital Reasoning at AirSummit 2014
Digital Reasoning at AirSummit 2014Marten den Haring
 
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Bernhard Rieder
 
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapNew Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapAxel Bruns
 
Intro to sentiment analysis
Intro to sentiment analysisIntro to sentiment analysis
Intro to sentiment analysisTimea Turdean
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...IEEEMEMTECHSTUDENTSPROJECTS
 
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...Dustin Pytko
 
Facilitating Dialogue - Using Semantic Web Technology for eParticipation
Facilitating Dialogue - Using Semantic Web Technology for eParticipationFacilitating Dialogue - Using Semantic Web Technology for eParticipation
Facilitating Dialogue - Using Semantic Web Technology for eParticipationIMC Technologies
 

Similar to Forensic linguistics with Apache Spark (20)

SocioViz : Social Network Analysis made easy
SocioViz : Social Network Analysis made easySocioViz : Social Network Analysis made easy
SocioViz : Social Network Analysis made easy
 
ICAME 2010
ICAME 2010ICAME 2010
ICAME 2010
 
Using cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communicationUsing cognitive computing to better analyze human communication
Using cognitive computing to better analyze human communication
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
Digital Humanities Workshop
Digital Humanities WorkshopDigital Humanities Workshop
Digital Humanities Workshop
 
Vuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachersVuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachers
 
Detecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversationsDetecting discourse creativity in chat conversations
Detecting discourse creativity in chat conversations
 
IVACS 2010
IVACS 2010IVACS 2010
IVACS 2010
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Data
 
Digital Reasoning at AirSummit 2014
Digital Reasoning at AirSummit 2014Digital Reasoning at AirSummit 2014
Digital Reasoning at AirSummit 2014
 
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
Analyzing Social Media with Digital Methods. Possibilities, Requirements, and...
 
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the MapNew Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
New Perspectives on Social Media: Putting Our ‘Known Unknowns’ on the Map
 
Identifying cyclic words with the help of google
Identifying cyclic words with the help of googleIdentifying cyclic words with the help of google
Identifying cyclic words with the help of google
 
Intro to sentiment analysis
Intro to sentiment analysisIntro to sentiment analysis
Intro to sentiment analysis
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
 
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
 
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...
A Large-Scale Comparison Of Social Media Coverage And Mentions Captured By Th...
 
Facilitating Dialogue - Using Semantic Web Technology for eParticipation
Facilitating Dialogue - Using Semantic Web Technology for eParticipationFacilitating Dialogue - Using Semantic Web Technology for eParticipation
Facilitating Dialogue - Using Semantic Web Technology for eParticipation
 
A Friendly Localized Platform for Multilingual Semantic Communication
A Friendly Localized Platform for Multilingual Semantic Communication A Friendly Localized Platform for Multilingual Semantic Communication
A Friendly Localized Platform for Multilingual Semantic Communication
 

More from Sheamus McGovern

Knime customer intelligence on social edia
Knime customer intelligence on social ediaKnime customer intelligence on social edia
Knime customer intelligence on social ediaSheamus McGovern
 
Jon Sedar Topic Modelling
Jon Sedar Topic Modelling Jon Sedar Topic Modelling
Jon Sedar Topic Modelling Sheamus McGovern
 
Deep Learning Frameworks slides
Deep Learning Frameworks slides Deep Learning Frameworks slides
Deep Learning Frameworks slides Sheamus McGovern
 
Ian Huston - "Deploying your data driven web app on Cloud Foundry"
Ian Huston - "Deploying your data driven web app on Cloud Foundry" Ian Huston - "Deploying your data driven web app on Cloud Foundry"
Ian Huston - "Deploying your data driven web app on Cloud Foundry" Sheamus McGovern
 

More from Sheamus McGovern (8)

Knime customer intelligence on social edia
Knime customer intelligence on social ediaKnime customer intelligence on social edia
Knime customer intelligence on social edia
 
Schierz ODSC Meetup pdf
Schierz ODSC Meetup pdfSchierz ODSC Meetup pdf
Schierz ODSC Meetup pdf
 
Jon Sedar Topic Modelling
Jon Sedar Topic Modelling Jon Sedar Topic Modelling
Jon Sedar Topic Modelling
 
Boris IoT slides
Boris IoT slides Boris IoT slides
Boris IoT slides
 
Deep Learning Frameworks slides
Deep Learning Frameworks slides Deep Learning Frameworks slides
Deep Learning Frameworks slides
 
Transfer Wise Data Talk 2
Transfer Wise Data Talk 2Transfer Wise Data Talk 2
Transfer Wise Data Talk 2
 
Ajit jaokar slides
Ajit jaokar slidesAjit jaokar slides
Ajit jaokar slides
 
Ian Huston - "Deploying your data driven web app on Cloud Foundry"
Ian Huston - "Deploying your data driven web app on Cloud Foundry" Ian Huston - "Deploying your data driven web app on Cloud Foundry"
Ian Huston - "Deploying your data driven web app on Cloud Foundry"
 

Recently uploaded

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 

Recently uploaded (20)

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 

Forensic linguistics with Apache Spark

  • 1. Forensic Linguistics with Apache Spark Kostas Perifanos @k_perifanos
  • 2. Idiolect, sociolect, intertextuality What? - Idiolect: individual’s distinctive and unique use of language - Sociolect : variety of language associated with a social group (socioeconomic, ethnic, age) - Intertextuality: the shaping of a text’s meaning by another text
  • 3. Forensic Linguistics "Forensic linguistics, legal linguistics, or language and the law, is the application of linguistic knowledge, methods and insights to the forensic context of law, language, crime investigation, trial, and judicial procedure. It is a branch of applied linguistics.” [Wikipedia] - Authorship Attribution - Authorship Identification - Gender/Age classification etc
  • 4. Dataset - 8m tweets between 18/06/2015 - 06/08/2015 - 92m words (white space tokenized) - 190K users - Key events during this period - Referendum Announcement - Capital Controls - Referendum voting
  • 5. Toolset - Apache Spark 1.6.1 - RDD - DataFrames / Spark SQL - Word2vec, KMeans - Apache Zeppelin - Gephi
  • 6. Basic Data Exploration - Counting Check for trends: - Lowercase vs Uppercase ratios - Relative frequencies of important (propaganda) words - Average text length (per day) - Average word length (per day)
  • 7. Counting - lowercase / uppercase ratio
  • 9. - Build a word2vec model, treat @mentions as vocabulary words - Find top-N “synonyms” using seed accounts, keep all starting with “@” - @handle1: @handle2, @handle3, ... - @handle32: @handle5, @handle3, ... - Visualize the graph Similarities & user interactions
  • 11. Similarities & interactions graph [Gephi] Gephi : Modularity analysis, 9 communities detected Communities: - “Yes”, black - “No”, magenta - media, red - celebrities, dark green - “Romantic twitter”, orange - ....
  • 12. - Choose top N most frequent words [1] - Build frequency vectors for all users - Compare user signatures [eg Cosine Similarity] - Identified double-account user among 180K candidates (so much for anonymity) [1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3694980/ 2. Idiolect : Style signatures
  • 13. 2. Idiolect : Style signatures
  • 14. - Apply clustering on signature vectors - KMeans on signatures - KMeans on word2vec vectors: - Transform words to vectors, sum and average - Also works very well for metaphor detection Sociolect: Clustering
  • 15. - User generates texts by sampling a number of topics - “Similar” users will tend to have similar topic distributions - Given a subset of similar users, identify the most influential, eg the user who enforces writing style. [But that’s another presentation :)] Challenges Noise “Random events” Opinion shifting: People change their opinions and their writing styles accordingly. Social media tends to amplify this behaviour [one more presentation :) ] Intertextuality: LDA + signatures
  • 16. - User - Topic Classification - Gender classification - Age - Personality, stress, anxiety etc - Try Deep Learning approaches Next steps
  • 17. Thank you! Questions? @k_perifanos - http://github.com/kperi