SlideShare a Scribd company logo
1 of 42
Innovating and Being Creative
with
1st Riyadh UseR Meetup
15th December 2016
Allure Hub,
King Fahd Road
: https://goo.gl/O4rnv3
: https://goo.gl/gx0iWX
Ali Kazmi
http://goo.glIcwGiB
https://goo.gl/eBGEKd
@scac1041
https://goo.gl/tiWAMm
• Meeting
• Meeting
• UseR Group
• Meeting
• UseR Group
• Meetup Group
• Meeting
• UseR Group
• Meetup Group
• 1st things 1st…
Objectives
Objectives
Promote Usage of R
• Statistical Data Analysis
Tool
• General Purpose
Programming Tool
Promote Computational
Thinking
Promote Creativity with R
Enable Riyadh useRs to become
‘Data Analytic Citizens’
Objectives
Promote Usage of R
• Statistical Data Analysis
Tool
•General Purpose
Programming Tool
Promote Computational
Thinking
Promote Creativity with R
Enable Riyadh useRs to become
‘Data Analytic Residents’
Content Coverage
Content Coverage
• Commercial Settings
– Use cases for Commercial work
• Personal Settings
– Use cases for possibly non-Commercial/Private
work
Structure of UseR Meetup Team
1. Ali Kazmi (Organiser)
2. _________
3. _________
Not a one man show,
please.
Today’s Presentations
• Personal Setting
– Data Journalism with R and Stylometry: Identifying
number of writers for a Prime Minister's speeches
• Commercial Setting
– Data de-duplication: Analysing misspelled names
to identify which refer to the same person
Using Stylometry to Identify
Authorship of Texts
A series of events prompt the Pakistani
Prime Minister to address the nation…
A speech is delivered...
And, thereafter, an Audio clip is leaked,
showing the PM taking advice on
writing style
Journalists wondered if the PM takes
advice on writing style for important
speeches only….
…Are some other speeches also a
product of such brainstorming
sessions?
Media wondered if the PM takes
advice on writing style for important
speeches only….
…Are some other speeches also a
product of such brainstorming
sessions?
How can we answer this?
Media wondered if the PM takes
advice on writing style for important
speeches only….
…Are some other speeches also a
product of such brainstorming
sessions?
How can we answer this?
Stylometry is Linguistics + Statistics
applied to detect stylistic changes
in text
Stylometry is Linguistics + Statistics
applied to detect stylistic changes
in text
Assumption of Stylometry: Each writer has a distinct
style of writing that is unconsciously learnt and used.
Various aspects of text can capture Stylistic variation:
• Punctuation Markers
• Length of a sentence
• Vocabulary Richness
• Parts of Speech
• Function Words
; , . !
Actually I don’t think that it is good
because of the fact that this is not the…
It behoves me to accomplish this work.
Verb, Noun, Adjective, Adverb,
Conjunction, etc.
That, but, therefore, and, etc.
What characterises a person’s writing
style?
Applications
• J. K. Rowling & Galbraith
• Writing Style in Novels
Roadmap
• Extract
• Quantify
• Analyse
• Visualise
Multi-Dimensional Scaling, PCA,
Bootstrap Consensus Trees
Traditional Journalism vs. Data
Journalism
• Traditional Journalism
• Data Journalism
Considerations in Stylometry
• Size of dataset/corpus
• Open World Problem
• Relatively new field
Questions?
Data De-duplication: Analysing misspelled names to
identify which refer to the same person
Client approaches us for analysing transactional
data with reference to contact names
1
Client approaches us for analysing transactional
data with reference to contact names
1
2
Typos, variation in names…
Hamza Sheikh vs. Humza Shaikh vs. Hamza Sheik vs. Hazma Shiekh
Client approaches us for analysing transactional
data with ref. to contacts
1
2
Typos, variation in names…
- Hundreds of Thousands of
records
- 5 Days
What to do?
Problem and Solution Elicitation
• Pattern of ‘errors’
– Typing Mistakes
– Minor Displacement of letters
• Solution
– Pattern Matching ~ Risky, Time-consuming
– String Matching Algorithms
String Matching Algorithms
• stringdist package in R
• Edit-based distance measures
– Includes:
• Deletion
• Addition
• Substitution
• Transposition
– Generally:
• Edit a string,
• count iterations of edit
• Less iterations = less distance = similar names!
Examples of Edit-Based Measures
How many Insertions to
obtain a particular text?
Duba ➜ Dubai
How many Substitutions
to obtain a particular text?
Tony ➜ Rony
How many Deletions to
obtain a particular text
Swisss ➜ Swiss
How many
Transpositions to obtain
a particular text?
Toyn ➜ Tony
Greater the amount of edits to text, greater the dissimilarity of two text strings
String Similarity Metrics
Similarity Metric Substitution Deletion Insertion Transposition
Longest Common
Substring
   
Levenshtein    
Damerau – Levenshtein    
Jaro – Winkler    
Soundex NA NA NA NA
Jaro – Winkler is a heuristic measure for typos. Designed to implement penalty if characters at
remote positions are changed, as these are probably not typos – they occur due to
transpositions at similar positions in a string.
Talha vs. Tahla
Talha vs. Lahaat
Soundex checks phonetic similarity for English words.
Application & Results
Similarity measures
applied to relevant
columns
Using each similarity
measure, records with
the highest similarity
identified as duplicates
and merged
4,243 unique donors
found!
• Can be quite expensive!
– Memory insufficiency (with R)
– Computationally time-consuming
Consideration
Questions?
• Stylometry for Data Journalism
– Actual Study
– Short Presentation
• Names’ De-duplication
– Confidential 
Links to Presented Work
Should you like to Network now: Go ahead!
Otherwise: Thanks for joining this session!
Networking & Conclusion

More Related Content

Viewers also liked

Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...
Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...
Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...
INOCENCIO MELÉNDEZ JULIO
 
Inocencio meléndez julio. contratación y gestión. como encontrarle senti...
Inocencio meléndez julio. contratación  y gestión.  como encontrarle senti...Inocencio meléndez julio. contratación  y gestión.  como encontrarle senti...
Inocencio meléndez julio. contratación y gestión. como encontrarle senti...
INOCENCIO MELÉNDEZ JULIO
 
Cuadro comparativo gestion de calidad
Cuadro comparativo gestion de calidadCuadro comparativo gestion de calidad
Cuadro comparativo gestion de calidad
uftfloro
 

Viewers also liked (7)

Hippies
HippiesHippies
Hippies
 
Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...
Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...
Inocencio melendez julio. el concepto de aprendizaje autónomo, de trabajo ac...
 
Mitx
MitxMitx
Mitx
 
La anorexia nerviosa
La anorexia nerviosaLa anorexia nerviosa
La anorexia nerviosa
 
Inocencio meléndez julio. contratación y gestión. como encontrarle senti...
Inocencio meléndez julio. contratación  y gestión.  como encontrarle senti...Inocencio meléndez julio. contratación  y gestión.  como encontrarle senti...
Inocencio meléndez julio. contratación y gestión. como encontrarle senti...
 
Cuadro comparativo gestion de calidad
Cuadro comparativo gestion de calidadCuadro comparativo gestion de calidad
Cuadro comparativo gestion de calidad
 
Comenzar
ComenzarComenzar
Comenzar
 

Similar to Riyadh UseR Group - 1st Meeting (Dec 2016(

Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 

Similar to Riyadh UseR Group - 1st Meeting (Dec 2016( (20)

Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
Sld-Natural-Language-Processing-for-large-volumes-of-human-text-data-Sozzi-Br...
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
IR
IRIR
IR
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Trends In Languages 2010
Trends In Languages 2010Trends In Languages 2010
Trends In Languages 2010
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
MODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptxMODULE 4-Text Analytics.pptx
MODULE 4-Text Analytics.pptx
 
MediaEval 2018: Baseline Algorithms for Predicting the Interest in News
MediaEval 2018: Baseline Algorithms for Predicting the Interest in NewsMediaEval 2018: Baseline Algorithms for Predicting the Interest in News
MediaEval 2018: Baseline Algorithms for Predicting the Interest in News
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
UX in the city Coping with Complexity
UX in the city   Coping with ComplexityUX in the city   Coping with Complexity
UX in the city Coping with Complexity
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Riyadh UseR Group - 1st Meeting (Dec 2016(

  • 1. Innovating and Being Creative with 1st Riyadh UseR Meetup 15th December 2016 Allure Hub, King Fahd Road : https://goo.gl/O4rnv3 : https://goo.gl/gx0iWX
  • 5. • Meeting • UseR Group • Meetup Group
  • 6. • Meeting • UseR Group • Meetup Group • 1st things 1st…
  • 8. Objectives Promote Usage of R • Statistical Data Analysis Tool • General Purpose Programming Tool Promote Computational Thinking Promote Creativity with R Enable Riyadh useRs to become ‘Data Analytic Citizens’
  • 9. Objectives Promote Usage of R • Statistical Data Analysis Tool •General Purpose Programming Tool Promote Computational Thinking Promote Creativity with R Enable Riyadh useRs to become ‘Data Analytic Residents’
  • 11. Content Coverage • Commercial Settings – Use cases for Commercial work • Personal Settings – Use cases for possibly non-Commercial/Private work
  • 12. Structure of UseR Meetup Team 1. Ali Kazmi (Organiser) 2. _________ 3. _________ Not a one man show, please.
  • 13. Today’s Presentations • Personal Setting – Data Journalism with R and Stylometry: Identifying number of writers for a Prime Minister's speeches • Commercial Setting – Data de-duplication: Analysing misspelled names to identify which refer to the same person
  • 14. Using Stylometry to Identify Authorship of Texts
  • 15. A series of events prompt the Pakistani Prime Minister to address the nation…
  • 16. A speech is delivered... And, thereafter, an Audio clip is leaked, showing the PM taking advice on writing style
  • 17. Journalists wondered if the PM takes advice on writing style for important speeches only…. …Are some other speeches also a product of such brainstorming sessions?
  • 18. Media wondered if the PM takes advice on writing style for important speeches only…. …Are some other speeches also a product of such brainstorming sessions? How can we answer this?
  • 19. Media wondered if the PM takes advice on writing style for important speeches only…. …Are some other speeches also a product of such brainstorming sessions? How can we answer this?
  • 20. Stylometry is Linguistics + Statistics applied to detect stylistic changes in text
  • 21. Stylometry is Linguistics + Statistics applied to detect stylistic changes in text Assumption of Stylometry: Each writer has a distinct style of writing that is unconsciously learnt and used.
  • 22. Various aspects of text can capture Stylistic variation: • Punctuation Markers • Length of a sentence • Vocabulary Richness • Parts of Speech • Function Words ; , . ! Actually I don’t think that it is good because of the fact that this is not the… It behoves me to accomplish this work. Verb, Noun, Adjective, Adverb, Conjunction, etc. That, but, therefore, and, etc. What characterises a person’s writing style?
  • 23. Applications • J. K. Rowling & Galbraith • Writing Style in Novels
  • 24. Roadmap • Extract • Quantify • Analyse • Visualise Multi-Dimensional Scaling, PCA, Bootstrap Consensus Trees
  • 25.
  • 26.
  • 27. Traditional Journalism vs. Data Journalism • Traditional Journalism • Data Journalism
  • 28. Considerations in Stylometry • Size of dataset/corpus • Open World Problem • Relatively new field
  • 30. Data De-duplication: Analysing misspelled names to identify which refer to the same person
  • 31. Client approaches us for analysing transactional data with reference to contact names 1
  • 32. Client approaches us for analysing transactional data with reference to contact names 1 2 Typos, variation in names… Hamza Sheikh vs. Humza Shaikh vs. Hamza Sheik vs. Hazma Shiekh
  • 33. Client approaches us for analysing transactional data with ref. to contacts 1 2 Typos, variation in names… - Hundreds of Thousands of records - 5 Days What to do?
  • 34. Problem and Solution Elicitation • Pattern of ‘errors’ – Typing Mistakes – Minor Displacement of letters • Solution – Pattern Matching ~ Risky, Time-consuming – String Matching Algorithms
  • 35. String Matching Algorithms • stringdist package in R • Edit-based distance measures – Includes: • Deletion • Addition • Substitution • Transposition – Generally: • Edit a string, • count iterations of edit • Less iterations = less distance = similar names!
  • 36. Examples of Edit-Based Measures How many Insertions to obtain a particular text? Duba ➜ Dubai How many Substitutions to obtain a particular text? Tony ➜ Rony How many Deletions to obtain a particular text Swisss ➜ Swiss How many Transpositions to obtain a particular text? Toyn ➜ Tony Greater the amount of edits to text, greater the dissimilarity of two text strings
  • 37. String Similarity Metrics Similarity Metric Substitution Deletion Insertion Transposition Longest Common Substring     Levenshtein     Damerau – Levenshtein     Jaro – Winkler     Soundex NA NA NA NA Jaro – Winkler is a heuristic measure for typos. Designed to implement penalty if characters at remote positions are changed, as these are probably not typos – they occur due to transpositions at similar positions in a string. Talha vs. Tahla Talha vs. Lahaat Soundex checks phonetic similarity for English words.
  • 38. Application & Results Similarity measures applied to relevant columns Using each similarity measure, records with the highest similarity identified as duplicates and merged 4,243 unique donors found!
  • 39. • Can be quite expensive! – Memory insufficiency (with R) – Computationally time-consuming Consideration
  • 41. • Stylometry for Data Journalism – Actual Study – Short Presentation • Names’ De-duplication – Confidential  Links to Presented Work
  • 42. Should you like to Network now: Go ahead! Otherwise: Thanks for joining this session! Networking & Conclusion

Editor's Notes

  1. We R a community, and this is a community run project, for the community, by the community
  2. There are different ways of measuring similarity of character data (hueristic approaches, q-grams, edit-based measures). We chose edit-based measures + heuristic approach [emphasise this is lent from intuition) Explain the slide…
  3. Explain the slide. Explain Jaro-Winkler: heuristic specifically formulated for typos/incorrect data entry; measures similarity by accounting for character mismatches taking into account a finding that fewer typos typically occur at the beginning as opposed to the end of words. Soundex: checks phonetic structure of (English) words – similar phonetic structure increase possibility of the record s being duplicates.
  4. Each similarity measure was applied to the data. Separating the wheat from chaff: only high similarity records were identified as being duplicates