1. Innovating and Being Creative
with
1st Riyadh UseR Meetup
15th December 2016
Allure Hub,
King Fahd Road
: https://goo.gl/O4rnv3
: https://goo.gl/gx0iWX
8. Objectives
Promote Usage of R
• Statistical Data Analysis
Tool
• General Purpose
Programming Tool
Promote Computational
Thinking
Promote Creativity with R
Enable Riyadh useRs to become
‘Data Analytic Citizens’
9. Objectives
Promote Usage of R
• Statistical Data Analysis
Tool
•General Purpose
Programming Tool
Promote Computational
Thinking
Promote Creativity with R
Enable Riyadh useRs to become
‘Data Analytic Residents’
11. Content Coverage
• Commercial Settings
– Use cases for Commercial work
• Personal Settings
– Use cases for possibly non-Commercial/Private
work
12. Structure of UseR Meetup Team
1. Ali Kazmi (Organiser)
2. _________
3. _________
Not a one man show,
please.
13. Today’s Presentations
• Personal Setting
– Data Journalism with R and Stylometry: Identifying
number of writers for a Prime Minister's speeches
• Commercial Setting
– Data de-duplication: Analysing misspelled names
to identify which refer to the same person
15. A series of events prompt the Pakistani
Prime Minister to address the nation…
16. A speech is delivered...
And, thereafter, an Audio clip is leaked,
showing the PM taking advice on
writing style
17. Journalists wondered if the PM takes
advice on writing style for important
speeches only….
…Are some other speeches also a
product of such brainstorming
sessions?
18. Media wondered if the PM takes
advice on writing style for important
speeches only….
…Are some other speeches also a
product of such brainstorming
sessions?
How can we answer this?
19. Media wondered if the PM takes
advice on writing style for important
speeches only….
…Are some other speeches also a
product of such brainstorming
sessions?
How can we answer this?
21. Stylometry is Linguistics + Statistics
applied to detect stylistic changes
in text
Assumption of Stylometry: Each writer has a distinct
style of writing that is unconsciously learnt and used.
22. Various aspects of text can capture Stylistic variation:
• Punctuation Markers
• Length of a sentence
• Vocabulary Richness
• Parts of Speech
• Function Words
; , . !
Actually I don’t think that it is good
because of the fact that this is not the…
It behoves me to accomplish this work.
Verb, Noun, Adjective, Adverb,
Conjunction, etc.
That, but, therefore, and, etc.
What characterises a person’s writing
style?
31. Client approaches us for analysing transactional
data with reference to contact names
1
32. Client approaches us for analysing transactional
data with reference to contact names
1
2
Typos, variation in names…
Hamza Sheikh vs. Humza Shaikh vs. Hamza Sheik vs. Hazma Shiekh
33. Client approaches us for analysing transactional
data with ref. to contacts
1
2
Typos, variation in names…
- Hundreds of Thousands of
records
- 5 Days
What to do?
34. Problem and Solution Elicitation
• Pattern of ‘errors’
– Typing Mistakes
– Minor Displacement of letters
• Solution
– Pattern Matching ~ Risky, Time-consuming
– String Matching Algorithms
35. String Matching Algorithms
• stringdist package in R
• Edit-based distance measures
– Includes:
• Deletion
• Addition
• Substitution
• Transposition
– Generally:
• Edit a string,
• count iterations of edit
• Less iterations = less distance = similar names!
36. Examples of Edit-Based Measures
How many Insertions to
obtain a particular text?
Duba ➜ Dubai
How many Substitutions
to obtain a particular text?
Tony ➜ Rony
How many Deletions to
obtain a particular text
Swisss ➜ Swiss
How many
Transpositions to obtain
a particular text?
Toyn ➜ Tony
Greater the amount of edits to text, greater the dissimilarity of two text strings
37. String Similarity Metrics
Similarity Metric Substitution Deletion Insertion Transposition
Longest Common
Substring
Levenshtein
Damerau – Levenshtein
Jaro – Winkler
Soundex NA NA NA NA
Jaro – Winkler is a heuristic measure for typos. Designed to implement penalty if characters at
remote positions are changed, as these are probably not typos – they occur due to
transpositions at similar positions in a string.
Talha vs. Tahla
Talha vs. Lahaat
Soundex checks phonetic similarity for English words.
38. Application & Results
Similarity measures
applied to relevant
columns
Using each similarity
measure, records with
the highest similarity
identified as duplicates
and merged
4,243 unique donors
found!
39. • Can be quite expensive!
– Memory insufficiency (with R)
– Computationally time-consuming
Consideration
41. • Stylometry for Data Journalism
– Actual Study
– Short Presentation
• Names’ De-duplication
– Confidential
Links to Presented Work
42. Should you like to Network now: Go ahead!
Otherwise: Thanks for joining this session!
Networking & Conclusion
Editor's Notes
We R a community, and this is a community run project, for the community, by the community
There are different ways of measuring similarity of character data (hueristic approaches, q-grams, edit-based measures).
We chose edit-based measures + heuristic approach [emphasise this is lent from intuition)
Explain the slide…
Explain the slide.
Explain Jaro-Winkler: heuristic specifically formulated for typos/incorrect data entry; measures similarity by accounting for character mismatches taking into account a finding that fewer typos typically occur at the beginning as opposed to the end of words.
Soundex: checks phonetic structure of (English) words – similar phonetic structure increase possibility of the record s being duplicates.
Each similarity measure was applied to the data.
Separating the wheat from chaff: only high similarity records were identified as being duplicates