SlideShare a Scribd company logo
1 of 23
Download to read offline
Problem and the Dataset Method Results References
Detecting Gender by Full Name:
Experiments with the Russian Language
Alexander Panchenko
a.panchenko@digsolab.com
Digital Society Laboratory LLC, Russia
April 11, 2014
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Problem
Detect gender of a user
to profile a user;
user segmentation is helpful in search, advertisement, etc.
By text written by a user:
Ciot et al. [2013], Koppel et al. [2002], Goswami et al. [2009],
Mukherjee and Liu [2010], Peersman et al. [2011], Rao et al.
[2010], Rangel and Rosso Rangel and Rosso [2013], Al Zamal
et al.Al Zamal et al. [2012] and Lui et al. Liu et al. [2012].
Russian language: Daniel [2012]
By full name: Burger et al. [2011]
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Online demo
http://research.digsolab.com/gender
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Dataset
100,000 full names of Facebook users
full name – first and last name of a user
a full name has a gender label: male or female
names written in both Cyrillic and Latin alphabets
“Alexander Ivanov”, “Masha Sidorova”, “Pavel Nikolenko”, etc.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Dataset
Figure : Name-surname co-occurrences: rows and columns are sorted byAlexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Dataset
72% of first names have typical male/female ending
68% of surnames have typical male/female ending
a typical male/female ending splits males from females with an
error less than 5%
gender of ≥ 50% first names recognized with 8 endings
gender of ≥ 50% second names recognized with 5 endings
Conclusion
Simple symbolic ending-based method cannot robustly classify
about 30% of names. This motivates the need for a more
sophisticated statistical approach.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Dataset: character endings of Russian names
Type Ending Gender Error, % Example
first name na (на) female 0.27 Ekaterina
first name iya (ия) female 0.32 Anastasiya
first name ei (ей) male 0.16 Sergei
first name dr (др) male 0.00 Alexandr
first name ga (га) male 4.94 Serega
first name an (ан) male 4.99 Ivan
first name la (ла) female 4.23 Luidmila
first name ii (ий) male 0.34 Yurii
second name va (ва) female 0.28 Morozova
second name ov (ов) male 0.21 Objedkov
second name na (на) female 2.22 Matyushina
second name ev (ев) male 0.44 Sergeev
second name in (ин) male 1.94 Teterin
Table : Most discriminative and frequent two character endings of
Russian names.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Gender Detection Method
input: a string representing a name of a person
output: gender (male or female)
binary classification task
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Features
endings
character n-grams
dictionary of male/female names and surnames
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Word endings
males: Alexander Yaroskavski, Oleg Arbuzov
females: Alexandra Yaroskavskaya, Nayaliya Arbuzova
BUT: “Sidorenko”, “Moroz” or “Bondar”!
two most common one-character endings: “a” and “ya” (“я”)
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Character n-grams
unigrams, bigrams or trigrams
k most frequent n-grams
Most frequent trigrams
a__, va_, v__, na , ova, __A, ov_, ina, kov, nov, _Al, n__,
a__, o__, __V, ndr, Ale, iya , ei , lek, eks, ko_, nko, rin, _An,
enk, __C, na_, “ii ”, __E, eva, __N, __M, san, ksa , __I ,
ev_ , in_, and, len, __O, va_, rov, _Na
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Dictionaries of first and last names
first/last name and a probability that it belongs to the male
gender:
P(c = male|w) =
nw
male
c∈{male,female} nw
c
,
nw
male – number of male profiles with the first/last name w in
the dictionary
probability that first name is of male gender
P(c = male|firstname)
probability that the last name is of male gender
P(c = male|lastname)
3,427 firs names, 11,411 last names
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Model
L2-regularized Logistic Regression
yields reasonable results for the NLP-related problems
model w, feature vector x
classification:
P(y = 1|x) =
1
1 + e−wT x
.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Results
Model Accuracy Precision Recall F-measure
rule-based baseline 0,638 0,995 0,633 0,774
endings 0,850 ± 0,002 0,921 ± 0,003 0,784 ± 0,004 0,847 ± 0,002
3-grams 0,944 ± 0,003 0,948 ± 0,003 0,946 ± 0,003 0,947 ± 0,003
dicts 0,956 ± 0,002 0,992 ± 0,001 0,925 ± 0,003 0,957 ± 0,002
endings+3-grams 0,946 ± 0,003 0,950 ± 0,002 0,947 ± 0,004 0,949 ± 0,003
3-grams+dicts 0,956 ± 0,003 0,960 ± 0,003 0,957 ± 0,004 0,959 ± 0,003
endings+3-grams+dicts 0,957 ± 0,003 0,961 ± 0,003 0,959 ± 0,004 0,960 ± 0,002
Table : Results of the experiments on the training set of 10,000 names.
Here endings – 4 Russian female endings, trigrams – 1000 most frequent
3-grams, dictionary – name/surname dictionary with γ = 80% top
entries. This table presents precision, recall and F-measure of the female
class.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Results
Figure : Learning curves of single and combined models. Accuracy was
estimated on separate sample of 10,000 names.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Results
Figure : Accuracy of the model 3-grams function of the number of
features used k.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Results
Figure : Accuracy of the dicts model function of the fraction of
dictionaries used γ.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Error Analysis
There are several types of errors:
Inconsistent annotation, such as “Anna Kryukova (male)” or
“Boris Krolchansky (female)”.
Name string is neither male nor female, but rather a name of a
group, e.g. “Wikom Tools”, “Kazakh University of Humanities”
or “Privat Bank”.
Name string represents a foreign name, e.g. “Abdulloh Ibn
Abdulloh”, “Brooke Alisson”, “Ulpetay Niyetbay” or “Yola
Dolson”. Our model was not trained to deal with such names.
Meaningless or partially anonymized names names, e.g.
“Crazzy Ma”, “Un Petit Diable”, “Vv Tt”, “Vio La Tor” or “Muu
Muu”. Additional information is required to derive gender of
such users.
People with rare names or surnames, e.g. “Guldjan Reyzova”,
“Yagun Zumpelich” or “Akob Saakan”. These are people with
common names and surnames in other countries, such asAlexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Error Analysis
Train Set Errors Test Set Errors
name true class name true class
1 Lea Shraiber female Ilya Nadorshin male
2 Profanum Vulgus female Rustem Saledinov male
3 Anna Kryukova male Erkin Bahlamet male
4 Gin Amaya male Gocha Lapachi male
5 Gertrud Gallet female Muttaqiyyah Abdulvahhab female
6 Dolores Laughter female Yola Dolson female
7 Di Nolik male Heiran Gasanova female
8 Jlija Hotieca female Hadji Murad male
9 Gic Globmedic female Jenya Chekulenko female
10 Ulpetay Niyetbay female Tury.Ru Domodedovskaya Metro Office male
11 Olga Shoff male Elmira Nabizade female
12 Phil Golosoun male Niko Liparteliani male
13 Tsitsino Shurgaya female Oleg Grin’ male
14 Anna Grobov female Santi Zarovneva female
15 Linguini Incident female Misha Badali male
16 Toma Oganesyan female Che Serega male
17 Swon Swetik female Petr Kiyashko male
18 Adel Simon female Sandugash Botabaeva female
19 Ant Kam- male Jenya Sergienko female
20 Xristi Xitrozver female Abdulloh Ibn Abdulloh female
21 Anii Reznookova female Naikaita Laitvainenko male
22 Aurelia Grishko male Fil Kalnitskiy male
23 Alex Bu female Helen Hovel’ female
24 Karen Karine female Valery Kotelnikov male
25 Russian Spain female Max Od male
26 Lucy Walter male Jean Kvartshelia male
27 Aysah Ahmed female Adjedo Trupachuli female
28 Kiti Iz female Ainur Serikova femaleAlexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Thank you! Questions?
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Al Zamal, F., Liu, W., and Ruths, D. (2012). Homophily and latent
attribute inference: Inferring latent attributes of twitter users
from neighbors. In ICWSM.
Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011).
Discriminating gender on twitter. In Proceedings of the
Conference on Empirical Methods in Natural Language
Processing, pages 1301–1309. Association for Computational
Linguistics.
Ciot, M., Sonderegger, M., and Ruths, D. (2013). Gender inference
of twitter users in non-english contexts. In Proceedings of the
2013 Conference on Empirical Methods in Natural Language
Processing, Seattle, Wash, pages 18–21.
Daniel, M. A. Zelenkov, Y. (2012). Russian national corpus as a
playground for sociolinguistic research. episode iv. gender and
length of the utterance. In Proceedings of Dialog-2012, pages
51–62.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Goswami, S., Sarkar, S., and Rustagi, M. (2009). Stylometric
analysis of bloggers’ age and gender. In Third International AAAI
Conference on Weblogs and Social Media.
Koppel, M., Argamon, S., and Shimoni, A. R. (2002).
Automatically categorizing written texts by author gender.
Literary and Linguistic Computing, 17(4):401–412.
Liu, W., Zamal, F. A., and Ruths, D. (2012). Using social media to
infer gender composition of commuter populations. Proceedings
of the When the City Meets the Citizen Worksop.
Mukherjee, A. and Liu, B. (2010). Improving gender classification
of blog authors. In Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Processing, pages
207–217. Association for Computational Linguistics.
Peersman, C., Daelemans, W., and Van Vaerenbergh, L. (2011).
Predicting age and gender in online social networks. In
Proceedings of the 3rd international workshop on Search and
mining user-generated contents, pages 37–44. ACM.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
Problem and the Dataset Method Results References
Rangel, F. and Rosso, P. (2013). Use of language and author
profiling: Identification of gender and age. Natural Language
Processing and Cognitive Science, page 177.
Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010).
Classifying latent user attributes in twitter. In Proceedings of the
2nd international workshop on Search and mining user-generated
contents, pages 37–44. ACM.
Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru

More Related Content

Viewers also liked

Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
 
Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...Alexander Panchenko
 
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Alexander Panchenko
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionAlexander Panchenko
 
Неологизмы в социальной сети Фейсбук
Неологизмы в социальной сети ФейсбукНеологизмы в социальной сети Фейсбук
Неологизмы в социальной сети ФейсбукAlexander Panchenko
 

Viewers also liked (7)

Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation Extraction
 
Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...
 
Document
DocumentDocument
Document
 
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
 
Making Sense of Word Embeddings
Making Sense of Word EmbeddingsMaking Sense of Word Embeddings
Making Sense of Word Embeddings
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
 
Неологизмы в социальной сети Фейсбук
Неологизмы в социальной сети ФейсбукНеологизмы в социальной сети Фейсбук
Неологизмы в социальной сети Фейсбук
 

More from Alexander Panchenko

Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...Alexander Panchenko
 
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlBuilding a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlAlexander Panchenko
 
Improving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic ClassesImproving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic ClassesAlexander Panchenko
 
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical ResourcesInducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical ResourcesAlexander Panchenko
 
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...Alexander Panchenko
 
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...Alexander Panchenko
 
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...Alexander Panchenko
 
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationUsing Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationAlexander Panchenko
 
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...Alexander Panchenko
 
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...Alexander Panchenko
 
Getting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part IIGetting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part IIAlexander Panchenko
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...Alexander Panchenko
 

More from Alexander Panchenko (12)

Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...
 
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlBuilding a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
 
Improving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic ClassesImproving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic Classes
 
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical ResourcesInducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
 
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
 
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
 
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
 
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationUsing Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
 
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
 
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
 
Getting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part IIGetting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part II
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
 

Recently uploaded

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Recently uploaded (20)

STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Detecting Gender by Full Name: Experiments with the Russian Language

  • 1. Problem and the Dataset Method Results References Detecting Gender by Full Name: Experiments with the Russian Language Alexander Panchenko a.panchenko@digsolab.com Digital Society Laboratory LLC, Russia April 11, 2014 Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 2. Problem and the Dataset Method Results References Problem Detect gender of a user to profile a user; user segmentation is helpful in search, advertisement, etc. By text written by a user: Ciot et al. [2013], Koppel et al. [2002], Goswami et al. [2009], Mukherjee and Liu [2010], Peersman et al. [2011], Rao et al. [2010], Rangel and Rosso Rangel and Rosso [2013], Al Zamal et al.Al Zamal et al. [2012] and Lui et al. Liu et al. [2012]. Russian language: Daniel [2012] By full name: Burger et al. [2011] Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 3. Problem and the Dataset Method Results References Online demo http://research.digsolab.com/gender Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 4. Problem and the Dataset Method Results References Dataset 100,000 full names of Facebook users full name – first and last name of a user a full name has a gender label: male or female names written in both Cyrillic and Latin alphabets “Alexander Ivanov”, “Masha Sidorova”, “Pavel Nikolenko”, etc. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 5. Problem and the Dataset Method Results References Dataset Figure : Name-surname co-occurrences: rows and columns are sorted byAlexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 6. Problem and the Dataset Method Results References Dataset 72% of first names have typical male/female ending 68% of surnames have typical male/female ending a typical male/female ending splits males from females with an error less than 5% gender of ≥ 50% first names recognized with 8 endings gender of ≥ 50% second names recognized with 5 endings Conclusion Simple symbolic ending-based method cannot robustly classify about 30% of names. This motivates the need for a more sophisticated statistical approach. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 7. Problem and the Dataset Method Results References Dataset: character endings of Russian names Type Ending Gender Error, % Example first name na (на) female 0.27 Ekaterina first name iya (ия) female 0.32 Anastasiya first name ei (ей) male 0.16 Sergei first name dr (др) male 0.00 Alexandr first name ga (га) male 4.94 Serega first name an (ан) male 4.99 Ivan first name la (ла) female 4.23 Luidmila first name ii (ий) male 0.34 Yurii second name va (ва) female 0.28 Morozova second name ov (ов) male 0.21 Objedkov second name na (на) female 2.22 Matyushina second name ev (ев) male 0.44 Sergeev second name in (ин) male 1.94 Teterin Table : Most discriminative and frequent two character endings of Russian names. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 8. Problem and the Dataset Method Results References Gender Detection Method input: a string representing a name of a person output: gender (male or female) binary classification task Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 9. Problem and the Dataset Method Results References Features endings character n-grams dictionary of male/female names and surnames Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 10. Problem and the Dataset Method Results References Word endings males: Alexander Yaroskavski, Oleg Arbuzov females: Alexandra Yaroskavskaya, Nayaliya Arbuzova BUT: “Sidorenko”, “Moroz” or “Bondar”! two most common one-character endings: “a” and “ya” (“я”) Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 11. Problem and the Dataset Method Results References Character n-grams unigrams, bigrams or trigrams k most frequent n-grams Most frequent trigrams a__, va_, v__, na , ova, __A, ov_, ina, kov, nov, _Al, n__, a__, o__, __V, ndr, Ale, iya , ei , lek, eks, ko_, nko, rin, _An, enk, __C, na_, “ii ”, __E, eva, __N, __M, san, ksa , __I , ev_ , in_, and, len, __O, va_, rov, _Na Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 12. Problem and the Dataset Method Results References Dictionaries of first and last names first/last name and a probability that it belongs to the male gender: P(c = male|w) = nw male c∈{male,female} nw c , nw male – number of male profiles with the first/last name w in the dictionary probability that first name is of male gender P(c = male|firstname) probability that the last name is of male gender P(c = male|lastname) 3,427 firs names, 11,411 last names Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 13. Problem and the Dataset Method Results References Model L2-regularized Logistic Regression yields reasonable results for the NLP-related problems model w, feature vector x classification: P(y = 1|x) = 1 1 + e−wT x . Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 14. Problem and the Dataset Method Results References Results Model Accuracy Precision Recall F-measure rule-based baseline 0,638 0,995 0,633 0,774 endings 0,850 ± 0,002 0,921 ± 0,003 0,784 ± 0,004 0,847 ± 0,002 3-grams 0,944 ± 0,003 0,948 ± 0,003 0,946 ± 0,003 0,947 ± 0,003 dicts 0,956 ± 0,002 0,992 ± 0,001 0,925 ± 0,003 0,957 ± 0,002 endings+3-grams 0,946 ± 0,003 0,950 ± 0,002 0,947 ± 0,004 0,949 ± 0,003 3-grams+dicts 0,956 ± 0,003 0,960 ± 0,003 0,957 ± 0,004 0,959 ± 0,003 endings+3-grams+dicts 0,957 ± 0,003 0,961 ± 0,003 0,959 ± 0,004 0,960 ± 0,002 Table : Results of the experiments on the training set of 10,000 names. Here endings – 4 Russian female endings, trigrams – 1000 most frequent 3-grams, dictionary – name/surname dictionary with γ = 80% top entries. This table presents precision, recall and F-measure of the female class. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 15. Problem and the Dataset Method Results References Results Figure : Learning curves of single and combined models. Accuracy was estimated on separate sample of 10,000 names. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 16. Problem and the Dataset Method Results References Results Figure : Accuracy of the model 3-grams function of the number of features used k. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 17. Problem and the Dataset Method Results References Results Figure : Accuracy of the dicts model function of the fraction of dictionaries used γ. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 18. Problem and the Dataset Method Results References Error Analysis There are several types of errors: Inconsistent annotation, such as “Anna Kryukova (male)” or “Boris Krolchansky (female)”. Name string is neither male nor female, but rather a name of a group, e.g. “Wikom Tools”, “Kazakh University of Humanities” or “Privat Bank”. Name string represents a foreign name, e.g. “Abdulloh Ibn Abdulloh”, “Brooke Alisson”, “Ulpetay Niyetbay” or “Yola Dolson”. Our model was not trained to deal with such names. Meaningless or partially anonymized names names, e.g. “Crazzy Ma”, “Un Petit Diable”, “Vv Tt”, “Vio La Tor” or “Muu Muu”. Additional information is required to derive gender of such users. People with rare names or surnames, e.g. “Guldjan Reyzova”, “Yagun Zumpelich” or “Akob Saakan”. These are people with common names and surnames in other countries, such asAlexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 19. Problem and the Dataset Method Results References Error Analysis Train Set Errors Test Set Errors name true class name true class 1 Lea Shraiber female Ilya Nadorshin male 2 Profanum Vulgus female Rustem Saledinov male 3 Anna Kryukova male Erkin Bahlamet male 4 Gin Amaya male Gocha Lapachi male 5 Gertrud Gallet female Muttaqiyyah Abdulvahhab female 6 Dolores Laughter female Yola Dolson female 7 Di Nolik male Heiran Gasanova female 8 Jlija Hotieca female Hadji Murad male 9 Gic Globmedic female Jenya Chekulenko female 10 Ulpetay Niyetbay female Tury.Ru Domodedovskaya Metro Office male 11 Olga Shoff male Elmira Nabizade female 12 Phil Golosoun male Niko Liparteliani male 13 Tsitsino Shurgaya female Oleg Grin’ male 14 Anna Grobov female Santi Zarovneva female 15 Linguini Incident female Misha Badali male 16 Toma Oganesyan female Che Serega male 17 Swon Swetik female Petr Kiyashko male 18 Adel Simon female Sandugash Botabaeva female 19 Ant Kam- male Jenya Sergienko female 20 Xristi Xitrozver female Abdulloh Ibn Abdulloh female 21 Anii Reznookova female Naikaita Laitvainenko male 22 Aurelia Grishko male Fil Kalnitskiy male 23 Alex Bu female Helen Hovel’ female 24 Karen Karine female Valery Kotelnikov male 25 Russian Spain female Max Od male 26 Lucy Walter male Jean Kvartshelia male 27 Aysah Ahmed female Adjedo Trupachuli female 28 Kiti Iz female Ainur Serikova femaleAlexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 20. Problem and the Dataset Method Results References Thank you! Questions? Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 21. Problem and the Dataset Method Results References Al Zamal, F., Liu, W., and Ruths, D. (2012). Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In ICWSM. Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011). Discriminating gender on twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1301–1309. Association for Computational Linguistics. Ciot, M., Sonderegger, M., and Ruths, D. (2013). Gender inference of twitter users in non-english contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Wash, pages 18–21. Daniel, M. A. Zelenkov, Y. (2012). Russian national corpus as a playground for sociolinguistic research. episode iv. gender and length of the utterance. In Proceedings of Dialog-2012, pages 51–62. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 22. Problem and the Dataset Method Results References Goswami, S., Sarkar, S., and Rustagi, M. (2009). Stylometric analysis of bloggers’ age and gender. In Third International AAAI Conference on Weblogs and Social Media. Koppel, M., Argamon, S., and Shimoni, A. R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4):401–412. Liu, W., Zamal, F. A., and Ruths, D. (2012). Using social media to infer gender composition of commuter populations. Proceedings of the When the City Meets the Citizen Worksop. Mukherjee, A. and Liu, B. (2010). Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 207–217. Association for Computational Linguistics. Peersman, C., Daelemans, W., and Van Vaerenbergh, L. (2011). Predicting age and gender in online social networks. In Proceedings of the 3rd international workshop on Search and mining user-generated contents, pages 37–44. ACM. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru
  • 23. Problem and the Dataset Method Results References Rangel, F. and Rosso, P. (2013). Use of language and author profiling: Identification of gender and age. Natural Language Processing and Cognitive Science, page 177. Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010). Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, pages 37–44. ACM. Alexander Panchenko a.panchenko@digsolab.comDetecting Gender by Full Name: Experiments with the Ru