Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Can we track the geographic origin of
surnames based on bibliographic data?
Nicolas Robinson-Garcia, Ed Noyons & Rodrigo C...
Agenda
oBackground
oBibliographic data
oMethod 1. Kullback-Leibler divergence
oMethod 2. Concentration Index
oThe ‘golden ...
Background
“the use of surnames in human population biology dates back to
1875, when George Darwin used frequency of occur...
Background
o The representation of Jewish surnames in biomedical journals
and US-patents
Kissin, 2011; Kissin & Bradley, 2...
Background
HOW CAN WE DETERMINE THE GEOGRAPHIC ORIGIN OF
SURNAMES?
METHODS
o Manually curated lists
o Probability and Baye...
Bibliographic data
o Scientific databases as international surnames data
sources
Regional restrictions Temporal restrictio...
Bibliographic data
o Scientific databases as international surnames data
sources
Regional restrictions Temporal restrictio...
Assumptions
HYPOTHESIS 1
A surname should be assigned to the country where there
is a higher frequency of such surname
HYP...
Method 1. Kullback-Leibler
OPERATIONALIZATION
A surname will be assigned to a country if 1) it has the highest
frequency, ...
Method 2. Gini Index
OPERATIONALIZATION
A surname will be assigned to a country if it is the one with the
highest concentr...
Kulback-Leibler vs. Gini index
Country No. surnames
FRANCE 138349
GERMANY 112445
RUSSIA 111716
SPAIN 83529
USA 76219
ITALY...
Kulback-Leibler vs. Gini index
Surname Country
CLINTON USA
EGGHE BELGIUM
GARFIELD USA
HERRERA SPAIN
GARCIA SPAIN
EINSTEIN ...
The ‘golden list’
Validating the methods proposed
SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS
o Coverage
o Criteria
...
The ‘golden list’
Validating the methods proposed
SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS
o Coverage
o Criteria
...
The ‘golden list’
Validating the methods proposed
Unified country Languages
Denmark Danish
England
Celtic; Anglo-Cornish; ...
The ‘golden list’
METHOD 1 METHOD 2
Countries % coverage % correct % coverage % correct
DENMARK 91.1% 68.75% 100% 60.16%
E...
Next or previous steps
o Is the Web of Science a good sample of the world
population?
› Country census crossed with the Wo...
Thank you! elrobin@ugr.es
Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas
15th INTERNATIONAL CONFERENCE
ON SCIENTOMETR...
Upcoming SlideShare
Loading in …5
×

Can we track the geography of surnames based on bibliographic data?

890 views

Published on

Presen

Published in: Education
  • Be the first to comment

  • Be the first to like this

Can we track the geography of surnames based on bibliographic data?

  1. 1. Can we track the geographic origin of surnames based on bibliographic data? Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas 15th INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS 29 June – 3 July, 2015, Bogazici University, Istanbul, Turkey EC3metrics spin off CWTS Leiden University
  2. 2. Agenda oBackground oBibliographic data oMethod 1. Kullback-Leibler divergence oMethod 2. Concentration Index oThe ‘golden list’ oNext or previous steps
  3. 3. Background “the use of surnames in human population biology dates back to 1875, when George Darwin used frequency of occurrences of the same surname in married couples to study in-breeding” Kissin, 2011 WHAT IS IN A SURNAME? o Proxy for genetic/ethnic origin -> Epidemiology, Biomedical research o Proxy for country origin -> Demographic studies, migratory movements
  4. 4. Background o The representation of Jewish surnames in biomedical journals and US-patents Kissin, 2011; Kissin & Bradley, 2013 o Relation between ethnic mix collaboration and citation impact Freeman & Huang, 2014 … in the field of bibliometrics
  5. 5. Background HOW CAN WE DETERMINE THE GEOGRAPHIC ORIGIN OF SURNAMES? METHODS o Manually curated lists o Probability and Bayesian methods o Clustering techniques DATA SOURCES o National census o Dispersion of sources o Lack of international coverage
  6. 6. Bibliographic data o Scientific databases as international surnames data sources Regional restrictions Temporal restrictions o Establishing ‘trusted’ linkages between surnames and countries Reprint address First author-First address One country publications Author-address linkages (2008)
  7. 7. Bibliographic data o Scientific databases as international surnames data sources Regional restrictions Temporal restrictions o Establishing ‘trusted’ linkages between surnames and countries Some figures: -> 1,568,052 distinct surnames assigned to 119 countries -> France 8,8%; Germany 8,0%; Russia 7,1%; Spain 4,9%
  8. 8. Assumptions HYPOTHESIS 1 A surname should be assigned to the country where there is a higher frequency of such surname HYPOTHESIS 2 A surname should be assigned to the country where there is a greater concentration of such surname.
  9. 9. Method 1. Kullback-Leibler OPERATIONALIZATION A surname will be assigned to a country if 1) it has the highest frequency, and 2) there are “certain levels of assurance”. METHOD 1 Kullback-Leibler divergence indicates the (dis)similarity of a global surname distribution with its distribution in each country.
  10. 10. Method 2. Gini Index OPERATIONALIZATION A surname will be assigned to a country if it is the one with the highest concentration of such surname. METHOD 2 Gini Index is an inequality indicator already employed for other purposes in bibliometrics. It ponder within 0 and 1 the concentration of a surname in a country.
  11. 11. Kulback-Leibler vs. Gini index Country No. surnames FRANCE 138349 GERMANY 112445 RUSSIA 111716 SPAIN 83529 USA 76219 ITALY 69637 ENGLAND 63885 JAPAN 56345 CANADA 49775 NETHERLANDS 41306 Country No. surnames USA 310739 FRANCE 117938 GERMANY 111375 RUSSIA 94369 ITALY 65699 JAPAN 52399 ENGLAND 47521 CANADA 46146 POLAND 44087 INDIA 42897 Method 1. Kullback-Leibler Method 2. Gini index Top 10 countries with the highest number of surnames assigned
  12. 12. Kulback-Leibler vs. Gini index Surname Country CLINTON USA EGGHE BELGIUM GARFIELD USA HERRERA SPAIN GARCIA SPAIN EINSTEIN USA NOYONS NETHERLANDS PEREIRA BRAZIL Method 1. Kullback-Leibler Method 2. Gini index Top 10 countries with the highest number of surnames assigned Surname Country CLINTON USA EGGHE BELGIUM GARFIELD USA HERRERA CUBA GARCIA CUBA EINSTEIN ISRAEL NOYONS NETHERLANDS PEREIRA PORTUGAL
  13. 13. The ‘golden list’ Validating the methods proposed SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS o Coverage o Criteria › Language › Ethnicity › Historical origin o Reliance and double assignments
  14. 14. The ‘golden list’ Validating the methods proposed SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS o Coverage o Criteria › Language › Ethnicity › Historical origin o Reliance and double assignments
  15. 15. The ‘golden list’ Validating the methods proposed Unified country Languages Denmark Danish England Celtic; Anglo-Cornish; English; Scottish; Irish Finland Finnish France Breton; French Germany German Greece Greek Iceland Icelandic Italy Italian Japan Japanese Netherlands Afrikaans; Dutch Portugal Portuguese Spain Basque; Catalan; Galician; In search for a ‘golden list’ of surnames assigned to countries/languages/ ethnicities http://en.wikipedia.org/wiki/Category:Surnames_by_language
  16. 16. The ‘golden list’ METHOD 1 METHOD 2 Countries % coverage % correct % coverage % correct DENMARK 91.1% 68.75% 100% 60.16% ENGLAND 28.8% 80.97% 100% 58.56% FINLAND 99.11 94.62% 100% 91.96% FRANCE 88.08% 68.28% 100% 50.54% GERMANY 52.24% 69.00% 100% 43.78% GREECE 84.12% 78.32% 100% 78.57% ICELAND 100.00% 65.52% 100% 100.00% ITALY 87.65% 86.97% 100% 64.77% JAPAN 98.74% 98.95% 100% 91.39% NETHERLANDS 88.11% 60.96% 100% 41.67% PORTUGAL 98.54% 92.59% 100% 91.91% SPAIN 93.18% 48.74% 100% 54.74% Total 73.22% 79.03% 100% 61.29%
  17. 17. Next or previous steps o Is the Web of Science a good sample of the world population? › Country census crossed with the WoS o Time frames and migratory movements › Apply methods to different periods o Validation and comparison with other techniques › Bayesian, probability, clustering o Multiple assignments of countries (e.g., Lee, Santos)
  18. 18. Thank you! elrobin@ugr.es Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas 15th INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS 29 June – 3 July, 2015, Bogazici University, Istanbul, Turkey EC3metrics spin off CWTS Leiden University

×