TEXT MINING NAMES IN ‘BIG DATA’ TO
RECOGNIZE TURKISH MIGRATION TRENDS
NamSor Applied Onomastics
1
2014-05-30
Names Data Mining is just a Tool
2
Zeynep Değirmencioğlu
Şükrü Kaya
Şükrü Saracoğlu
Elian Carsenat
Hüseyin Yıldız
Mahmut Yıldırım
Fatih Öztürk
Mehmet Bölükbaşı
Mehmet Yılmaz
Elif Yıldırım
Ahmet Yıldırım
Mustafa Yücedağ
Mustafa Uzunyılmaz
Fatih Kılıç
Fatih Yılmaz
Murat Yıldırım
Hüseyin Kılıç
Oğuzhan Yıldız
Mevlüt Çavuşoğlu
… (Source: Freebase)
What’s in a name? What’s a name?
3
 Elian Carsenat
 @ElianCarsenat (Twitter)
 elian.carsenat@namsor.com
 elian.carsenat@sfr.fr
 tioulpanov (Skype)
 NamSor.com
 Onomastics = the science of proper names
Onoma != Residence != Nationality
4
Source: OECD
NamSor sorts names : functions, use cases
5
2.Name
Transliteration
& Matching
3.Named Entity
Extraction, Parsing
1.Name Ling.
Classification
Multilingual Text Mining
Control Watch ListsSocial Networks Analytics
Geo demographics
NamSor supervised learning
6
FN LN
MetteAndersen
LeneAndersson
EvaArndt-Riise
HeidiAstrup
MieAugustesen
MargotBærentzen
LouiseBager Nørgaard
MarieBagger Rasmussen
YuttaBarding
UllaBarding-Poulsen
FN LN
XianDongmei
ZhengDongmei
JinDongxiang
XuDongxiang
LiDongxiao
QinDongya
LiDongying
HanDuan
LiDuihong
JiangFan
Training set : Athletes
Step 1 – Learn stereotypes
bitao gong
biwang jiang
birgitta agerberth
birgitte l. eriksen
bitao gong
bitten thorengaard
biwang Jiang
birgitta agerberth
birgitte l. eriksen
bitten thorengaard
Data set : Inventors
Step 2 – Classify
Accuracy is measurable ~80%
The very first backtesting on the onomastics of 150,000 Olympic game athletes
7
TOTAL PERF Row Labels
3794 97%Japan
260 93%Mongolia
1576 92%Greece
262 89%Lithuania
4150 89%Italy
2818 88%Poland
2180 87%South Korea
Japan Indonesia Sri Lanka Nigeria Congo (B)
Japan 3686 4 3 3 3
Mongolia Iraq Japan Mali Kazakhstan
Mongolia 243 2 1 1 1
Greece Italy Georgia Romania Great Britain
Greece 1444 14 6 5 5
Lithuania Namibia Greece Latvia Russia
Lithuania 234 3 3 3 2
Italy Spain Portugal France Austria
Italy 3675 81 80 29 26
Poland Czechoslovakia Czech Republic Slovakia Austria
Poland 2486 46 38 34 22
South Korea North Korea Chinese Taipei
Equatorial
Guinea China
South Korea 1901 209 10 6 5
Euro athletes (excl. Anglo & Latin).
Breakdown accuracy 84%
Ex- Yugoslavia athletes
Breakdown accuracy 75%
Decrypting identity accross space/time:
India Geodemographics (1914)8
Source: Commonwealth WWI Casualties
Unsupervised learning is
fine-grain: Country/Region,…9
 Ex. Russian Federation
In progress :
Syrian names (backtesting)
Onoma Count
Syria 201
Saudi Arabia 20
Iraq 8
Kuwait 4
United Arab Emirates 3
Egypt 3
Qatar 2
Bahrain 2
Soudan 2
Lebanon 2
Algeria 1
Oman 1
Grand Total 249
10
201
Syria
Saudi Arabia
Iraq
Kuwait
United Arab Emirates
Egypt
Qatar
Bahrain
Soudan
Lebanon
Algeria
Oman
‫طاهر‬ ‫الحريري‬
‫عبدالغفار‬ ‫العيدة‬ ‫سليمان‬
‫عبدالغفار‬ ‫شحادة‬
‫قاسم‬ ‫األسعد‬
‫مؤمن‬ ‫حموده‬
‫مفلح‬ ‫محمد‬ ‫الجراد‬
‫نزار‬ ‫الحروب‬
‫نزار‬ ‫العيدة‬ ‫سليمان‬
‫أسامة‬ ‫الحراكي‬
‫أنس‬ ‫الصغير‬
‫خالد‬ ‫الهبول‬
‫وفيق‬ ‫الواحد‬ ‫عبد‬
‫إسراء‬ ‫يونس‬
‫رشا‬ ‫نزهة‬
‫زكريا‬ ‫محمد‬ ‫وهبة‬
‫كمال‬ ‫بركات‬
‫عيد‬ ‫محمد‬ ‫اللو‬
[…]
Syrian names recognized at ~80%
Other name may effectively be non-
Syrian or generic to the Arab world.
What can you dig with this tool?
11
Mining 5M names to recognize Gender,
breakdown by nationality/likely origin
12
Mining 1M names to map Diasporas
13
Source: Twitter
Mining 3M Geo-Tweets
Population flows on Twitter
14
Source Target Type Id Onoma Weight
United Kingdom France Directed 16 Great Britain 37
Spain France Directed 55 Spain 14
United States France Directed 75 Great Britain 12
Turkey France Directed 79 Turkey 11
Brazil France Directed 87 Portugal 10
United Kingdom France Directed 112 Ireland 9
Italy France Directed 152 Italy 7
Switzerland France Directed 226 France 5
Belgium France Directed 247 France 5
United Kingdom France Directed 258 France 5
Mexico France Directed 287 Spain 4
Ireland France Directed 317 Great Britain 4
United Kingdom France Directed 333 Italy 4
United States France Directed 375 France 4
Source: Twitter
Mining 150k names in Patents to see
where the Turkish ‘brain juice’ flows15
Mining names : a word of caution
16
Can ‘Big Data’ answer any question?
17
 Trash in, Gold out ? Yes, to some extent
 Beware of biases induced by the data source itself
 Data access limitations / privacy issues
 Open Data vs. Free APIs vs. Commercial Databases
Still, tools make possible the impossible
18
originating FDI leads
19
 NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.
 What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European
Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct
Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have
attracted huge amounts of money from America – due largely to a century of personal and familial ties –
and they have used this money to build factories ”.
 A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant
for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian
origin living abroad, there is a good many personal and familial ties to be leveraged to attract new
investment projects to the country. NamSor name recognition software helped discover those ties.
 Recognizing names and their origin in global professional databases allows Investment Promotion Agencies
to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out
to them. Another method to accelerate the origination of new leads is to better understand and leverage
the existing network of foreign businessmen in the country itself.
 NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.
 Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the
name recognition software: it reliably predicts the country of origin and the number of false positives is fully
manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like
seeking a gold needle in a haystack: doable once the right tool exists".
Conclusions
20
 We recognize names in any language, any place, any
database; we can classify and we can sort
 Onomastic class is no ‘hard fact’ like a place of birth, a
nationality, etc. but it’s accurate and fine-grain
 As a statistics tool, it might be dabatable. But as a datamining
tool, it’s sharp, simple and efficient : it can help find research
directions, discover trends
 We see use cases in Migration research; Education & Skills;
Labour & Social Affairs; Territorial Development/FDI; Science
& Innovation
Merci !
 http://fdimagnet.com/  http://namsor.com/
21
Juillet 2013, Ambassade de Lituanie à Paris
 elian.carsenat@namsor.com
 +33 6 52 77 99 07
 Twitter @NamsSor_com

Text mining names in ‘Big Data’ to recognize migration trends

  • 1.
    TEXT MINING NAMESIN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied Onomastics 1 2014-05-30
  • 2.
    Names Data Miningis just a Tool 2 Zeynep Değirmencioğlu Şükrü Kaya Şükrü Saracoğlu Elian Carsenat Hüseyin Yıldız Mahmut Yıldırım Fatih Öztürk Mehmet Bölükbaşı Mehmet Yılmaz Elif Yıldırım Ahmet Yıldırım Mustafa Yücedağ Mustafa Uzunyılmaz Fatih Kılıç Fatih Yılmaz Murat Yıldırım Hüseyin Kılıç Oğuzhan Yıldız Mevlüt Çavuşoğlu … (Source: Freebase)
  • 3.
    What’s in aname? What’s a name? 3  Elian Carsenat  @ElianCarsenat (Twitter)  elian.carsenat@namsor.com  elian.carsenat@sfr.fr  tioulpanov (Skype)  NamSor.com  Onomastics = the science of proper names
  • 4.
    Onoma != Residence!= Nationality 4 Source: OECD
  • 5.
    NamSor sorts names: functions, use cases 5 2.Name Transliteration & Matching 3.Named Entity Extraction, Parsing 1.Name Ling. Classification Multilingual Text Mining Control Watch ListsSocial Networks Analytics Geo demographics
  • 6.
    NamSor supervised learning 6 FNLN MetteAndersen LeneAndersson EvaArndt-Riise HeidiAstrup MieAugustesen MargotBærentzen LouiseBager Nørgaard MarieBagger Rasmussen YuttaBarding UllaBarding-Poulsen FN LN XianDongmei ZhengDongmei JinDongxiang XuDongxiang LiDongxiao QinDongya LiDongying HanDuan LiDuihong JiangFan Training set : Athletes Step 1 – Learn stereotypes bitao gong biwang jiang birgitta agerberth birgitte l. eriksen bitao gong bitten thorengaard biwang Jiang birgitta agerberth birgitte l. eriksen bitten thorengaard Data set : Inventors Step 2 – Classify
  • 7.
    Accuracy is measurable~80% The very first backtesting on the onomastics of 150,000 Olympic game athletes 7 TOTAL PERF Row Labels 3794 97%Japan 260 93%Mongolia 1576 92%Greece 262 89%Lithuania 4150 89%Italy 2818 88%Poland 2180 87%South Korea Japan Indonesia Sri Lanka Nigeria Congo (B) Japan 3686 4 3 3 3 Mongolia Iraq Japan Mali Kazakhstan Mongolia 243 2 1 1 1 Greece Italy Georgia Romania Great Britain Greece 1444 14 6 5 5 Lithuania Namibia Greece Latvia Russia Lithuania 234 3 3 3 2 Italy Spain Portugal France Austria Italy 3675 81 80 29 26 Poland Czechoslovakia Czech Republic Slovakia Austria Poland 2486 46 38 34 22 South Korea North Korea Chinese Taipei Equatorial Guinea China South Korea 1901 209 10 6 5 Euro athletes (excl. Anglo & Latin). Breakdown accuracy 84% Ex- Yugoslavia athletes Breakdown accuracy 75%
  • 8.
    Decrypting identity accrossspace/time: India Geodemographics (1914)8 Source: Commonwealth WWI Casualties
  • 9.
    Unsupervised learning is fine-grain:Country/Region,…9  Ex. Russian Federation
  • 10.
    In progress : Syriannames (backtesting) Onoma Count Syria 201 Saudi Arabia 20 Iraq 8 Kuwait 4 United Arab Emirates 3 Egypt 3 Qatar 2 Bahrain 2 Soudan 2 Lebanon 2 Algeria 1 Oman 1 Grand Total 249 10 201 Syria Saudi Arabia Iraq Kuwait United Arab Emirates Egypt Qatar Bahrain Soudan Lebanon Algeria Oman ‫طاهر‬ ‫الحريري‬ ‫عبدالغفار‬ ‫العيدة‬ ‫سليمان‬ ‫عبدالغفار‬ ‫شحادة‬ ‫قاسم‬ ‫األسعد‬ ‫مؤمن‬ ‫حموده‬ ‫مفلح‬ ‫محمد‬ ‫الجراد‬ ‫نزار‬ ‫الحروب‬ ‫نزار‬ ‫العيدة‬ ‫سليمان‬ ‫أسامة‬ ‫الحراكي‬ ‫أنس‬ ‫الصغير‬ ‫خالد‬ ‫الهبول‬ ‫وفيق‬ ‫الواحد‬ ‫عبد‬ ‫إسراء‬ ‫يونس‬ ‫رشا‬ ‫نزهة‬ ‫زكريا‬ ‫محمد‬ ‫وهبة‬ ‫كمال‬ ‫بركات‬ ‫عيد‬ ‫محمد‬ ‫اللو‬ […] Syrian names recognized at ~80% Other name may effectively be non- Syrian or generic to the Arab world.
  • 11.
    What can youdig with this tool? 11
  • 12.
    Mining 5M namesto recognize Gender, breakdown by nationality/likely origin 12
  • 13.
    Mining 1M namesto map Diasporas 13 Source: Twitter
  • 14.
    Mining 3M Geo-Tweets Populationflows on Twitter 14 Source Target Type Id Onoma Weight United Kingdom France Directed 16 Great Britain 37 Spain France Directed 55 Spain 14 United States France Directed 75 Great Britain 12 Turkey France Directed 79 Turkey 11 Brazil France Directed 87 Portugal 10 United Kingdom France Directed 112 Ireland 9 Italy France Directed 152 Italy 7 Switzerland France Directed 226 France 5 Belgium France Directed 247 France 5 United Kingdom France Directed 258 France 5 Mexico France Directed 287 Spain 4 Ireland France Directed 317 Great Britain 4 United Kingdom France Directed 333 Italy 4 United States France Directed 375 France 4 Source: Twitter
  • 15.
    Mining 150k namesin Patents to see where the Turkish ‘brain juice’ flows15
  • 16.
    Mining names :a word of caution 16
  • 17.
    Can ‘Big Data’answer any question? 17  Trash in, Gold out ? Yes, to some extent  Beware of biases induced by the data source itself  Data access limitations / privacy issues  Open Data vs. Free APIs vs. Commercial Databases
  • 18.
    Still, tools makepossible the impossible 18
  • 19.
    originating FDI leads 19 NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.  What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have attracted huge amounts of money from America – due largely to a century of personal and familial ties – and they have used this money to build factories ”.  A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian origin living abroad, there is a good many personal and familial ties to be leveraged to attract new investment projects to the country. NamSor name recognition software helped discover those ties.  Recognizing names and their origin in global professional databases allows Investment Promotion Agencies to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out to them. Another method to accelerate the origination of new leads is to better understand and leverage the existing network of foreign businessmen in the country itself.  NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.  Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the name recognition software: it reliably predicts the country of origin and the number of false positives is fully manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like seeking a gold needle in a haystack: doable once the right tool exists".
  • 20.
    Conclusions 20  We recognizenames in any language, any place, any database; we can classify and we can sort  Onomastic class is no ‘hard fact’ like a place of birth, a nationality, etc. but it’s accurate and fine-grain  As a statistics tool, it might be dabatable. But as a datamining tool, it’s sharp, simple and efficient : it can help find research directions, discover trends  We see use cases in Migration research; Education & Skills; Labour & Social Affairs; Territorial Development/FDI; Science & Innovation
  • 21.
    Merci !  http://fdimagnet.com/ http://namsor.com/ 21 Juillet 2013, Ambassade de Lituanie à Paris  elian.carsenat@namsor.com  +33 6 52 77 99 07  Twitter @NamsSor_com