SALES RELAUNCH F&Q SESSION
Multi-lingual data processing
The CIS and Georgia
Olga Rink, director general
3
Content
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• Business environment
• Main stages of processing multi-lingual business data
o Naming convention
o Transliteration
o Matching
• Seeding and verifying objects in a media coverage
4
Official languages, population (mn) and Russian as a
second language (est.)
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
5
Multi-lingual environment
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Country Official language (group)
Population,
mn Alphabet Second language
Russian, % of
population, est.
Russia Russian 150Cyrillic
35+* official and over 100
used  100%
Armenia
Armenian (Indo-European
language) 3Own script Russian, English 100%
Azerbaijan Azeri Turkish 9,8
Latin in Azerbaijan, Cyrillic in Russia
(Dagestan) 90%
Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100%
Georgia Georgian (Kartvelian language) 3,7Georgian script
Russian, English, Azeri,
Armenian 100%
Kazakhstan
Kazakh (Turkic language),
Russian 17,7
Kazakh alphabets (Cyrillic, Latin,
Perso-Arabic, Kazakh Braille)
Russian
 100%
Kyrgyzstan
Kyrgyz (Turkic language),
Russian 6Cyrillic Kyrgyz  100%
Moldova Romanian 3,6Latin Russian is widely used  90%
Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90%
Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100%
Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic
Russian is widely used along
with a number of other
languages  100%
Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100%
• The Constitution of Dagestan defines "Russian and the languages of
the peoples of Dagestan" as the state languages
•  a bulk of newly-registered business is available in Cyrillic or Latin
6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
• For Slavic languages we use ISO
9:1995 standard with one exception:
put a combination of Latin characters
instead of Latin diacritic characters.
Example: Ch (without diacritic) instead of
Ч – Č (with diacritic)
• ISO9985 is used for Armenian
• ISO 9984 – for Georgian
• ООО «Ъ» (Trade style: OOO TVERDY
ZNAK; OOO “” is a transliterated
name – no way to find by the
original name)
• Minor changes in transliteration like
3DNYUS, OOO >3DNEWS, LLC are
accepted and now filtered while
being updated
• Matching rules are defined in our
“Naming Convention”: i.e. the
transliterated «normalized» Charter
brief company name is used as
primary: an indication to a legal form
in the name (required by law) is put at
the end via comma.
• Second one is the transliterated full
legal name.
• Trade style contains official name in
English/Latin or trade marks
• We use rule-based and machine
learning approaches, including areas
of collecting data, identifying
objects, developing credit scorings,
digesting media coverage
7
Natural Language Processing and Machine Learning
The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train,
and deploy credit and reputation risk models with minimal effort
• Tagging documents and
• Classifying by a text type (media-release,
forecast, feature etc)
Detecting and Disambiguating Named Entities
Support Vector Machine (SVM) or Bayes are used,
depending on configuration
• SVM represents a text as a vector to compare with a pattern
(prototype); The closeness defines the type
• Bayes rule is applicable when you rely on pre-determined
assumptions (a range of known “symptoms”) while calculating
probabilities
Rule-based fact extraction and sentiment analysis
At an initial phase for seeding named persons
• Rule-based approach mostly
• Context analysis and statistics for entity disambiguation
Clarification of Named Entity Detection with learning semi-
automatically labelled corpus
• Support Vector Machine (SVM)
• A neural network on the basis of the existing rule-based
structure is considered for future
8
An intellectual WOW-effect or what can only SCAN
do – forward to “verifying” media coverage
Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
Out of 3 mn companies automatically
generated by the Scan linguistic kernel for
the recent year 22 thousand have been
verified, 0.5 mn are identified with Spark
2 mn persons were generated (seeded);
out of them 75 thousand verified
300 thousand of geographic locations: all
Russian ones identified by OKATO classifier
and many global locations got by parsing
Wikipedia
13 thousand trade marks (“Trade style”)
24 thousand sources in
Russian
ThankYou
Interfax – Dun & Bradstreet
www.dnb.ru

Processing multi-lingual business data

  • 1.
  • 2.
    Multi-lingual data processing TheCIS and Georgia Olga Rink, director general
  • 3.
    3 Content Interfax - Dun& Bradstreet, Innovations in Multi-lingual context • Business environment • Main stages of processing multi-lingual business data o Naming convention o Transliteration o Matching • Seeding and verifying objects in a media coverage
  • 4.
    4 Official languages, population(mn) and Russian as a second language (est.) Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
  • 5.
    5 Multi-lingual environment Interfax -Dun & Bradstreet, Innovations in Multi-lingual context Country Official language (group) Population, mn Alphabet Second language Russian, % of population, est. Russia Russian 150Cyrillic 35+* official and over 100 used  100% Armenia Armenian (Indo-European language) 3Own script Russian, English 100% Azerbaijan Azeri Turkish 9,8 Latin in Azerbaijan, Cyrillic in Russia (Dagestan) 90% Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100% Georgia Georgian (Kartvelian language) 3,7Georgian script Russian, English, Azeri, Armenian 100% Kazakhstan Kazakh (Turkic language), Russian 17,7 Kazakh alphabets (Cyrillic, Latin, Perso-Arabic, Kazakh Braille) Russian  100% Kyrgyzstan Kyrgyz (Turkic language), Russian 6Cyrillic Kyrgyz  100% Moldova Romanian 3,6Latin Russian is widely used  90% Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90% Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100% Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic Russian is widely used along with a number of other languages  100% Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100% • The Constitution of Dagestan defines "Russian and the languages of the peoples of Dagestan" as the state languages •  a bulk of newly-registered business is available in Cyrillic or Latin
  • 6.
    6Interfax - Dun& Bradstreet, Innovations in Multi-lingual context • For Slavic languages we use ISO 9:1995 standard with one exception: put a combination of Latin characters instead of Latin diacritic characters. Example: Ch (without diacritic) instead of Ч – Č (with diacritic) • ISO9985 is used for Armenian • ISO 9984 – for Georgian • ООО «Ъ» (Trade style: OOO TVERDY ZNAK; OOO “” is a transliterated name – no way to find by the original name) • Minor changes in transliteration like 3DNYUS, OOO >3DNEWS, LLC are accepted and now filtered while being updated • Matching rules are defined in our “Naming Convention”: i.e. the transliterated «normalized» Charter brief company name is used as primary: an indication to a legal form in the name (required by law) is put at the end via comma. • Second one is the transliterated full legal name. • Trade style contains official name in English/Latin or trade marks • We use rule-based and machine learning approaches, including areas of collecting data, identifying objects, developing credit scorings, digesting media coverage
  • 7.
    7 Natural Language Processingand Machine Learning The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train, and deploy credit and reputation risk models with minimal effort • Tagging documents and • Classifying by a text type (media-release, forecast, feature etc) Detecting and Disambiguating Named Entities Support Vector Machine (SVM) or Bayes are used, depending on configuration • SVM represents a text as a vector to compare with a pattern (prototype); The closeness defines the type • Bayes rule is applicable when you rely on pre-determined assumptions (a range of known “symptoms”) while calculating probabilities Rule-based fact extraction and sentiment analysis At an initial phase for seeding named persons • Rule-based approach mostly • Context analysis and statistics for entity disambiguation Clarification of Named Entity Detection with learning semi- automatically labelled corpus • Support Vector Machine (SVM) • A neural network on the basis of the existing rule-based structure is considered for future
  • 8.
    8 An intellectual WOW-effector what can only SCAN do – forward to “verifying” media coverage Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Out of 3 mn companies automatically generated by the Scan linguistic kernel for the recent year 22 thousand have been verified, 0.5 mn are identified with Spark 2 mn persons were generated (seeded); out of them 75 thousand verified 300 thousand of geographic locations: all Russian ones identified by OKATO classifier and many global locations got by parsing Wikipedia 13 thousand trade marks (“Trade style”) 24 thousand sources in Russian
  • 9.
    ThankYou Interfax – Dun& Bradstreet www.dnb.ru