Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Processing multi-lingual business data


Published on

Interfax - Dun & Bradstreet review of the approaches to processing multi-lingual information

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Processing multi-lingual business data

  2. 2. Multi-lingual data processing The CIS and Georgia Olga Rink, director general
  3. 3. 3 Content Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • Business environment • Main stages of processing multi-lingual business data o Naming convention o Transliteration o Matching • Seeding and verifying objects in a media coverage
  4. 4. 4 Official languages, population (mn) and Russian as a second language (est.) Interfax - Dun & Bradstreet, Innovations in Multi-lingual context
  5. 5. 5 Multi-lingual environment Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Country Official language (group) Population, mn Alphabet Second language Russian, % of population, est. Russia Russian 150Cyrillic 35+* official and over 100 used  100% Armenia Armenian (Indo-European language) 3Own script Russian, English 100% Azerbaijan Azeri Turkish 9,8 Latin in Azerbaijan, Cyrillic in Russia (Dagestan) 90% Belarus Bielaruskaja mova, Russian 9,5Cyrillic Russian  100% Georgia Georgian (Kartvelian language) 3,7Georgian script Russian, English, Azeri, Armenian 100% Kazakhstan Kazakh (Turkic language), Russian 17,7 Kazakh alphabets (Cyrillic, Latin, Perso-Arabic, Kazakh Braille) Russian  100% Kyrgyzstan Kyrgyz (Turkic language), Russian 6Cyrillic Kyrgyz  100% Moldova Romanian 3,6Latin Russian is widely used  90% Tajikistan Tajik (Persian dialect) 8Cyrillic Russian 90% Turkmenistan Turkmen (Turkic language) 5,2Cyrillic, Latin Russian is used 100% Ukraine Ukrainian (Ukrayins'ka mova) 42,5Cyrillic Russian is widely used along with a number of other languages  100% Uzbekistan Uzbek, in fact Russian 31,6Cyrillic, Latin Russian is widely used 100% • The Constitution of Dagestan defines "Russian and the languages of the peoples of Dagestan" as the state languages •  a bulk of newly-registered business is available in Cyrillic or Latin
  6. 6. 6Interfax - Dun & Bradstreet, Innovations in Multi-lingual context • For Slavic languages we use ISO 9:1995 standard with one exception: put a combination of Latin characters instead of Latin diacritic characters. Example: Ch (without diacritic) instead of Ч – Č (with diacritic) • ISO9985 is used for Armenian • ISO 9984 – for Georgian • ООО «Ъ» (Trade style: OOO TVERDY ZNAK; OOO “” is a transliterated name – no way to find by the original name) • Minor changes in transliteration like 3DNYUS, OOO >3DNEWS, LLC are accepted and now filtered while being updated • Matching rules are defined in our “Naming Convention”: i.e. the transliterated «normalized» Charter brief company name is used as primary: an indication to a legal form in the name (required by law) is put at the end via comma. • Second one is the transliterated full legal name. • Trade style contains official name in English/Latin or trade marks • We use rule-based and machine learning approaches, including areas of collecting data, identifying objects, developing credit scorings, digesting media coverage
  7. 7. 7 Natural Language Processing and Machine Learning The SCAN engine is leveraging vast amounts of text data to enable the next generation of Interfax data products Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Interfax builds a scalable machine learning infrastructure that enables data scientists and engineers to explore, train, and deploy credit and reputation risk models with minimal effort • Tagging documents and • Classifying by a text type (media-release, forecast, feature etc) Detecting and Disambiguating Named Entities Support Vector Machine (SVM) or Bayes are used, depending on configuration • SVM represents a text as a vector to compare with a pattern (prototype); The closeness defines the type • Bayes rule is applicable when you rely on pre-determined assumptions (a range of known “symptoms”) while calculating probabilities Rule-based fact extraction and sentiment analysis At an initial phase for seeding named persons • Rule-based approach mostly • Context analysis and statistics for entity disambiguation Clarification of Named Entity Detection with learning semi- automatically labelled corpus • Support Vector Machine (SVM) • A neural network on the basis of the existing rule-based structure is considered for future
  8. 8. 8 An intellectual WOW-effect or what can only SCAN do – forward to “verifying” media coverage Interfax - Dun & Bradstreet, Innovations in Multi-lingual context Out of 3 mn companies automatically generated by the Scan linguistic kernel for the recent year 22 thousand have been verified, 0.5 mn are identified with Spark 2 mn persons were generated (seeded); out of them 75 thousand verified 300 thousand of geographic locations: all Russian ones identified by OKATO classifier and many global locations got by parsing Wikipedia 13 thousand trade marks (“Trade style”) 24 thousand sources in Russian
  9. 9. ThankYou Interfax – Dun & Bradstreet