Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications

671 views

Published on

Advances in text mining, analytics and machine learning are transforming our applications and enabling ever more powerful applications, yet most applications and platforms are designed to deal with a single (normalized) language. Hence as our applications and platforms are increasingly required to ingest international content, the challenge becomes to find ways to normalize content to a single language without compromising quality. An extension of this question in terms of such applications is also how we define quality in this context and what, if any, bi-products a localization effort can produce that may enhance the usefulness of the application.

This talk will, using patent searching as an example use case, review the challenges and possible solution approaches for handling localization effectively and will show what current emerging technology offers, what to expect and what not to expect and provide an introductory practical guide to handling localization in the context of data mining and analytics.

Published in: Internet
  • Be the first to comment

II-SDV 2017: Localizing International Content for Search, Data Mining and Analytics Applications

  1. 1. Localizing International Content for Search, Data Mining and Analytics Applications Andrew Rufener E: andrew.rufener@omniscien.com Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  2. 2. Agenda • Who we are and what we do • Setting the scene – a architecture for our discussion and the key challenges • The localization workflow and why content localization and search are intertwined • Illustrating using a practical example • Summary & Recommendations
  3. 3. COMPANY OVERVIEW • Founded in 2007 as Asia Online, changed company name in 2016 to Omniscien Technologies • Award winning, leading global supplier of specialized and highly scalable language processing, machine translation and machine learning solutions offering in excess of 540 global language pairs • HQ in Singapore, European operation in The Hague, The Netherlands, Asian operation in Bangkok, Thailand • Global customer base in North America, Europe and Asia Pacific Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  4. 4. MARKETS AND SOLUTIONS • eCommerce and Online Travel Automated, high-volume localization of complex product catalogue information as well as user generated content and reviews • Online Research System and Digital Publishing Automated, high-volume tagging, language processing, translation and transliteration of legal, intellectual property, scientific, financial and business information content as well as generation of relevant meta data • Government & Intelligence Automated, high-volume language identification, entity and entity relationship recognition, sentiment analysis, linking and translation and transliteration of various information sources • Technology & Enterprise Complex language processing, tagging, enriching and localization • Localization Industry Support of complex and high-volume localization • Media and Subtitling Subtitle extraction and manufacturing from different sources, support of re-writing source for subtitling, localization and post-editing, automated placement in frames and improvement • eDiscovery Automated a high volume content tagging, localization and discovery for litigation data gathering, analysis and support
  5. 5. Setting the scene and why content localization and search are intertwined Copyright © 2017 Omniscien Technologies. All Rights Reserved. • 31, MARCH 2017
  6. 6. SIMPLIFIED REFERENCE ARCHITECTURE FOR OUR DISCUSSION Unstructured Data Structured Data Search “Engine”
  7. 7. HOW DO I KNOW WHAT TO “ASK” FOR? Unstructured Data Structured Data Search “Engine” • How do I construct the right query / search? • How do I know what keywords to use? • Semantic or Concept Search • Keyword lists • Domain classifications • Keyword based domain classification (AI) • …
  8. 8. HOW DO WE DEAL WITH MULTI-LINGUAL CONTENT? Unstructured Data Structured Data Search “Engine” Option 1: Normalize to a single language Option 2: Cross-lingual search What domain, how do we maintain quality, what is quality, what language do we normalize to..? What kind of data, is normalization or transliteration needed, how do we dal with variants?
  9. 9. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments
  10. 10. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments
  11. 11. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments - Translation naturally provides the translated source – using either Statistical or Neural Machine Translation - However, bi-products and translation capabilities that are interesting in this context are: - Ability to normalize terminology - Pre-processing and enriching content prior to translation (tagging, conversion..) - Using the term analysis generated during the engine build Extrémne problémy extrémne problémy extrémne problémy extrémnej problémy refraktérnym mnohopočetným myelómom refraktérnym mnohopočetným myelómom refraktérnym mnohopočetným myelómom žiaruvzdorné myelómom je mladších veľkosti nádoru veľkosť nádoru veľkosti nádoru veľkosti nádoru
  12. 12. JA-EN Sample Patent Translations; one is machine, one human • The coagulation time was determined as described above. • The setting time was determined as described above. • The lighting device also typically includes a light source disposed at the end of the light conductor. • The light device typically also includes a light source arranged at an end of the light guide. • Such communication between components is but one example of a unidirectional communication system. • Such communication between components is only one example of a one-way communication system. • The use of a hearing aid by a healthcare provider is routine. • The use of a stethoscope by health care providers is routine. • This can further enhance the electrical and long-term performance of the backsheet. • This may further increase the electrical properties and long-term performance of the backsheets. • Initial Binding measurements were performed as described above for Plaque Initial Binding measurements. • Initial bonding measurements were carried out as described above for Plaque Initial Bonding Measurements. • The subtractive color mixture selected may depend on the metalized surface area and the resistance material used. • The subtractive process selected can depend upon the metallized structured surface region and the resist material utilized. Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  13. 13. THE GENERIC LOCALIZATION WORKFLOW Extraction Enrichment Translation Enrichment Delivery 1 2 3 4 5 Extract from source format to text or XML Identifying entities, entity relationships, adding meta data, sentiment analysis, etc. Translation and/or transliteration, normalizing terminology, maintaining meta-data Post-translation corrections, additional enrichment and classification, etc. Delivery to user / application with or without enrichments
  14. 14. A REAL-LIFE EXAMPLE APPLICATION • Example term (n-gram) extraction; extracted from actual human translations. The -gram variants show the (green) suggested n-gram based on frequency but also the other candidates that were found. ”Distance” is an available parameter. • This process provides term variants, distance but also term relationships • The results can be used for different purposes, amongst others • Term normalization • Term suggestions for search • In conjunction with other meta data, domain identification • … Copyright © 2017 Omniscien Technologies. All Rights Reserved. actual swirl speed Vitesse de rotation réelle la vitesse de turbulence réelle vitesse réelle tourbillon vitesse réelle de remous high byte octet haut octet de poids fort octet haut byte élevé non-freezing fluid fluide antigel fluide incongelable sans gel fluide fluide de non-congélation dental spray jet dentaire pulvérisation dentaire jet dentaire jet dentaire
  15. 15. A REAL-LIFE EXAMPLE APPLICATION (2) • WIPO Patentscope (Patent Research) uses this data extensively • WIPO Pearl is an example application • Many other examples exist in • eCommerce (Products, Brands, etc.) • Business Information (Names, Locations, etc.) • Scientific Research Platforms (Medical Terms, Chemical Compounds, Domain Identification, etc.) • .. Copyright © 2017 Omniscien Technologies. All Rights Reserved. Source: http://www.wipo.int/wipopearl/search/linguisticSearch.html
  16. 16. A FEW KEY RECOMMENDATIONS 1. Take a holistic view of your workflow end to end 2. Work from the desired application result backwards 3. Ensure you review the data production and localization process, both the engine build as well as the production workflow. Ensure valuable meta data is not discarded. The localization team will have a vey different view on the “value” of certain data elements than the team handling search or even the application 4. Keep in mind the enrichment capabilities of the localization workflow ranging from entities, sentiment right to the ability to manipulate data on the fly and call external data sources and subsequently “locking” the data in for localization Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  17. 17. SUMMARY • The Machine Translation and associated Language Processing workflow provides a wealth of information that can support search • Understanding the interaction between the content localization and search is critical to good search results and allows balancing precision and recall • With Machine Learning entering translation with Neural Machine Translation, a number of Machine learning applications are enabled • Use the localization workflow to your advantage in a multi-lingual environment Copyright © 2017 Omniscien Technologies. All Rights Reserved.
  18. 18. Copyright © 2017 Omniscien Technologies. All Rights Reserved. Q & A

×