Essential Elements of Excellent Multilingual Search


Published on

Boosting Search Quality with the Rosette Linguistics Platform

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Essential Elements of Excellent Multilingual Search

  1. 1. February 22, 2012Essential Elements ofExcellent Multilingual SearchBoosting Search Quality with the Rosette Linguistics Platform We put the World in the World Wide Web®
  2. 2. ABOUT BASIS TECHNOLOGYBasis Technology provides software solutions for text analytics, information retrieval,digital forensics, and identity resolution in over forty languages. Our Rosette® linguisticsplatform is a widely used suite of interoperable components that power search, businessintelligence, e-discovery, social media monitoring, financial compliance, and otherenterprise applications. Our linguistics team is at the forefront of applied natural languageprocessing using a combination of statistical modeling, expert rules, andcorpus-derived data. Our forensics team pioneers better, faster, and cheaper techniquesto extract forensic evidence, keeping government and law enforcement ahead ofexponential growth of data storage volumes.Software vendors, content providers, financial institutions, and government agenciesworldwide rely on Basis Technology’s solutions for Unicode compliance, languageidentification, multilingual search, entity extraction, name indexing, and nametranslation. Our products and services are used by over 250 major firms, including Cisco,EMC, Exalead/Dassault Systems, Hewlett-Packard, Microsoft, Oracle, and Symantec.Our text analysis products are widely used in the U.S. defense and intelligence industryby such firms as CACI, Lockheed Martin, Northrop Grumman, SAIC, and SRI. We arethe top provider of multilingual technology to web and e-commerce search engines,including, Bing, Google, and Yahoo!.Company headquarters are in Cambridge, Massachusetts, with branch officesin San Francisco, Washington, London, and Tokyo. For more information,© 2012 Basis Technology Corporation. “Basis Technology”, “Geoscope”, “Odyssey Digital Forensics”, “Rosette”, and “We put the World in theWorld Wide Web” are registered trademarks of Basis Technology Corporation. All other trademarks, service marks, and logos used in this document are theproperty of their respective owners. (2012-08-30)
  3. 3. ABSTRACTIf search is vital to the competitive advantage or value proposition of your business, then thiswhitepaper is a must read. In the survival of the fittest, companies like Netflix and manybusinesses today thrive or die depending on whether they can: a) Provide “excellent search” – comprehensive, accurate, and relevant to the user b) Serve each user equally well in his or her preferred languageThe three essential elements of a business-dependable search solution on which to buildapplications or workflows to generate revenue are: (1) search customizability and the speed-scalability-cost trio, (2) high search quality—with emphasis on “recall” or maximizing the numberof good results found—and (3) reliability and support. This paper will examine what these threeelements mean, and examine Basis Technology’s Rosette linguistic platform as one possiblesolution.THE THREE ELEMENTS OF “EXCELLENT SEARCH”More and more, revenues and business longevity rely on cost-effective, accurate, and scalablesearch in English and other languages. E-commerce, job or real estate hunting sites, e-discovery/e-disclosure solutions, financial compliance solutions, online information providers, media andentertainment portal sites all begin with users finding what they are looking for. Choose the rightsearch and see revenues go up, satisfied customers increase, and a business well-positioned forthe future. Choose the wrong search and costs constrict growth and poor relevancy chases awaycustomers. Savvy businesses recognize that their customers, partners, suppliers and employeesincreasingly create and consume content in their native language.For most enterprises, improving search quality means improving (1) recall—i.e., maximizing thenumber of relevant results found—and to a slightly lesser degree, (2) precision—i.e., maximizingthe ratio of good to bad results returned.We’ll now examine each of these three elements of search in detail: • Customizability of search is critical to fit it to your particular use case or create the value add of your business. Speed, Scale, and Cost can quickly strangle the success of your business if search is struggling to maintain speed and scale, and choking profits with escalating per- query or per-document costs. • Maximizing Search Recall and Precision is the 50,000 pound gorilla. Speed is important, but if your users aren’t finding the results they want to see, it doesn’t matter how fast it gets delivered. For enterprise applications and search engines, maximizing recall (without diminishing precision) is usually the greater concern. Good language support is an intrinsic ingredient for improved recall • Reliability and Technical Support ensures that expert resources are available for solving any serious issues in the service. Essential Elements of Excellent Multilingual Search 3
  4. 4. ELEMENT 1: CUSTOMIZABILITY, SPEED, SCALE, COSTThe basic question a company has to ask is: Do we want/like/need/know how to bend technologyto our business processes and business model in order to achieve competitive advantage? Andmost importantly: Is search part of that competitive advantage? If the answer is “yes” then searchis not a plug-in-a-solution problem, but a software development problem.Because search is such a broad-scope problem, there are all kinds of edge cases that are difficult tosupport using a proprietary, commercial search solution. The technology is not easily changedwithout expensive consultation and modifications to the core software that are dependent on thevendor’s product release cycle.For companies where better search means more money earned or saved, the frustration canincrease geometrically because their use case may be unique, but the software isnt extensibleenough to deal with it, and they have to spend more and more for less and less.Like three legs of a triangle, speed, scalability and cost are inextricably linked. Assuming that theload on the search engine will only increase over time, search performance must be scalable tokeep the query response times low and to ensure index updating times do not affect searches.At the same time, pricing models of commercial search engines based on number of queries ornumber of documents indexed will eventually constrict growth by sheer cost.ELEMENT 2: MAXIMIZING SEARCH RECALL AND PRECISION DEPENDS ON LANGUAGE SUPPORTSearch is really all about finding, whether it’s a shopper on an e-commerce site, a banker checking awire transfer request against a list of money launderers, or home buyers looking through real estatelistings.The concepts of precision and recall measure how “good” search is. More precise results with greaterrecall mean better quality search, but frequently increasing one means a decrease in the other, so thetrick is how to maximize each metric while minimizing the impact on the other. (See sidebar.)Let’s now look at the language-specific processing required to increase recall and/or precision for threecategories of notoriously difficult languages to process: Chinese/Japanese/Korean, Germanic andScandinavian languages, and Arabic. We’ll first look at one aspect of linguistic processing that improvesrecall in most languages, including English and European languages: lemmas and stems.Improving Recall via Lemmas and StemsLemmatization and stemming are two methods of increasing the recall of search results. Stemmingimproves recall, but can diminish precision. Lemmatization is the preferred method since it increasesrecall while maintaining precision.Lemmatization, finding the dictionary form of a word, broadens search to add relevant search results,in ways that stemming—more commonly available—cannot. A user searching for “President Obamaspeaking on healthcare” would likely also want results containing “President Obama spoke onhealthcare” and “President Obama was speaking on healthcare.” By lemmatizing“ spoke” and “wasspeaking” and storing the lemma “speak” in the search index, the latter two results will (correctly!) getpicked up by a search for “President Obama speaking on healthcare. Essential Elements of Excellent Multilingual Search 4
  5. 5. Stemming uses some basic rules to look for common stems in words. For example, “decod” is thestem of “decoding,” “decoder,” and “decodes.” However, no matter how many letters a stemmerchops off, “spoke” will never become “speak.” Lemmas are found using dictionary data andrecognizing the context of a word. The lemma is very different when “spoke” is used as a noun or averb. Stemming, on the other hand, applies a set of rules regardless of a word’s part-of-speech orcontext in a sentence.Since lemmas go back to the intrinsic meaning of a word, irrelevant results are kept to a minimum.On the other hand, stemming may sometimes produce unpredictable results. Stemming turns“several” into “sever,” and “arsenic” and “arsenal” share the same stem “arsen.” And, whereas“apples” to “apple” is reasonable, “Los Angeles” to “Los Angele” is not.Although stemming is more readily accessible as a set of language-specific rules than thedictionary-requiring lemmatization, the greater recall and precision from Lemmatization is thedifference between good and excellent search.Stemming vs. Lemmatization in a NutshellSearch Query Traditional Stemming Lemmatization using Rosette Comparisonanimals anim animal Two unrelated words may share a stem, in this caseanimated anim animate “anim”arsenal arsen arsenalarsenic arsen arsenicseveral sever several Stemming may have unintended consequencesorganization organ organizationchildren children child Irregular verbs and nouns stump the stemmerspoke spoke speak (when used as past tense verb) spoke (when used as noun)Improving Precision in Chinese, Japanese, and KoreanAsian languages like Chinese, Japanese, and Korean are fundamentally more difficult to processthan European languages because their words are not consistently separated by spaces. Chineseand Japanese use no spaces between words, and Korean uses some spaces, but not betweenevery word, and inconsistently depending on the writer.For a search engine to build an index, it has to have words. N-gram and Tokenization are methodsused to produce indexable units in these languages, each resulting in different degrees of searchprecision.N-gram vs. TokenizationSome engines may use the n-gram technique to break up streams of text into overlapping 2-4character units, but though recall will be high, precision will be low and indexes will be very large,impacting speed.Here is a Japanese example: 東京都の観光地 (translation: sightseeing spots in Tokyo) Essential Elements of Excellent Multilingual Search 5
  6. 6. Bigram English translation東京 Tokyo京都 Kyoto都の capital’sの観 not a word観光 sightseeing光地 not a wordA seven character phrase produces six items to index via the n-gram technique, but also introducesa false word “Kyoto.” Since many of the bigrams are not actual words in the original text, manyother unrelated strings might accidentally match them, too, and a search becomes a statisticalprocess: Do enough of these two character segments appear in a given document to flag it as a“match,” while not accidentally matching the wrong strings?This approach also produces many more entries in the index than words, unnecessarily increasingits size. Processing a 100-character buffer (about 13-17 words), adds 999 entries to the index.A better choice is true tokenization based on morphological analysis1, which breaks up text intoreal words. With tokenization of the example above, there are at most three items to index, andpossibly only two if the possessive marker is treated as a “stop word”.2Words English Translation東京都 Tokyoの (possessive marker)観光地 Sightseeing spotsCharacter Normalization to Improve RecallChinese, Japanese, and Korean have additional character normalization needs beyond making sureASCII characters are all uppercase or lowercase. In digital files, these languages also use a full- widthvariant of ASCII letters and punctuation which need to be normalized to the half-width ASCII form. ASCIIwords%$& → ASCIIwords%$&In the case of Japanese, a half-width version of Japanese katakana characters also needs to benormalized to the usual full-width katakana. The half-width version was invented in the early daysof computer processing to reduce data size, but it is still found in documents and webpages. カタカナ → カタカナ1 Morphological analysis in linguistics examines the smallest units of meaning within a word— called “morphemes—such as “un-” in “unremarkable.”2 A “stop word” is one which occurs so frequently within a language as to not help in finding search results. Some search engines may choose to ignore them. Essential Elements of Excellent Multilingual Search 6
  7. 7. Improving Recall with Pan-Chinese SearchChinese search gains an enormous boost in recall if both the simplified and traditional Chinesescripts are searched. Mainland China and Singapore use simplified Chinese, whereas Taiwan andHong Kong use traditional Chinese. Chinese speakers expect one query in one script to find resultsfrom both scripts.Pan-Chinese search—searching documents in both Chinese scripts—is accomplished by convertingall the text of one script into the other at query and indexThe differences between the two scripts fall into three categories:Category Simplified Chinese Traditional Chinese English Translation bigSame character used in both scripts—this is truewhere the character was simple to start with 大 大Two characters with same pronunciation (butdifferent base meaning)—collapsed to one 头发 頭髮 emitcharacter in simplified Chinese hair 出发 出發Different vocabulary (analogous to American computerand British English differences such as “truck” vs. 计算机 電腦“lorry”)—occurs in mostly modern wordsIn the first case, the change is trivial, and amounts to making sure all the text is in the sameencoding; however, the second and third categories require dictionary data to ensure that theconversion is context-sensitive, so that the correct traditional character is chosen.Improving Recall in Germanic and Scandinavian LanguagesLinguistic processing required by German, Dutch, Scandinavian languages (Danish, Norwegian,Swedish, and Finnish), and Korean is very similar in that these languages freely use compoundwords which, frequently need to be broken up to increase recall.Take these two German compound words:German compound German decompounded English translationJugendarbeitslosigkeit Jugend + arbeitslosigkeit Youth unemployment (Youth + unemployment)Samstagmorgen Samstag + morgen Saturday morning (Saturday + morning)It’s reasonable to imagine that a person looking for information about youth unemployment inGerman would also welcome search results which included “youth” and “unemployment” asseparate words. Similarly, the concepts of “Saturday” and “morning” are very likely to appear as acompound word or as separate words in related documents.Decompounding increases search recall with minimal impact on precision, but again, requiresdictionary data to do properly.Additionally, to maximize recall, character normalization for plurals “Garten” (garden, singularform) and its plural form “Gärten” is needed. Essential Elements of Excellent Multilingual Search 7
  8. 8. Improving Recall in ArabicArabic is one language which especially suffers from low recall if a search engine does notperform Arabic-specific linguistic processing, starting from the basic character normalization up tostemming of proper nouns and Lemmatization of common nouns. Arabic attaches so many affixes—to the beginning, end and middle of words—that without lemmatizing before searching, manyrelevant results are never found. One root could form the basis for up to fifteen different verbs!Character Normalization to Improve Precision and RecallIn English, search engines perform some basic normalization such as lowercasing all the words. InArabic, normalization is much more complex and will improve both precision and recall. Types ofcharacter normalization required include: • Words with additional vocalization marks such as: ‫ﺳﻲـ‬ vs. ‫ﻲـﺳﺎﯾـﺳ‬ ‫ﺳـ ﯾﺎ‬ • Words containing certain letters with dots added or removed such as: ‫ رـ ـﻗﯾﻪ ـ‬vs. ‫رـ ـﻗﯾﺔ ـ‬ • Words–including ambiguous cases–containing certain letters with symbols added or removed such as: ‫ٱدﺎـﻣﺗﻋ‬ vs. ‫ادﺎـﻣﺗﻋ‬ ‫ادؤود‬ vs. ‫ادوود‬ ‫اﻣـﺛ‬ vs. (‫ آﻣـﺛ‬or ‫إﻣـﺛ‬ or ‫)أﻣـﺛ‬Improving Recall with Lemmatization and StemmingIn Arabic as other languages, Lemmatization increases the recall of search results (cf. “ImprovingRecall via Lemmas vs. Stems”), but with a twist. In Arabic, Lemmatization is only applicable to verbsand common nouns (e.g., apple, book, table) and is not applicable to proper nouns (e.g., names ofpeople such as, Abul-qassem El-Chabby, Baqah Al-Sharqiyyah).Proper nouns require stemming to remove prepositions and conjunctions which are oftenattached to them. Searches for names of people, places, and organizations are seriously hamperedwithout stemming. For example, these phrases in English appear as one word in Arabic: “forOthman,” “with Othman,” “as Othman,” and “and Othman.” Thus a search for just “Othman”would not find the above variations without stemming.Additionally, since Arabic names are frequently the same words as common nouns, part-of-speechtagging becomes critical to differentiating between nouns and proper nouns, particularly when aword’s part-of-speech varies depending on where it appears in a sentence.Search Recall and Precision Enhancers for Many LanguagesFor English and most European languages, better recall and precision means: • Adding lemmas (the dictionary form of a word) to the search index to increase recall, while minimizing the negative impact on precision (cf. previous section “Improving Recall via Lemmas vs. Stems”) • Boosting the ranking of relevant documents on the results page through the use of document metadata such as entities. • Increasing recall through comprehensive name search that goes beyond the “Did you mean?” functionality of most search engines. Essential Elements of Excellent Multilingual Search 8
  9. 9. Boosting Relevant ResultsOnce the search engine is finding more good results (better recall) and returning fewer bad results(better precision), the good results need to be at the top of the search results.Frequently the most critical words in a search have to do with entities—proper nouns such asnames of people, places, and organizations. But when is “Christian” a religious affiliation or thename of a person “Christian Dior”? A good entity extractor will know. At indexing time, entities canbe added to a document record’s metadata to boost the rank of documents which have entitiesmatching those in the query.Comprehensive Name Search to Improve RecallThe “Did you mean?” function of most modern search engines will find actress “Cate Blanche ”even if the user types “Kate .” However, for less famous people, and to cover a gamut ofname variations, a true name matching function is needed to handle “Chuck Berry” vs. “CharlesBerry”; “John Kearns” vs. “Jon Cairns” or even “Baqah Al-Sharqiyyah” and “‫.”ﺔﻗﺎﺑ رﺷﻟ اﻗﺔﯾ‬The Rosette Option for Language SupportRosette is a software development kit (SDK) designed for a wide range of large-scale applicationsthat need to identify, classify, analyze, index, and search unstructured text from various sources,and its linguistic analysis capabilities are widely used by search engines to both increaseprecision and recall It uses a combination of statistical models, dictionary data, and sophisticatedcomputational linguistics to parse digital text in English and over 25 major European, Asian,and Middle Eastern languages. Years of development and linguistic analysis have gone into thedevelopment of Rosette to satisfy the most demanding customers, both for quality, robustness,and speed.Nearly every major web and enterprise search engine since 1999, including Bing, Endeca, goo(Japanese), Google, Microsoft/FAST, and Yahoo! has used Rosette. New generations of search-based applications for e-discovery, financial compliance, and other fields also use Rosette.Rosette is a cross-platform SDK available for Windows and Unix, and offers all its capabilities via asingle API, in C, C++, Java, or .NET.Rosette Capabilities • Language Identification—identification of the primary language of a document—or language regions within a multilingual document—and the file’s encoding in 55 languages and 45 encodings. • Unicode Conversion—converts documents in legacy encodings to Unicode • Character Normalization—normalizes characters to a single representation (e.g. character+ diacritic to single character with diacritics, half/full-size variants of ASCII characters in Asian languages, and several language-specific normalizations.) • Linguistic Analysis—Lemmatization, tokenization, part-of-speech tagging, decompounding, and more in over 25 languages. • Entity Extraction—extracts entities such as people, places, and organizations in 15 languages via statistical models with customizable user-defined entities via regular expressions and entity databases • Name Matching—returns matching names despite spelling variations, initials, nicknames, missing name components, missing spaces, out-of-order name components, the same name written in different languages, and more—supported in multiple languages including Arabic, Chinese, English, Korean, and Persian. Essential Elements of Excellent Multilingual Search 9
  10. 10. • Name Translation—translates names from non-Latin script languages to English and standardizes already translated names—supported in multiple languages including Arabic, Chinese, English, Korean, Russian, and Persian. Lemmatization and Stemming Needing to support a variety languages can stretch the capabilities of out of the box search solutions, requiring additional effort to custom-configure the handling of each language and tune its accuracy one by one. Chinese, Japanese, and Korean Search Normalization of Chinese, Japanese and Korean characters is done by the Rosette Core Library for Unicode (RCLU): • For Chinese, Japanese, and Korean: Converting full-width ASCII characters (ASDF&%$) to the usual half-width characters (ASDF&%$). • For Japanese: Converting half-width Japanese katakana characters (アイウエオ; one-byte characters invented in the early days of Japanese computing to save on storage) to the usual full-width characters (アイウエオ). Pan-Chinese Search Rosette’s Chinese script converter can convert text into either simplified or traditional Chinese. Germanic and Scandinavian Languages Character normalization of plurals is part of the Lemmatization of German within Rosette. Arabic Rosette provides speech tagging and Lemmatization for Arabic. Essential Elements of Excellent Multilingual Search 10
  11. 11. Boosting Relevant Results with EntitiesRosette’s entity extractor can push results higher in the rankings.Comprehensive Name SearchRosette’s name matching functionality including name variations due to nicknames, initials, missingname components, missing spaces between names, out-of-order name components, the samename represented in different languages, and more. Thus, besides handling “Robert” vs. “Bob,”and “Abdul Rasheed” vs. “Abd-al-Rasheed” vs. “Abd Ar-Rashid,” Rose e also matches “Mao Tse-Tung”, “Mao Zedong”, and “毛泽东” (simplified Chinese).PRECISION VS. RECALL VS. F-SCORE IN A NUTSHELLPrecision and recall are metrics used to evaluate the quality of search results, whether searchingfor articles on a topic, or finding name matches.Supposed you are searching for red balls from a box that contains 7 red balls and 8 green balls.Blindfolded, you pull out 8 balls, of which 4 are red and 4 are green.Precision is the number of correct items over the total number of items found. That is, you found4 red (which you wanted) and pulled out 8 balls in total, thus precision is 4/8 (half of your resultswere correct) or 50%.Recall asks the question: “Of all the correct items, how many did I find?” In this case, there were 7correct items (because there are 7 red balls) and you found 4, thus your recall is 4/7 or 57%.F-score is a measure that attempts to balance precision and recall and is often called the“relevancy” of a system.F-score = 2*(Precision*Recall) ÷ (Precision+Recall)Thus our F-score above = 8/15 = .53, or 53% Essential Elements of Excellent Multilingual Search 11
  12. 12. ELEMENT 3: RELIABILITY AND SUPPORTEven in a technologically sophisticated company, the core business logic will be the main focus ofall business operations, so technical support as a safety net is essential. Just as firms rely oncalling the vendor when the copier is acting up, a few judicious pieces of advice from the searchor linguistics experts to settle a nettlesome problem lends assurance to both engineers and uppermanagement.On the language support side, the wide language coverage of Basis Technology’s Rosette—over 25languages covering Asia, Europe, and the Middle East—ensure there will be one point of contactto address questions about any languages, instead of a multitude of single-language vendors withvarying support contracts. Rosette developers are on the front lines, providing high quality supportto customer inquiries.Additionally as a single platform, Rosette’s benchmarks for speed and accuracy are readilyavailable, and implementing one or over 25 languages is the same amount of work. Essential Elements of Excellent Multilingual Search 12
  13. 13. SUMMARY & CONCLUSION: ROSETTE LINGUISTICS PLATFORM For companies that derive competitive advantage from a better quality or better customized search, search is a development problem, and not a “plug-in-a-solution” problem, and the four search engine essentials are: 1. Customizability—the ability to innovate on top of the chosen search engine to build value- added features 2. Speed—the search engine’s query response rates and indexing and updating times 3. Scalability—the ability to easily handle increased number of queries or volumes of content requiring indexing 4. Cost—total cost of ownership; return on investment; scaling of cost to search needsThe Rosette platform fills that gap, through solid linguistic analysis technology and comprehensivedictionary data to perform: • Language Identification—The language identifier detects 55 languages with speed and accuracy, having been trained on gigabytes of hand-verified data. It covers a broad range of Asian, Indo-European, and Middle Eastern languages. • Linguistic Analysis—Rosette performs a complete linguistic analysis to improve search recall and precision for some of the most linguistically difficult languages. ○ Lemmatization—for boosting recall while maintaining precision in many languages ○ Tokenization—for Chinese, Japanese, and Korean which do not have spaces between words ○ Chinese script conversion—for pan-Chinese search ○ Character normalization—specialized transforms for Arabic and Asian languages ○ Decompounding—for Germanic and Scandinavian languages which form words from multiple words ○ Part-of-speech tagging—for context-sensitive Lemmatization, and particularly for Arabic search to distinguish between common nouns requiring Lemmatization, and proper nouns, requiring stemming • Entity Extraction—To locate names, places, organizations, and other entities, both to boost ranking of results whose entities match the search query or as a base for building faceted search. Rosette plugs into the UpdateRequestProcessor at indexing to populate each document record with entities. • Name Matching—Going beyond “Did you mean?”, Rosette finds matching names that differ due to nicknames, initials, missing name components, missing spaces between names, out- of-order name components, the same name represented in different languages, and more.NEXT STEPS • Request a free product evaluation of Rosette at: -request.html Essential Elements of Excellent Multilingual Search 13