Open-source Hebrew search


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Open-source Hebrew search

  1. 1. Open-Source Hebrew Search Itamar Syn-Hershko SIGTRS Meetup 22/7/2010, Jerusalem
  2. 2. Introduction <ul><li>The requirement to control masses of information </li></ul><ul><li>Manual tagging / categorization is no longer an option </li></ul><ul><li>Scanning text? </li></ul><ul><li>Using an inverted index: faster, flexible, relevance </li></ul><ul><li>Measuring TR engine: relevance, precision, recall </li></ul><ul><li>The perfect search engine is language dependant </li></ul><ul><li>The perfect Hebrew search engine </li></ul><ul><li>Introducing: HebMorph </li></ul>Open-Source Hebrew Search: Introduction
  3. 3. How do search engines work? <ul><li>Inverted index </li></ul><ul><li>Normalizations: Porter stemmer, s-stemmer, Soundex etc. </li></ul><ul><li>Stemming, so (looking, looked, looker) equal “look”, and book will return “books”. </li></ul>Open-Source Hebrew Search: Introduction
  4. 4. The Challenge Open-Source Hebrew Search
  5. 5. Tokens Ambiguity <ul><li>With Niqqud, Hebrew is no different than any other non-Semitic language </li></ul><ul><li>Niqqud-less spelling yields more than one possible meaning to almost any given word </li></ul><ul><li>English: Look, Luke; Wine, Whine; Stack, Stuck. </li></ul><ul><li>Hebrew: שָנִי , שֵנִי , שְנֵי , שֹנִי , שְנִי </li></ul><ul><li>Niqqud-less spelling: שני , שני , שני , שני , שני … </li></ul>Open-Source Hebrew Search: The Challenge
  6. 6. Particles Separation <ul><li>Hebrew word uses particles for context </li></ul><ul><li>Without removing suffixes, relevant words might be skipped (for example: חבלה ) </li></ul><ul><li>Without removing prefixes, relevant words will not be looked up at all </li></ul><ul><li>Ambiguity makes affixes removal impossible in many cases </li></ul><ul><li>בית -> הבית , בבית , שבבית , לבית , והבית ... </li></ul><ul><li>הרכבת -> רותי פספסה את ה רכבת </li></ul><ul><li>הרכבת המוצר מסובכת להפליא </li></ul><ul><li>כלבי -> ? </li></ul><ul><li>שבתו – > ? </li></ul>Open-Source Hebrew Search: The Challenge
  7. 7. Spelling Rules? <ul><li>There is no common agreement over rules for Niqqud-less spelling, like the one exists for diacriticized Hebrew </li></ul><ul><li>Even spelling in common agreement isn’t always being widely used </li></ul><ul><li>Did you know the correct spelling for “mother” is “ אימא “ ? </li></ul><ul><li>The same word can be spelled differently by different writers, or even by the same writer </li></ul><ul><li>שירות / שרות / שיירות </li></ul><ul><li>דוגמא / דוגמה </li></ul>Open-Source Hebrew Search: The Challenge
  8. 8. !(Spelling Rules) <ul><li>Most debates are over spelling of nouns and loanwords, which have the greatest value in IR </li></ul><ul><li>An extra layer of ambiguity, where each author or user can choose the spelling he likes </li></ul><ul><li>אחשורוש או אחשוורוש ? </li></ul><ul><li>שבדיה או שוודיה ? </li></ul><ul><li>טורקיה או תורכיה ? </li></ul><ul><li>פריס או פריז ? או אולי פאריז ? </li></ul>Open-Source Hebrew Search: The Challenge
  9. 9. Noise Reduction <ul><li>Stop words ambiguity </li></ul><ul><li>אשר , כדי , אף ... </li></ul><ul><li>Stop words as collations </li></ul><ul><li>על ידי , אי פעם , אף על פי , שום דבר ... </li></ul><ul><li>Collations where a meaning of a single word is changed </li></ul><ul><li>פי התהום </li></ul>Open-Source Hebrew Search: The Challenge
  10. 10. Tokenization Challenges <ul><li>Hebrew acronyms use double-quotes character, which is usually considered as punctuation character by most tokenizers </li></ul><ul><li>Same with Geresh, which is used for abbrevations </li></ul><ul><li>Geresh is also used for חצ &quot; ץ ג &quot; ז </li></ul><ul><li>… and ambiguity again: אינצ ' </li></ul>Open-Source Hebrew Search: The Challenge
  11. 11. Common Texts <ul><li>Various dialects may present OOV cases, or change a meaning ( חמר , חמרא ), hence require different handling </li></ul><ul><li>Each corpus might hold more than one dialect </li></ul><ul><li>Even partial Niqqud can help disambiguation </li></ul><ul><li>Niqqud-less spelling is the most common nowadays </li></ul>Open-Source Hebrew Search: The Challenge
  12. 12. Ways of Resolution Open-Source Hebrew Search
  13. 13. What to Index? <ul><li>Deciding on an “indexing unit” is the cornerstone of any good performing search engine </li></ul><ul><li>For Hebrew we have: </li></ul><ul><ul><li>The original term (and possibly using wildcards?) </li></ul></ul><ul><ul><li>Hebrew triliteral root </li></ul></ul><ul><ul><li>Lemma ( דלת ← דלתותינו ) </li></ul></ul><ul><ul><li>Psuedo-lemma, Stem </li></ul></ul><ul><li>Considerations </li></ul>Open-Source Hebrew Search: Ways of Resolution
  14. 14. Hebrew NLP Methods <ul><li>To analyze a Hebrew word, NLP tools have to be used: </li></ul><ul><ul><li>Dictionary-based approach </li></ul></ul><ul><ul><li>Algorithmic approach </li></ul></ul><ul><li>Comparison criteria include: </li></ul><ul><ul><li>Morphological precision (handling of 4-5 roots, broken plurals, assimilation, etc.) </li></ul></ul><ul><ul><li>Handling of loanwords, names and slang </li></ul></ul><ul><ul><li>Toleration of spelling differences </li></ul></ul><ul><ul><li>Disambiguation (error rate, POS, ranking) </li></ul></ul>Open-Source Hebrew Search: Ways of Resolution
  15. 15. Dictionary vs Algorithm <ul><li>Dictionaries are easier to build and maintain, but they need much more on-going attention and coverage tests </li></ul><ul><li>Easier to support non-exact matches with an algorithm </li></ul><ul><li>Prerequisites and dependencies </li></ul><ul><li>Hand-crafted dictionaries with morphological information, and corpora generated dictionaries with statistical data </li></ul>Open-Source Hebrew Search: Ways of Resolution
  16. 16. Lemma Disambiguation <ul><li>In order to index a correct lemma, a good disambiguation process needs to be used </li></ul><ul><li>POS tools, grammatical or statistical, is the only reliable way to correctly eliminate false positives </li></ul><ul><li>Even with such tools, ambiguity may exist: </li></ul><ul><li>&quot; המראה של מטוסים ריקים [...]&quot; </li></ul><ul><li>&quot; ראש הממשלה בבון &quot; </li></ul>Open-Source Hebrew Search: Ways of Resolution
  17. 17. NLP-based Hebrew Text Retrieval <ul><li>Filter lemmas based on their rank, morphological characteristics or statistical data </li></ul><ul><li>OOV cases can be saved as-is, have affixes removed from them, or compared to a list of known words (i.e. names and addresses) </li></ul><ul><li>Removal of stop and noise words </li></ul><ul><li>Term expansion (soundex, synonyms) </li></ul><ul><li>Save lemma to index (multiple lemmas at the same position) </li></ul>Open-Source Hebrew Search: Ways of Resolution
  18. 18. Other Text Retrieval Methods <ul><li>Is morphological analysis necessary? </li></ul><ul><li>Available methods: </li></ul><ul><ul><li>Light-stemming </li></ul></ul><ul><ul><li>Word truncation </li></ul></ul><ul><ul><li>N-grams </li></ul></ul><ul><ul><li>Skipgrams </li></ul></ul><ul><ul><li>(Sub-types) </li></ul></ul><ul><li>Require no extra overhead </li></ul><ul><li>F avorable, even when not superior </li></ul><ul><li>Disadvantages: larger index size, slower searches (for some) </li></ul>Open-Source Hebrew Search: Ways of Resolution
  19. 19. … Applied on Semitic Languages <ul><li>Researches have shown 4-grams and light stemmers (“light-10”) to work better than morphological lemmatizers for Arabic </li></ul><ul><li>Apparently, good relevance can be achieved without ‘knowing’ the language </li></ul><ul><li>Computers vs Humans </li></ul><ul><li>Lemmatization and disambiguation processes do make mistakes </li></ul><ul><li>Contextual processing can fail for short queries, producing incorrect searches </li></ul>Open-Source Hebrew Search: Ways of Resolution
  20. 20. The Best Retrieval Method for Hebrew Texts <ul><li>Arabic and Hebrew share many morphologic phenomenas </li></ul><ul><li>… but they do differ </li></ul><ul><li>Without trying, we can never know </li></ul><ul><li>Where HebMorph comes in </li></ul>Open-Source Hebrew Search: Ways of Resolution
  21. 21. HebMorph’s Approach Open-Source Hebrew Search
  22. 22. HebMorph <ul><li>… is a free , open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals. </li></ul><ul><li>2 goals </li></ul><ul><li>Development is done with Lucene (why?) </li></ul><ul><li>MorphAnalyzer, Hebrew.SimpleAnalyzer (+ duality) </li></ul><ul><li>OpenRelevance </li></ul>Open-Source Hebrew Search: HebMorph’s Approach
  23. 23. Indexing Flow Chart Open-Source Hebrew Search: HebMorph’s Approach
  24. 24. Searching Wikipedia with BzReader and HebMorph <ul><li>Source available from </li></ul><ul><li> </li></ul>Open-Source Hebrew Search: HebMorph’s Approach
  25. 25. The Road Ahead <ul><li>A better tokenizer </li></ul><ul><li>MorphAnalyzer: </li></ul><ul><ul><li>Hspell improvements (coverage, lemma probabilities, prefixes probabilities) </li></ul></ul><ul><ul><li>Toleration guidelines </li></ul></ul><ul><ul><li>Smarter OOV handling </li></ul></ul><ul><ul><li>Better stop words handling </li></ul></ul><ul><li>Hebrew judgments for OpenRelevance with Orev </li></ul><ul><li>Comparing various approaches to Hebrew IR </li></ul><ul><li>Wide availability (Java port underway!) </li></ul><ul><li>Other uses (NLP, OCR, you name it) </li></ul>Open-Source Hebrew Search: HebMorph’s Approach
  26. 26. Join Us! <ul><li>The more people join, the more feedback we get, and the better we become. </li></ul><ul><li>Our mailing list: </li></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><li>Code repository (Released under GPLv2): </li></ul><ul><ul><li>http :// github . com / synhershko / HebMorph </li></ul></ul><ul><li>Activity updates: </li></ul><ul><ul><ul><li> </li></ul></ul></ul><ul><ul><ul><li>#HebMorph on Twitter </li></ul></ul></ul>Open-Source Hebrew Search: HebMorph’s Approach
  27. 27. Thank you! Open-Source Hebrew Search