Your SlideShare is downloading. ×
Open-source Hebrew search
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Open-source Hebrew search


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Open-Source Hebrew Search Itamar Syn-Hershko SIGTRS Meetup 22/7/2010, Jerusalem
  • 2. Introduction
    • The requirement to control masses of information
    • Manual tagging / categorization is no longer an option
    • Scanning text?
    • Using an inverted index: faster, flexible, relevance
    • Measuring TR engine: relevance, precision, recall
    • The perfect search engine is language dependant
    • The perfect Hebrew search engine
    • Introducing: HebMorph
    Open-Source Hebrew Search: Introduction
  • 3. How do search engines work?
    • Inverted index
    • Normalizations: Porter stemmer, s-stemmer, Soundex etc.
    • Stemming, so (looking, looked, looker) equal “look”, and book will return “books”.
    Open-Source Hebrew Search: Introduction
  • 4. The Challenge Open-Source Hebrew Search
  • 5. Tokens Ambiguity
    • With Niqqud, Hebrew is no different than any other non-Semitic language
    • Niqqud-less spelling yields more than one possible meaning to almost any given word
    • English: Look, Luke; Wine, Whine; Stack, Stuck.
    • Hebrew: שָנִי , שֵנִי , שְנֵי , שֹנִי , שְנִי
    • Niqqud-less spelling: שני , שני , שני , שני , שני …
    Open-Source Hebrew Search: The Challenge
  • 6. Particles Separation
    • Hebrew word uses particles for context
    • Without removing suffixes, relevant words might be skipped (for example: חבלה )
    • Without removing prefixes, relevant words will not be looked up at all
    • Ambiguity makes affixes removal impossible in many cases
    • בית -> הבית , בבית , שבבית , לבית , והבית ...
    • הרכבת -> רותי פספסה את ה רכבת
    • הרכבת המוצר מסובכת להפליא
    • כלבי -> ?
    • שבתו – > ?
    Open-Source Hebrew Search: The Challenge
  • 7. Spelling Rules?
    • There is no common agreement over rules for Niqqud-less spelling, like the one exists for diacriticized Hebrew
    • Even spelling in common agreement isn’t always being widely used
    • Did you know the correct spelling for “mother” is “ אימא “ ?
    • The same word can be spelled differently by different writers, or even by the same writer
    • שירות / שרות / שיירות
    • דוגמא / דוגמה
    Open-Source Hebrew Search: The Challenge
  • 8. !(Spelling Rules)
    • Most debates are over spelling of nouns and loanwords, which have the greatest value in IR
    • An extra layer of ambiguity, where each author or user can choose the spelling he likes
    • אחשורוש או אחשוורוש ?
    • שבדיה או שוודיה ?
    • טורקיה או תורכיה ?
    • פריס או פריז ? או אולי פאריז ?
    Open-Source Hebrew Search: The Challenge
  • 9. Noise Reduction
    • Stop words ambiguity
    • אשר , כדי , אף ...
    • Stop words as collations
    • על ידי , אי פעם , אף על פי , שום דבר ...
    • Collations where a meaning of a single word is changed
    • פי התהום
    Open-Source Hebrew Search: The Challenge
  • 10. Tokenization Challenges
    • Hebrew acronyms use double-quotes character, which is usually considered as punctuation character by most tokenizers
    • Same with Geresh, which is used for abbrevations
    • Geresh is also used for חצ " ץ ג " ז
    • … and ambiguity again: אינצ '
    Open-Source Hebrew Search: The Challenge
  • 11. Common Texts
    • Various dialects may present OOV cases, or change a meaning ( חמר , חמרא ), hence require different handling
    • Each corpus might hold more than one dialect
    • Even partial Niqqud can help disambiguation
    • Niqqud-less spelling is the most common nowadays
    Open-Source Hebrew Search: The Challenge
  • 12. Ways of Resolution Open-Source Hebrew Search
  • 13. What to Index?
    • Deciding on an “indexing unit” is the cornerstone of any good performing search engine
    • For Hebrew we have:
      • The original term (and possibly using wildcards?)
      • Hebrew triliteral root
      • Lemma ( דלת ← דלתותינו )
      • Psuedo-lemma, Stem
    • Considerations
    Open-Source Hebrew Search: Ways of Resolution
  • 14. Hebrew NLP Methods
    • To analyze a Hebrew word, NLP tools have to be used:
      • Dictionary-based approach
      • Algorithmic approach
    • Comparison criteria include:
      • Morphological precision (handling of 4-5 roots, broken plurals, assimilation, etc.)
      • Handling of loanwords, names and slang
      • Toleration of spelling differences
      • Disambiguation (error rate, POS, ranking)
    Open-Source Hebrew Search: Ways of Resolution
  • 15. Dictionary vs Algorithm
    • Dictionaries are easier to build and maintain, but they need much more on-going attention and coverage tests
    • Easier to support non-exact matches with an algorithm
    • Prerequisites and dependencies
    • Hand-crafted dictionaries with morphological information, and corpora generated dictionaries with statistical data
    Open-Source Hebrew Search: Ways of Resolution
  • 16. Lemma Disambiguation
    • In order to index a correct lemma, a good disambiguation process needs to be used
    • POS tools, grammatical or statistical, is the only reliable way to correctly eliminate false positives
    • Even with such tools, ambiguity may exist:
    • " המראה של מטוסים ריקים [...]"
    • " ראש הממשלה בבון "
    Open-Source Hebrew Search: Ways of Resolution
  • 17. NLP-based Hebrew Text Retrieval
    • Filter lemmas based on their rank, morphological characteristics or statistical data
    • OOV cases can be saved as-is, have affixes removed from them, or compared to a list of known words (i.e. names and addresses)
    • Removal of stop and noise words
    • Term expansion (soundex, synonyms)
    • Save lemma to index (multiple lemmas at the same position)
    Open-Source Hebrew Search: Ways of Resolution
  • 18. Other Text Retrieval Methods
    • Is morphological analysis necessary?
    • Available methods:
      • Light-stemming
      • Word truncation
      • N-grams
      • Skipgrams
      • (Sub-types)
    • Require no extra overhead
    • F avorable, even when not superior
    • Disadvantages: larger index size, slower searches (for some)
    Open-Source Hebrew Search: Ways of Resolution
  • 19. … Applied on Semitic Languages
    • Researches have shown 4-grams and light stemmers (“light-10”) to work better than morphological lemmatizers for Arabic
    • Apparently, good relevance can be achieved without ‘knowing’ the language
    • Computers vs Humans
    • Lemmatization and disambiguation processes do make mistakes
    • Contextual processing can fail for short queries, producing incorrect searches
    Open-Source Hebrew Search: Ways of Resolution
  • 20. The Best Retrieval Method for Hebrew Texts
    • Arabic and Hebrew share many morphologic phenomenas
    • … but they do differ
    • Without trying, we can never know
    • Where HebMorph comes in
    Open-Source Hebrew Search: Ways of Resolution
  • 21. HebMorph’s Approach Open-Source Hebrew Search
  • 22. HebMorph
    • … is a free , open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals.
    • 2 goals
    • Development is done with Lucene (why?)
    • MorphAnalyzer, Hebrew.SimpleAnalyzer (+ duality)
    • OpenRelevance
    Open-Source Hebrew Search: HebMorph’s Approach
  • 23. Indexing Flow Chart Open-Source Hebrew Search: HebMorph’s Approach
  • 24. Searching Wikipedia with BzReader and HebMorph
    • Source available from
    Open-Source Hebrew Search: HebMorph’s Approach
  • 25. The Road Ahead
    • A better tokenizer
    • MorphAnalyzer:
      • Hspell improvements (coverage, lemma probabilities, prefixes probabilities)
      • Toleration guidelines
      • Smarter OOV handling
      • Better stop words handling
    • Hebrew judgments for OpenRelevance with Orev
    • Comparing various approaches to Hebrew IR
    • Wide availability (Java port underway!)
    • Other uses (NLP, OCR, you name it)
    Open-Source Hebrew Search: HebMorph’s Approach
  • 26. Join Us!
    • The more people join, the more feedback we get, and the better we become.
    • Our mailing list:
    • Code repository (Released under GPLv2):
      • http :// github . com / synhershko / HebMorph
    • Activity updates:
        • #HebMorph on Twitter
    Open-Source Hebrew Search: HebMorph’s Approach
  • 27. Thank you! Open-Source Hebrew Search