Your SlideShare is downloading. ×
Open-source Hebrew search
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Open-source Hebrew search

1,825
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,825
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Open-Source Hebrew Search Itamar Syn-Hershko SIGTRS Meetup 22/7/2010, Jerusalem
  • 2. Introduction
    • The requirement to control masses of information
    • Manual tagging / categorization is no longer an option
    • Scanning text?
    • Using an inverted index: faster, flexible, relevance
    • Measuring TR engine: relevance, precision, recall
    • The perfect search engine is language dependant
    • The perfect Hebrew search engine
    • Introducing: HebMorph
    Open-Source Hebrew Search: Introduction
  • 3. How do search engines work?
    • Inverted index
    • Normalizations: Porter stemmer, s-stemmer, Soundex etc.
    • Stemming, so (looking, looked, looker) equal “look”, and book will return “books”.
    Open-Source Hebrew Search: Introduction
  • 4. The Challenge Open-Source Hebrew Search
  • 5. Tokens Ambiguity
    • With Niqqud, Hebrew is no different than any other non-Semitic language
    • Niqqud-less spelling yields more than one possible meaning to almost any given word
    • English: Look, Luke; Wine, Whine; Stack, Stuck.
    • Hebrew: שָנִי , שֵנִי , שְנֵי , שֹנִי , שְנִי
    • Niqqud-less spelling: שני , שני , שני , שני , שני …
    Open-Source Hebrew Search: The Challenge
  • 6. Particles Separation
    • Hebrew word uses particles for context
    • Without removing suffixes, relevant words might be skipped (for example: חבלה )
    • Without removing prefixes, relevant words will not be looked up at all
    • Ambiguity makes affixes removal impossible in many cases
    • בית -> הבית , בבית , שבבית , לבית , והבית ...
    • הרכבת -> רותי פספסה את ה רכבת
    • הרכבת המוצר מסובכת להפליא
    • כלבי -> ?
    • שבתו – > ?
    Open-Source Hebrew Search: The Challenge
  • 7. Spelling Rules?
    • There is no common agreement over rules for Niqqud-less spelling, like the one exists for diacriticized Hebrew
    • Even spelling in common agreement isn’t always being widely used
    • Did you know the correct spelling for “mother” is “ אימא “ ?
    • The same word can be spelled differently by different writers, or even by the same writer
    • שירות / שרות / שיירות
    • דוגמא / דוגמה
    Open-Source Hebrew Search: The Challenge
  • 8. !(Spelling Rules)
    • Most debates are over spelling of nouns and loanwords, which have the greatest value in IR
    • An extra layer of ambiguity, where each author or user can choose the spelling he likes
    • אחשורוש או אחשוורוש ?
    • שבדיה או שוודיה ?
    • טורקיה או תורכיה ?
    • פריס או פריז ? או אולי פאריז ?
    Open-Source Hebrew Search: The Challenge
  • 9. Noise Reduction
    • Stop words ambiguity
    • אשר , כדי , אף ...
    • Stop words as collations
    • על ידי , אי פעם , אף על פי , שום דבר ...
    • Collations where a meaning of a single word is changed
    • פי התהום
    Open-Source Hebrew Search: The Challenge
  • 10. Tokenization Challenges
    • Hebrew acronyms use double-quotes character, which is usually considered as punctuation character by most tokenizers
    • Same with Geresh, which is used for abbrevations
    • Geresh is also used for חצ " ץ ג " ז
    • … and ambiguity again: אינצ '
    Open-Source Hebrew Search: The Challenge
  • 11. Common Texts
    • Various dialects may present OOV cases, or change a meaning ( חמר , חמרא ), hence require different handling
    • Each corpus might hold more than one dialect
    • Even partial Niqqud can help disambiguation
    • Niqqud-less spelling is the most common nowadays
    Open-Source Hebrew Search: The Challenge
  • 12. Ways of Resolution Open-Source Hebrew Search
  • 13. What to Index?
    • Deciding on an “indexing unit” is the cornerstone of any good performing search engine
    • For Hebrew we have:
      • The original term (and possibly using wildcards?)
      • Hebrew triliteral root
      • Lemma ( דלת ← דלתותינו )
      • Psuedo-lemma, Stem
    • Considerations
    Open-Source Hebrew Search: Ways of Resolution
  • 14. Hebrew NLP Methods
    • To analyze a Hebrew word, NLP tools have to be used:
      • Dictionary-based approach
      • Algorithmic approach
    • Comparison criteria include:
      • Morphological precision (handling of 4-5 roots, broken plurals, assimilation, etc.)
      • Handling of loanwords, names and slang
      • Toleration of spelling differences
      • Disambiguation (error rate, POS, ranking)
    Open-Source Hebrew Search: Ways of Resolution
  • 15. Dictionary vs Algorithm
    • Dictionaries are easier to build and maintain, but they need much more on-going attention and coverage tests
    • Easier to support non-exact matches with an algorithm
    • Prerequisites and dependencies
    • Hand-crafted dictionaries with morphological information, and corpora generated dictionaries with statistical data
    Open-Source Hebrew Search: Ways of Resolution
  • 16. Lemma Disambiguation
    • In order to index a correct lemma, a good disambiguation process needs to be used
    • POS tools, grammatical or statistical, is the only reliable way to correctly eliminate false positives
    • Even with such tools, ambiguity may exist:
    • " המראה של מטוסים ריקים [...]"
    • " ראש הממשלה בבון "
    Open-Source Hebrew Search: Ways of Resolution
  • 17. NLP-based Hebrew Text Retrieval
    • Filter lemmas based on their rank, morphological characteristics or statistical data
    • OOV cases can be saved as-is, have affixes removed from them, or compared to a list of known words (i.e. names and addresses)
    • Removal of stop and noise words
    • Term expansion (soundex, synonyms)
    • Save lemma to index (multiple lemmas at the same position)
    Open-Source Hebrew Search: Ways of Resolution
  • 18. Other Text Retrieval Methods
    • Is morphological analysis necessary?
    • Available methods:
      • Light-stemming
      • Word truncation
      • N-grams
      • Skipgrams
      • (Sub-types)
    • Require no extra overhead
    • F avorable, even when not superior
    • Disadvantages: larger index size, slower searches (for some)
    Open-Source Hebrew Search: Ways of Resolution
  • 19. … Applied on Semitic Languages
    • Researches have shown 4-grams and light stemmers (“light-10”) to work better than morphological lemmatizers for Arabic
    • Apparently, good relevance can be achieved without ‘knowing’ the language
    • Computers vs Humans
    • Lemmatization and disambiguation processes do make mistakes
    • Contextual processing can fail for short queries, producing incorrect searches
    Open-Source Hebrew Search: Ways of Resolution
  • 20. The Best Retrieval Method for Hebrew Texts
    • Arabic and Hebrew share many morphologic phenomenas
    • … but they do differ
    • Without trying, we can never know
    • Where HebMorph comes in
    Open-Source Hebrew Search: Ways of Resolution
  • 21. HebMorph’s Approach Open-Source Hebrew Search
  • 22. HebMorph
    • … is a free , open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals.
    • 2 goals
    • Development is done with Lucene (why?)
    • MorphAnalyzer, Hebrew.SimpleAnalyzer (+ duality)
    • OpenRelevance
    Open-Source Hebrew Search: HebMorph’s Approach
  • 23. Indexing Flow Chart Open-Source Hebrew Search: HebMorph’s Approach
  • 24. Searching Wikipedia with BzReader and HebMorph
    • Source available from
    • http://github.com/synhershko/BzReader
    Open-Source Hebrew Search: HebMorph’s Approach
  • 25. The Road Ahead
    • A better tokenizer
    • MorphAnalyzer:
      • Hspell improvements (coverage, lemma probabilities, prefixes probabilities)
      • Toleration guidelines
      • Smarter OOV handling
      • Better stop words handling
    • Hebrew judgments for OpenRelevance with Orev
    • Comparing various approaches to Hebrew IR
    • Wide availability (Java port underway!)
    • Other uses (NLP, OCR, you name it)
    Open-Source Hebrew Search: HebMorph’s Approach
  • 26. Join Us!
    • The more people join, the more feedback we get, and the better we become.
    • Our mailing list:
        • https://lists.sourceforge.net/lists/listinfo/hebmorph-thinktank
    • Code repository (Released under GPLv2):
      • http :// github . com / synhershko / HebMorph
    • Activity updates:
        • http://www.code972.com/blog/hebmorph/
        • #HebMorph on Twitter
    Open-Source Hebrew Search: HebMorph’s Approach
  • 27. Thank you! Open-Source Hebrew Search