SlideShare a Scribd company logo
1 of 18
Alan Nochenson
    IST 511
   10/1/2012
   Motivation
   Real-world example
   Techniques
     Tokenization
     Stop words
     Normalization
     Stemming/lemmatization
   Using a variety of techniques, we want to
    improve IR systems so that they “understand”
    more of what we want from a query
   E.g. When searching for a paper about
    Facebook, the following queries should all
    return the paper
       The facebook, facebook, face-book
   Damerau–Levenshtein distance is the number
    of ops between two words
     Insert
     Delete
     Change
     Swap

   adidas = adiidas == adifas (distance 1)
   But: cat != rat != hat (distance 1)
   Breaking up sentences on a variety of rules
       Split on non-alphanumeric?
         Good: The dog ran to the park
         Bad: Ms. O’Hannety went to O’Flaggerty’s pub
           (Ms, O, Hannety, went, to, O, Flaggerty, s, pub)
       Split on space?
         Bad: San Fransisco is a great city.
   E.g. Lebensversicherungsgesellschaftsangestellter  =
    life insurance company employee
   Would not get split by any of the previously
    mentioned methods
   Drop common ‘useless’ words
       How useless are they (“President of the USA”)
   Not a big problem to include them, space or
    time-wise
   What I did at Amazon (codenamed BrandSims
    normalization)
   Maps words/phrases that are semantically
    related to each other, so they can refer to the
    same content
   E.g. Alan went to the store = Alan go store
   Mainly dropped since they were not always
    supported
   Problematic since in certain languages accents
    are critical to understanding
   Standardize to all caps or all lowercase (more
    common)
   Everywhere in the sentence?
       Bad: We went to the White House
   Better solution is the beginning of a sentence
    and in titles
   More complicated than previous normalization
    techniques
   Goal is to remove things like tense, number,
    possession from strings
   Chop off the end of the word
     Con: Crude and sometime ineffective
     Pro: Fast and no overhead

   E.g. cookies -> cooki, cup->c
   Use a vocab list and morphological (structural)
    list [which may or may not help much]
   Recognize context in a sentence (saw would
    become see if used as a verb, not a noun)
   Porter’s algorithm:
   Understand the type of queries that will be
    submitted
   It is all about tradeoffs between precision and
    recall
   These techniques can be used differently
    depending on the context.
10-1 Vocab of Terms

More Related Content

Similar to 10-1 Vocab of Terms

Sentence fragments
Sentence fragmentsSentence fragments
Sentence fragmentssasknic
 
Dyslexia awareness and teaching strategies kn
Dyslexia awareness and teaching strategies knDyslexia awareness and teaching strategies kn
Dyslexia awareness and teaching strategies knStarryArgenta
 
FS English Level 2 - June 24, 2023 (Identifying language features).
FS English Level 2 - June 24, 2023 (Identifying language features).FS English Level 2 - June 24, 2023 (Identifying language features).
FS English Level 2 - June 24, 2023 (Identifying language features).LeadAcademy3
 
Tesol 2011 cristin boyd
Tesol 2011 cristin boydTesol 2011 cristin boyd
Tesol 2011 cristin boydsal.teachers
 
ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2
ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2
ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2ALATechSource
 
Designing Effective Power Point Presentation
Designing Effective Power Point PresentationDesigning Effective Power Point Presentation
Designing Effective Power Point PresentationWaqas Faizan
 
Serving Children with Autism Spectrum Disorder (Part 2)
Serving Children with Autism Spectrum Disorder (Part 2)Serving Children with Autism Spectrum Disorder (Part 2)
Serving Children with Autism Spectrum Disorder (Part 2)ALATechSource
 
Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)
Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)
Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)ALATechSource
 
Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)
Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)
Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)ALATechSource
 
Vocabulary and listening
Vocabulary and listeningVocabulary and listening
Vocabulary and listeningIvan Aguilar
 
language skills editing updated
language skills editing updatedlanguage skills editing updated
language skills editing updatedKiran
 
Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)
Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)
Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)ALATechSource
 
Mechanics of writing
Mechanics of writingMechanics of writing
Mechanics of writingannmaryjose16
 
B2 - C1 vocab and grammar
B2 - C1 vocab and grammarB2 - C1 vocab and grammar
B2 - C1 vocab and grammarRobyn Goodman
 
Sentence fragments
Sentence fragmentsSentence fragments
Sentence fragmentssasknic
 

Similar to 10-1 Vocab of Terms (20)

Hypermedia jt
Hypermedia jtHypermedia jt
Hypermedia jt
 
Sentence fragments
Sentence fragmentsSentence fragments
Sentence fragments
 
Dyslexia awareness and teaching strategies kn
Dyslexia awareness and teaching strategies knDyslexia awareness and teaching strategies kn
Dyslexia awareness and teaching strategies kn
 
FS English Level 2 - June 24, 2023 (Identifying language features).
FS English Level 2 - June 24, 2023 (Identifying language features).FS English Level 2 - June 24, 2023 (Identifying language features).
FS English Level 2 - June 24, 2023 (Identifying language features).
 
Tesol 2011 cristin boyd
Tesol 2011 cristin boydTesol 2011 cristin boyd
Tesol 2011 cristin boyd
 
ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2
ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2
ALA Editions Workshop: Serving Children with Autism Spectrum Disorder: Part 2
 
Designing Effective Power Point Presentation
Designing Effective Power Point PresentationDesigning Effective Power Point Presentation
Designing Effective Power Point Presentation
 
From words to sentences
From words to sentencesFrom words to sentences
From words to sentences
 
Serving Children with Autism Spectrum Disorder (Part 2)
Serving Children with Autism Spectrum Disorder (Part 2)Serving Children with Autism Spectrum Disorder (Part 2)
Serving Children with Autism Spectrum Disorder (Part 2)
 
Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)
Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)
Serving Children with Autism Spectrum Disorder: Part 2 (Feb. 2019)
 
Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)
Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)
Serving Children with Autism Spectrum Disorder: Part 2 (June 2018)
 
Vocabulary and listening
Vocabulary and listeningVocabulary and listening
Vocabulary and listening
 
Reading exam techniques
Reading exam techniquesReading exam techniques
Reading exam techniques
 
language skills editing updated
language skills editing updatedlanguage skills editing updated
language skills editing updated
 
Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)
Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)
Serving Children with Autism Spectrum Disorder: Part 2 (March 2017)
 
Mechanics of writing
Mechanics of writingMechanics of writing
Mechanics of writing
 
B2 - C1 vocab and grammar
B2 - C1 vocab and grammarB2 - C1 vocab and grammar
B2 - C1 vocab and grammar
 
6 POS SA.pptx
6 POS SA.pptx6 POS SA.pptx
6 POS SA.pptx
 
Parts of speech
Parts of speechParts of speech
Parts of speech
 
Sentence fragments
Sentence fragmentsSentence fragments
Sentence fragments
 

10-1 Vocab of Terms

  • 1. Alan Nochenson IST 511 10/1/2012
  • 2. Motivation  Real-world example  Techniques  Tokenization  Stop words  Normalization  Stemming/lemmatization
  • 3. Using a variety of techniques, we want to improve IR systems so that they “understand” more of what we want from a query  E.g. When searching for a paper about Facebook, the following queries should all return the paper  The facebook, facebook, face-book
  • 4.
  • 5.
  • 6.
  • 7. Damerau–Levenshtein distance is the number of ops between two words  Insert  Delete  Change  Swap  adidas = adiidas == adifas (distance 1)  But: cat != rat != hat (distance 1)
  • 8. Breaking up sentences on a variety of rules  Split on non-alphanumeric?  Good: The dog ran to the park  Bad: Ms. O’Hannety went to O’Flaggerty’s pub  (Ms, O, Hannety, went, to, O, Flaggerty, s, pub)  Split on space?  Bad: San Fransisco is a great city.
  • 9. E.g. Lebensversicherungsgesellschaftsangestellter  = life insurance company employee  Would not get split by any of the previously mentioned methods
  • 10. Drop common ‘useless’ words  How useless are they (“President of the USA”)  Not a big problem to include them, space or time-wise
  • 11. What I did at Amazon (codenamed BrandSims normalization)  Maps words/phrases that are semantically related to each other, so they can refer to the same content  E.g. Alan went to the store = Alan go store
  • 12. Mainly dropped since they were not always supported  Problematic since in certain languages accents are critical to understanding
  • 13. Standardize to all caps or all lowercase (more common)  Everywhere in the sentence?  Bad: We went to the White House  Better solution is the beginning of a sentence and in titles
  • 14. More complicated than previous normalization techniques  Goal is to remove things like tense, number, possession from strings
  • 15. Chop off the end of the word  Con: Crude and sometime ineffective  Pro: Fast and no overhead  E.g. cookies -> cooki, cup->c
  • 16. Use a vocab list and morphological (structural) list [which may or may not help much]  Recognize context in a sentence (saw would become see if used as a verb, not a noun)  Porter’s algorithm:
  • 17. Understand the type of queries that will be submitted  It is all about tradeoffs between precision and recall  These techniques can be used differently depending on the context.