SlideShare a Scribd company logo
1 of 57
Download to read offline
Tomasz Korzeniowski
   tomek@polarrose.com
Information Retrieval
Retrieval strategies
• Vector Space Model
• Latent Semantic Indexing
• Probabilistic Retrieval Strategies
• Language Models
• Inference Networks
• Extended Boolean Retrieval
• Neural Networks
• Genetic Algorithms
• Fuzzy Set Retrieval
Vector space model
Text retrieval
Analysis
Tokenization
Stop-words
Stemming

Lemmatization
http://tartarus.org/~martin/
      PorterStemmer/
Document

 Term
Term frequency
r boost for a query on ferrari than the
 get from a query on insurance.
  entInversionof a term used to sca
      frequency df document
total number of documents in a corpu
             frequency follows:
 frequency (idf) of a term t as
                      N
           idft = log     .
                      dft
rare term is high, whereas the idf of a
ure 6.4 gives an example of idf’s in a co
g scheme assigns to term


 tf-idft,d = tft,d × idft .
ssigns to term t a weigh
Search
7 Vector space re

            6


                                      v(q)
                     
                      
                     
                          v(d2 )
                        B
                        ¨
                       ¨
                     ¨¨   v(d2 )
                         I
                    ¨   
                      
                   ¨
                 ¨¨ 
               ¨¨
              ¨
             ¨
                                             -
            ¨
            


             Cosine similarity illustrated.
igure 7.1
Q: “gold silver truck”

D1: “Shipment of gold damaged in a
fire”

D2: “Delivery of silver arrived in a
silver truck”

D3: “Shipment of gold arrived in a
truck”
TF

    a arrived damaged delivery   fire   gold   in of shipment silver truck

D1 1             1               1      1     11       1
         0               0                                     0     0


D2 1     1               1                    11               2
                 0               0      0              0             0


D3 1     1                              1     11       1             1
                 0       0       0                             0


                                        1                      1     1
Q   0    0       0       0       0            00       0
N
                 idft = log        .
                            dft
  •                        • of
area term is high, whereas 0the idf of
             0
    log 3/3 =                log 3/3 =



  • arrived                • silver
re 6.4 gives0.176 example of idf’s in a
                     an                0.477
            log 3/2 =            log 3/1 =



  • damaged                • shipment
ample logarithms are to the base 10.
                     0.477                0.176
                 log 3/1 =               log 3/2 =



  • delivery               • truck
                    0.477              0.176
                log 3/1 =        log 3/2 =



  • fire                    • gold
                0.477                 0.176
       log 3/1 =                log 3/2 =

 always finite?
  • in        0
     log 3/3 =
a arrived damaged delivery    fire   gold   in of shipment silver truck

                0.477            0.477 0.176 0 0      0.176
D1 0     0               0                                      0     0


        0.176           0.477                                 0.954 0.176
D2 0             0                0      0     00       0


        0.176                           0.176 0 0     0.176         0.176
D3 0             0       0        0                             0


                                        0.176 0 0             0.477 0.176
Q   0    0       0       0        0                     0
SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0)
(0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0)
(0)+(0)(0.176)+(0.477)(0)+(0.176)(0)=
(0.176)(0.176) ⋲ 0.031
SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486

SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062
Inverted index
term - 1   (dn,1)    (d10,1)



term - 2   (dn,5)    (dn,3)



term - 3   (d2,11)   (d10,1)



term - 4   (dn,1)    (d2,1)



term - 5   (dn,2)    (d4,3)




term - n   (d6,1)    (d7,3)
Lucene
Analysis
Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.
We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer
and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper,
PerFieldAnalyzerWrapper, to section 4.4.

Table 4.2   Primary analyzers available in Lucene

            Analyzer                                          Steps taken

                                Splits tokens at whitespace
  WhitespaceAnalyzer

                                Divides text at nonletter characters and lowercases
  SimpleAnalyzer

                                Divides text at nonletter characters, lowercases, and removes stop words
  StopAnalyzer

                                Tokenizes based on a sophisticated grammar that recognizes e-mail
  StandardAnalyzer
                                addresses, acronyms, Chinese-Japanese-Korean characters,
                                alphanumerics, and more; lowercases; and removes stop words



The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-
Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in
almost any Western (European-based) language. You can see the effect of each of
these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-
Analyzer are both trivial and we don’t cover them in more detail here. We explore
the StopAnalyzer and StandardAnalyzer in more depth because they have non-
Index
Index

• IndexWriter
• Directory
• Analyzer
• Document
• Field
ex options: store
                         store
  Value         Description
  :no           Don’t store field
  :yes          Store field in its original format.
                Use this value if you want to highlight
                matches or print match excerpts a la Google
                search.
  :compressed   Store field in compressed format.
index
Index options: index

        Value                                   Description
        :no                                     Do not make this field searchable.
        :yes                                    Make this field searchable and tok-
                                                enize its contents.
        :untokenized                            Make this field searchable but do not
                                                tokenize its contents. Use this value
                                                for fields you wish to sort by.
        :omit norms                             Same as :yes except omit the norms
                                                file. The norms file can be omit-
                                                ted if you don’t boost any fields and
                                                you don’t need scoring based on field
                                                length.
        :untokenized omit norms                 Same as :untokenized except omit the
                                                norms file.
Ruby Day Kraków: Full Text Search with Ferret
term_vector
Index options: term vector



        Value                                   Description
        :no                                     Don’t store term-vectors
        :yes                                    Store term-vectors without storing positions
                                                or offsets.
        :with positions                         Store term-vectors with positions.
        :with offsets                            Store term-vectors with offsets.
        :with positions ofssets                 Store term-vectors with positions and off-
                                                sets.




Ruby Day Kraków: Full Text Search with Ferret
Search
Search

• IndexSearcher
• Term
• Query
• Hits
Query
Query

• API
 •   new TermQuery(new Term(“name”,”Tomek”));

• Lucene QueryParser
 •   queryParser.parse(“name:Tomekquot;);
TermQuery
 name:Tomek
BooleanQuery
    ramobo OR ninja

+rambo +ninja –name:rocky
PhraseQuery
“ninja java” –name:rocky
SloppyPhraseQuery
 “red-faced politicians”~3
RangeQuery
releaseDate:[2000 TO 2007]
WildcardQuery
 sup?r, su*r, super*
FuzzyQuery
      color~

 colour, collor, colro
http://en.wikipedia.org/wiki/Levenshtein_distance


                 color colour - 1

                  colour coller - 2
Equation 1. Levenstein Distance Score




This means that an exact match will h
corresponding letters will have a score
Boost
title:Spring^10
Information Retrieval with Open Source
Information Retrieval with Open Source

More Related Content

Viewers also liked

Hnp Berritu Behar Zuzendariei Aurkezpena
Hnp Berritu Behar Zuzendariei AurkezpenaHnp Berritu Behar Zuzendariei Aurkezpena
Hnp Berritu Behar Zuzendariei Aurkezpenahnoiratzualdea
 
HNPberritzekoLanakMintegietanAurkezteko
HNPberritzekoLanakMintegietanAurkeztekoHNPberritzekoLanakMintegietanAurkezteko
HNPberritzekoLanakMintegietanAurkeztekohnoiratzualdea
 
God Exists
God ExistsGod Exists
God Existsantso
 
Strijker, A. (2004 11 27). Lcmss
Strijker, A. (2004 11 27). LcmssStrijker, A. (2004 11 27). Lcmss
Strijker, A. (2004 11 27). LcmssSaxion
 
In Memoriam
In MemoriamIn Memoriam
In Memoriamantso
 
Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...
Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...
Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...Saxion
 
Strijker, A. (2002, April 03). Reuse And Metadata In Practice
Strijker, A. (2002, April 03). Reuse And Metadata In PracticeStrijker, A. (2002, April 03). Reuse And Metadata In Practice
Strijker, A. (2002, April 03). Reuse And Metadata In PracticeSaxion
 
Kansengroepen en vrijetijdsparticipatie
Kansengroepen en vrijetijdsparticipatieKansengroepen en vrijetijdsparticipatie
Kansengroepen en vrijetijdsparticipatieAvansa Kempen
 
Strijker, A (2001). Teletop And Reuse
Strijker, A (2001). Teletop And ReuseStrijker, A (2001). Teletop And Reuse
Strijker, A (2001). Teletop And ReuseSaxion
 
Sales Skill Develop In E Learning
Sales Skill Develop In E LearningSales Skill Develop In E Learning
Sales Skill Develop In E LearningNash Bai
 
Powerpoint Salland 3,1
Powerpoint Salland 3,1Powerpoint Salland 3,1
Powerpoint Salland 3,1Chin Min
 
Branka Mintzagrama Datuen Iakurketarako Azalpena
Branka Mintzagrama Datuen Iakurketarako AzalpenaBranka Mintzagrama Datuen Iakurketarako Azalpena
Branka Mintzagrama Datuen Iakurketarako Azalpenahnoiratzualdea
 
Presentacion cracovia
Presentacion cracoviaPresentacion cracovia
Presentacion cracoviarubenroa
 

Viewers also liked (20)

Board Meeting06.10.08
Board Meeting06.10.08Board Meeting06.10.08
Board Meeting06.10.08
 
Hnp Berritu Behar Zuzendariei Aurkezpena
Hnp Berritu Behar Zuzendariei AurkezpenaHnp Berritu Behar Zuzendariei Aurkezpena
Hnp Berritu Behar Zuzendariei Aurkezpena
 
HNPberritzekoLanakMintegietanAurkezteko
HNPberritzekoLanakMintegietanAurkeztekoHNPberritzekoLanakMintegietanAurkezteko
HNPberritzekoLanakMintegietanAurkezteko
 
Zorionak09
Zorionak09Zorionak09
Zorionak09
 
God Exists
God ExistsGod Exists
God Exists
 
D Mac P.Lessons Learned
D Mac P.Lessons LearnedD Mac P.Lessons Learned
D Mac P.Lessons Learned
 
Strijker, A. (2004 11 27). Lcmss
Strijker, A. (2004 11 27). LcmssStrijker, A. (2004 11 27). Lcmss
Strijker, A. (2004 11 27). Lcmss
 
In Memoriam
In MemoriamIn Memoriam
In Memoriam
 
Natuurrijke tuin
Natuurrijke tuinNatuurrijke tuin
Natuurrijke tuin
 
Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...
Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...
Strijker, A. (2002 09 09). Learning Content Management Systems For Web Suppor...
 
Strijker, A. (2002, April 03). Reuse And Metadata In Practice
Strijker, A. (2002, April 03). Reuse And Metadata In PracticeStrijker, A. (2002, April 03). Reuse And Metadata In Practice
Strijker, A. (2002, April 03). Reuse And Metadata In Practice
 
Kansengroepen en vrijetijdsparticipatie
Kansengroepen en vrijetijdsparticipatieKansengroepen en vrijetijdsparticipatie
Kansengroepen en vrijetijdsparticipatie
 
Strijker, A (2001). Teletop And Reuse
Strijker, A (2001). Teletop And ReuseStrijker, A (2001). Teletop And Reuse
Strijker, A (2001). Teletop And Reuse
 
Vormingplus Kempen
Vormingplus KempenVormingplus Kempen
Vormingplus Kempen
 
Sales Skill Develop In E Learning
Sales Skill Develop In E LearningSales Skill Develop In E Learning
Sales Skill Develop In E Learning
 
Powerpoint Salland 3,1
Powerpoint Salland 3,1Powerpoint Salland 3,1
Powerpoint Salland 3,1
 
Branka Mintzagrama Datuen Iakurketarako Azalpena
Branka Mintzagrama Datuen Iakurketarako AzalpenaBranka Mintzagrama Datuen Iakurketarako Azalpena
Branka Mintzagrama Datuen Iakurketarako Azalpena
 
Ejemplo
EjemploEjemplo
Ejemplo
 
Ncte GBN
Ncte GBNNcte GBN
Ncte GBN
 
Presentacion cracovia
Presentacion cracoviaPresentacion cracovia
Presentacion cracovia
 

Similar to Information Retrieval with Open Source

Visualization of Traceability Models with Domain-specific Layouting
Visualization of Traceability Models with Domain-specific LayoutingVisualization of Traceability Models with Domain-specific Layouting
Visualization of Traceability Models with Domain-specific LayoutingZoltán Ujhelyi
 
BADCamp 2008 DB Sync
BADCamp 2008 DB SyncBADCamp 2008 DB Sync
BADCamp 2008 DB SyncShaun Haber
 
iOS Visual F/X Using GLSL
iOS Visual F/X Using GLSLiOS Visual F/X Using GLSL
iOS Visual F/X Using GLSLDouglass Turner
 
資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuan資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuanWei-Yuan Chang
 
Obsidian Talk JP 資料 - 2021-10-15
Obsidian Talk JP 資料 - 2021-10-15Obsidian Talk JP 資料 - 2021-10-15
Obsidian Talk JP 資料 - 2021-10-15博文 斉藤
 
Design margin analysis & prediction 2005
Design margin analysis & prediction 2005Design margin analysis & prediction 2005
Design margin analysis & prediction 2005Sachin Modgil
 
Introducing Riak
Introducing RiakIntroducing Riak
Introducing RiakKevin Smith
 
Introducing Riak
Introducing RiakIntroducing Riak
Introducing RiakKevin Smith
 

Similar to Information Retrieval with Open Source (8)

Visualization of Traceability Models with Domain-specific Layouting
Visualization of Traceability Models with Domain-specific LayoutingVisualization of Traceability Models with Domain-specific Layouting
Visualization of Traceability Models with Domain-specific Layouting
 
BADCamp 2008 DB Sync
BADCamp 2008 DB SyncBADCamp 2008 DB Sync
BADCamp 2008 DB Sync
 
iOS Visual F/X Using GLSL
iOS Visual F/X Using GLSLiOS Visual F/X Using GLSL
iOS Visual F/X Using GLSL
 
資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuan資料視覺化 - D3 的第一堂課 | WeiYuan
資料視覺化 - D3 的第一堂課 | WeiYuan
 
Obsidian Talk JP 資料 - 2021-10-15
Obsidian Talk JP 資料 - 2021-10-15Obsidian Talk JP 資料 - 2021-10-15
Obsidian Talk JP 資料 - 2021-10-15
 
Design margin analysis & prediction 2005
Design margin analysis & prediction 2005Design margin analysis & prediction 2005
Design margin analysis & prediction 2005
 
Introducing Riak
Introducing RiakIntroducing Riak
Introducing Riak
 
Introducing Riak
Introducing RiakIntroducing Riak
Introducing Riak
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Information Retrieval with Open Source

  • 1.
  • 2. Tomasz Korzeniowski tomek@polarrose.com
  • 4.
  • 5.
  • 6.
  • 7. Retrieval strategies • Vector Space Model • Latent Semantic Indexing • Probabilistic Retrieval Strategies • Language Models • Inference Networks • Extended Boolean Retrieval • Neural Networks • Genetic Algorithms • Fuzzy Set Retrieval
  • 16.
  • 18. r boost for a query on ferrari than the get from a query on insurance. entInversionof a term used to sca frequency df document total number of documents in a corpu frequency follows: frequency (idf) of a term t as N idft = log . dft rare term is high, whereas the idf of a ure 6.4 gives an example of idf’s in a co
  • 19. g scheme assigns to term tf-idft,d = tft,d × idft . ssigns to term t a weigh
  • 21. 7 Vector space re 6 v(q)       v(d2 )   B ¨ ¨   ¨¨ v(d2 ) I   ¨ ¨   ¨¨  ¨¨  ¨ ¨   - ¨ Cosine similarity illustrated. igure 7.1
  • 22.
  • 23. Q: “gold silver truck” D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
  • 24. TF a arrived damaged delivery fire gold in of shipment silver truck D1 1 1 1 1 11 1 0 0 0 0 D2 1 1 1 11 2 0 0 0 0 0 D3 1 1 1 11 1 1 0 0 0 0 1 1 1 Q 0 0 0 0 0 00 0
  • 25. N idft = log . dft • • of area term is high, whereas 0the idf of 0 log 3/3 = log 3/3 = • arrived • silver re 6.4 gives0.176 example of idf’s in a an 0.477 log 3/2 = log 3/1 = • damaged • shipment ample logarithms are to the base 10. 0.477 0.176 log 3/1 = log 3/2 = • delivery • truck 0.477 0.176 log 3/1 = log 3/2 = • fire • gold 0.477 0.176 log 3/1 = log 3/2 = always finite? • in 0 log 3/3 =
  • 26. a arrived damaged delivery fire gold in of shipment silver truck 0.477 0.477 0.176 0 0 0.176 D1 0 0 0 0 0 0.176 0.477 0.954 0.176 D2 0 0 0 0 00 0 0.176 0.176 0 0 0.176 0.176 D3 0 0 0 0 0 0.176 0 0 0.477 0.176 Q 0 0 0 0 0 0
  • 30. term - 1 (dn,1) (d10,1) term - 2 (dn,5) (dn,3) term - 3 (d2,11) (d10,1) term - 4 (dn,1) (d2,1) term - 5 (dn,2) (d4,3) term - n (d6,1) (d7,3)
  • 33.
  • 34.
  • 35. Lucene includes several built-in analyzers. The primary ones are shown in table 4.2. We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper, PerFieldAnalyzerWrapper, to section 4.4. Table 4.2 Primary analyzers available in Lucene Analyzer Steps taken Splits tokens at whitespace WhitespaceAnalyzer Divides text at nonletter characters and lowercases SimpleAnalyzer Divides text at nonletter characters, lowercases, and removes stop words StopAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mail StandardAnalyzer addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple- Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple- Analyzer are both trivial and we don’t cover them in more detail here. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have non-
  • 36. Index
  • 37. Index • IndexWriter • Directory • Analyzer • Document • Field
  • 38. ex options: store store Value Description :no Don’t store field :yes Store field in its original format. Use this value if you want to highlight matches or print match excerpts a la Google search. :compressed Store field in compressed format.
  • 39. index Index options: index Value Description :no Do not make this field searchable. :yes Make this field searchable and tok- enize its contents. :untokenized Make this field searchable but do not tokenize its contents. Use this value for fields you wish to sort by. :omit norms Same as :yes except omit the norms file. The norms file can be omit- ted if you don’t boost any fields and you don’t need scoring based on field length. :untokenized omit norms Same as :untokenized except omit the norms file. Ruby Day Kraków: Full Text Search with Ferret
  • 40. term_vector Index options: term vector Value Description :no Don’t store term-vectors :yes Store term-vectors without storing positions or offsets. :with positions Store term-vectors with positions. :with offsets Store term-vectors with offsets. :with positions ofssets Store term-vectors with positions and off- sets. Ruby Day Kraków: Full Text Search with Ferret
  • 41.
  • 44. Query
  • 45. Query • API • new TermQuery(new Term(“name”,”Tomek”)); • Lucene QueryParser • queryParser.parse(“name:Tomekquot;);
  • 47. BooleanQuery ramobo OR ninja +rambo +ninja –name:rocky
  • 52. FuzzyQuery color~ colour, collor, colro
  • 53. http://en.wikipedia.org/wiki/Levenshtein_distance color colour - 1 colour coller - 2
  • 54. Equation 1. Levenstein Distance Score This means that an exact match will h corresponding letters will have a score