SlideShare a Scribd company logo
1 of 66
Download to read offline
In Searchsite search
    integrating
                Of...


                      Ian Barber
                     @ianbarber
               http://phpir.com
             ian@ibuildings.com
           http://joind.in/2172
How Search Works
 Integrating Search
  Improving Results
       Using Search
Search Performance
          Questions




                 2
3
Query
Query      Query
 Query
 Query
           Parser


Result
Result
 Result
 Result   Index




          Analyser   Document
                     Document
                      Document
                      Document

                             4
Tokenisation



“  With AT&T’s help, the F.B.I
Miami-Dade office had recovered
$1.1 million from O’Healy’s Ponzi
scheme, 10-15% more than


           ”
expected.


                                    5
PHP Tokenisation

function tokenise($string) {
    $string = strtolower($string);
    preg_match_all('/w+/', $string,
            $matches, PREG_OFFSET_CAPTURE);
    return $matches[0];
}




                                         6
Document Term Pairs
Document ID         Term
    1                the
    1               best
    1                of
    1                the
    ...              ...
   204               and
   204              what
   204              would
                               7
Inverted Index
Term              Documents
best    1 (4, 16), 4 (422), 129 (344) ...

what    24 (50, 98), 75 (33, 208) ...

would   99 (32, 599), 201 (344) ..

 ...                    ...


                                            8
Boolean Query Merge
Query: Best Western Hotel
 best     1    4    129   298   305   338
western   4   95    194   204   298   305


working   4   298   305
 hotel    2   40    200   298   355   402

Result: Document 298
                                        9
Lorem ipsum dolor sit amet,
                                                                 Lorem ipsum dolor sit amet,               consectetur adipiscing elit. Sed sit amet ante
                                                                                                           vitae enim elementum semper sodales quis
                                                              consectetur adipiscing elit. Sed sit amet ante
                                                              vitae enim elementum semper sodalesipsum. Aliquam vel condimentum Lorem ipsum dolor sit amet,
                                                                                                            quis                                   neque.
                                                              ipsum. Aliquam vel condimentum neque.        Curabitur ornare feugiat ornare. Donec
                                                                                                                                                consectetur adipiscing elit. Sed sit amet ante
                                                                                                           consectetur elit metus. Nulla eleifend
                                                              Curabitur ornare feugiat ornare. Donec                                            vitae enim elementum semper sodales quis
                                                              consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum     ipsum. Aliquam vel condimentum neque.
                                                                                                           vestibulum, justo vel egestas elementum,
                                                              tincidunt massa et euismod. Vestibulum sit amet,
                                                                                     Lorem ipsum dolor                                          Curabitur ornare feugiat ornare. Donec
                                                              vestibulum, justo consectetur elementum,elit.enim sit ametquam, vel gravida est
                                                                                  vel egestas adipiscing   purus
                                                                                                                   Sed
                                                                                                                        ornare
                                                                                                                                  ante          consectetur elit metus. Nulla eleifend
                                                              purus enim ornarevitae enim elementum sempernibh.
                                                                                    quam, vel gravida est vel sodales quis
                                                                                                           enim
                                                                                                                                                tincidunt massa et euismod. Vestibulum
Lorem ipsum dolor sit amet, consectetur                       enim vel nibh.
                                                            Lorem ipsum dolor ipsum. Aliquam vel condimentum neque. fringillavestibulum, justo vel egestas elementum,
                                                                                  sit amet,                Nam non eros nisi, eget               justo.
                                                         consectetur adipiscingCurabitur sit ametfeugiat ornare. Donec mauris vehicula enim ornare quam, vel gravida est
                                                                                    elit. Sed ornare ante                                       purus
adipiscing elit. Sed sit amet ante vitae enim            vitae enim elementum     consectetur elitjusto.Fusce vel risus vitae
                                                              Nam non eros nisi,semper sodalesmetus. Nulla eleifend
                                                                                     eget fringilla quis                                        enim vel nibh.
                                                              Fusce vel risus condimentum neque. facilisis sit amet in mi. Nulla ut turpis id
                                                         ipsum. Aliquam velvitae maurismassa et euismod. Vestibulum
                                                                                  tincidunt vehicula
elementum semper sodales quis ipsum. Aliquam                  facilisis sit amet in mi. Nulla ut turpis felis sollicitudin dictum sed nonNam non eros nisi, eget fringilla justo.
                                                         Curabitur ornare feugiat ornare. Donec velid
                                                                                  vestibulum, justo          egestas elementum,                  ipsum.
                                                                                                           Praesent gravida nulla, sed blandit leo.
                                                                                                                     ut risus est
                                                                                                       Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet,
                                                         consectetur elit metus.purus enim ornare quam, vel volutpat laoreet lacus,Fusce vel risus vitae mauris vehicula
                                                              felis sollicitudin dictum sed non ipsum.
                                                                                    Nulla eleifend
vel condimentum neque. Curabitur ornare                                           enim Vestibulum Curabitur                                      ut
                                                                                                    consectetur adipiscing elit. Sed sit amet ante
                                                                                                                                       consectetur adipiscing elit. Sed sit amet ante
                                                         tincidunt massa risus nulla, sed nibh. leo.consectetur arcu vestibulum vel.facilisis sit amet in mi. Nulla ut turpis id
                                                              Praesent ut et euismod.vel blandit
                                                                                                    ut                                  sodales Donec
                                                              Curabitur volutpat laoreet lacus, vitae enim elementum semper vitae enim elementum semper sodales quis
                                                                                                                                                  quis
                                                                                                                                                felis sollicitudin dictum sed non ipsum.
                                                         vestibulum, justo vel egestas elementum, dapibus fringilla arcu, et semper lacus
feugiat ornare. Donec consectetur elit metus.                                     Nam non vel. ipsum. Aliquam vel condimentumLorem ipsumut risussit amet, blandit leo.
                                                              consectetur arcu vestibulumeros nisi, eget fringilla justo.
                                                         purus enim ornare quam, vel gravida est     Donec                             ipsum. Praesent vel condimentum neque.
                                                                                                                                             neque.
                                                                                                                                                Aliquam dolor nulla, sed
                                                                                   arcu, vel risusCurabitur ornare feugiat ornare.consectetur adipiscing elit. Sed Donec ut
                                                                                                                                       Curabitur ornare volutpat laoreetsit amet ante
                                                                                                                                          Donec
                                                         enim dapibus fringilla Fusce et sempervitae mauris vehicula
                                                                vel nibh.                             lacus                                     Curabitur feugiat ornare. lacus,
                                                                                                    consectetur elitut turpisNulla eleifendenim elementumNulla eleifend quis
                                                                                                                      metus. id        consectetur elit metus. semper sodales Donec
Nulla eleifend tincidunt massa et euismod.                                        facilisis sit amet in mi. Nulla                        vitae consectetur arcu vestibulum vel.
                                                                                                    tincidunt massa et euismod. Vestibulum massa et euismod. Vestibulum lacus
                                                                                                                                       tincidunt
                                                         Nam non eros nisi, eget fringilla justo. dictum sed non ipsum.
                                                                                  felis sollicitudin                                     ipsum. dapibus fringilla arcu, et semper
                                                                                                                                                  Aliquam vel condimentum neque.
                                                                                                    vestibulum, justo vel egestas elementum, ornare vel egestas elementum,
                                                                                                                                       vestibulum, justo feugiat ornare. Donec
Vestibulum vestibulum, justo vel egestas                 Fusce vel risus vitae mauris vehicula nulla, sed blandit leo.
                                                                                  Praesent ut risus
                                                                                                    purus
                                                                                                                                         Curabitur
                                                                                  Curabitur volutpat enim ornare quam, vel gravidaenim ornare quam, vel gravida est
                                                                                                                                       purus est elit metus. Nulla eleifend
                                                         facilisis sit amet in mi. Nulla ut turpis id laoreet lacus, ut                  consectetur
                                                                                                    enim vel nibh.vel. Donec
                                                                                  consectetur arcu vestibulum                          enim vel nibh. et euismod. Vestibulum
elementum, purus enim ornare quam, vel                   felis sollicitudin dictum sed non ipsum.
                                                         Praesent ut risus nulla, dapibus fringilla arcu, et semper lacus
                                                                                    sed blandit leo.
                                                                                                                                         tincidunt massa
                                                                                                                                         vestibulum, justo vel egestas elementum,
                                                                                                    Nam non eros nisi, eget fringilla justo. eros nisi, eget fringilla justo.est
                                                                                                                                       Nam non ornare quam, vel gravida
gravida est enim vel nibh.                               Curabitur volutpat laoreet lacus, ut                                            purus enim
                                                                                                    Fusce vel risus vitae mauris vehicula vel nibh. vitae mauris vehicula
                                                                                                                                       Fusce vel risus
                                                                                                                                         enim
                                                        Lorem ipsum dolor sit amet, vel. Donec
                                                         consectetur arcu vestibulum
                                                                                                    facilisis sit amet in mi. Nulla ut turpis id amet in mi. Nulla ut turpis id
                                                                                                                                       facilisis sit
                                                     consectetur adipiscing elit.et semper lacus sollicitudin dictum sed non ipsum.
                                                           dapibus fringilla arcu, Sed sit amet ante
                                                                                                    felis                              felis sollicitudin dictum sed non ipsum.
                                                                                                                                         Nam non eros nisi, eget fringilla justo.
Nam non eros nisi, eget fringilla justo. Fusce vel   vitae enim elementum semper sodales quis
                                                     ipsum. Aliquam vel condimentum neque.
                                                                                                    Praesent ut risus nulla, sed blandit leo. utrisus vitae mauris vehicula
                                                                                                                                       Praesent risus nulla, sed blandit leo.
                                                                                                                                         Fusce vel
                                                                                                    Curabitur volutpat laoreet lacus, ut
                                                                                                                                       Curabitur volutpat laoreet lacus, ut
                                                     Curabitur ornare feugiat ornare. Donec consectetur arcu vestibulum vel. Donec sit arcu vestibulum vel. turpis id
                                                                                                                                         facilisis amet in mi. Nulla ut
risus vitae mauris vehicula facilisis sit amet in                                    Lorem ipsum dolor sit amet,
                                                                                                                                       consectetur
                                                     consectetur elit metus. Nulla eleifendadipiscing elit. Sed sit amet ante felis sollicitudin dictum sed non ipsum.
                                                                                  consectetur
                                                                                                                                                                            Donec
                                                                  Lorem ipsum dolor sit amet, dapibus fringilla arcu, etLorem ipsum dolor sit amet, et semper lacus
                                                                                                                                   semper lacus fringilla nulla, sed blandit leo.
                                                                                                                                         dapibus ut risus arcu,
                                                               consectetur adipiscing enimSed
                                                                                  vitae elit. elementum ante                    quis Praesent
                                                     tincidunt massa et euismod. Vestibulumsit amet semper sodalesconsectetur adipiscing elit. Sed sit amet ante
mi. Nulla ut turpis id felis sollicitudin dictum     vestibulum, justo vel egestas elementum,
                                                               vitae enim elementum semper sodales quis                        vitae
                                                                                                                                         Curabitur volutpat laoreet lacus, ut
                                                                                  ipsum. Aliquam vel condimentum neque. enim elementum semper sodales quis
                                                     purus enim ornare quam,vel condimentum feugiat ornare. Donec
                                                                                  Curabitur ornare neque.
                                                                                   vel gravida est                                       consectetur arcu vestibulum vel. Donec
sed non ipsum. Praesent ut risus nulla, sed                    ipsum. Aliquam
                                                     enim vel nibh.
                                                               Curabitur ornare feugiat ornare.       metus.
                                                                                                                               ipsum. Aliquam vel condimentum neque.
                                                                                  consectetur elit Donec Nulla eleifend Curabiturdapibus feugiat ornare.et semper lacus
                                                                                                                                            ornare
                                                                                                                                                    fringilla arcu,
                                                                                                                                                                    Donec
                                                                                  tincidunt massa et euismod. Vestibulum
blandit leo. Curabitur volutpat laoreet lacus, ut              consectetur elit metus. Nulla eleifend
                                                                                  vestibulum,Loremvel egestas elementum,
                                                     Nam non eros nisi, eget fringilla justo.    justo ipsum dolor sit amet,
                                                               tincidunt massa et euismod. Vestibulum
                                                                                                                               consectetur elit metus. Nulla eleifend
                                                                                                                               tincidunt ipsum dolor sit amet,
                                                                                                                                  Lorem massa et euismod. Vestibulum
                                                                                  purus enim ornare quam, vel gravidaSed sit amet ante vel egestas elementum,
                                                                                             consectetur adipiscing elit. est
                                                     Fusce vel risus vitaejusto vel egestas elementum,
                                                               vestibulum, mauris vehicula
consectetur arcu vestibulum vel. Donec dapibus                                    enim vel vitae enim est
                                                                                             nibh.
                                                                sit amet in ornare quam, vel id
                                                                                                                               vestibulum, justo
                                                                                                                               consectetur adipiscing elit. Sed sit amet ante
                                                     facilisis purus enim mi. Nulla ut turpisgravida elementum semper sodales quis
                                                                                                                               purus enim ornare quam, vel gravida est
                                                                                                                               vitae enim elementum semper sodales quis
                                                     felis sollicitudin dictum sed non ipsum. Aliquam vel condimentum vel nibh. vel condimentum neque.
                                                               enim vel nibh.
                                                                                             ipsum.
                                                                                                                               enim
                                                                                                                                      neque.
fringilla arcu, et semper lacus egestas non.         Praesent ut risus nulla, sed blandit leo. nisi, eget fringilla
                                                                                  Nam non eros
                                                                                                                               ipsum. Aliquam
                                                                                             Curabitur ornare feugiatjusto. Donec
                                                                                                                          ornare.
                                                                                                                               Curabitur ornare feugiat ornare. Donec
                                                                                             consectetur elit metus. Nulla eleifend
                                                     Curabitur volutpateros nisi,lacus, fringilla vitae mauris vehicula consectetur elit metus. Nulla eleifend
                                                               Nam non laoreet egetvel risus justo.
                                                                                  Fusce ut                                     Nam non eros nisi, eget fringilla justo.
Quisque eu purus ut lacus egestas dapibus.           consectetur arcu vestibulum vel. Donec inmassa et euismod. Vestibulum
                                                                                             tincidunt
                                                               Fusce vel risus vitae mauris amet mi. Nulla ut turpis tincidunt massavitae mauris Vestibulum
                                                                                  facilisis sit vehicula                       id
                                                                                                                               Fusce vel risus et euismod. vehicula
                                                                                  felis sollicitudin dictum vel egestas vestibulum,amet in mi. Nulla elementum,
                                                                                             vestibulum, justo
                                                       dapibus fringilla arcu, et semper lacus turpis id sed non ipsum.
                                                               facilisis sit amet in mi. Nulla ut
                                                                                                                                elementum,
                                                                                                                               facilisis sit justo vel egestas ut turpis id
Integer in velit id est dictum bibendum in id mi.                                            purus enim ornareblandit vel gravida est
                                                               felis sollicitudin dictum sed non ipsum. sed
                                                                                  Praesent ut risus nulla,
                                                                                             enim vel nibh.
                                                                                                                   quam, leo.
                                                                                                                               purus enim ornare quam, velnon ipsum.
                                                                                                                               felis sollicitudin dictum sed gravida est
                                                               Praesent ut risus Curabitur volutpat laoreet lacus, ut enim vel ut risus nulla, sed blandit leo.
                                                                                  nulla, sed blandit leo.                      Praesent nibh.
                                                                                  consectetur arcu vestibulum vel. Donec
                                                               Curabitur volutpat laoreet lacus, ut                            Curabitur volutpat laoreet lacus, ut
                                                                                    dapibus Nam nonarcu, nisi, eget fringilla justo. arcu vestibulum vel. Donec
                                                                                             fringilla eros
                                                               consectetur arcu vestibulum vel. Donec et semper lacus          consectetur
                                                                                                                               Nam non eros nisi, eget fringilla justo.
                                                                                             Fusce vel risus vitae mauris vehicula
                                                                 dapibus fringilla arcu, et semper lacus                       Fusce velfringilla arcu, et semper lacus
                                                                                                                                 dapibus risus vitae mauris vehicula
                                                                                             facilisis sit amet in mi. Nulla ut turpis id
                                                                                                                               facilisis sit amet in mi. Nulla ut turpis id
                                                                                             felis sollicitudin dictum sed non ipsum.
                                                                                                                               felis sollicitudin dictum sed non ipsum.
                                                                                             Praesent ut risus nulla, sed blandit leo.
                                                                                                                               Praesent ut risus nulla, sed blandit leo.
                                                                                             Curabitur volutpat laoreet lacus, ut
                                                                                                                               Curabitur volutpat laoreet lacus, ut
                                                                                             consectetur arcu vestibulum vel. Donec
                                                                                                                               consectetur arcu vestibulum vel. Donec
                                                                                               dapibus fringilla arcu, et semper lacus
                                                                                                                                 dapibus fringilla arcu, et semper lacus
TF-IDF

function getWeight($docID, $term, $total) {
  $tf = count($term[$docID]);
  $idf = log($total / count($term), 2);
  return $tf * $idf;
}




                                         11
Document Vector
        socket   what   heavy   steel   ...

Doc 1    0.02    0.3    0.001    0      ...

Doc 2     0       0       0      0      ...

Doc 3   0.001    0.2      0      0      ...

Doc 4     0       0     0.002   0.003   ...


                                              12
Ranked Query Merge
  best     23    42   179   246   333   703

 weight   0.008 0.002 0.023 0.039 0.014 0.001

western    42    88   120   179   246   798

 weight   0.003 0.004 0.023 0.001 0.034 0.004

1 - 246: 0.073
2 - 179: 0.024
3 - 120: 0.023
                                           13
PHP Similarity
function score($queryString, $index) {
  $query = tokenize($queryString);
  $matches = array();
  foreach($query as $qterm) {
    $postings = $index[$qterm];
    foreach($postings as $id => $posting) {
      $matches[$id] += $posting['score'];
    }
  }
  return arsort($matches);
}

                                         14
Integrating Search
                     15
MySQL Full Text Search
CREATE TABLE example (
    id INT(11) NOT NULL auto_increment,
    title VARCHAR(255),
    content TEXT,
    PRIMARY KEY(id),
    FULLTEXT(title,content)
) Engine=MyISAM;

INSERT INTO example (title, content) VALUES
('Mikko & Bacon','Mikko loves bacon'),
('Marcello & Bacon','Marcello hates bacon'),
('Jo & Sausages','Johanna loves sausages'),
('Hollywood & Garlic','Lorenzo hates garlic'),
('James & Cheddar','James is keen on cheeses');
                                              16
MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');

+----+------------------+------------------------+
| id | title             | content                |
+----+------------------+------------------------+
| 1 | Mikko & Bacon      | Mikko loves bacon      |
| 2 | Marcello & Bacon | Marcello hates bacon     |
| 3 | Jo & Sausages      | Johanna loves sausages |
+----+------------------+------------------------+
3 rows in set (0.00 sec)




                                                 17
Sphinx
http://www.sphinxsearch.com




                        18
Sphinx Configuration
source posts
{
  type             =   mysql
  sql_host         =   localhost
  sql_user         =   user
  sql_pass         =   password
  sql_db           =   search

    sql_query      = 
      SELECT id, title, content FROM example;
    sql_attr_multi = uint tag from query; 
      SELECT example_id, tag_id FROM tags;
}
                                            19
index posts
{
  source     = posts
  path       = /var/data/sphinx/example
  morphology = stem_en

    min_word_len     =   3
    min_prefix_len   =   3
    min_infix_len    =   0
    enable_star      =   1
}




                                          20
Stemming
        http://tartarus.org/~martin/PorterStemmer




happening - happen
happened - happen
happens   - happen




                                              21
Command Line Searching
indexer --config /etc/sphinx.conf --all
search --config /etc/sphinx.conf love bacon

displaying matches:
1. document=1, weight=3, tag=(1,2)
! id=1
! title=Mikko & Bacon
! content=Mikko loves bacon
words:
1. 'love': 2 documents, 2 hits
2. 'bacon': 2 documents, 4 hits

searchd --config /etc/sphinx.conf

                                              22
Sphinx From PHP

$cl = new SphinxClient();
$cl->SetServer('localhost', 3312);
$cl->SetMatchMode(SPH_MATCH_ANY);

$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);

$cl->SetFilter('tag', array(1));
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);


                                            23
Swish-E
   http://swish-e.org
pecl install swish-beta
                    24
Filesystem Index With Swish-E
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf


fs-swish-e.conf
IndexDir            /var/data/documents
IndexFile           fs-swish-e.index
IndexOnly           .doc .docx .pdf
FuzzyIndexingMode   Stemming_en1

FileFilter .pdf /usr/local/bin/swish_filter.pl
FileFilter .doc /usr/local/bin/swish_filter.pl
Crawling Content
/usr/local/bin/swish-e -S prog -c www-swish-e.conf


www-swish-e.conf
IndexDir      /usr/local/lib/swish-e/spider.pl
IndexFile     www-swish-e.index
SwishProgParameters default http://phpir.com/

FuzzyIndexingMode Stemming_en1
DefaultContents   HTML
Swish-E With Multiple Indices
$swish     = new Swish(
   'www-swish-e.index fs-swish-e.index'
);
$search    = $swish->prepare();

$queryStr = 'search string goes here';
$result   = $search->execute($queryStr);
$total    = $result->hits;

while($r = $result->nextResult()) {
  echo $r->swishdocpath; // url
}
Lucene




         28
$index = Zend_Search_Lucene::create('idx');
foreach($documents as $title => $content) {
  $doc = new Zend_Search_Lucene_Document();
  $doc->addField(
    Zend_Search_Lucene_Field::Text(
      'title', $title));
  $doc->addField(
    Zend_Search_Lucene_Field::UnStored(
      'content', $content));
  $index->addDocument($doc);
}



                             Build Index
                                         29
$results = $index->find('loves bacon');
foreach($results as $result) {
        echo $result->score, " ";
        echo $result->title, "n";
}

Output:
0.81656279309067 Mikko and Bacon
0.24800278854758 Marcello & Bacon



       Query Zend Search Lucene
                                          30
$file = file_get_contents($url);

$doc = Zend_Search_Lucene_Document_Html::
                           loadHTML($file);

$doc->addField(
   Zend_Search_Lucene_Field::Text(
     'url', $url
);
$index->addDocument($doc)



                             Index HTML
                                         31
Solr
http://lucene.apache.org/solr/
                                 32
Solr Search Index
$options = array( 'hostname' => 'localhost',
                  'port'     => 8983 );

$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', $id);
$doc->addField('cat', $category);
$doc->addField('title', $title);
$doc->addField('text', $text);
$response = $client->addDocument($doc);
$client->commit();


                                          33
Solr Search Client
$client = new SolrClient($options);

$query = new SolrQuery('bacon');
$response = $client->query($query);
$r = $response->getResponse();

foreach($r['response']['docs'] as $d) {
  echo $d->title[0] . "n";
}



                                          34
Xapian
http://xapian.org




              35
Xapian In PHP
$db = new XapianWritableDatabase(
      'idx', Xapian::DB_CREATE_OR_OPEN);
$i = new XapianTermGenerator();
$i->set_stemmer(new XapianStem("english"));

$doc = new XapianDocument();
$doc->set_data($content);
$doc->add_value(1, $title);

$i->set_document($doc);
$i->index_text($content);
$db->add_document($doc);
                                         36
Xapian Search In PHP

$database = new XapianDatabase('idx');
$enquire = new XapianEnquire($database);
$qp = new XapianQueryParser();
$qp->set_stemmer(new XapianStem("english"));
$qp->set_database($database);
$qp->set_stemming_strategy(
    XapianQueryParser::STEM_SOME);
$query = $qp->parse_query($queryString);

$enquire->set_query($query);


                                          37
$matches = $enquire->get_mset(0, 10);

$i = $matches->begin();
while(!$i->equals($matches->end())) {
  $n = $i->get_rank() + 1;
  $data = $i->get_document()->get_data();
  $title = $i->get_document()->get_value(1);
  $score = $i->get_percent();
  $i->next();
}




                                         38
Improving Results




                    39
Anchor Text




         40
Parse Anchor Text
$p = file_get_contents('http://phpir.com');

libxml_use_internal_errors(true);
$dom = DomDocument::loadHTML($p);
$links = $dom->getElementsByTagName('a');

foreach($links as $link) {
    $href = $link->getAttribute('href');
    $text = $link->nodeValue;
}


                                            41
1
         2




         3



    Zone Weighting
                42
$doc = new Zend_Search_Lucene_Document();

$tfield = Zend_Search_Lucene_Field::Text
   ('title', $title);
$tfield->boost = 1.3;
$doc->addField($tfield);

$doc->addField(
  Zend_Search_Lucene_Field::UnStored
   ('content', $content));

$index->addDocument($doc);


                 ZSL Zone Weighting
                                            43
Document Authority




                44
Document Weights in ZSL
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
  Zend_Search_Lucene_Field::Text
   ('title', $title));
$doc->addField(
  Zend_Search_Lucene_Field::UnStored
   ('content', $content));

$doc->boost = 1 + ($numComments / 100);

$index->addDocument($doc);

                                            45
Using Search




          46
Summaries & Highlighting




                           47
Sphinx Extract & Highlight
$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 );
$q = 'bacon';
$r = $cl->Query($q);
foreach ($r["matches"] as $doc => $info) {
  $text[$doc] = getTextFromDB($doc);
}

$e = $cl->BuildExcerpts($text, 'posts', $q);
foreach($extracts as $extract) {
  echo $extract;
}
                                             48
Xapian Spelling Correction
Indexer
$indexer = new XapianTermGenerator();
$indexer->set_database($database);
$indexer->set_flags(
   XapianTermGenerator::FLAG_SPELLING);
Searcher
$queryString = "strreplace or str_cmp";
$q = new XapianQueryParser();
$q->set_database($database);
$query = $q->parse_query($queryString,
XapianQueryParser::FLAG_SPELLING_CORRECTION);
echo "Did you mean: " .
  $q->get_corrected_query_string() . "n";
                                          50
Spelling Correction Output
 php xapsearch.php

Did you mean: str_replace or strcmp

4644 results found for “strreplace or str_cmp”:
1: 2% docid=572
  [phpdocs/html/cc.license.html]
2: 2% docid=7169
  [phpdocs/html/imagick.constants.html]
3: 2% docid=10086
  [phpdocs/html/sqlite3result.fetcharray.html]
4: 2% docid=6132
  [phpdocs/html/function.swf-posround.html]

                                                  51
Results Sorting




                  52
Sorting in ZSL

$q = Zend_Search_Lucene_Search_QueryParser::
 parse('search string');

$results = $index->find($q, 'title');
foreach($results as $result) {
  echo '<h3>', $result->title, "</h3>n";
  $doc = getDocumentFromDB($result->did);
  echo
    $q->htmlFragmentHighlightMatches($doc);
}


                                          53
Faceted Search




                 54
Faceted Search In Solr
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$query->setFacet(true);
$query->addFacetField('cat');
$r = $response->getResponse();
$f = $r['facet_counts']['facet_fields'];
foreach($f['cat'] as $facet => $count) {
  echo $facet . " " . $count . "n";
}


                                           55
More Like This




            56
More Like This
$rset = new XapianRset();
$rset->add_document(5959); // str_replace
$e = $enquire->get_eset(40, $rset);

$t = $e->begin();
for($t; !$t->equals($e->end()); $t->next()){
  $qs[] = new XapianQuery($t->get_term(),
                  intval($t->get_weight()));
}

$query = new XapianQuery(
                  XapianQuery::OP_OR, $qs);
                                            57
More Like This Example
 php xapsim.php

1656 results found:
1: 100% docid=5959
    [phpdocs/html/function.str-replace.html]
2: 47% docid=5956
    [phpdocs/html/function.str-ireplace.html]
3: 24% docid=5328
    [phpdocs/html/function.preg-replace.html]
4: 18% docid=5958
    [phpdocs/html/function.str-repeat.html]


                                          58
Search Performance




                59
Index Updates
            New


Docs
Docs        Delta
 Docs
 Docs                    Delta   Main


Main                         Query
            Main


        Delta     Main
                                        60
Search Speed
Zend Search Lucene
$index = Zend_Search_Lucene::open('index');
$index->optimize();
Sphinx
 indexer --merge main delta --rotate
Solr
$client = new SolrClient($options);
$client->optimize();

Xapian
 xapian-compact xapindex xapindex2
                                         61
Distributing Search
        Document
        Document
         Document
         Document



Index     Index       Index




        Application

                              62
Large Scale Search


      http://www.nutch.org




   http://hadoop.apache.org



                        63
Image Credits
Title                http://www.flickr.com/photos/generated/2084287794/
What Do You Want     http://www.flickr.com/photos/the_justified_sinner/
You Are Here         2498066986/
                     http://www.flickr.com/photos/alecvuijlsteke/2692475420/
Integrating Search   http://www.flickr.com/photos/squeaks2569/3700355684/
Sphinx               http://www.flickr.com/photos/generated/2084287794/
Lucene               http://www.flickr.com/photos/mypanda/7731447/
Swish-e              http://www.flickr.com/photos/ryan_fung/2239687100/
Solr                 http://www.flickr.com/photos/m-j-s/2724756177/
Xapian               http://www.flickr.com/photos/olibac/3522056495/
Using Search         http://www.flickr.com/photos/eneas/175027945/
Improving Search     http://www.flickr.com/photos/x-ray_delta_one/3928200642/
Search Performance   http://www.flickr.com/photos/maisonbisson/1634408/
Large Scale Search   http://www.flickr.com/photos/zedzap/3663508847/

                                                                              64
Questions?




             65
Thank You!

                 Ian Barber
                @ianbarber
          http://phpir.com
        ian@ibuildings.com
      http://joind.in/2172

More Related Content

More from Ian Barber

Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersIan Barber
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionIan Barber
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionIan Barber
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionIan Barber
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The AnswerIan Barber
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment TacticsIan Barber
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)Ian Barber
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & ToolsIan Barber
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)Ian Barber
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search Ian Barber
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnIan Barber
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHPIan Barber
 

More from Ian Barber (12)

Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 Version
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 Version
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment Tactics
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & Tools
 
In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)In Search Of... (Dutch PHP Conference 2010)
In Search Of... (Dutch PHP Conference 2010)
 
In Search Of... integrating site search
In Search Of... integrating site search In Search Of... integrating site search
In Search Of... integrating site search
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
 

Recently uploaded

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Recently uploaded (20)

Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

In Search Of: Integrating Site Search (IPC)

  • 1. In Searchsite search integrating Of... Ian Barber @ianbarber http://phpir.com ian@ibuildings.com http://joind.in/2172
  • 2. How Search Works Integrating Search Improving Results Using Search Search Performance Questions 2
  • 3. 3
  • 4. Query Query Query Query Query Parser Result Result Result Result Index Analyser Document Document Document Document 4
  • 5. Tokenisation “ With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than ” expected. 5
  • 6. PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } 6
  • 7. Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would 7
  • 8. Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... 8
  • 9. Boolean Query Merge Query: Best Western Hotel best 1 4 129 298 305 338 western 4 95 194 204 298 305 working 4 298 305 hotel 2 40 200 298 355 402 Result: Document 298 9
  • 10. Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodalesipsum. Aliquam vel condimentum Lorem ipsum dolor sit amet, quis neque. ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur adipiscing elit. Sed sit amet ante consectetur elit metus. Nulla eleifend Curabitur ornare feugiat ornare. Donec vitae enim elementum semper sodales quis consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum ipsum. Aliquam vel condimentum neque. vestibulum, justo vel egestas elementum, tincidunt massa et euismod. Vestibulum sit amet, Lorem ipsum dolor Curabitur ornare feugiat ornare. Donec vestibulum, justo consectetur elementum,elit.enim sit ametquam, vel gravida est vel egestas adipiscing purus Sed ornare ante consectetur elit metus. Nulla eleifend purus enim ornarevitae enim elementum sempernibh. quam, vel gravida est vel sodales quis enim tincidunt massa et euismod. Vestibulum Lorem ipsum dolor sit amet, consectetur enim vel nibh. Lorem ipsum dolor ipsum. Aliquam vel condimentum neque. fringillavestibulum, justo vel egestas elementum, sit amet, Nam non eros nisi, eget justo. consectetur adipiscingCurabitur sit ametfeugiat ornare. Donec mauris vehicula enim ornare quam, vel gravida est elit. Sed ornare ante purus adipiscing elit. Sed sit amet ante vitae enim vitae enim elementum consectetur elitjusto.Fusce vel risus vitae Nam non eros nisi,semper sodalesmetus. Nulla eleifend eget fringilla quis enim vel nibh. Fusce vel risus condimentum neque. facilisis sit amet in mi. Nulla ut turpis id ipsum. Aliquam velvitae maurismassa et euismod. Vestibulum tincidunt vehicula elementum semper sodales quis ipsum. Aliquam facilisis sit amet in mi. Nulla ut turpis felis sollicitudin dictum sed nonNam non eros nisi, eget fringilla justo. Curabitur ornare feugiat ornare. Donec velid vestibulum, justo egestas elementum, ipsum. Praesent gravida nulla, sed blandit leo. ut risus est Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur elit metus.purus enim ornare quam, vel volutpat laoreet lacus,Fusce vel risus vitae mauris vehicula felis sollicitudin dictum sed non ipsum. Nulla eleifend vel condimentum neque. Curabitur ornare enim Vestibulum Curabitur ut consectetur adipiscing elit. Sed sit amet ante consectetur adipiscing elit. Sed sit amet ante tincidunt massa risus nulla, sed nibh. leo.consectetur arcu vestibulum vel.facilisis sit amet in mi. Nulla ut turpis id Praesent ut et euismod.vel blandit ut sodales Donec Curabitur volutpat laoreet lacus, vitae enim elementum semper vitae enim elementum semper sodales quis quis felis sollicitudin dictum sed non ipsum. vestibulum, justo vel egestas elementum, dapibus fringilla arcu, et semper lacus feugiat ornare. Donec consectetur elit metus. Nam non vel. ipsum. Aliquam vel condimentumLorem ipsumut risussit amet, blandit leo. consectetur arcu vestibulumeros nisi, eget fringilla justo. purus enim ornare quam, vel gravida est Donec ipsum. Praesent vel condimentum neque. neque. Aliquam dolor nulla, sed arcu, vel risusCurabitur ornare feugiat ornare.consectetur adipiscing elit. Sed Donec ut Curabitur ornare volutpat laoreetsit amet ante Donec enim dapibus fringilla Fusce et sempervitae mauris vehicula vel nibh. lacus Curabitur feugiat ornare. lacus, consectetur elitut turpisNulla eleifendenim elementumNulla eleifend quis metus. id consectetur elit metus. semper sodales Donec Nulla eleifend tincidunt massa et euismod. facilisis sit amet in mi. Nulla vitae consectetur arcu vestibulum vel. tincidunt massa et euismod. Vestibulum massa et euismod. Vestibulum lacus tincidunt Nam non eros nisi, eget fringilla justo. dictum sed non ipsum. felis sollicitudin ipsum. dapibus fringilla arcu, et semper Aliquam vel condimentum neque. vestibulum, justo vel egestas elementum, ornare vel egestas elementum, vestibulum, justo feugiat ornare. Donec Vestibulum vestibulum, justo vel egestas Fusce vel risus vitae mauris vehicula nulla, sed blandit leo. Praesent ut risus purus Curabitur Curabitur volutpat enim ornare quam, vel gravidaenim ornare quam, vel gravida est purus est elit metus. Nulla eleifend facilisis sit amet in mi. Nulla ut turpis id laoreet lacus, ut consectetur enim vel nibh.vel. Donec consectetur arcu vestibulum enim vel nibh. et euismod. Vestibulum elementum, purus enim ornare quam, vel felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, dapibus fringilla arcu, et semper lacus sed blandit leo. tincidunt massa vestibulum, justo vel egestas elementum, Nam non eros nisi, eget fringilla justo. eros nisi, eget fringilla justo.est Nam non ornare quam, vel gravida gravida est enim vel nibh. Curabitur volutpat laoreet lacus, ut purus enim Fusce vel risus vitae mauris vehicula vel nibh. vitae mauris vehicula Fusce vel risus enim Lorem ipsum dolor sit amet, vel. Donec consectetur arcu vestibulum facilisis sit amet in mi. Nulla ut turpis id amet in mi. Nulla ut turpis id facilisis sit consectetur adipiscing elit.et semper lacus sollicitudin dictum sed non ipsum. dapibus fringilla arcu, Sed sit amet ante felis felis sollicitudin dictum sed non ipsum. Nam non eros nisi, eget fringilla justo. Nam non eros nisi, eget fringilla justo. Fusce vel vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Praesent ut risus nulla, sed blandit leo. utrisus vitae mauris vehicula Praesent risus nulla, sed blandit leo. Fusce vel Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut Curabitur ornare feugiat ornare. Donec consectetur arcu vestibulum vel. Donec sit arcu vestibulum vel. turpis id facilisis amet in mi. Nulla ut risus vitae mauris vehicula facilisis sit amet in Lorem ipsum dolor sit amet, consectetur consectetur elit metus. Nulla eleifendadipiscing elit. Sed sit amet ante felis sollicitudin dictum sed non ipsum. consectetur Donec Lorem ipsum dolor sit amet, dapibus fringilla arcu, etLorem ipsum dolor sit amet, et semper lacus semper lacus fringilla nulla, sed blandit leo. dapibus ut risus arcu, consectetur adipiscing enimSed vitae elit. elementum ante quis Praesent tincidunt massa et euismod. Vestibulumsit amet semper sodalesconsectetur adipiscing elit. Sed sit amet ante mi. Nulla ut turpis id felis sollicitudin dictum vestibulum, justo vel egestas elementum, vitae enim elementum semper sodales quis vitae Curabitur volutpat laoreet lacus, ut ipsum. Aliquam vel condimentum neque. enim elementum semper sodales quis purus enim ornare quam,vel condimentum feugiat ornare. Donec Curabitur ornare neque. vel gravida est consectetur arcu vestibulum vel. Donec sed non ipsum. Praesent ut risus nulla, sed ipsum. Aliquam enim vel nibh. Curabitur ornare feugiat ornare. metus. ipsum. Aliquam vel condimentum neque. consectetur elit Donec Nulla eleifend Curabiturdapibus feugiat ornare.et semper lacus ornare fringilla arcu, Donec tincidunt massa et euismod. Vestibulum blandit leo. Curabitur volutpat laoreet lacus, ut consectetur elit metus. Nulla eleifend vestibulum,Loremvel egestas elementum, Nam non eros nisi, eget fringilla justo. justo ipsum dolor sit amet, tincidunt massa et euismod. Vestibulum consectetur elit metus. Nulla eleifend tincidunt ipsum dolor sit amet, Lorem massa et euismod. Vestibulum purus enim ornare quam, vel gravidaSed sit amet ante vel egestas elementum, consectetur adipiscing elit. est Fusce vel risus vitaejusto vel egestas elementum, vestibulum, mauris vehicula consectetur arcu vestibulum vel. Donec dapibus enim vel vitae enim est nibh. sit amet in ornare quam, vel id vestibulum, justo consectetur adipiscing elit. Sed sit amet ante facilisis purus enim mi. Nulla ut turpisgravida elementum semper sodales quis purus enim ornare quam, vel gravida est vitae enim elementum semper sodales quis felis sollicitudin dictum sed non ipsum. Aliquam vel condimentum vel nibh. vel condimentum neque. enim vel nibh. ipsum. enim neque. fringilla arcu, et semper lacus egestas non. Praesent ut risus nulla, sed blandit leo. nisi, eget fringilla Nam non eros ipsum. Aliquam Curabitur ornare feugiatjusto. Donec ornare. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend Curabitur volutpateros nisi,lacus, fringilla vitae mauris vehicula consectetur elit metus. Nulla eleifend Nam non laoreet egetvel risus justo. Fusce ut Nam non eros nisi, eget fringilla justo. Quisque eu purus ut lacus egestas dapibus. consectetur arcu vestibulum vel. Donec inmassa et euismod. Vestibulum tincidunt Fusce vel risus vitae mauris amet mi. Nulla ut turpis tincidunt massavitae mauris Vestibulum facilisis sit vehicula id Fusce vel risus et euismod. vehicula felis sollicitudin dictum vel egestas vestibulum,amet in mi. Nulla elementum, vestibulum, justo dapibus fringilla arcu, et semper lacus turpis id sed non ipsum. facilisis sit amet in mi. Nulla ut elementum, facilisis sit justo vel egestas ut turpis id Integer in velit id est dictum bibendum in id mi. purus enim ornareblandit vel gravida est felis sollicitudin dictum sed non ipsum. sed Praesent ut risus nulla, enim vel nibh. quam, leo. purus enim ornare quam, velnon ipsum. felis sollicitudin dictum sed gravida est Praesent ut risus Curabitur volutpat laoreet lacus, ut enim vel ut risus nulla, sed blandit leo. nulla, sed blandit leo. Praesent nibh. consectetur arcu vestibulum vel. Donec Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut dapibus Nam nonarcu, nisi, eget fringilla justo. arcu vestibulum vel. Donec fringilla eros consectetur arcu vestibulum vel. Donec et semper lacus consectetur Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula dapibus fringilla arcu, et semper lacus Fusce velfringilla arcu, et semper lacus dapibus risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus dapibus fringilla arcu, et semper lacus
  • 11. TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } 11
  • 12. Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... 12
  • 13. Ranked Query Merge best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 13
  • 14. PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } 14
  • 16. MySQL Full Text Search CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); 16
  • 17. MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) 17
  • 19. Sphinx Configuration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } 19
  • 20. index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } 20
  • 21. Stemming http://tartarus.org/~martin/PorterStemmer happening - happen happened - happen happens - happen 21
  • 22. Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf 22
  • 23. Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); 23
  • 24. Swish-E http://swish-e.org pecl install swish-beta 24
  • 25. Filesystem Index With Swish-E /usr/local/bin/swish-e -S fs -c fs-swish-e.conf fs-swish-e.conf IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl
  • 26. Crawling Content /usr/local/bin/swish-e -S prog -c www-swish-e.conf www-swish-e.conf IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML
  • 27. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url }
  • 28. Lucene 28
  • 29. $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index 29
  • 30. $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene 30
  • 31. $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML 31
  • 33. Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); 33
  • 34. Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } 34
  • 36. Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); 36
  • 37. Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); 37
  • 38. $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } 38
  • 41. Parse Anchor Text $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } 41
  • 42. 1 2 3 Zone Weighting 42
  • 43. $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); ZSL Zone Weighting 43
  • 45. Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); 45
  • 48. Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } 48
  • 49.
  • 50. Xapian Spelling Correction Indexer $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Searcher $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; 50
  • 51. Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for “strreplace or str_cmp”: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] 51
  • 53. Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } 53
  • 55. Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } 55
  • 57. More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); 57
  • 58. More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] 58
  • 60. Index Updates New Docs Docs Delta Docs Docs Delta Main Main Query Main Delta Main 60
  • 61. Search Speed Zend Search Lucene $index = Zend_Search_Lucene::open('index'); $index->optimize(); Sphinx indexer --merge main delta --rotate Solr $client = new SolrClient($options); $client->optimize(); Xapian xapian-compact xapindex xapindex2 61
  • 62. Distributing Search Document Document Document Document Index Index Index Application 62
  • 63. Large Scale Search http://www.nutch.org http://hadoop.apache.org 63
  • 64. Image Credits Title http://www.flickr.com/photos/generated/2084287794/ What Do You Want http://www.flickr.com/photos/the_justified_sinner/ You Are Here 2498066986/ http://www.flickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.flickr.com/photos/generated/2084287794/ Lucene http://www.flickr.com/photos/mypanda/7731447/ Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/ Solr http://www.flickr.com/photos/m-j-s/2724756177/ Xapian http://www.flickr.com/photos/olibac/3522056495/ Using Search http://www.flickr.com/photos/eneas/175027945/ Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.flickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/ 64
  • 66. Thank You! Ian Barber @ianbarber http://phpir.com ian@ibuildings.com http://joind.in/2172