SlideShare a Scribd company logo
In Searchsite search
    integrating
                Of...


                      Ian Barber
                     @ianbarber
               http://phpir.com
             ian@ibuildings.com
           http://joind.in/1556
How Search Works
Integrating Search
 Improving Results
     Using Search
         Questions




                 2
3
Query
Query      Query
 Query
 Query
           Parser


Result
Result
 Result
 Result   Index




          Analyser   Document
                     Document
                      Document
                      Document

                             4
Tokenisation



“  With AT&T’s help, the F.B.I
Miami-Dade office had recovered
$1.1 million from O’Healy’s Ponzi
scheme, 10-15% more than


           ”
expected.


                                    5
PHP Tokenisation

function tokenise($string) {
    $string = strtolower($string);
    preg_match_all('/w+/', $string,
            $matches, PREG_OFFSET_CAPTURE);
    return $matches[0];
}




                                         6
Document Term Pairs
Document ID         Term
    1                the
    1               best
    1                of
    1                the
    ...              ...
   204               and
   204              what
   204              would
                               7
Inverted Index
Term              Documents
best    1 (4, 16), 4 (422), 129 (344) ...

what    24 (50, 98), 75 (33, 208) ...

would   99 (32, 599), 201 (344) ..

 ...                    ...


                                            8
Boolean Query Merge
Query: Best Western Hotel
 best     1    4    129   298   305   338
western   4   95    194   204   298   305


working   4   298   305
 hotel    2   40    200   298   355   402

Result: Document 298
                                        9
Lorem ipsum dolor sit amet,
                                                                 Lorem ipsum dolor sit amet,               consectetur adipiscing elit. Sed sit amet ante
                                                                                                           vitae enim elementum semper sodales quis
                                                              consectetur adipiscing elit. Sed sit amet ante
                                                              vitae enim elementum semper sodalesipsum. Aliquam vel condimentum Lorem ipsum dolor sit amet,
                                                                                                            quis                                   neque.
                                                              ipsum. Aliquam vel condimentum neque.        Curabitur ornare feugiat ornare. Donec
                                                                                                                                                consectetur adipiscing elit. Sed sit amet ante
                                                                                                           consectetur elit metus. Nulla eleifend
                                                              Curabitur ornare feugiat ornare. Donec                                            vitae enim elementum semper sodales quis
                                                              consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum     ipsum. Aliquam vel condimentum neque.
                                                                                                           vestibulum, justo vel egestas elementum,
                                                              tincidunt massa et euismod. Vestibulum sit amet,
                                                                                     Lorem ipsum dolor                                          Curabitur ornare feugiat ornare. Donec
                                                              vestibulum, justo consectetur elementum,elit.enim sit ametquam, vel gravida est
                                                                                  vel egestas adipiscing   purus
                                                                                                                   Sed
                                                                                                                        ornare
                                                                                                                                  ante          consectetur elit metus. Nulla eleifend
                                                              purus enim ornarevitae enim elementum sempernibh.
                                                                                    quam, vel gravida est vel sodales quis
                                                                                                           enim
                                                                                                                                                tincidunt massa et euismod. Vestibulum
Lorem ipsum dolor sit amet, consectetur                       enim vel nibh.
                                                            Lorem ipsum dolor ipsum. Aliquam vel condimentum neque. fringillavestibulum, justo vel egestas elementum,
                                                                                  sit amet,                Nam non eros nisi, eget               justo.
                                                         consectetur adipiscingCurabitur sit ametfeugiat ornare. Donec mauris vehicula enim ornare quam, vel gravida est
                                                                                    elit. Sed ornare ante                                       purus
adipiscing elit. Sed sit amet ante vitae enim            vitae enim elementum     consectetur elitjusto.Fusce vel risus vitae
                                                              Nam non eros nisi,semper sodalesmetus. Nulla eleifend
                                                                                     eget fringilla quis                                        enim vel nibh.
                                                              Fusce vel risus condimentum neque. facilisis sit amet in mi. Nulla ut turpis id
                                                         ipsum. Aliquam velvitae maurismassa et euismod. Vestibulum
                                                                                  tincidunt vehicula
elementum semper sodales quis ipsum. Aliquam                  facilisis sit amet in mi. Nulla ut turpis felis sollicitudin dictum sed nonNam non eros nisi, eget fringilla justo.
                                                         Curabitur ornare feugiat ornare. Donec velid
                                                                                  vestibulum, justo          egestas elementum,                  ipsum.
                                                                                                           Praesent gravida nulla, sed blandit leo.
                                                                                                                     ut risus est
                                                                                                       Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet,
                                                         consectetur elit metus.purus enim ornare quam, vel volutpat laoreet lacus,Fusce vel risus vitae mauris vehicula
                                                              felis sollicitudin dictum sed non ipsum.
                                                                                    Nulla eleifend
vel condimentum neque. Curabitur ornare                                           enim Vestibulum Curabitur                                      ut
                                                                                                    consectetur adipiscing elit. Sed sit amet ante
                                                                                                                                       consectetur adipiscing elit. Sed sit amet ante
                                                         tincidunt massa risus nulla, sed nibh. leo.consectetur arcu vestibulum vel.facilisis sit amet in mi. Nulla ut turpis id
                                                              Praesent ut et euismod.vel blandit
                                                                                                    ut                                  sodales Donec
                                                              Curabitur volutpat laoreet lacus, vitae enim elementum semper vitae enim elementum semper sodales quis
                                                                                                                                                  quis
                                                                                                                                                felis sollicitudin dictum sed non ipsum.
                                                         vestibulum, justo vel egestas elementum, dapibus fringilla arcu, et semper lacus
feugiat ornare. Donec consectetur elit metus.                                     Nam non vel. ipsum. Aliquam vel condimentumLorem ipsumut risussit amet, blandit leo.
                                                              consectetur arcu vestibulumeros nisi, eget fringilla justo.
                                                         purus enim ornare quam, vel gravida est     Donec                             ipsum. Praesent vel condimentum neque.
                                                                                                                                             neque.
                                                                                                                                                Aliquam dolor nulla, sed
                                                                                   arcu, vel risusCurabitur ornare feugiat ornare.consectetur adipiscing elit. Sed Donec ut
                                                                                                                                       Curabitur ornare volutpat laoreetsit amet ante
                                                                                                                                          Donec
                                                         enim dapibus fringilla Fusce et sempervitae mauris vehicula
                                                                vel nibh.                             lacus                                     Curabitur feugiat ornare. lacus,
                                                                                                    consectetur elitut turpisNulla eleifendenim elementumNulla eleifend quis
                                                                                                                      metus. id        consectetur elit metus. semper sodales Donec
Nulla eleifend tincidunt massa et euismod.                                        facilisis sit amet in mi. Nulla                        vitae consectetur arcu vestibulum vel.
                                                                                                    tincidunt massa et euismod. Vestibulum massa et euismod. Vestibulum lacus
                                                                                                                                       tincidunt
                                                         Nam non eros nisi, eget fringilla justo. dictum sed non ipsum.
                                                                                  felis sollicitudin                                     ipsum. dapibus fringilla arcu, et semper
                                                                                                                                                  Aliquam vel condimentum neque.
                                                                                                    vestibulum, justo vel egestas elementum, ornare vel egestas elementum,
                                                                                                                                       vestibulum, justo feugiat ornare. Donec
Vestibulum vestibulum, justo vel egestas                 Fusce vel risus vitae mauris vehicula nulla, sed blandit leo.
                                                                                  Praesent ut risus
                                                                                                    purus
                                                                                                                                         Curabitur
                                                                                  Curabitur volutpat enim ornare quam, vel gravidaenim ornare quam, vel gravida est
                                                                                                                                       purus est elit metus. Nulla eleifend
                                                         facilisis sit amet in mi. Nulla ut turpis id laoreet lacus, ut                  consectetur
                                                                                                    enim vel nibh.vel. Donec
                                                                                  consectetur arcu vestibulum                          enim vel nibh. et euismod. Vestibulum
elementum, purus enim ornare quam, vel                   felis sollicitudin dictum sed non ipsum.
                                                         Praesent ut risus nulla, dapibus fringilla arcu, et semper lacus
                                                                                    sed blandit leo.
                                                                                                                                         tincidunt massa
                                                                                                                                         vestibulum, justo vel egestas elementum,
                                                                                                    Nam non eros nisi, eget fringilla justo. eros nisi, eget fringilla justo.est
                                                                                                                                       Nam non ornare quam, vel gravida
gravida est enim vel nibh.                               Curabitur volutpat laoreet lacus, ut                                            purus enim
                                                                                                    Fusce vel risus vitae mauris vehicula vel nibh. vitae mauris vehicula
                                                                                                                                       Fusce vel risus
                                                                                                                                         enim
                                                        Lorem ipsum dolor sit amet, vel. Donec
                                                         consectetur arcu vestibulum
                                                                                                    facilisis sit amet in mi. Nulla ut turpis id amet in mi. Nulla ut turpis id
                                                                                                                                       facilisis sit
                                                     consectetur adipiscing elit.et semper lacus sollicitudin dictum sed non ipsum.
                                                           dapibus fringilla arcu, Sed sit amet ante
                                                                                                    felis                              felis sollicitudin dictum sed non ipsum.
                                                                                                                                         Nam non eros nisi, eget fringilla justo.
Nam non eros nisi, eget fringilla justo. Fusce vel   vitae enim elementum semper sodales quis
                                                     ipsum. Aliquam vel condimentum neque.
                                                                                                    Praesent ut risus nulla, sed blandit leo. utrisus vitae mauris vehicula
                                                                                                                                       Praesent risus nulla, sed blandit leo.
                                                                                                                                         Fusce vel
                                                                                                    Curabitur volutpat laoreet lacus, ut
                                                                                                                                       Curabitur volutpat laoreet lacus, ut
                                                     Curabitur ornare feugiat ornare. Donec consectetur arcu vestibulum vel. Donec sit arcu vestibulum vel. turpis id
                                                                                                                                         facilisis amet in mi. Nulla ut
risus vitae mauris vehicula facilisis sit amet in                                    Lorem ipsum dolor sit amet,
                                                                                                                                       consectetur
                                                     consectetur elit metus. Nulla eleifendadipiscing elit. Sed sit amet ante felis sollicitudin dictum sed non ipsum.
                                                                                  consectetur
                                                                                                                                                                            Donec
                                                                  Lorem ipsum dolor sit amet, dapibus fringilla arcu, etLorem ipsum dolor sit amet, et semper lacus
                                                                                                                                   semper lacus fringilla nulla, sed blandit leo.
                                                                                                                                         dapibus ut risus arcu,
                                                               consectetur adipiscing enimSed
                                                                                  vitae elit. elementum ante                    quis Praesent
                                                     tincidunt massa et euismod. Vestibulumsit amet semper sodalesconsectetur adipiscing elit. Sed sit amet ante
mi. Nulla ut turpis id felis sollicitudin dictum     vestibulum, justo vel egestas elementum,
                                                               vitae enim elementum semper sodales quis                        vitae
                                                                                                                                         Curabitur volutpat laoreet lacus, ut
                                                                                  ipsum. Aliquam vel condimentum neque. enim elementum semper sodales quis
                                                     purus enim ornare quam,vel condimentum feugiat ornare. Donec
                                                                                  Curabitur ornare neque.
                                                                                   vel gravida est                                       consectetur arcu vestibulum vel. Donec
sed non ipsum. Praesent ut risus nulla, sed                    ipsum. Aliquam
                                                     enim vel nibh.
                                                               Curabitur ornare feugiat ornare.       metus.
                                                                                                                               ipsum. Aliquam vel condimentum neque.
                                                                                  consectetur elit Donec Nulla eleifend Curabiturdapibus feugiat ornare.et semper lacus
                                                                                                                                            ornare
                                                                                                                                                    fringilla arcu,
                                                                                                                                                                    Donec
                                                                                  tincidunt massa et euismod. Vestibulum
blandit leo. Curabitur volutpat laoreet lacus, ut              consectetur elit metus. Nulla eleifend
                                                                                  vestibulum,Loremvel egestas elementum,
                                                     Nam non eros nisi, eget fringilla justo.    justo ipsum dolor sit amet,
                                                               tincidunt massa et euismod. Vestibulum
                                                                                                                               consectetur elit metus. Nulla eleifend
                                                                                                                               tincidunt ipsum dolor sit amet,
                                                                                                                                  Lorem massa et euismod. Vestibulum
                                                                                  purus enim ornare quam, vel gravidaSed sit amet ante vel egestas elementum,
                                                                                             consectetur adipiscing elit. est
                                                     Fusce vel risus vitaejusto vel egestas elementum,
                                                               vestibulum, mauris vehicula
consectetur arcu vestibulum vel. Donec dapibus                                    enim vel vitae enim est
                                                                                             nibh.
                                                                sit amet in ornare quam, vel id
                                                                                                                               vestibulum, justo
                                                                                                                               consectetur adipiscing elit. Sed sit amet ante
                                                     facilisis purus enim mi. Nulla ut turpisgravida elementum semper sodales quis
                                                                                                                               purus enim ornare quam, vel gravida est
                                                                                                                               vitae enim elementum semper sodales quis
                                                     felis sollicitudin dictum sed non ipsum. Aliquam vel condimentum vel nibh. vel condimentum neque.
                                                               enim vel nibh.
                                                                                             ipsum.
                                                                                                                               enim
                                                                                                                                      neque.
fringilla arcu, et semper lacus egestas non.         Praesent ut risus nulla, sed blandit leo. nisi, eget fringilla
                                                                                  Nam non eros
                                                                                                                               ipsum. Aliquam
                                                                                             Curabitur ornare feugiatjusto. Donec
                                                                                                                          ornare.
                                                                                                                               Curabitur ornare feugiat ornare. Donec
                                                                                             consectetur elit metus. Nulla eleifend
                                                     Curabitur volutpateros nisi,lacus, fringilla vitae mauris vehicula consectetur elit metus. Nulla eleifend
                                                               Nam non laoreet egetvel risus justo.
                                                                                  Fusce ut                                     Nam non eros nisi, eget fringilla justo.
Quisque eu purus ut lacus egestas dapibus.           consectetur arcu vestibulum vel. Donec inmassa et euismod. Vestibulum
                                                                                             tincidunt
                                                               Fusce vel risus vitae mauris amet mi. Nulla ut turpis tincidunt massavitae mauris Vestibulum
                                                                                  facilisis sit vehicula                       id
                                                                                                                               Fusce vel risus et euismod. vehicula
                                                                                  felis sollicitudin dictum vel egestas vestibulum,amet in mi. Nulla elementum,
                                                                                             vestibulum, justo
                                                       dapibus fringilla arcu, et semper lacus turpis id sed non ipsum.
                                                               facilisis sit amet in mi. Nulla ut
                                                                                                                                elementum,
                                                                                                                               facilisis sit justo vel egestas ut turpis id
Integer in velit id est dictum bibendum in id mi.                                            purus enim ornareblandit vel gravida est
                                                               felis sollicitudin dictum sed non ipsum. sed
                                                                                  Praesent ut risus nulla,
                                                                                             enim vel nibh.
                                                                                                                   quam, leo.
                                                                                                                               purus enim ornare quam, velnon ipsum.
                                                                                                                               felis sollicitudin dictum sed gravida est
                                                               Praesent ut risus Curabitur volutpat laoreet lacus, ut enim vel ut risus nulla, sed blandit leo.
                                                                                  nulla, sed blandit leo.                      Praesent nibh.
                                                                                  consectetur arcu vestibulum vel. Donec
                                                               Curabitur volutpat laoreet lacus, ut                            Curabitur volutpat laoreet lacus, ut
                                                                                    dapibus Nam nonarcu, nisi, eget fringilla justo. arcu vestibulum vel. Donec
                                                                                             fringilla eros
                                                               consectetur arcu vestibulum vel. Donec et semper lacus          consectetur
                                                                                                                               Nam non eros nisi, eget fringilla justo.
                                                                                             Fusce vel risus vitae mauris vehicula
                                                                 dapibus fringilla arcu, et semper lacus                       Fusce velfringilla arcu, et semper lacus
                                                                                                                                 dapibus risus vitae mauris vehicula
                                                                                             facilisis sit amet in mi. Nulla ut turpis id
                                                                                                                               facilisis sit amet in mi. Nulla ut turpis id
                                                                                             felis sollicitudin dictum sed non ipsum.
                                                                                                                               felis sollicitudin dictum sed non ipsum.
                                                                                             Praesent ut risus nulla, sed blandit leo.
                                                                                                                               Praesent ut risus nulla, sed blandit leo.
                                                                                             Curabitur volutpat laoreet lacus, ut
                                                                                                                               Curabitur volutpat laoreet lacus, ut
                                                                                             consectetur arcu vestibulum vel. Donec
                                                                                                                               consectetur arcu vestibulum vel. Donec
                                                                                               dapibus fringilla arcu, et semper lacus
                                                                                                                                 dapibus fringilla arcu, et semper lacus
TF-IDF

function getWeight($docID, $term, $total) {
  $tf = count($term[$docID]);
  $idf = log($total / count($term), 2);
  return $tf * $idf;
}




                                         11
Document Vector
        socket   what   heavy   steel   ...

Doc 1    0.02    0.3    0.001    0      ...

Doc 2     0       0       0      0      ...

Doc 3   0.001    0.2      0      0      ...

Doc 4     0       0     0.002   0.003   ...


                                              12
Ranked Query Merge
  best     23    42   179   246   333   703

 weight   0.008 0.002 0.023 0.039 0.014 0.001

western    42    88   120   179   246   798

 weight   0.003 0.004 0.023 0.001 0.034 0.004

1 - 246: 0.073
2 - 179: 0.024
3 - 120: 0.023
                                           13
PHP Similarity
function score($queryString, $index) {
  $query = tokenize($queryString);
  $matches = array();
  foreach($query as $qterm) {
    $postings = $index[$qterm];
    foreach($postings as $id => $posting) {
      $matches[$id] += $posting['score'];
    }
  }
  return arsort($matches);
}

                                         14
Integrating Search
                     15
MySQL Full Text Search
CREATE TABLE example (
    id INT(11) NOT NULL auto_increment,
    title VARCHAR(255),
    content TEXT,
    PRIMARY KEY(id),
    FULLTEXT(title,content)
) Engine=MyISAM;

INSERT INTO example (title, content) VALUES
('Mikko & Bacon','Mikko loves bacon'),
('Marcello & Bacon','Marcello hates bacon'),
('Jo & Sausages','Johanna loves sausages'),
('Hollywood & Garlic','Lorenzo hates garlic'),
('James & Cheddar','James is keen on cheeses');
                                              16
MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');

+----+------------------+------------------------+
| id | title             | content                |
+----+------------------+------------------------+
| 1 | Mikko & Bacon      | Mikko loves bacon      |
| 2 | Marcello & Bacon | Marcello hates bacon     |
| 3 | Jo & Sausages      | Johanna loves sausages |
+----+------------------+------------------------+
3 rows in set (0.00 sec)




                                                 17
Sphinx
http://www.sphinxsearch.com




                        18
Sphinx Configuration
source posts
{
  type             =   mysql
  sql_host         =   localhost
  sql_user         =   user
  sql_pass         =   password
  sql_db           =   search

    sql_query      = 
      SELECT id, title, content FROM example;
    sql_attr_multi = uint tag from query; 
      SELECT example_id, tag_id FROM tags;
}
                                            19
index posts
{
  source     = posts
  path       = /var/data/sphinx/example
  morphology = stem_en

    min_word_len     =   3
    min_prefix_len   =   3
    min_infix_len    =   0
    enable_star      =   1
}




                                          20
Stemming
        http://tartarus.org/~martin/PorterStemmer




happening - happen
happened - happen
happens   - happen




                                              21
Command Line Searching
indexer --config /etc/sphinx.conf --all
search --config /etc/sphinx.conf love bacon

displaying matches:
1. document=1, weight=3, tag=(1,2)
! id=1
! title=Mikko & Bacon
! content=Mikko loves bacon
words:
1. 'love': 2 documents, 2 hits
2. 'bacon': 2 documents, 4 hits

searchd --config /etc/sphinx.conf

                                              22
Sphinx From PHP

$cl = new SphinxClient();
$cl->SetServer('localhost', 3312);
$cl->SetMatchMode(SPH_MATCH_ANY);

$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);

$cl->SetFilter('tag', array(1));
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);


                                            23
Swish-E
   http://swish-e.org
pecl install swish-beta
                    24
Filesystem Index With Swish-E
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf


fs-swish-e.conf
IndexDir            /var/data/documents
IndexFile           fs-swish-e.index
IndexOnly           .doc .docx .pdf
FuzzyIndexingMode   Stemming_en1

FileFilter .pdf /usr/local/bin/swish_filter.pl
FileFilter .doc /usr/local/bin/swish_filter.pl
Crawling Content
/usr/local/bin/swish-e -S prog -c www-swish-e.conf


www-swish-e.conf
IndexDir      /usr/local/lib/swish-e/spider.pl
IndexFile     www-swish-e.index
SwishProgParameters default http://phpir.com/

FuzzyIndexingMode Stemming_en1
DefaultContents   HTML
Swish-E With Multiple Indices
$swish     = new Swish(
   'www-swish-e.index fs-swish-e.index'
);
$search    = $swish->prepare();

$queryStr = 'search string goes here';
$result   = $search->execute($queryStr);
$total    = $result->hits;

while($r = $result->nextResult()) {
  echo $r->swishdocpath; // url
}
Lucene




         28
$index = Zend_Search_Lucene::create('idx');
foreach($documents as $title => $content) {
  $doc = new Zend_Search_Lucene_Document();
  $doc->addField(
    Zend_Search_Lucene_Field::Text(
      'title', $title));
  $doc->addField(
    Zend_Search_Lucene_Field::UnStored(
      'content', $content));
  $index->addDocument($doc);
}



                             Build Index
                                         29
$results = $index->find('loves bacon');
foreach($results as $result) {
        echo $result->score, " ";
        echo $result->title, "n";
}

Output:
0.81656279309067 Mikko and Bacon
0.24800278854758 Marcello & Bacon



       Query Zend Search Lucene
                                          30
$file = file_get_contents($url);

$doc = Zend_Search_Lucene_Document_Html::
                           loadHTML($file);

$doc->addField(
   Zend_Search_Lucene_Field::Text(
     'url', $url
);
$index->addDocument($doc)



                             Index HTML
                                         31
Solr
http://lucene.apache.org/solr/
                                 32
Solr Search Index
$options = array( 'hostname' => 'localhost',
                  'port'     => 8983 );

$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', $id);
$doc->addField('cat', $category);
$doc->addField('title', $title);
$doc->addField('text', $text);
$response = $client->addDocument($doc);
$client->commit();


                                          33
Solr Search Client
$client = new SolrClient($options);

$query = new SolrQuery('bacon');
$response = $client->query($query);
$r = $response->getResponse();

foreach($r['response']['docs'] as $d) {
  echo $d->title[0] . "n";
}



                                          34
Xapian
http://xapian.org




              35
Xapian In PHP
$db = new XapianWritableDatabase(
      'idx', Xapian::DB_CREATE_OR_OPEN);
$i = new XapianTermGenerator();
$i->set_stemmer(new XapianStem("english"));

$doc = new XapianDocument();
$doc->set_data($content);
$doc->add_value(1, $title);

$i->set_document($doc);
$i->index_text($content);
$db->add_document($doc);
                                         36
Xapian Search In PHP

$database = new XapianDatabase('idx');
$enquire = new XapianEnquire($database);
$qp = new XapianQueryParser();
$qp->set_stemmer(new XapianStem("english"));
$qp->set_database($database);
$qp->set_stemming_strategy(
    XapianQueryParser::STEM_SOME);
$query = $qp->parse_query($queryString);

$enquire->set_query($query);


                                          37
$matches = $enquire->get_mset(0, 10);

$i = $matches->begin();
while(!$i->equals($matches->end())) {
  $n = $i->get_rank() + 1;
  $data = $i->get_document()->get_data();
  $title = $i->get_document()->get_value(1);
  $score = $i->get_percent();
  $i->next();
}




                                         38
Improving Results




                    39
Anchor Text




         40
Parse Anchor Text
$p = file_get_contents('http://phpir.com');

libxml_use_internal_errors(true);
$dom = DomDocument::loadHTML($p);
$links = $dom->getElementsByTagName('a');

foreach($links as $link) {
    $href = $link->getAttribute('href');
    $text = $link->nodeValue;
}


                                            41
1
         2




         3



    Zone Weighting
                42
$doc = new Zend_Search_Lucene_Document();

$tfield = Zend_Search_Lucene_Field::Text
   ('title', $title);
$tfield->boost = 1.3;
$doc->addField($tfield);

$doc->addField(
  Zend_Search_Lucene_Field::UnStored
   ('content', $content));

$index->addDocument($doc);


                 ZSL Zone Weighting
                                            43
Document Authority




                44
Document Weights in ZSL
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
  Zend_Search_Lucene_Field::Text
   ('title', $title));
$doc->addField(
  Zend_Search_Lucene_Field::UnStored
   ('content', $content));

$doc->boost = 1 + ($numComments / 100);

$index->addDocument($doc);

                                            45
Using Search




          46
Summaries & Highlighting




                           47
Sphinx Extract & Highlight
$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 );
$q = 'bacon';
$r = $cl->Query($q);
foreach ($r["matches"] as $doc => $info) {
  $text[$doc] = getTextFromDB($doc);
}

$e = $cl->BuildExcerpts($text, 'posts', $q);
foreach($extracts as $extract) {
  echo $extract;
}
                                             48
Xapian Spelling Correction
Indexer
$indexer = new XapianTermGenerator();
$indexer->set_database($database);
$indexer->set_flags(
   XapianTermGenerator::FLAG_SPELLING);
Searcher
$queryString = "strreplace or str_cmp";
$q = new XapianQueryParser();
$q->set_database($database);
$query = $q->parse_query($queryString,
XapianQueryParser::FLAG_SPELLING_CORRECTION);
echo "Did you mean: " .
  $q->get_corrected_query_string() . "n";
                                          50
Spelling Correction Output
 php xapsearch.php

Did you mean: str_replace or strcmp

4644 results found for “strreplace or str_cmp”:
1: 2% docid=572
  [phpdocs/html/cc.license.html]
2: 2% docid=7169
  [phpdocs/html/imagick.constants.html]
3: 2% docid=10086
  [phpdocs/html/sqlite3result.fetcharray.html]
4: 2% docid=6132
  [phpdocs/html/function.swf-posround.html]

                                                  51
Results Sorting




                  52
Sorting in ZSL

$q = Zend_Search_Lucene_Search_QueryParser::
 parse('search string');

$results = $index->find($q, 'title');
foreach($results as $result) {
  echo '<h3>', $result->title, "</h3>n";
  $doc = getDocumentFromDB($result->did);
  echo
    $q->htmlFragmentHighlightMatches($doc);
}


                                          53
Faceted Search




                 54
Faceted Search In Solr
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$query->setFacet(true);
$query->addFacetField('cat');
$r = $response->getResponse();
$f = $r['facet_counts']['facet_fields'];
foreach($f['cat'] as $facet => $count) {
  echo $facet . " " . $count . "n";
}


                                           55
More Like This




            56
More Like This
$rset = new XapianRset();
$rset->add_document(5959); // str_replace
$e = $enquire->get_eset(40, $rset);

$t = $e->begin();
for($t; !$t->equals($e->end()); $t->next()){
  $qs[] = new XapianQuery($t->get_term(),
                  intval($t->get_weight()));
}

$query = new XapianQuery(
                  XapianQuery::OP_OR, $qs);
                                            57
More Like This Example
 php xapsim.php

1656 results found:
1: 100% docid=5959
    [phpdocs/html/function.str-replace.html]
2: 47% docid=5956
    [phpdocs/html/function.str-ireplace.html]
3: 24% docid=5328
    [phpdocs/html/function.preg-replace.html]
4: 18% docid=5958
    [phpdocs/html/function.str-repeat.html]


                                          58
Image Credits
Title                http://www.flickr.com/photos/generated/2084287794/
What Do You Want     http://www.flickr.com/photos/the_justified_sinner/
You Are Here         2498066986/
                     http://www.flickr.com/photos/alecvuijlsteke/2692475420/
Integrating Search   http://www.flickr.com/photos/squeaks2569/3700355684/
Sphinx               http://www.flickr.com/photos/generated/2084287794/
Lucene               http://www.flickr.com/photos/mypanda/7731447/
Swish-e              http://www.flickr.com/photos/ryan_fung/2239687100/
Solr                 http://www.flickr.com/photos/m-j-s/2724756177/
Xapian               http://www.flickr.com/photos/olibac/3522056495/
Using Search         http://www.flickr.com/photos/eneas/175027945/
Improving Search     http://www.flickr.com/photos/x-ray_delta_one/3928200642/
Search Performance   http://www.flickr.com/photos/maisonbisson/1634408/
Large Scale Search   http://www.flickr.com/photos/zedzap/3663508847/

                                                                              59
Questions?




             60
Thank You!

                 Ian Barber
                @ianbarber
          http://phpir.com
        ian@ibuildings.com
      http://joind.in/1556

More Related Content

Similar to In Search Of... (Dutch PHP Conference 2010)

Why would we want to talk to customers or them to us? TCUK 2012
Why would we want to talk to customers or them to us? TCUK 2012Why would we want to talk to customers or them to us? TCUK 2012
Why would we want to talk to customers or them to us? TCUK 2012
Ian Ampleford
 
TCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customers
TCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customersTCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customers
TCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customers
TCUK Conference
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)
Ian Barber
 
Paint it Plone!
Paint it Plone!Paint it Plone!
Paint it Plone!
Simone Deponti
 
MODELO EDUCACAO
MODELO EDUCACAOMODELO EDUCACAO
MODELO EDUCACAO
José Dantas
 
Harnessing the Power of the Visual
Harnessing the Power of the VisualHarnessing the Power of the Visual
Harnessing the Power of the Visual
Kathleen A. Paris
 
006-GreenHouse-Effect.pptx
006-GreenHouse-Effect.pptx006-GreenHouse-Effect.pptx

Similar to In Search Of... (Dutch PHP Conference 2010) (7)

Why would we want to talk to customers or them to us? TCUK 2012
Why would we want to talk to customers or them to us? TCUK 2012Why would we want to talk to customers or them to us? TCUK 2012
Why would we want to talk to customers or them to us? TCUK 2012
 
TCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customers
TCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customersTCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customers
TCUK 2012, Ian Ampleford and Peter Jones, Why would we want to talk to customers
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)
 
Paint it Plone!
Paint it Plone!Paint it Plone!
Paint it Plone!
 
MODELO EDUCACAO
MODELO EDUCACAOMODELO EDUCACAO
MODELO EDUCACAO
 
Harnessing the Power of the Visual
Harnessing the Power of the VisualHarnessing the Power of the Visual
Harnessing the Power of the Visual
 
006-GreenHouse-Effect.pptx
006-GreenHouse-Effect.pptx006-GreenHouse-Effect.pptx
006-GreenHouse-Effect.pptx
 

More from Ian Barber

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
Ian Barber
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
Ian Barber
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
Ian Barber
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 Version
Ian Barber
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
Ian Barber
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 Version
Ian Barber
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
Ian Barber
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment Tactics
Ian Barber
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
Ian Barber
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
Ian Barber
 

More from Ian Barber (10)

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 Version
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 Version
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment Tactics
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
 

Recently uploaded

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 

In Search Of... (Dutch PHP Conference 2010)

  • 1. In Searchsite search integrating Of... Ian Barber @ianbarber http://phpir.com ian@ibuildings.com http://joind.in/1556
  • 2. How Search Works Integrating Search Improving Results Using Search Questions 2
  • 3. 3
  • 4. Query Query Query Query Query Parser Result Result Result Result Index Analyser Document Document Document Document 4
  • 5. Tokenisation “ With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than ” expected. 5
  • 6. PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } 6
  • 7. Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would 7
  • 8. Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... 8
  • 9. Boolean Query Merge Query: Best Western Hotel best 1 4 129 298 305 338 western 4 95 194 204 298 305 working 4 298 305 hotel 2 40 200 298 355 402 Result: Document 298 9
  • 10. Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodalesipsum. Aliquam vel condimentum Lorem ipsum dolor sit amet, quis neque. ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur adipiscing elit. Sed sit amet ante consectetur elit metus. Nulla eleifend Curabitur ornare feugiat ornare. Donec vitae enim elementum semper sodales quis consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum ipsum. Aliquam vel condimentum neque. vestibulum, justo vel egestas elementum, tincidunt massa et euismod. Vestibulum sit amet, Lorem ipsum dolor Curabitur ornare feugiat ornare. Donec vestibulum, justo consectetur elementum,elit.enim sit ametquam, vel gravida est vel egestas adipiscing purus Sed ornare ante consectetur elit metus. Nulla eleifend purus enim ornarevitae enim elementum sempernibh. quam, vel gravida est vel sodales quis enim tincidunt massa et euismod. Vestibulum Lorem ipsum dolor sit amet, consectetur enim vel nibh. Lorem ipsum dolor ipsum. Aliquam vel condimentum neque. fringillavestibulum, justo vel egestas elementum, sit amet, Nam non eros nisi, eget justo. consectetur adipiscingCurabitur sit ametfeugiat ornare. Donec mauris vehicula enim ornare quam, vel gravida est elit. Sed ornare ante purus adipiscing elit. Sed sit amet ante vitae enim vitae enim elementum consectetur elitjusto.Fusce vel risus vitae Nam non eros nisi,semper sodalesmetus. Nulla eleifend eget fringilla quis enim vel nibh. Fusce vel risus condimentum neque. facilisis sit amet in mi. Nulla ut turpis id ipsum. Aliquam velvitae maurismassa et euismod. Vestibulum tincidunt vehicula elementum semper sodales quis ipsum. Aliquam facilisis sit amet in mi. Nulla ut turpis felis sollicitudin dictum sed nonNam non eros nisi, eget fringilla justo. Curabitur ornare feugiat ornare. Donec velid vestibulum, justo egestas elementum, ipsum. Praesent gravida nulla, sed blandit leo. ut risus est Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur elit metus.purus enim ornare quam, vel volutpat laoreet lacus,Fusce vel risus vitae mauris vehicula felis sollicitudin dictum sed non ipsum. Nulla eleifend vel condimentum neque. Curabitur ornare enim Vestibulum Curabitur ut consectetur adipiscing elit. Sed sit amet ante consectetur adipiscing elit. Sed sit amet ante tincidunt massa risus nulla, sed nibh. leo.consectetur arcu vestibulum vel.facilisis sit amet in mi. Nulla ut turpis id Praesent ut et euismod.vel blandit ut sodales Donec Curabitur volutpat laoreet lacus, vitae enim elementum semper vitae enim elementum semper sodales quis quis felis sollicitudin dictum sed non ipsum. vestibulum, justo vel egestas elementum, dapibus fringilla arcu, et semper lacus feugiat ornare. Donec consectetur elit metus. Nam non vel. ipsum. Aliquam vel condimentumLorem ipsumut risussit amet, blandit leo. consectetur arcu vestibulumeros nisi, eget fringilla justo. purus enim ornare quam, vel gravida est Donec ipsum. Praesent vel condimentum neque. neque. Aliquam dolor nulla, sed arcu, vel risusCurabitur ornare feugiat ornare.consectetur adipiscing elit. Sed Donec ut Curabitur ornare volutpat laoreetsit amet ante Donec enim dapibus fringilla Fusce et sempervitae mauris vehicula vel nibh. lacus Curabitur feugiat ornare. lacus, consectetur elitut turpisNulla eleifendenim elementumNulla eleifend quis metus. id consectetur elit metus. semper sodales Donec Nulla eleifend tincidunt massa et euismod. facilisis sit amet in mi. Nulla vitae consectetur arcu vestibulum vel. tincidunt massa et euismod. Vestibulum massa et euismod. Vestibulum lacus tincidunt Nam non eros nisi, eget fringilla justo. dictum sed non ipsum. felis sollicitudin ipsum. dapibus fringilla arcu, et semper Aliquam vel condimentum neque. vestibulum, justo vel egestas elementum, ornare vel egestas elementum, vestibulum, justo feugiat ornare. Donec Vestibulum vestibulum, justo vel egestas Fusce vel risus vitae mauris vehicula nulla, sed blandit leo. Praesent ut risus purus Curabitur Curabitur volutpat enim ornare quam, vel gravidaenim ornare quam, vel gravida est purus est elit metus. Nulla eleifend facilisis sit amet in mi. Nulla ut turpis id laoreet lacus, ut consectetur enim vel nibh.vel. Donec consectetur arcu vestibulum enim vel nibh. et euismod. Vestibulum elementum, purus enim ornare quam, vel felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, dapibus fringilla arcu, et semper lacus sed blandit leo. tincidunt massa vestibulum, justo vel egestas elementum, Nam non eros nisi, eget fringilla justo. eros nisi, eget fringilla justo.est Nam non ornare quam, vel gravida gravida est enim vel nibh. Curabitur volutpat laoreet lacus, ut purus enim Fusce vel risus vitae mauris vehicula vel nibh. vitae mauris vehicula Fusce vel risus enim Lorem ipsum dolor sit amet, vel. Donec consectetur arcu vestibulum facilisis sit amet in mi. Nulla ut turpis id amet in mi. Nulla ut turpis id facilisis sit consectetur adipiscing elit.et semper lacus sollicitudin dictum sed non ipsum. dapibus fringilla arcu, Sed sit amet ante felis felis sollicitudin dictum sed non ipsum. Nam non eros nisi, eget fringilla justo. Nam non eros nisi, eget fringilla justo. Fusce vel vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Praesent ut risus nulla, sed blandit leo. utrisus vitae mauris vehicula Praesent risus nulla, sed blandit leo. Fusce vel Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut Curabitur ornare feugiat ornare. Donec consectetur arcu vestibulum vel. Donec sit arcu vestibulum vel. turpis id facilisis amet in mi. Nulla ut risus vitae mauris vehicula facilisis sit amet in Lorem ipsum dolor sit amet, consectetur consectetur elit metus. Nulla eleifendadipiscing elit. Sed sit amet ante felis sollicitudin dictum sed non ipsum. consectetur Donec Lorem ipsum dolor sit amet, dapibus fringilla arcu, etLorem ipsum dolor sit amet, et semper lacus semper lacus fringilla nulla, sed blandit leo. dapibus ut risus arcu, consectetur adipiscing enimSed vitae elit. elementum ante quis Praesent tincidunt massa et euismod. Vestibulumsit amet semper sodalesconsectetur adipiscing elit. Sed sit amet ante mi. Nulla ut turpis id felis sollicitudin dictum vestibulum, justo vel egestas elementum, vitae enim elementum semper sodales quis vitae Curabitur volutpat laoreet lacus, ut ipsum. Aliquam vel condimentum neque. enim elementum semper sodales quis purus enim ornare quam,vel condimentum feugiat ornare. Donec Curabitur ornare neque. vel gravida est consectetur arcu vestibulum vel. Donec sed non ipsum. Praesent ut risus nulla, sed ipsum. Aliquam enim vel nibh. Curabitur ornare feugiat ornare. metus. ipsum. Aliquam vel condimentum neque. consectetur elit Donec Nulla eleifend Curabiturdapibus feugiat ornare.et semper lacus ornare fringilla arcu, Donec tincidunt massa et euismod. Vestibulum blandit leo. Curabitur volutpat laoreet lacus, ut consectetur elit metus. Nulla eleifend vestibulum,Loremvel egestas elementum, Nam non eros nisi, eget fringilla justo. justo ipsum dolor sit amet, tincidunt massa et euismod. Vestibulum consectetur elit metus. Nulla eleifend tincidunt ipsum dolor sit amet, Lorem massa et euismod. Vestibulum purus enim ornare quam, vel gravidaSed sit amet ante vel egestas elementum, consectetur adipiscing elit. est Fusce vel risus vitaejusto vel egestas elementum, vestibulum, mauris vehicula consectetur arcu vestibulum vel. Donec dapibus enim vel vitae enim est nibh. sit amet in ornare quam, vel id vestibulum, justo consectetur adipiscing elit. Sed sit amet ante facilisis purus enim mi. Nulla ut turpisgravida elementum semper sodales quis purus enim ornare quam, vel gravida est vitae enim elementum semper sodales quis felis sollicitudin dictum sed non ipsum. Aliquam vel condimentum vel nibh. vel condimentum neque. enim vel nibh. ipsum. enim neque. fringilla arcu, et semper lacus egestas non. Praesent ut risus nulla, sed blandit leo. nisi, eget fringilla Nam non eros ipsum. Aliquam Curabitur ornare feugiatjusto. Donec ornare. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend Curabitur volutpateros nisi,lacus, fringilla vitae mauris vehicula consectetur elit metus. Nulla eleifend Nam non laoreet egetvel risus justo. Fusce ut Nam non eros nisi, eget fringilla justo. Quisque eu purus ut lacus egestas dapibus. consectetur arcu vestibulum vel. Donec inmassa et euismod. Vestibulum tincidunt Fusce vel risus vitae mauris amet mi. Nulla ut turpis tincidunt massavitae mauris Vestibulum facilisis sit vehicula id Fusce vel risus et euismod. vehicula felis sollicitudin dictum vel egestas vestibulum,amet in mi. Nulla elementum, vestibulum, justo dapibus fringilla arcu, et semper lacus turpis id sed non ipsum. facilisis sit amet in mi. Nulla ut elementum, facilisis sit justo vel egestas ut turpis id Integer in velit id est dictum bibendum in id mi. purus enim ornareblandit vel gravida est felis sollicitudin dictum sed non ipsum. sed Praesent ut risus nulla, enim vel nibh. quam, leo. purus enim ornare quam, velnon ipsum. felis sollicitudin dictum sed gravida est Praesent ut risus Curabitur volutpat laoreet lacus, ut enim vel ut risus nulla, sed blandit leo. nulla, sed blandit leo. Praesent nibh. consectetur arcu vestibulum vel. Donec Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut dapibus Nam nonarcu, nisi, eget fringilla justo. arcu vestibulum vel. Donec fringilla eros consectetur arcu vestibulum vel. Donec et semper lacus consectetur Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula dapibus fringilla arcu, et semper lacus Fusce velfringilla arcu, et semper lacus dapibus risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus dapibus fringilla arcu, et semper lacus
  • 11. TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } 11
  • 12. Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... 12
  • 13. Ranked Query Merge best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 13
  • 14. PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } 14
  • 16. MySQL Full Text Search CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); 16
  • 17. MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) 17
  • 19. Sphinx Configuration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } 19
  • 20. index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } 20
  • 21. Stemming http://tartarus.org/~martin/PorterStemmer happening - happen happened - happen happens - happen 21
  • 22. Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf 22
  • 23. Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); 23
  • 24. Swish-E http://swish-e.org pecl install swish-beta 24
  • 25. Filesystem Index With Swish-E /usr/local/bin/swish-e -S fs -c fs-swish-e.conf fs-swish-e.conf IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl
  • 26. Crawling Content /usr/local/bin/swish-e -S prog -c www-swish-e.conf www-swish-e.conf IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML
  • 27. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url }
  • 28. Lucene 28
  • 29. $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index 29
  • 30. $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene 30
  • 31. $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML 31
  • 33. Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); 33
  • 34. Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } 34
  • 36. Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); 36
  • 37. Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); 37
  • 38. $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } 38
  • 41. Parse Anchor Text $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } 41
  • 42. 1 2 3 Zone Weighting 42
  • 43. $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); ZSL Zone Weighting 43
  • 45. Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); 45
  • 48. Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } 48
  • 49.
  • 50. Xapian Spelling Correction Indexer $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Searcher $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; 50
  • 51. Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for “strreplace or str_cmp”: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] 51
  • 53. Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } 53
  • 55. Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } 55
  • 57. More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); 57
  • 58. More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] 58
  • 59. Image Credits Title http://www.flickr.com/photos/generated/2084287794/ What Do You Want http://www.flickr.com/photos/the_justified_sinner/ You Are Here 2498066986/ http://www.flickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.flickr.com/photos/generated/2084287794/ Lucene http://www.flickr.com/photos/mypanda/7731447/ Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/ Solr http://www.flickr.com/photos/m-j-s/2724756177/ Xapian http://www.flickr.com/photos/olibac/3522056495/ Using Search http://www.flickr.com/photos/eneas/175027945/ Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.flickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/ 59
  • 61. Thank You! Ian Barber @ianbarber http://phpir.com ian@ibuildings.com http://joind.in/1556