In Search Of: Integrating Site Search (PHP Barcelona)

4,063 views

Published on

Despite being a key method of navigation on many sites, search functionality often gets the short end of the stick in development, either by handing the job over to Google or just enabling full text search on the appropriate column in the database. In this talk we will look at how full text search actually works, how to integrate local text search engines into your PHP application, and how it’s possible to actually provide better and more relevant results than Google itself, at least for your own site.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
4,063
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
69
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

In Search Of: Integrating Site Search (PHP Barcelona)

  1. 1. In Search Of... Ian Barber @ianbarber http://phpir.com ian@ibuildings.com integrating site search Friday, 29 October 2010
  2. 2. 2 How Search Works Integrating Search Improving Results Using Search Search Performance Questions Friday, 29 October 2010
  3. 3. 3 Friday, 29 October 2010
  4. 4. 4 Index DocumentDocumentDocumentDocumentAnalyser Query Parser QueryQueryQueryQuery ResultResultResultResult Friday, 29 October 2010
  5. 5. 5 With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected. Tokenisation “ ”Friday, 29 October 2010
  6. 6. 6 PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } Friday, 29 October 2010
  7. 7. 7 Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would Friday, 29 October 2010
  8. 8. 8 Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... Friday, 29 October 2010
  9. 9. 9 Boolean Query Merge Query: Best Western Hotel Result: Document 298 best 1 4 129 298 305 338 western 4 95 194 204 298 305 hotel 2 40 200 298 355 402 working 4 298 305 Friday, 29 October 2010
  10. 10. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Friday, 29 October 2010
  11. 11. 11 TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } Friday, 29 October 2010
  12. 12. 12 Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... Friday, 29 October 2010
  13. 13. best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 Ranked Query Merge 13 Friday, 29 October 2010
  14. 14. 14 PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } Friday, 29 October 2010
  15. 15. 15 Integrating Search Friday, 29 October 2010
  16. 16. 16 CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); MySQL Full Text Search Friday, 29 October 2010
  17. 17. 17 MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) Friday, 29 October 2010
  18. 18. 18 Sphinx http://www.sphinxsearch.com Friday, 29 October 2010
  19. 19. 19 Sphinx Configuration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } Friday, 29 October 2010
  20. 20. 20 index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } Friday, 29 October 2010
  21. 21. 21 Stemming happening happened happens http://tartarus.org/~martin/PorterStemmer - happen - happen - happen Friday, 29 October 2010
  22. 22. 22 Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf Friday, 29 October 2010
  23. 23. 23 Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); Friday, 29 October 2010
  24. 24. 24 Swish-E http://swish-e.org pecl install swish-beta Friday, 29 October 2010
  25. 25. Filesystem Index With Swish-E IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl fs-swish-e.conf /usr/local/bin/swish-e -S fs -c fs-swish-e.conf Friday, 29 October 2010
  26. 26. Crawling Content IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML www-swish-e.conf /usr/local/bin/swish-e -S prog -c www-swish-e.conf Friday, 29 October 2010
  27. 27. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url } Friday, 29 October 2010
  28. 28. 28 Lucene Friday, 29 October 2010
  29. 29. 29 $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index Friday, 29 October 2010
  30. 30. 30 $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene Friday, 29 October 2010
  31. 31. 31 $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML Friday, 29 October 2010
  32. 32. 32 Solr http://lucene.apache.org/solr/ Friday, 29 October 2010
  33. 33. 33 Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); Friday, 29 October 2010
  34. 34. 34 Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } Friday, 29 October 2010
  35. 35. 35 Xapian http://xapian.org Friday, 29 October 2010
  36. 36. 36 Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); Friday, 29 October 2010
  37. 37. 37 Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); Friday, 29 October 2010
  38. 38. 38 $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } Friday, 29 October 2010
  39. 39. 39 Improving Results Friday, 29 October 2010
  40. 40. 40 Anchor Text Friday, 29 October 2010
  41. 41. 41 $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } Parse Anchor Text Friday, 29 October 2010
  42. 42. 42 1 2 3 Zone Weighting Friday, 29 October 2010
  43. 43. 43 ZSL Zone Weighting $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); Friday, 29 October 2010
  44. 44. 44 Document Authority Friday, 29 October 2010
  45. 45. 45 Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); Friday, 29 October 2010
  46. 46. 46 Using Search Friday, 29 October 2010
  47. 47. 47 Summaries & Highlighting Friday, 29 October 2010
  48. 48. 48 Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } Friday, 29 October 2010
  49. 49. Friday, 29 October 2010
  50. 50. 50 Xapian Spelling Correction $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Indexer $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; Searcher Friday, 29 October 2010
  51. 51. 51 Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for “strreplace or str_cmp”: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] Friday, 29 October 2010
  52. 52. 52 Results Sorting Friday, 29 October 2010
  53. 53. 53 Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } Friday, 29 October 2010
  54. 54. 54 Faceted Search Friday, 29 October 2010
  55. 55. 55 Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } Friday, 29 October 2010
  56. 56. 56 More Like This Friday, 29 October 2010
  57. 57. 57 More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); Friday, 29 October 2010
  58. 58. 58 More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] Friday, 29 October 2010
  59. 59. 59 Search Performance Friday, 29 October 2010
  60. 60. 60 Index Updates Docs Main New Delta Delta Main Query Delta Main Main DocsDocsDocs Friday, 29 October 2010
  61. 61. 61 Search Speed $index = Zend_Search_Lucene::open('index'); $index->optimize(); indexer --merge main delta --rotate Zend Search Lucene Sphinx $client = new SolrClient($options); $client->optimize(); Solr xapian-compact xapindex xapindex2 Xapian Friday, 29 October 2010
  62. 62. 62 Distributing Search Index Application Index Index DocumentDocumentDocumentDocument Friday, 29 October 2010
  63. 63. 63 Large Scale Search http://www.nutch.org http://hadoop.apache.org Friday, 29 October 2010
  64. 64. 64 Image Credits Title http://www.flickr.com/photos/generated/2084287794/ What Do You Want http://www.flickr.com/photos/the_justified_sinner/ 2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.flickr.com/photos/generated/2084287794/ Lucene http://www.flickr.com/photos/mypanda/7731447/ Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/ Solr http://www.flickr.com/photos/m-j-s/2724756177/ Xapian http://www.flickr.com/photos/olibac/3522056495/ Using Search http://www.flickr.com/photos/eneas/175027945/ Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.flickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/ Friday, 29 October 2010
  65. 65. Questions? 65 Friday, 29 October 2010
  66. 66. Thank You! Ian Barber @ianbarber http://phpir.com ian@ibuildings.com Friday, 29 October 2010

×