Your SlideShare is downloading. ×
0
In Search Of...
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
integrating site search
Friday, 29 October 2010
2
How Search Works
Integrating Search
Improving Results
Using Search
Search Performance
Questions
Friday, 29 October 2010
3
Friday, 29 October 2010
4
Index
DocumentDocumentDocumentDocumentAnalyser
Query
Parser
QueryQueryQueryQuery
ResultResultResultResult
Friday, 29 Oct...
5
With AT&T’s help, the F.B.I
Miami-Dade office had recovered
$1.1 million from O’Healy’s Ponzi
scheme, 10-15% more than
ex...
6
PHP Tokenisation
function tokenise($string) {
$string = strtolower($string);
preg_match_all('/w+/', $string,
$matches, P...
7
Document Term Pairs
Document ID Term
1 the
1 best
1 of
1 the
... ...
204 and
204 what
204 would
Friday, 29 October 2010
8
Inverted Index
Term Documents
best 1 (4, 16), 4 (422), 129 (344) ...
what 24 (50, 98), 75 (33, 208) ...
would 99 (32, 59...
9
Boolean Query Merge
Query: Best Western Hotel
Result: Document 298
best 1 4 129 298 305 338
western 4 95 194 204 298 305...
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Sed sit amet ante vitae enim
elementum semper sodales quis ipsum....
11
TF-IDF
function getWeight($docID, $term, $total) {
$tf = count($term[$docID]);
$idf = log($total / count($term), 2);
re...
12
Document Vector
socket what heavy steel ...
Doc 1 0.02 0.3 0.001 0 ...
Doc 2 0 0 0 0 ...
Doc 3 0.001 0.2 0 0 ...
Doc 4 ...
best 23 42 179 246 333 703
weight 0.008 0.002 0.023 0.039 0.014 0.001
western 42 88 120 179 246 798
weight 0.003 0.004 0.0...
14
PHP Similarity
function score($queryString, $index) {
$query = tokenize($queryString);
$matches = array();
foreach($que...
15
Integrating Search
Friday, 29 October 2010
16
CREATE TABLE example (
id INT(11) NOT NULL auto_increment,
title VARCHAR(255),
content TEXT,
PRIMARY KEY(id),
FULLTEXT(...
17
MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');
+----+------------------+-----...
18
Sphinx
http://www.sphinxsearch.com
Friday, 29 October 2010
19
Sphinx Configuration
source posts
{
type = mysql
sql_host = localhost
sql_user = user
sql_pass = password
sql_db = searc...
20
index posts
{
source = posts
path = /var/data/sphinx/example
morphology = stem_en
min_word_len = 3
min_prefix_len = 3
m...
21
Stemming
happening
happened
happens
http://tartarus.org/~martin/PorterStemmer
- happen
- happen
- happen
Friday, 29 Oct...
22
Command Line Searching
indexer --config /etc/sphinx.conf --all
search --config /etc/sphinx.conf love bacon
displaying m...
23
Sphinx From PHP
$cl = new SphinxClient();
$cl->SetServer('localhost', 3312);
$cl->SetMatchMode(SPH_MATCH_ANY);
$result ...
24
Swish-E
http://swish-e.org
pecl install swish-beta
Friday, 29 October 2010
Filesystem Index With Swish-E
IndexDir /var/data/documents
IndexFile fs-swish-e.index
IndexOnly .doc .docx .pdf
FuzzyIndex...
Crawling Content
IndexDir /usr/local/lib/swish-e/spider.pl
IndexFile www-swish-e.index
SwishProgParameters default http://...
Swish-E With Multiple Indices
$swish = new Swish(
'www-swish-e.index fs-swish-e.index'
);
$search = $swish->prepare();
$qu...
28
Lucene
Friday, 29 October 2010
29
$index = Zend_Search_Lucene::create('idx');
foreach($documents as $title => $content) {
$doc = new Zend_Search_Lucene_D...
30
$results = $index->find('loves bacon');
foreach($results as $result) {
echo $result->score, " ";
echo $result->title, "...
31
$file = file_get_contents($url);
$doc = Zend_Search_Lucene_Document_Html::
loadHTML($file);
$doc->addField(
Zend_Search...
32
Solr
http://lucene.apache.org/solr/
Friday, 29 October 2010
33
Solr Search Index
$options = array( 'hostname' => 'localhost',
'port' => 8983 );
$client = new SolrClient($options);
$d...
34
Solr Search Client
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($que...
35
Xapian
http://xapian.org
Friday, 29 October 2010
36
Xapian In PHP
$db = new XapianWritableDatabase(
'idx', Xapian::DB_CREATE_OR_OPEN);
$i = new XapianTermGenerator();
$i->...
37
Xapian Search In PHP
$database = new XapianDatabase('idx');
$enquire = new XapianEnquire($database);
$qp = new XapianQu...
38
$matches = $enquire->get_mset(0, 10);
$i = $matches->begin();
while(!$i->equals($matches->end())) {
$n = $i->get_rank()...
39
Improving Results
Friday, 29 October 2010
40
Anchor Text
Friday, 29 October 2010
41
$p = file_get_contents('http://phpir.com');
libxml_use_internal_errors(true);
$dom = DomDocument::loadHTML($p);
$links ...
42
1
2
3
Zone Weighting
Friday, 29 October 2010
43
ZSL Zone Weighting
$doc = new Zend_Search_Lucene_Document();
$tfield = Zend_Search_Lucene_Field::Text
('title', $title)...
44
Document Authority
Friday, 29 October 2010
45
Document Weights in ZSL
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
Zend_Search_Lucene_Field::Text
('titl...
46
Using Search
Friday, 29 October 2010
47
Summaries & Highlighting
Friday, 29 October 2010
48
Sphinx Extract & Highlight
$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 );
$q = 'bacon';
$r = $cl->Query...
Friday, 29 October 2010
50
Xapian Spelling Correction
$indexer = new XapianTermGenerator();
$indexer->set_database($database);
$indexer->set_flags...
51
Spelling Correction Output
php xapsearch.php
Did you mean: str_replace or strcmp
4644 results found for “strreplace or ...
52
Results Sorting
Friday, 29 October 2010
53
Sorting in ZSL
$q = Zend_Search_Lucene_Search_QueryParser::
parse('search string');
$results = $index->find($q, 'title'...
54
Faceted Search
Friday, 29 October 2010
55
Faceted Search In Solr
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query(...
56
More Like This
Friday, 29 October 2010
57
More Like This
$rset = new XapianRset();
$rset->add_document(5959); // str_replace
$e = $enquire->get_eset(40, $rset);
...
58
More Like This Example
php xapsim.php
1656 results found:
1: 100% docid=5959
[phpdocs/html/function.str-replace.html]
2...
59
Search Performance
Friday, 29 October 2010
60
Index Updates
Docs
Main
New
Delta
Delta Main
Query
Delta Main
Main
DocsDocsDocs
Friday, 29 October 2010
61
Search Speed
$index = Zend_Search_Lucene::open('index');
$index->optimize();
indexer --merge main delta --rotate
Zend S...
62
Distributing Search
Index
Application
Index Index
DocumentDocumentDocumentDocument
Friday, 29 October 2010
63
Large Scale Search
http://www.nutch.org
http://hadoop.apache.org
Friday, 29 October 2010
64
Image Credits
Title http://www.flickr.com/photos/generated/2084287794/
What Do You Want http://www.flickr.com/photos/the_...
Questions?
65
Friday, 29 October 2010
Thank You!
Ian Barber
@ianbarber
http://phpir.com
ian@ibuildings.com
Friday, 29 October 2010
Upcoming SlideShare
Loading in...5
×

In Search Of: Integrating Site Search (PHP Barcelona)

3,595

Published on

Despite being a key method of navigation on many sites, search functionality often gets the short end of the stick in development, either by handing the job over to Google or just enabling full text search on the appropriate column in the database. In this talk we will look at how full text search actually works, how to integrate local text search engines into your PHP application, and how it’s possible to actually provide better and more relevant results than Google itself, at least for your own site.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,595
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
69
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "In Search Of: Integrating Site Search (PHP Barcelona)"

  1. 1. In Search Of... Ian Barber @ianbarber http://phpir.com ian@ibuildings.com integrating site search Friday, 29 October 2010
  2. 2. 2 How Search Works Integrating Search Improving Results Using Search Search Performance Questions Friday, 29 October 2010
  3. 3. 3 Friday, 29 October 2010
  4. 4. 4 Index DocumentDocumentDocumentDocumentAnalyser Query Parser QueryQueryQueryQuery ResultResultResultResult Friday, 29 October 2010
  5. 5. 5 With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected. Tokenisation “ ”Friday, 29 October 2010
  6. 6. 6 PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } Friday, 29 October 2010
  7. 7. 7 Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would Friday, 29 October 2010
  8. 8. 8 Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... Friday, 29 October 2010
  9. 9. 9 Boolean Query Merge Query: Best Western Hotel Result: Document 298 best 1 4 129 298 305 338 western 4 95 194 204 298 305 hotel 2 40 200 298 355 402 working 4 298 305 Friday, 29 October 2010
  10. 10. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacusLorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh. Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Friday, 29 October 2010
  11. 11. 11 TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } Friday, 29 October 2010
  12. 12. 12 Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... Friday, 29 October 2010
  13. 13. best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 Ranked Query Merge 13 Friday, 29 October 2010
  14. 14. 14 PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } Friday, 29 October 2010
  15. 15. 15 Integrating Search Friday, 29 October 2010
  16. 16. 16 CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); MySQL Full Text Search Friday, 29 October 2010
  17. 17. 17 MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) Friday, 29 October 2010
  18. 18. 18 Sphinx http://www.sphinxsearch.com Friday, 29 October 2010
  19. 19. 19 Sphinx Configuration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } Friday, 29 October 2010
  20. 20. 20 index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } Friday, 29 October 2010
  21. 21. 21 Stemming happening happened happens http://tartarus.org/~martin/PorterStemmer - happen - happen - happen Friday, 29 October 2010
  22. 22. 22 Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf Friday, 29 October 2010
  23. 23. 23 Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); Friday, 29 October 2010
  24. 24. 24 Swish-E http://swish-e.org pecl install swish-beta Friday, 29 October 2010
  25. 25. Filesystem Index With Swish-E IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl fs-swish-e.conf /usr/local/bin/swish-e -S fs -c fs-swish-e.conf Friday, 29 October 2010
  26. 26. Crawling Content IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML www-swish-e.conf /usr/local/bin/swish-e -S prog -c www-swish-e.conf Friday, 29 October 2010
  27. 27. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url } Friday, 29 October 2010
  28. 28. 28 Lucene Friday, 29 October 2010
  29. 29. 29 $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index Friday, 29 October 2010
  30. 30. 30 $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene Friday, 29 October 2010
  31. 31. 31 $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML Friday, 29 October 2010
  32. 32. 32 Solr http://lucene.apache.org/solr/ Friday, 29 October 2010
  33. 33. 33 Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); Friday, 29 October 2010
  34. 34. 34 Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } Friday, 29 October 2010
  35. 35. 35 Xapian http://xapian.org Friday, 29 October 2010
  36. 36. 36 Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); Friday, 29 October 2010
  37. 37. 37 Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); Friday, 29 October 2010
  38. 38. 38 $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } Friday, 29 October 2010
  39. 39. 39 Improving Results Friday, 29 October 2010
  40. 40. 40 Anchor Text Friday, 29 October 2010
  41. 41. 41 $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } Parse Anchor Text Friday, 29 October 2010
  42. 42. 42 1 2 3 Zone Weighting Friday, 29 October 2010
  43. 43. 43 ZSL Zone Weighting $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); Friday, 29 October 2010
  44. 44. 44 Document Authority Friday, 29 October 2010
  45. 45. 45 Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); Friday, 29 October 2010
  46. 46. 46 Using Search Friday, 29 October 2010
  47. 47. 47 Summaries & Highlighting Friday, 29 October 2010
  48. 48. 48 Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } Friday, 29 October 2010
  49. 49. Friday, 29 October 2010
  50. 50. 50 Xapian Spelling Correction $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Indexer $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; Searcher Friday, 29 October 2010
  51. 51. 51 Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for “strreplace or str_cmp”: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] Friday, 29 October 2010
  52. 52. 52 Results Sorting Friday, 29 October 2010
  53. 53. 53 Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } Friday, 29 October 2010
  54. 54. 54 Faceted Search Friday, 29 October 2010
  55. 55. 55 Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } Friday, 29 October 2010
  56. 56. 56 More Like This Friday, 29 October 2010
  57. 57. 57 More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); Friday, 29 October 2010
  58. 58. 58 More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] Friday, 29 October 2010
  59. 59. 59 Search Performance Friday, 29 October 2010
  60. 60. 60 Index Updates Docs Main New Delta Delta Main Query Delta Main Main DocsDocsDocs Friday, 29 October 2010
  61. 61. 61 Search Speed $index = Zend_Search_Lucene::open('index'); $index->optimize(); indexer --merge main delta --rotate Zend Search Lucene Sphinx $client = new SolrClient($options); $client->optimize(); Solr xapian-compact xapindex xapindex2 Xapian Friday, 29 October 2010
  62. 62. 62 Distributing Search Index Application Index Index DocumentDocumentDocumentDocument Friday, 29 October 2010
  63. 63. 63 Large Scale Search http://www.nutch.org http://hadoop.apache.org Friday, 29 October 2010
  64. 64. 64 Image Credits Title http://www.flickr.com/photos/generated/2084287794/ What Do You Want http://www.flickr.com/photos/the_justified_sinner/ 2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.flickr.com/photos/generated/2084287794/ Lucene http://www.flickr.com/photos/mypanda/7731447/ Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/ Solr http://www.flickr.com/photos/m-j-s/2724756177/ Xapian http://www.flickr.com/photos/olibac/3522056495/ Using Search http://www.flickr.com/photos/eneas/175027945/ Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.flickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/ Friday, 29 October 2010
  65. 65. Questions? 65 Friday, 29 October 2010
  66. 66. Thank You! Ian Barber @ianbarber http://phpir.com ian@ibuildings.com Friday, 29 October 2010
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×