• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
In Search Of... integrating site search
 

In Search Of... integrating site search

on

  • 7,748 views

Presentation from PHP UK 2010. Despite being a key method of navigation on many sites, search functionality often gets the short end of the stick in development, either by handing the job over to ...

Presentation from PHP UK 2010. Despite being a key method of navigation on many sites, search functionality often gets the short end of the stick in development, either by handing the job over to Google or just enabling full text search on the appropriate column in the database. In this talk we will look at how full text search actually works, how to integrate local text search engines into your PHP application, and how it's possible to actually provide better and more relevant results than Google itself, at least for your own site.

Statistics

Views

Total Views
7,748
Views on SlideShare
7,672
Embed Views
76

Actions

Likes
6
Downloads
115
Comments
2

6 Embeds 76

http://www.slideshare.net 46
http://www.berejeb.com 24
http://www.linkedin.com 3
http://www.gandhinarni.it 1
http://bookssearch.blogspot.com 1
https://connect.ubc.ca 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • thanks !
    Are you sure you want to
    Your message goes here
    Processing…
  • great round up Ian, wish I'd been to the talk too
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Contact Details <br />
  • This is a question we&#x2019;d often like to ask our users <br /> But with search, they tell us <br /> Search is about getting content to users that want it <br /> Searching users are Active and engaged <br /> Give them what they want and they are more likely to <br /> Buy, Read, Comment, Share etc. <br />
  • This talk covers <br /> how full text search works, <br /> looks at some different options for integration <br /> looks at making it better <br /> Time for questions at the end, but one does spring to mind now: <br /> <br />
  • Why search, why not let google do it? <br /> Private, intranet, FB inbox, offline <br /> Bad at, twitter for a long time, blogs for a long time <br /> Product focus, like amazon <br /> Speed of update, like a forum <br /> Now, lets look at how a full text search operates. <br /> <br />
  • Search Engine Structure <br /> Raw Text <br /> Documents (add url, title, split up etc.) <br /> Text Analysis <br /> Index <br /> Query Parser <br /> Query <br /> Results <br /> Search UI <br />
  • Simplified structure of a search engine. <br /> Start with pool of raw data, chunked into documents <br /> Analyser processes text in docs , Index stores <br /> Other side: Search UI <br /> Query parsed by query parser, like anlyser, <br /> Searched on index and Results sorted and returned <br />
  • Tokenising is taking a document and splitting it into tokens to index. <br /> Can be difficult, even with space char. <br /> Commas - remove punctuation - then send 1.1 mil to 11 mil! <br /> Hyphens <br /> Apostrophes <br />
  • That said, starting with something simple isn&#x2019;t a bad idea. <br /> Here we look for continuous sequences of word chars <br /> Capture with offset, which is for phrase matching. <br /> More advanced SEs have better tokenisation: & in AT&T <br /> Some instead have buzzwords file, specific terms: C++ <br /> <br />
  • Pair extracted tokens with assigned doc ID <br /> Filter stop words - an, the, of - don&#x2019;t distinguish <br /> Position info included <br />
  • Invert and merge pairs, so terms -> doc <br /> Positions still stored, represented by () <br /> e.g best @ 4 and 16 in doc 1. <br /> Often stored separate, or just a straight count <br /> List of docs == posting list <br /> Enough to start a search <br />
  • Take search query and tokenise the same way. Important! <br /> For each term we array_intersect. <br /> Can do boolean searches by doing array union for OR etc. <br /> BUT no RANK - any result with all words as good as other <br /> Must store importance of terms to documents - weight <br />
  • The weighting scheme includes two measures <br /> TF - term frequency, the count of terms in the document <br /> IDF, inverse document frequency, the rareness of the term in the collection <br />
  • Simple but usable weight algo, basis of most. <br /> TF - Count of times term appears <br /> IDF - total docs / docs with term, 10 total / 3 with term. Log to smooth <br /> Store this score with the document in the posting list for the term <br /> Normalise scores over a doc to acct for length - but still boosts short text <br />
  • TF-IDF PHP code <br /> <br /> <br />
  • TF-IDF PHP code <br /> <br /> <br />
  • Document is position in N dimension space <br /> One dimension for each term ever seen <br /> Mostly 0 <br /> Normalised to length 1 (sqrt of sum of sqrs of vals) <br />
  • Just look at 2 terms here to keep it simple <br /> Here, rather than just looking for matches <br /> We accumulate a score for each matching document <br /> 246 is our highest scoring document, picking up two good scores <br /> But 120 makes it in at number three, despite not having &#x2018;best&#x2019; in it. <br /> <br />
  • For a 2 term, 2 dimension case, that looks like this. <br /> Calculate cosine of angle between with dot prod <br /> Similarity - 1 = same, 0 = orthogonal (no shared terms) <br /> We can treat a query as a new document <br /> The documents it is most similar to are the best results <br /> Only need to compare to documents that share terms -rest will be score 0 <br />
  • Look at query terms, retrieve posting list from index <br /> Treat query term weights as 1 - incorrect, but ok for relative results <br /> Index merge, and calc dot product by summing weights. <br /> So, don&#x2019;t need a full match <br /> Could add phrase search, or positional bias. <br />
  • Two main question <br /> Where does the data come from? <br /> How is the index accessed? <br /> Look at 6 PHP friendly engines <br /> Each different integration method <br /> Each with new bits of functionality <br /> <br />
  • Data from a database columns in one table <br /> Simplest of all to implement - integrate through query <br /> Note fulltext index. <br /> Straight vector space search impl. as described before. <br /> Only can be used for MyISAM, not InnoDB <br /> If you&#x2019;re using postgres, tsearch built in since 8.3 <br />
  • MATCH AGAINST syntax <br /> Boolean too - all engines have this, we focus on natural <br /> Only one document has both words <br /> Ranked in score order - MATCH AGAINST returns a float <br /> Note there&#x2019;s some tricky default config: min word length 4, and 50% fill exclusion <br />
  • One interesting option is Query Expansion - <br /> Blindly expand the search based on words returned. <br /> Usually not a very good idea, because we want more precision that recall <br /> Precision is quality of results, recall is completeness <br /> In this case it&#x2019;s expanded to lorenzo, because of marcello&#x2019;s hatred for bacon <br /> <br /> <br />
  • Can actually interrogate the index <br /> myisam_ftdump <br /> Run from the database directory <br /> However, lets say you want to search on a normalised schema directly - multiple tables <br />
  • Using sphinx you can index a more complex query <br /> Used on craigslist, and apparently on The Pirate Bay <br /> There is a PHP API for access, or extension pecl/sphinx <br /> Same interface but faster <br />
  • Once installed, setup sphinx.conf file <br /> Top: Connection Stuff - also works with postgresql <br /> Indexing on sql_query - could use view, complex etc. <br /> Adding attributes - non indexed elements of a doc - Numeric or timestamp only in sphinx. <br /> Using multi valued attributes, support tags many to many <br /> Other options, such sql_query_pre or post <br />
  • Next tell sphinx about the index <br /> Minimum length of indexed word <br /> Prefix for wild card search - infix anywhere, prefix end <br /> We also enable a stemmer <br />
  • Stemming consistently collapses different forms of the same word to a stem <br /> Here each version is reduced to happen, but not always an english word is generated, just a consistent one <br /> This allows us to match more words, and is often, but not always, helpful <br /> The most common algorithm is the Porter stemmer, there is a PHP implementation on the site <br />
  • Indexer command to build index <br /> Might lock DB table, there is a ranged table work around <br /> Command line search, defaults to require all <br /> Stemmer - love vs loves <br /> Last line - start indexing daemon <br />
  • Match any word <br /> Wildcard search - prefix search, <br /> Returns both &#x2018;bacon&#x2019; docs <br /> Add filters - limiting to certain values of attributes <br /> Now we just get 1 result <br /> Sphinx can be built into mysql as a table type, and queried via a where arg <br />
  • From the other end - Swish is easy to plug in to existing system at short notice <br /> Swish-e is an engine with a long pedigree, and a PHP extension. <br /> Used by quite a few universities. <br /> Doesn&#x2019;t support multibyte charsets, which is a bit of a downside. <br /> Great for combinations where you have a bunch of word docs or similar documents, and a website, and you want to search both. <br /> <br />
  • First &#x2018;fs&#x2019; for file system index - we create a conf file for indexer <br /> In the conf we tell it where to look for files <br /> FileFilters extract text from non-text formats doc/pdf <br /> Can specify IndexDir multiple times different doc stores <br /> Requires wv ware and xpdf <br /> Apache Tika <br /> <br />
  • Includes an effective web crawler, another way to get data <br /> Getting it through the web loses some of the advantages <br /> Can plug into website no real control over <br /> Mode is prog to call out to the spider script <br /> Index file is different name <br />
  • Being able to query across the two indexes is very handy <br /> Here we search fs and www indexes and give combined results <br /> Can use various filters to limit search to parts of HTML documents <br /> Or filter on file system paths <br />
  • Now we&#x2019;ll look at engines where we index from within PHP <br /> Lucene, apache foundation search engine <br /> Very succesful, but has ports instead of bindings <br /> Native PHP port in Zend Framework, Zend Search Lucene <br />
  • Hook right into the application, easy addition/plugin <br /> Lots of control, easy to add metadata/attributes <br /> Lucene calls them fields: string keys, multiple value types <br /> Text indexed and original stored - unstored not <br /> Index compatible w/ Java lucene 2.3 - can index java, search PHP <br /> <br />
  • Querying is straightforward, and quick. <br /> The scores are only really interesting as a relative value <br /> <br />
  • Includes some handy utilities such as HTML doc parser <br /> Spits out various fields such as title and body auto <br /> Allows you to add other fields as required <br /> Advantage of PHP - easy to hack at, add new doc types <br /> HOWEVER - doesn&#x2019;t scale to large collections so you may prefer to use one the Java based versions... and the easiest way is with Solr <br />
  • Solr uses java lucene - wraps in REST+XML/JSON web service <br /> Convenient for all the usual SOA reasons. <br /> Solr is in use by CNET, digg, netflix and other high profile sites. <br /> There is a PHP extension, or a PHP client API <br />
  • Not massively different from ZSL. <br /> Solr needs you to create a schema first, to define the fields of docs <br /> Note the client commit down the bottom. <br /> Until a document is committed <br /> Hardly know you&#x2019;re using a webservice <br />
  • Searching is similar <br /> XML based response format means a more complex return struct <br /> Solr is great for larger scale collections <br /> Provides good admin functionality - enterprise friendly <br />
  • Our last engine is Xapian <br /> High performance C based search engine. <br /> There is a Solr like service called Flax, but we&#x2019;ll look at the engine directly. <br /> PHP SWIG based extension and low level API <br /> Gives some cool features, and a lot of control <br /> Creates database on FS, or can be accessed remote <br />
  • Separation between the document and the indexer <br /> Integration of stemmer - english here <br /> We have an numeric indexed attribute, referred to as a &#x2018;value&#x2019; here, for the title <br />
  • Xapian index (local etc.) <br />
  • The searching is more complicated <br /> We have more control - <br /> STEM_SOME, don&#x2019;t stem words that start with a capital letter (proper nouns) <br />
  • Xapian query <br />
  • Retrieving the result relies on these functions wrapping around C iterators. <br /> Note the percentage score value - overtly relative, but can be thresholded if needed <br />
  • Xapian query result <br />
  • We have search engine <br /> We know where data coming from <br /> How can we improve results <br /> <br />
  • Link text can be a great source of keywords <br /> To use a classic example, from one of the early papers about google, if someone types &#x2018;big blue&#x2019; into google, one of the top results is IBM.com. <br /> But the page it points to doesn&#x2019;t contain the phrase <br /> Things link to it that do contain that phrase, and Google index against it. <br /> Big win for things like images and videos, where there may be no text <br /> <br />
  • Need to parse document <br /> Easy in PHP with the DOM parser <br /> We could then add these to the index, as a new field on a document <br /> ZSL has a built in html document type, but the getLinks function doesn&#x2019;t include anchor text <br /> <br /> <br />
  • Anchor text extract <br />
  • The next idea is zone weighting. <br /> This is a page from my blog <br /> I know what&#x2019;s important on this page - 1 to 3 <br /> Google has to guess, based on appearance <br /> Green = boilerplate - don&#x2019;t index <br /> Index these zones as fields, and weight differently <br />
  • If we break our content down into fields, <br /> We can set different &#x2018;boost&#x2019; values on them <br /> Boosts > 1 more important, &lt; 1 less important <br /> E.G. de-emphasise comments <br /> <br /> <br />
  • Document Weight - Importance, Authority <br /> In general - not tied to specific query <br /> Page rank - but that wont work on small collection <br /> Comments - &#x201C;great post&#x201D;, comment count <br /> Inbound visitors <br /> Retweets - Google uses a UserRank PR type calculation on follower counts <br />
  • Similar to zones, boost at document level <br /> The default is 1 <br /> Adding one 100th for each comment <br /> This of course could be tuned for individual circumstances <br />
  • Got engine, got data, got good results <br /> Now, look at ways to improve search user experience <br />
  • With UI - do what other websites do <br /> With search - do what google et al do <br /> Summaries or snippits show a selection of the page <br />
  • Sphinx build highlights <br />
  • Most search engines have some support for this. <br /> With Sphinx here, we can pass the query and index name to the BuildExcerpts function to get highlighted contextual snippits <br /> getTextFromDB is just a pretend function that would wrap retrieving the raw full text. <br />
  • We can do by storing some of the original text in SE <br /> We&#x2019;ve added a StoreDescription based on the body, for 1000 characters <br /> This will appear in the result object as swishdescription. <br /> We may want to index more, then choose the bit we display based on the presence of query words. <br />
  • Google highlighted search terms on summaries <br /> Can do on whole document as well <br /> Easy to do in many engines <br /> ZSL highlight matches - could use stored field or external <br /> HighlightMatches without fragment will add HTML headers <br />
  • Spelling correction is a really handy function <br /> Important to correct to known words from the index <br /> Rather than default dictionary <br /> <br />
  • Xapian example - set flag on indexer & queryparser. <br /> We had an index based on PHP documentation <br /> Have mistyped str_replace and strcmp <br />
  • Function names were corrected, despite not being &#x2018;words&#x2019; <br /> They featured in index, and had low edit distance from query <br /> Some low quality results returned - where we might use threshold <br /> Solr/Lucene has a similar plugin <br />
  • Another useful idea is sorting result sets on other than rank <br /> This is an example from google news <br /> E.G. file search, email, private messages may want others (sender, date, subject) <br />
  • Here we&#x2019;ve added a sort on title <br /> Can be expensive as SEs can&#x2019;t do normal shortcuts <br /> But normally straightforward <br />
  • We&#x2019;ve got a search here on epicurious, the food and cooking site. <br /> Shows categories and result counts <br /> This is called faceted search, categories = facets <br /> Document has many categories <br /> Good for product based search <br /> Solr was built with faceted search in mind for CNET reviews <br />
  • Enable faceted mode, set one facet, &#x2018;cat&#x2019; <br /> If we&#x2019;d been duplicating epicurious, each of the options on the left would have been a facet. <br /> Get results plus enumeration of options in each facet + count <br />
  • User can offer feedback by selecting more like this <br /> Find documents like this one <br /> Good for search with many meanings - &#x2018;creed&#x2019; (game, band, belief) <br /> Example from a dissertation search engine <br />
  • Generate search from document user selected <br /> Xapian has built in, can do in Solr as well. <br /> Top 40 most important terms extracted (can do more than one doc) <br /> Using str_replace from index of phpdoc <br /> Combine terms with ORs <br />
  • Finds itself, and other good matches <br /> MySQL FTI has blind query expansion, which gets more results based on the results retrieved - not as good, and hella slow! <br />
  • Search can be expensive <br /> Lots of data to process <br /> Most engines have some sort of query cache built in <br /> We&#x2019;ll take a quick look at some different aspects of performance. <br />
  • Indexes designed for more read than write <br /> Adding data can be expensive to a large index. <br /> Have two indexes <br /> Merge <br /> Lucene uses segments automatically <br />
  • Smaller index: less IO, better O/S cache, faster results <br /> But slower update speed <br /> Recombine segments, Merge deltas <br /> Optimise and compress index <br /> This can be an expensive operation though. <br /> Try to keep index on local disk, not network <br />
  • When demands too big for a single server, need to look at distributing <br /> Replication tends not to give such a boost here, as you generally have too large an index which is too slow for single queries, rather than scale <br /> Need to shard contents based on hash - something not searched for <br /> Most systems have a way of working with remote backends, to give single search and sort point <br />
  • The systems we&#x2019;ve talked about will all index tens of thousands of documents <br /> Xap and Solr should handle into the millions on one server <br /> 100s of mil/billions = webscale - Challenges: Data size, rate of update <br /> Nutch is a FOSS webscale SE/crawler created by Doug Cutting, of Lucene. <br /> Also did hadoop: mapreduce, distributes files etc. (not being sued by google) <br /> Used on thousands of nodes at yahoo, among others <br />
  • <br />
  • <br />
  • Thanks! <br />

In Search Of... integrating site search In Search Of... integrating site search Presentation Transcript