Open Source Search<br />Architectural design considerations<br />
Overview<br />Open Source & Corporate Usage<br />Search Design Principals<br />Metrics<br />2/14/11<br />pjaol@pjaol.com<b...
Open Source & Corporate Usage<br />What to look for & what to avoid<br />License pit falls<br />Just because it's open sou...
What to look for?<br />Activity<br />Updates<br />Usage<br />Complexity<br />Community<br />2/14/11<br />pjaol@pjaol.com<b...
2/14/11<br />pjaol@pjaol.com<br />5<br />“Individuals invent problems<br />Communities invent solutions”<br />
License pit falls<br />Licenses are legal bindings<br />There are lots of them<br />GPL, LGPL, ASF, BSD<br />Apple, MIT, M...
License View<br />2/14/11<br />pjaol@pjaol.com<br />7<br />
Features of licensing<br />Commercial usage<br />Distribution<br />Attribution<br />Modifiable<br />Superseding licensing<...
Just because it’s open source…<br />!= license to hack<br />TCO, mirror repository<br />Use a single master repository<br ...
Search Design Principles<br />Design principles<br />Separation of high level features<br />Types of search and their diff...
Design Principles<br />Core Search Basics<br />Lucene<br />API gotchas<br />2/15/11<br />pjaol@pjaol.com<br />11<br />
Core Search Basic<br />Inverted Index<br />1) The quick brown fox jumped over the fence<br />2) The cat and the fiddle<br ...
Core Search Basics<br />Word { doc id, ….}<br />and		{ 2}<br />brown		{ 1}<br />cat		{ 2}<br />cow		{ 3}<br />fence		{ 1}<...
Core Search Basics<br />How Similar are documents?<br />Cosine Similarity<br />Used to find similar results<br />Cluster i...
2/15/11<br />pjaol@pjaol.com<br />15<br />
2/15/11<br />pjaol@pjaol.com<br />16<br />
Cosine Θ =      A . B<br />                    --------------<br />                     || A || x || B ||<br />Closer to 1...
Separation of high level features<br />User Intent<br />Try to classify what the user is looking for<br />Retrieval<br />F...
User Intent<br />Classifying search queries<br />Allow for disjointed searches<br />Different ranking algorithms<br />Cont...
Retrieval<br />Create a candidate set of results<br />Based on your data model for that index<br />Separation is required ...
Ranking / Sorting<br />Rank candidate set from retrieval with<br />Affinity to user intent<br />Relevancy<br />Performance...
Types of Search<br />News<br />Most relevant & latest<br />Entertainment, Sports, Financial, Viral<br />Web<br />Best page...
Local Data Quality<br />2/16/11<br />pjaol@pjaol.com<br />23<br />
Don’t over kill<br />2/15/11<br />pjaol@pjaol.com<br />24<br />
Metrics<br />Basics<br />Searches<br />Clicks<br />Pagination<br />No Results<br />Unique users<br />Measure Everything !!...
Metrics<br />2/16/11<br />pjaol@pjaol.com<br />26<br />
Metrics: Search Quality<br />New searches should increase<br />Query rewrites decrease<br />No Results should decrease<br ...
2/14/11<br />pjaol@pjaol.com<br />28<br />Metrics are as critical as software<br /><ul><li>Not all Metrics are obvious
Incorporate user intent in overall picture</li></li></ul><li>Q&A<br />pjaol@pjaol.com<br />2/16/11<br />pjaol@pjaol.com<br...
Upcoming SlideShare
Loading in …5
×

open source technologies & search engine design

717 views

Published on

Search engines have become almost an everyday technology for companies
Large scale search, storage, and crawler technologies exist as open source available to anybody.
But what other behind the scene elements need to be taken into account when designing a search solution for a company.



Open source & corporate usage
  - What to look for & what to avoid (activity, usage, updates, complexity, community)
  - License pit falls, how open source licenses are modified for corporate gains
  - Just because it's open source, doesn't mean you should modify it

Search
 - Design principles, infrastructure, abstraction (don't over commit to a piece of software)
 - Separation of high level features - User intent, Retrieval, Ranking
 - Types of search and their differences Web, News, Local, 
 - Now add social to the above

Metrics
 - What can be measured and why it's important? Upstream measuring vs. Usage measuring
 - What can go up and down and still be a metric of success?
 - How feedback loops are used to influence search

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
717
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • What to look for &amp; what to avoid in open source softwareActivityIs the project dead?Pretty common in places like source forge, github, google code etc.Private sites, educational sites, corporate releases.UpdatesWhen was the last commit / releaseIs it functionally complete? Does not need update?Are updates required for to match the language or dependent libraries? UsageIs it commonly used? Is it a standard?A standard might not need updates e.g. Expat XML library, default standard last updated June 2007Known &amp; published limitations, &amp; work aroundComplexityDoes the project solve just a simple problem? If activity / updates are low, consider doing it yourselfHighly complex, saves a ton of time, use with caution.CommunityIs there an established community or is it one or more individuals?Are the developers open to input or contribution?
  • Individuals can be describe as a collective with goals that are not sharedCommunities allow for common problems to bubble and get prioritized.
  • http://opensource.org/docs/osd
  • Licenses can restrict environments where software can be usedSome restrict commercial usage (insane)Some restrict government / military usageSome prevent you distributing compiled versionsCan’t use it in your iphone appGPS deviceSome require attribution / acknowledgement of usageSome prevent modificationCompletely proprietaryOr without contributing back the changesSome licenses supersede where usage of the software requires adoption of their licenseForce you to release your source code (Tomtom GPS GPL usage, now leader in OSI , Tomtom modified the kernel they were using)Prevent you using a different licensing in your softwareSome licenses limit how large your deployment can get, before you have to switch to a commercial licenseAGPL (Affero GPL) http://www.gnu.org/licenses/agpl-3.0.html
  • If it doesn’t meet your needs, can you work around before modifying the softwareUpgrades are hard to maintain with hacked softwareContribute back changes so you don’t have to support them going forwardTCO, mirror the repository if possibleOSDN can go down at critical momentsSecurityDevelop with a single masterSame version of the code, mirrored sync’s can be slow, developers out of syncBranching / merging is handled
  • Inverted indexStandard document (grep search) requires searching through each document until a match is foundIf a score is required, then each matching word has to be counted.Could be Sigma Words, but always some overhead of opening &amp; seeking through files, thus the split of operations
  • InvertedConverts docs of words to an ordered unique words pointing to documents
  • Document VectorIs a mathematical representation of the documentUsually based on the terms of a corpusA basic version consists of just term frequenciesBetter versions consist of Term Frequencies over Document Frequencieshttp://en.wikipedia.org/wiki/Cosine_similarityCosine Θ = A . B -------------- || A || x || B ||This results in all documents being similar with “ the “ being the strongest signal
  • From this we’ve seen Document Vectors Similarity Term Frequency Inverted Document FrequencySee Included doc_vectors.xslx
  • Separate out search engine design based onUser IntentRetrievalRanking
  • A search for Pot Roast differs from Diner - Pot Roast, usually recipes - Diner, look for a business nearbyImagine you run search for a digital newspaper -News, Weather, Sport, Restaurant Reviews, Recipes indexesSearch Everything, means look in all indexesAnd see what comes back with the most results - More than likely you have lots of restaurant reviews about pot roast, only 1 recipe.Unpredictable, driven by data coverage, not by performanceSearch Everything is costly,Predefined classifications means you can focus on a single index, saving Hardware.So how do you classify queries, most new search engines start with DMOZ http://www.dmoz.org/rdf.html
  • In complex searches don’t depend on term score from retrievalMethods used in retrieval such as field normalization, shingling etc can affect score
  • News is now !Often driven by navigation rather than search (show me what’s happening)Pattern / Trending / ZeitgeistGeo / Local / National / GlobalWeb depends on coverage, authority, latent elements clicks / in-links, social bit.ly, fb, twitter etc..
  • From David Mihm in search engine landhttps://searchengineland.com/local-search-complexity-smb-frustration-36839
  • Metrics are usually divided into 2 areasBusiness intelligence / Search Quality
  • A query rewrite is where a user changes their query in a session of say 15 minsWhile staying in the same concepte.gTools -&gt; Hardware -&gt; Hardware Stores -&gt; Home Depot (click)Short circuit Tools to Hardware Stores in the future.Clicks are driven by results displayed to the user.Clicks can decrease, indicating you are providing the result the user is looking for immediatelyThe last click on a page of results could be the most relevant.Searches with 1 or more clicks should increase.
  • open source technologies & search engine design

    1. 1. Open Source Search<br />Architectural design considerations<br />
    2. 2. Overview<br />Open Source & Corporate Usage<br />Search Design Principals<br />Metrics<br />2/14/11<br />pjaol@pjaol.com<br />2<br />
    3. 3. Open Source & Corporate Usage<br />What to look for & what to avoid<br />License pit falls<br />Just because it's open source….<br />2/14/11<br />pjaol@pjaol.com<br />3<br />
    4. 4. What to look for?<br />Activity<br />Updates<br />Usage<br />Complexity<br />Community<br />2/14/11<br />pjaol@pjaol.com<br />4<br />
    5. 5. 2/14/11<br />pjaol@pjaol.com<br />5<br />“Individuals invent problems<br />Communities invent solutions”<br />
    6. 6. License pit falls<br />Licenses are legal bindings<br />There are lots of them<br />GPL, LGPL, ASF, BSD<br />Apple, MIT, Mozilla<br /> There to protect the developers, IP, and company<br />2/14/11<br />pjaol@pjaol.com<br />6<br />
    7. 7. License View<br />2/14/11<br />pjaol@pjaol.com<br />7<br />
    8. 8. Features of licensing<br />Commercial usage<br />Distribution<br />Attribution<br />Modifiable<br />Superseding licensing<br />Scaling limits<br />http://opensource.org/<br />2/14/11<br />pjaol@pjaol.com<br />8<br />
    9. 9. Just because it’s open source…<br />!= license to hack<br />TCO, mirror repository<br />Use a single master repository<br />2/14/11<br />pjaol@pjaol.com<br />9<br />
    10. 10. Search Design Principles<br />Design principles<br />Separation of high level features<br />Types of search and their differences<br />What does social mean to search<br />2/15/11<br />pjaol@pjaol.com<br />10<br />
    11. 11. Design Principles<br />Core Search Basics<br />Lucene<br />API gotchas<br />2/15/11<br />pjaol@pjaol.com<br />11<br />
    12. 12. Core Search Basic<br />Inverted Index<br />1) The quick brown fox jumped over the fence<br />2) The cat and the fiddle<br />3) The cow jumped over the moon<br />Σ Docs * Σ Words/Doc<br />O(N)<br />2/15/11<br />pjaol@pjaol.com<br />12<br />
    13. 13. Core Search Basics<br />Word { doc id, ….}<br />and { 2}<br />brown { 1}<br />cat { 2}<br />cow { 3}<br />fence { 1}<br />fiddle { 2}<br />fox { 1}<br />jumped { 1, 3} <br />moon { 3}<br />over { 1, 3}<br />quick { 1}<br />the { 1, 2, 3}<br />2/15/11<br />pjaol@pjaol.com<br />13<br />O(Log(n))<br />
    14. 14. Core Search Basics<br />How Similar are documents?<br />Cosine Similarity<br />Used to find similar results<br />Cluster items, find trends etc.<br />Bring some AI into search<br />Requires creating a “document vector”<br />2/15/11<br />pjaol@pjaol.com<br />14<br />
    15. 15. 2/15/11<br />pjaol@pjaol.com<br />15<br />
    16. 16. 2/15/11<br />pjaol@pjaol.com<br />16<br />
    17. 17. Cosine Θ = A . B<br /> --------------<br /> || A || x || B ||<br />Closer to 1 means similar<br />Doc 1 vs 2 = 0.107695892<br />Doc 2 vs 3 = 0.139558473<br />Doc 1 vs 3 = 0.247523241<br />2/15/11<br />pjaol@pjaol.com<br />17<br />
    18. 18. Separation of high level features<br />User Intent<br />Try to classify what the user is looking for<br />Retrieval<br />Find data based on query / classification & data model<br />Ranking <br />Sort results, based on business rules, relevancy, performance / popularity<br />2/15/11<br />pjaol@pjaol.com<br />18<br />
    19. 19. User Intent<br />Classifying search queries<br />Allow for disjointed searches<br />Different ranking algorithms<br />Control & override results for traffic shaping<br />Map user queries to the data you have<br />VS. Search everything<br />Most search engines bootstrap classifiers with DMOZ<br />2/15/11<br />pjaol@pjaol.com<br />19<br />
    20. 20. Retrieval<br />Create a candidate set of results<br />Based on your data model for that index<br />Separation is required because<br />Data models & data integrity can fluctuate<br />Coverage changes signal strength<br />2/15/11<br />pjaol@pjaol.com<br />20<br />
    21. 21. Ranking / Sorting<br />Rank candidate set from retrieval with<br />Affinity to user intent<br />Relevancy<br />Performance / clicks / popularity etc..<br />Business rules<br />Make ranking config / override / user friendly !<br />2/15/11<br />pjaol@pjaol.com<br />21<br />
    22. 22. Types of Search<br />News<br />Most relevant & latest<br />Entertainment, Sports, Financial, Viral<br />Web<br />Best page in index<br />Textual relevancy<br />Local<br />Coverage / data quality<br />2/16/11<br />pjaol@pjaol.com<br />22<br />
    23. 23. Local Data Quality<br />2/16/11<br />pjaol@pjaol.com<br />23<br />
    24. 24. Don’t over kill<br />2/15/11<br />pjaol@pjaol.com<br />24<br />
    25. 25. Metrics<br />Basics<br />Searches<br />Clicks<br />Pagination<br />No Results<br />Unique users<br />Measure Everything !!<br />2/15/11<br />pjaol@pjaol.com<br />25<br />
    26. 26. Metrics<br />2/16/11<br />pjaol@pjaol.com<br />26<br />
    27. 27. Metrics: Search Quality<br />New searches should increase<br />Query rewrites decrease<br />No Results should decrease<br />Clicks might decrease<br />Searches with 1+ clicks should increase<br />2/15/11<br />pjaol@pjaol.com<br />27<br />
    28. 28. 2/14/11<br />pjaol@pjaol.com<br />28<br />Metrics are as critical as software<br /><ul><li>Not all Metrics are obvious
    29. 29. Incorporate user intent in overall picture</li></li></ul><li>Q&A<br />pjaol@pjaol.com<br />2/16/11<br />pjaol@pjaol.com<br />29<br />

    ×