ENTERPRISE  SEARCH an introduction
Web Search Desktop Search Enterprise Search
so what is a Search Engine?
<ul><li>a SOFTWARE </li></ul><ul><li>that  builds index  on Text </li></ul><ul><li>answers queries  using that index </li>...
Any search application has  two major components SEARCH   component  INDEXING   component - of importance to us  developer...
data INDEX  FILES is indexed user sends  search query receives  search results INDEXING   component SEARCH   component
Let’s start with INDEXING
is it easy to search here  . . .
or  here  . . .
<ul><li>that’s information like  garbage   </li></ul><ul><li>no   structure </li></ul><ul><li>comes in all  kinds of   </l...
<ul><li>And this is  what indexing does  </li></ul><ul><li>Makes data accessible in a  structured format , easily accessib...
so what all   needs to be  Indexed and Searched ?
various   FILE FORMATS Text Files HTML PDF MS Word PPT
coming from various   DATA SOURCES Emails CMS File System Database Web Pages
data  ( documents )   INDEX  FILES user sends  search query receives  search results Analyzer fed to text that should be i...
Document 1: Coffee isn't my cup of tea.   Document 2:  Chocolate, men, coffee - some things are better rich.   INDEX coffe...
And now the SEARCH  Component
data INDEX  FILES is indexed user receives  search results sends  search query search terms
Search Request Terms Taxonomy Spelling Index Correct Search Terms + Incorrect Search Terms Search Terms + Related Terms fr...
introducing   LUCENE
<ul><li>Full-text  search   library </li></ul><ul><li>Open Source   </li></ul><ul><li>Documents in  xml  format </li></ul>...
 
 
Ways of storing fields  of any document: Indexed   means it is   searchable Stored   you may chose not to make a field sea...
introducing   SOLR Solr Solr Lucene Index
<ul><li>open source  </li></ul><ul><li>handles index/Query to Lucene  via HTTP and XML ( also JSON ) </li></ul><ul><li>man...
Adding Documents to SOLR
<ul><li>HTTP POST to /update </li></ul><ul><li><add><doc boost=“2”> </li></ul><ul><li><field name=“type”>05991</field> </l...
Schema.xml   field indexing and display definition
<ul><li><field name=&quot;subject&quot;  type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;false&quot; multiVa...
Solrconfig.xml  file  defines cache size, faceted field type, request handler customization
Deleting Documents <ul><li>Delete by Id </li></ul><ul><li><delete><id>05591</id></delete> </li></ul><ul><li>Delete by Quer...
Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
Default Parameters http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price param default description q The ...
<response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound=&quot;16173&quot; start=&q...
Solr Core Lucene Admin Interface Standard Request Handler Disjunction Max Request Handler Custom Request Handler Update  H...
 
Upcoming SlideShare
Loading in …5
×

Introduction to Search Engines

9,599
-1

Published on

This presentation gives an introduction to the Search Engines. What are they? How do they work?
It also has a brief introduction to Solr and Lucene

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
9,599
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
356
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Introduction to Search Engines

  1. 1. ENTERPRISE SEARCH an introduction
  2. 2. Web Search Desktop Search Enterprise Search
  3. 3. so what is a Search Engine?
  4. 4. <ul><li>a SOFTWARE </li></ul><ul><li>that builds index on Text </li></ul><ul><li>answers queries using that index </li></ul>
  5. 5. Any search application has two major components SEARCH component INDEXING component - of importance to us developers (read headache) - of importance to the users
  6. 6. data INDEX FILES is indexed user sends search query receives search results INDEXING component SEARCH component
  7. 7. Let’s start with INDEXING
  8. 8. is it easy to search here . . .
  9. 9. or here . . .
  10. 10. <ul><li>that’s information like garbage </li></ul><ul><li>no structure </li></ul><ul><li>comes in all kinds of </li></ul><ul><li> shapes, sizes, formats </li></ul>
  11. 11. <ul><li>And this is what indexing does </li></ul><ul><li>Makes data accessible in a structured format , easily accessible through search. </li></ul>
  12. 12. so what all needs to be Indexed and Searched ?
  13. 13. various FILE FORMATS Text Files HTML PDF MS Word PPT
  14. 14. coming from various DATA SOURCES Emails CMS File System Database Web Pages
  15. 15. data ( documents ) INDEX FILES user sends search query receives search results Analyzer fed to text that should be indexed removing stop words such as &quot;a&quot; or &quot;the&quot; converting all text to lowercase letters for case-insensitive searching Stemming (A stemming algorithm reduces the words &quot;fishing&quot;, &quot;fished&quot;, &quot;fish&quot;, and &quot;fisher&quot; to the root word, &quot;fish&quot;. )- Index Writer tokenized text
  16. 16. Document 1: Coffee isn't my cup of tea. Document 2: Chocolate, men, coffee - some things are better rich. INDEX coffee - 1,2 cup - 1 tea - 1 chocolate - 1 men - 1 things - 1 better - 1 rich - 1
  17. 17. And now the SEARCH Component
  18. 18. data INDEX FILES is indexed user receives search results sends search query search terms
  19. 19. Search Request Terms Taxonomy Spelling Index Correct Search Terms + Incorrect Search Terms Search Terms + Related Terms from Taxonomy + Concept IDs Search engine (INDEX) Search results with 1) Actual Location of the result 2) Rank 3) Details 4) Facet Categorization Results’ Page
  20. 20. introducing LUCENE
  21. 21. <ul><li>Full-text search library </li></ul><ul><li>Open Source </li></ul><ul><li>Documents in xml format </li></ul><ul><li>Can operate on its own or via Solr </li></ul>
  22. 24. Ways of storing fields of any document: Indexed means it is searchable Stored you may chose not to make a field searchable, means the content can be displayed in the search results. Example : “ summary associated with a page ” Tokenized means it is run through an Analyzer , that converts the content into a sequence of tokens
  23. 25. introducing SOLR Solr Solr Lucene Index
  24. 26. <ul><li>open source </li></ul><ul><li>handles index/Query to Lucene via HTTP and XML ( also JSON ) </li></ul><ul><li>manages document update , add and delete requests to Lucene </li></ul><ul><li>straightforward schema and config files </li></ul><ul><li>comprehensive HTML Admin Interfaces </li></ul><ul><li>highly configurable </li></ul>
  25. 27. Adding Documents to SOLR
  26. 28. <ul><li>HTTP POST to /update </li></ul><ul><li><add><doc boost=“2”> </li></ul><ul><li><field name=“type”>05991</field> </li></ul><ul><li><field name=“from”>Apache Solr</field> </li></ul><ul><li><field name=“subject”>An intro...</field> </li></ul><ul><li><field name=“category”>search</field> </li></ul><ul><li><field name=“category”>lucene</field> </li></ul><ul><li><field name=“body”>Solr is a full...</field> </li></ul><ul><li></doc></add> </li></ul>
  27. 29. Schema.xml field indexing and display definition
  28. 30. <ul><li><field name=&quot;subject&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;false&quot; multiValued=&quot;true&quot;/> </li></ul><ul><li><field name=&quot;genus_species&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;false&quot; multiValued=&quot;true&quot;/> </li></ul><ul><li><field name=&quot;language&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;false&quot; multiValued=&quot;true&quot;/> </li></ul><ul><li><field name=&quot;creator&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;false&quot; multiValued=&quot;true&quot;/> </li></ul><ul><li><field name=&quot;control_num&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;false&quot; multiValued=&quot;true&quot;/> </li></ul><ul><li><field name=&quot;title_sort&quot; type=&quot;string&quot; indexed=&quot;true&quot; stored=&quot;false&quot;/> </li></ul>
  29. 31. Solrconfig.xml file defines cache size, faceted field type, request handler customization
  30. 32. Deleting Documents <ul><li>Delete by Id </li></ul><ul><li><delete><id>05591</id></delete> </li></ul><ul><li>Delete by Query (multiple documents) </li></ul><ul><li><delete> </li></ul><ul><li><query>manufacturer:microsoft</query> </li></ul><ul><li></delete> </li></ul>
  31. 33. Search Results http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
  32. 34. Default Parameters http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price param default description q The query start 0 Offset into the list of matches rows 10 Number of documents to return fl * Stored fields to return qt standard Query type; maps to query handler df (schema) Default field to search
  33. 35. <response><responseHeader><status>0</status> <QTime>1</QTime></responseHeader> <result numFound=&quot;16173&quot; start=&quot;0&quot;> <doc> <str name=&quot;name&quot;>Apple 60 GB iPod with Video</str> <float name=&quot;price&quot;>399.0</float> </doc> <doc> <str name=&quot;name&quot;>ASUS Extreme N7800GTX/2DHTV</str> <float name=&quot;price&quot;>479.95</float> </doc> </result> </response>
  34. 36. Solr Core Lucene Admin Interface Standard Request Handler Disjunction Max Request Handler Custom Request Handler Update Handler Caching XML Update Interface Config Analysis HTTP Request Servlet Concurrency Update Servlet XML Response Writer Replication Schema Search Requests hit here New document to be added here
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×