ENTERPRISE SEARCH
an introduction
Web Search
Desktop Search
Enterprise Search
so what is a
Search Engine?
a SOFTWARE
• that builds index on Text
• answers queries using that index
Any search application has
two major components
SEARCH component
INDEXING component
- of importance to us developers
(read headache)
- of importance to the users
data
INDEX FILES
is indexed
user
sends
search query
receives
search results
INDEXING component SEARCH component
Let’s start with
INDEXING
is it easy to search here . . .
or here . . .
• that’s information like garbage
• no structure
• comes in all kinds of
shapes, sizes, formats
• And this is what indexing does
• Makes data accessible in a
structured format, easily accessible
through search.
so what all needs to be
Indexed and Searched ?
various FILE FORMATS
Text Files
HTML
PDF
MS Word
PPT
coming from various DATA SOURCES
Emails
CMS
File System
Database
Web Pages
data ( documents )
INDEX FILES
user
sends
search query
receives
search results
Analyzer
fed to
text that should be indexed
removing stop words such as "a" or "the"
converting all text to lowercase letters
for case-insensitive searching
Stemming
(A stemming algorithm reduces
the words "fishing", "fished",
"fish", and "fisher" to the root word, "fish". )-
Index Writer
tokenized text
Document 1:
Coffee isn't my cup of tea.
Document 2:
Chocolate, men, coffee - some things are better rich.
INDEX
coffee - 1,2
cup - 1
tea - 1
chocolate - 1
men - 1
things - 1
better - 1
rich - 1
And now the
SEARCH Component
data
INDEX FILES
is indexed
user
receives
search results
sends
search query
search terms
Search Request Terms
Taxonomy
Spelling Index
Correct Search Terms + Incorrect Search Terms
Search Terms +
Related Terms from Taxonomy + Concept IDs
Search engine
(INDEX)
Search results with
1) Actual Location of the result
2) Rank
3) Details
4) Facet Categorization
Results’ Page
introducing
LUCENE
Full-text search library
Open Source
Documents in xml format
Can operate on its own or via Solr
Ways of storing fields of any document:
Indexed means it is searchable
Stored you may chose not to make a field searchable, means the content can be
displayed in the search results. Example : “summary associated with a page”
Tokenized means it is run through an Analyzer, that converts the content into
a sequence of tokens
introducing
SOLR
Solr
Solr
Lucene
Index
• open source
• handles index/Query to Lucene via HTTP and XML
( also JSON )
• manages document update, add and delete
requests to Lucene
• straightforward schema and config files
• comprehensive HTML Admin Interfaces
• highly configurable
Adding Documents
to SOLR
HTTP POST to /update
<add><doc boost=“2”>
<field name=“type”>05991</field>
<field name=“from”>Apache Solr</field>
<field name=“subject”>An intro...</field>
<field name=“category”>search</field>
<field name=“category”>lucene</field>
<field name=“body”>Solr is a full...</field>
</doc></add>
Schema.xml
field indexing and display definition
Solrconfig.xml file
defines cache size, faceted field type, request handler customization
Deleting Documents
• Delete by Id
<delete><id>05591</id></delete>
• Delete by Query (multiple documents)
<delete>
<query>manufacturer:microsoft</query>
</delete>
Search Results
http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
Default Parameters
param default description
q The query
start 0 Offset into the list of matches
rows 10 Number of documents to return
fl * Stored fields to return
qt standard Query type; maps to query handler
df (schema) Default field to search
http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
<response><responseHeader><status>0</status>
<QTime>1</QTime></responseHeader>
<result numFound="16173" start="0">
<doc>
<str name="name">Apple 60 GB iPod with Video</str>
<float name="price">399.0</float>
</doc>
<doc>
<str name="name">ASUS Extreme N7800GTX/2DHTV</str>
<float name="price">479.95</float>
</doc>
</result>
</response>
Solr Core
Lucene
Admin
Interface
Standard
Request
Handler
Disjunction
Max
Request
Handler
Custom
Request
Handler
Update
Handler
Caching
XML
Update
Interface
Config
Analysis
HTTP Request Servlet
Concurrency
Update Servlet
XML
Response
Writer
Replication
Schema
Search Requests hit here New document to be added here
FAQ
WebSite:www.solr.cc
QQGroup: 187670960

Solr中国6月21日企业搜索