Open source enterprise search and retrieval platform

Open Source
Search & Retrieval
Platform

Enterprise Search
EAI
Marc Teutelink Semantic Web
Datum 21 augustus 2010

How Apache open source software is used
during the implementation of an
Enterprise Search and Retrieval Platform

(Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)

Marc Teutelink
marc.teutelink@luminis.eu
@mteutelink

•Software architect at Luminis
•15+ years experience in software development; specialized in
Enterprise Search, Enterprise Application Integration and
Semantic Web technology
•Currently writing “Enterprise Search in Action” for Manning
(Mid-2011)

Agenda

•Enterprise Search
• What is Enterprise Search: Functions and features
• Challenges
• Logical Architecture
•Enterprise Search Solution
• Technology Stack
• Collection Process
• Publication Process
• Enricher framework
• Deployment
•Conclusion

What is Enterprise Search?

“Enterprise Search offers a solution for searching,
finding and presenting enterprise related information
in the larger sense of the word”

Enterprise search is all about searching through documents from
any type and format from any sources located anywhere with the
upmost flexibility
• Web search: limited to public documents on the web
• Desktop search: limited to private documents on the local machine
• Enterprise search: no limitations on document type and location

Enterprise Search
(features)

•Information Sources and Types
• Wide range of sources: local and remote filesystems, content repositories,
e-mail, databases, internet, intranet and extranet
• Type not limited: any type ranging from structured to unstructured data, text
and binary formats and compound formats (zip)

•Usage
• Not limited to interactive use  automated business processes

•Security
• Integrations with enterprise security infrastructure

•User Interaction and personalization
• Identity enables more personalized search results

Enterprise Search
(features)

•Extended metadata
• More metadata  better and more precise search results
• More control over schema (for example Dynamic Fields)

•Ranking
• More control over ranking: personalized ranking (group)

•Data extraction and derivation
• Extract data using various techniques: Xpath, Xquery
• Derive data: using external knowledge models: RDBMS, RDF Store, Web Services
• Conditional extraction & derivation

•Managing and monitoring
• On-the-fly management (JMX)
• Real time monitoring

Enterprise Search
(features)

•User Interfaces
• Web search
• All about selling advertisements to the mass
• Generalistic & minimalistic screens; focus on adds

• Enterprise search
• All about finding: rich navigation; focus on quick find
• Small targeted audience
• Specialized and customized screens (use of ontologies, taxonomies
and classifications)
• Use of identity (results customized to user) and web 2.0
• Grouping
• field collapsing, faceted search & clustering

Enterprise Search
(Challenges)

•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip

Enterprise Search
(Challenges)

•Managebility
•Flexibility
•Easy maintenance

Commercial Search Engines?

Enterprise Search
(Challenges)

•Managebility
•Flexibility
•Easy maintenance

Apache Based (Open Source)
Search & Retrieval Platform

Enterprise Search
(Logical Architecture)
Collection Process Publication Process

Sources Actors

Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull
(Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework)

Content Inbound Request Inbound Response Outbound

Syntactic Semantic Syntactic Semantic

Content Validation Request Validation
Redirection Enhancement Redirection Enhancement
Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial)

Content Enrichment Request Enrichment Response Enrichment

Filtering Grouping Sorting

Indexing Searching & Ordering
Search Engine

Enterprise Search
(Collection Process)
Collection Process

Sources Sources
• Any document format
• Any type Pull Pull
(Harvesting)
Push
(Crawling) (SOAP/ReST)

• Structured and unstructured Content Inbound

• Textual and binary Syntactic Semantic

Content Validation
• Compound
• Residing Anywhere Extraction Enhancement Filtering

Content Enrichment
• Security
Indexing
Search Engine

Enterprise Search
Collection Process

Content Inbound Sources
• Pull (Crawling/Spidering)
• Internet, intranet & extranet Pull Pull Push
(Crawling) (Harvesting) (SOAP/ReST)
• Local and remote filesystems Content Inbound

Syntactic Semantic
• Pull (Harvesting) Content Validation
• Databases
Extraction Enhancement Filtering
• Content Repositories / Mgmt Systems
Content Enrichment
• Webservices inbound
Indexing
• Push
Search Engine
• Webservices (SOAP/REST)
• Real time indexing

Enterprise Search
Collection Process

Content Validation Sources
• Syntactic validation
• Based on DTD / XML-Schema Pull Pull Push
• Structure and limited content Content Inbound
• Semantic validation Syntactic Semantic

• Based on algorithms: Content Validation
• Groovy, XPath, Regex, … Extraction Enhancement Filtering

• Think about exception handling Content Enrichment
• Placed anywhere in flow
• During inbound: XML-Schema validation Indexing
• After Enrichment: Validate derived metadata Search Engine

Enterprise Search
Collection Process

Content Enrichment Sources
• Extraction
• Metadata Pull Pull Push
• Content (free text of document) Content Inbound
• Enhancing Syntactic Semantic

• Derive new and alter existing metadata Content Validation
• Filtering Extraction Enhancement Filtering

• Remove (parts of) metadata Content Enrichment

• Leverage external knowledge models
Indexing
• Conditional enrichment
Search Engine

Enterprise Search
Collection Process

Indexing Sources
• Store in search engine(s)
• Content based routing Pull Pull Push

• Document boosting Content Inbound

Syntactic Semantic

Content Validation

Extraction Enhancement Filtering

Content Enrichment

Indexing
Search Engine

Enterprise Search
(Publication Process)

Publication Process Request Inbound
• HTTP/Get
Actors • URL based with parameters
• Response in XML, JSON, …
• HTTP/Post
HTTP/Get HTTP/Post API Stateless Statefull
(URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework)

Request Inbound Response Outbound
• XML (SOAP, REST) request
Syntactic Semantic

Request Validation
• XML (SOAP, REST) response
Redirection Enhancement Redirection Enhancement • API
(Suggestions) (add/remove clauses) (more like this) (metadata, editorial)

Request Enrichment Response Enrichment • Java, Perl, …
Filtering Grouping Sorting • Wrappers on HTTP/Get
Searching & Ordering
Search Engine

Enterprise Search

Publication Process Request Validation
• Syntactic Validation
Actors • Correct Query syntax?
• Semantic Validation
• Correct Field Filters?

• Based on algorithms: Groovy, Regex
Syntactic Semantic

Request Validation
• Placed anywhere in flow
(Suggestions)

Request Enrichment
(add/remove clauses) (more like this)

Response Enrichment
(metadata, editorial)
• @inbound: XML-Schema validation
• @enrichment: Validate derived request clauses

Search Engine

Enterprise Search

Publication Process Request Enrichment
• Redirection
Actors • Spelling suggestions
• Metadata suggestions
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework) • Enhancing
• Add/Remove clauses
Syntactic Semantic
• Stemming, Synonyms, stop words
Request Validation

Request Enrichment Response Enrichment


Search Engine

Enterprise Search

Publication Process Searching & Ordering
• Filtering
Actors • Field Search
• Grouping
• Add group information

• Field collapsing, Faceted Search & Clustering
• Sorting
Syntactic Semantic

Request Validation
• Sort on Field

Request Enrichment Response Enrichment • Ranking

Search Engine

Enterprise Search

Publication Process Response Enrichment
• Redirection
Actors • Suggestions
• More like this
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework) • Enhancing
• Add/Remove response fields
Syntactic Semantic
• Schema information
Request Validation
• Editorial information



Search Engine

Enterprise Search

Publication Process Response outbound
• Stateless
Actors • No security
• XSLT, SolrJS
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework) • Statefull
• Security
Syntactic Semantic
• Web2.0
Request Validation
• Web Application Framework



Search Engine

Technology Stack

•Use ESB for the flow: Apache ServiceMix with Camel
• Leverage standard ESB components (Transformers, Validation, Splitter,
Filter, Routers, Scripting)
• Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE)
• Custom: Crawler Apache Nutch
• Leverage only crawl framework
• Extend NutchIndexWriter; asynchronously pushing crawled documents
back into ESB flow (reply-to)
•ESB Makes distributed flow possibleContent based routing
•Hot deploy Easy maintenance
•Reusing services across collection processes
•Search Engine independent

Collection Process Flow
Content Inbound

1
2 D D D
N Document
Push Inbound Syntactic Validation Splitter Messages
Documents
(Message Endpoint) (Channel Purger)
Message

Channel

Content Validation Content Enrichment Content Indexer

Semantic Validation Channel Channel Transformer HTTP Transport
(Channel Purger) Content Filter (Message Translator) (Channel Adapter)

? Content Enricher D
Invalid Message Enricher SOLR Document
Message

! Lucene/Solr
INDEX
Invalid Message
Channel Lucene/SOLR
(SOLRJ)

Technology Stack

•Use flow from Apache Lucene/Solr
• Leverage standard Solr components (synonyms, stopwords,
stemming, MLT, spelling, faceted search, …)
• Custom components: using Solr’s extendability framework
• Security: authority field in schema with Apache Shiro integration
• Field filters (zipcode,…)

•User interfaces
• Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter
• Statefull: Apache Wicket with Spring

Enterprise Search

Sources Actors
ServiceMix/Camel Apache Wicket
Nutch SolrJS/XSLT



Content Validation Request Validation



Search Engine
Lucene/SOLR

Enterprise Search

Sources Actors




Content Validation LuminisRequest Validation
Enricher Framework



Search Engine

Luminis Enricher Framework

•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.

• Enriching = one or more actions (extraction, enhancing &
filtering) performed on documents with fields

• Same enricher to be used for:
• Collection process:
• Documents  enriching, filtering & splitting
• Publication process:
• Search requests’first-components’ searchcomponent
• Search response’last-components’ searchcomponent

Content Inbound

1
2 D D D
N Document
Documents
Message

Channel

Semantic Validation Channel Channel SOLR Indexer
(Channel Purger) Content Filter (Channel Adapter)

? Content Enricher
D
Message

!
Invalid Message
Lucene/Solr
INDEX

Channel Lucene/SOLR
(SOLRJ)


Content Inbound

1
2 D D D
N Document
Documents
Message

Channel
<<XSLT>>
XML2HTML
<<SOLRQueryRequest>> <<(X)HTML>>
Query Resultaat
<<SearchHandler>>
<<XML>> <<QueryResponseWriter>>
RequestHandler Response XSLTResponseWriter
Semantic Validation Channel Channel SOLR Indexer
"ﬁrst-components"
(Channel Purger)
"components" "last-components"
Content Filter (Channel Adapter)

? Content Enricher
D
Message

<<SearchComponent>>
<<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>>
query facet mlt highlight stats debug
!
Invalid Message
Lucene/Solr
INDEX

Channel Lucene/SOLR
(SOLRJ)


(architecture)

•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported

(architecture)

•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported
Action Document
(select C where ${B}) [[A1],[B],[C1]]
YES

Document Action Document
[[A1,A2],[B]] [[A1],[B]] If [B=3]
(remove A2)

NO
Action Document
(select C where ${A}) [[A1],[B],[C2]]

(Configuration)

•Enricher flow and expression configuration via XML based DSL
• Conditional: if-then-else & switch-case-else (with regex support)
• Actions: Add & remove fields and field values using expressions
• Expression handlers currently supported:
• Field
• Function (execute methods via Java Reflection)
• HttpClient (retrieve content by URL described by field values)
• Xslt, Xpath, Xquery (external XML databases)
• JDBC
• SparQL (OpenRDF)
• Apache Lucene/Solr
• Apache Tika (Meta and Text extraction)

(Examples)
<enricher name="Field" >
<field name="a">AA1</field>
<field name="b">BB1</field>
<multivalue-field name="c">CC1</multivalue-field>
<if test="field::c" pattern="CC2">
<then>
<field name="e">EE1</field>
</then>
</if>
<if test="field::a">
<then>
<field name="f">FF1</field>
</then>
</if>
<rename-field name="b">d</rename-field>
<remove-field name="a"/>
</enricher>

(Examples)
<enricher name="XPath”
xmlns:str="http://exslt.org/strings"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:html="http://www.w3.org/1999/xhtml">
field name="Description" expression-type="xpath">
//html:meta[@name='DC.description']/@content
</field>
<multivalue-field name="Type" expression-type="xpath">
//html:meta[@name='DC.type' and
(@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
<then> @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
@scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
</then> ]/@content
</if> </multivalue-field>
<field name="publisher" expression-type="xpath">
fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
<then>
</field>
</then> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
</if> //html:meta[@name='DC.creator']/@content)
</field>
</enricher>
</enricher>

(Examples) <enricher name="SPARQL">
<field name="place">http://www.my.com/#channels</field>
<enricher name="XPath” <field expression-type="sparql" repository="TESTRDF">
<![CDATA[
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?definition
WHERE {
<field name="b">BB2</field> ?${place} skos:definition ?definition.
</field>
}
]]>
</field>
</enricher>
</then> ]/@content
</if> </multivalue-field>
<then>
</field>
</then> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
</if> //html:meta[@name='DC.creator']/@content)
</field>
</enricher>
</enricher>

(Examples) <enricher name="SPARQL">
<field name="place">http://www.my.com/#channels</field>
<enricher name="XPath” <field expression-type="sparql" repository="TESTRDF">
<![CDATA[
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?definition
WHERE {
<field name="b">BB2</field> ?${place} skos:definition ?definition.
</field>
}
]]>
</field>
</enricher>
<enricher name=”HttpAndTika">
</then> ]/@content
</if> <field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field>
</multivalue-field>
<field expression-type=”http" name="content.file">field:content.url</field>
<field name="auteur" source="field::content.file">xpath://H1</field>
<then>
<multivalue-field expression-type=”tika.meta” source="field::content.file”/>
</field>
<field name=”content" expression-type=”tika.text” source="field::content.file”/>
</then>
<switch test=”field::content.url
fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
</if> <case pattern=".*.rijksweb.nl.*"><field name=”source">Rijksweb</field></case>
//html:meta[@name='DC.creator']/@content)
<case name="b">d</rename-field>
<rename-field pattern=".*.deventer.nl.*"><field name=”source">Gemeente Deventer</field></case>
</field>
<case</enricher>
pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case>
<else><field name=”source">Overige</field></else>
</enricher>
</switch>
</enricher>

(Technology)

•Enricher and expresion handlers are Java based OSGi
services:
• Hot pluggable and updatable
• Flow and expression configuration changes no restart
• Extendible: New expression handlers immediatly available in
actions after installing OSGi bundle
•Runs in Apache Felix
• Collection Process: ServiceMix contains OSGi container
• Publication Process: Custom OSGi loader for Lucene/Solr
•Centralized & transactional provisioning (Apache Ace)
‑ Components & Configuration

Deployment Architecture <<HTTP>>

<<device>>
Firewall <<device>>
HTTP Load Balancer
<<HTTP>>
<<device>>
Deployment Server <<device>> <<HTTP>>
Felix OSGi Master Collection Server
(Apache) <<device>>
<<Container>> Slave Publication Server
Ace
Apache Tomcat (Slave2)
(Apache)
ServiceMix Felix OSGi
(Apache) (Apache)
<<Container>>
<<device>>
Enricher Nutch Apache Tomcat
(Luminis)
Slave Publication Server
(Apache) Felix OSGi
(Slave1)
(Apache)
<<config>>
SOLR::solrconfig.xml
Lucene/SOLR
<<HTTP/ReST>>
<<PROVISIONING>> <<config>>
(Apache) Lucene/SOLR
Luminis:Enricher.xml <<Container>> (Apache)
<<config>> Tika Apache Tomcat
SOLR::schema.xml
(Apache) <<HTTP/ReST>> Felix OSGi Wicket
<<config>>
servicenix::config.xml
(Apache) (Apache)
OpenRDF <<HTTP>>
Lucene/SOLR Enricher
(Apache) (Luminis)
<<HTTP>>
<<config>>
<<Data Container>> Wicket SOLR::solrconfig.xml
SQL <<JDBC>> (Apache) <<config>>
Luminis:Enricher.xml
<<JDBC>>
<<Database>> <<RDFTripleStore>> Enricher <<config>>
Knowledge Models Knowledge Models SOLR::schema.xml
(Luminis)

<<config>>
SOLR::solrconfig.xml
<<config>>
Luminis:Enricher.xml
<<config>>
SOLR::schema.xml

Conclusions

•Enterprise Search Solution is not Google search

•Open Source paves the way; misses some ingredients
• Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel,
Wicket, MySQL, OpenRDF, Felix/Ace
• Missing ingredients: Enricher

•Interesting developments:
• Apache Chemistry (CMIS)
• Apache Clerezza
• Apache Nutch
• Apache Connectors Framework (ManifoldCF)

Questions & (answers?)

Marc Teutelink
marc.teutelink@luminis.eu

@mteutelink

MEAP December 2010 

Open source enterprise search and retrieval platform

Recommended

Recommended

More Related Content

Similar to Open source enterprise search and retrieval platform

Similar to Open source enterprise search and retrieval platform (20)

Open source enterprise search and retrieval platform

Editor's Notes