SlideShare a Scribd company logo
Datum 21 augustus 2010
Enterprise Search
EAI
Semantic Web
Open Source
Search & Retrieval
Platform
Marc Teutelink
How Apache open source software is used
during the implementation of an
Enterprise Search and Retrieval Platform
(Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)
Marc Teutelink
marc.teutelink@luminis.eu
@mteutelink
•Software architect at Luminis
•15+ years experience in software development; specialized in
Enterprise Search, Enterprise Application Integration and
Semantic Web technology
•Currently writing “Enterprise Search in Action” for Manning
(Mid-2011)
Agenda
•Enterprise Search
• What is Enterprise Search: Functions and features
• Challenges
• Logical Architecture
•Enterprise Search Solution
• Technology Stack
• Collection Process
• Publication Process
• Enricher framework
• Deployment
•Conclusion
What is Enterprise Search?
“Enterprise Search offers a solution for searching,
finding and presenting enterprise related information
in the larger sense of the word”
Enterprise search is all about searching through documents from
any type and format from any sources located anywhere with the
upmost flexibility
• Web search: limited to public documents on the web
• Desktop search: limited to private documents on the local machine
• Enterprise search: no limitations on document type and location
Enterprise Search
(features)
•Information Sources and Types
• Wide range of sources: local and remote filesystems, content repositories,
e-mail, databases, internet, intranet and extranet
• Type not limited: any type ranging from structured to unstructured data, text
and binary formats and compound formats (zip)
•Usage
• Not limited to interactive use  automated business processes
•Security
• Integrations with enterprise security infrastructure
•User Interaction and personalization
• Identity enables more personalized search results
Enterprise Search
(features)
•Extended metadata
• More metadata  better and more precise search results
• More control over schema (for example Dynamic Fields)
•Ranking
• More control over ranking: personalized ranking (group)
•Data extraction and derivation
• Extract data using various techniques: Xpath, Xquery
• Derive data: using external knowledge models: RDBMS, RDF Store, Web Services
• Conditional extraction & derivation
•Managing and monitoring
• On-the-fly management (JMX)
• Real time monitoring
Enterprise Search
(features)
•User Interfaces
• Web search
• All about selling advertisements to the mass
• Generalistic & minimalistic screens; focus on adds
• Enterprise search
• All about finding: rich navigation; focus on quick find
• Small targeted audience
• Specialized and customized screens (use of ontologies, taxonomies
and classifications)
• Use of identity (results customized to user) and web 2.0
• Grouping
• field collapsing, faceted search & clustering
Enterprise Search
(Challenges)
•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip
Enterprise Search
(Challenges)
•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip
Commercial Search Engines?
Enterprise Search
(Challenges)
•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip
Apache Based (Open Source)
Search & Retrieval Platform
Enterprise Search
(Logical Architecture)
Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Extraction Enhancement Filtering
Collection Process Publication Process
Content Validation
SemanticSyntactic
Enterprise Search
(Collection Process)
Sources
• Any document format
• Any type
• Structured and unstructured
• Textual and binary
• Compound
• Residing Anywhere
• Security
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Extraction Enhancement Filtering
Collection Process
Content Validation
SemanticSyntactic
Enterprise Search
(Collection Process)
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Extraction Enhancement Filtering
Collection Process
Content Validation
SemanticSyntactic
Content Inbound
• Pull (Crawling/Spidering)
• Internet, intranet & extranet
• Local and remote filesystems
• Pull (Harvesting)
• Databases
• Content Repositories / Mgmt Systems
• Webservices inbound
• Push
• Webservices (SOAP/REST)
• Real time indexing
Enterprise Search
(Collection Process)
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Extraction Enhancement Filtering
Collection Process
Content Validation
SemanticSyntactic
Content Validation
• Syntactic validation
• Based on DTD / XML-Schema
• Structure and limited content
• Semantic validation
• Based on algorithms:
• Groovy, XPath, Regex, …
• Think about exception handling
• Placed anywhere in flow
• During inbound: XML-Schema validation
• After Enrichment: Validate derived metadata
Enterprise Search
(Collection Process)
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Extraction Enhancement Filtering
Collection Process
Content Validation
SemanticSyntactic
Content Enrichment
• Extraction
• Metadata
• Content (free text of document)
• Enhancing
• Derive new and alter existing metadata
• Filtering
• Remove (parts of) metadata
• Leverage external knowledge models
• Conditional enrichment
Enterprise Search
(Collection Process)
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Extraction Enhancement Filtering
Collection Process
Content Validation
SemanticSyntactic
Indexing
• Store in search engine(s)
• Content based routing
• Document boosting
Enterprise Search
(Publication Process)
Request Inbound
• HTTP/Get
• URL based with parameters
• Response in XML, JSON, …
• HTTP/Post
• XML (SOAP, REST) request
• XML (SOAP, REST) response
• API
• Java, Perl, …
• Wrappers on HTTP/Get
Actors
Search Engine
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process
Enterprise Search
(Publication Process)
Actors
Search Engine
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Request Validation
• Syntactic Validation
• Correct Query syntax?
• Semantic Validation
• Correct Field Filters?
• Based on algorithms: Groovy, Regex
• Placed anywhere in flow
• @inbound: XML-Schema validation
• @enrichment: Validate derived request clauses
Enterprise Search
(Publication Process)
Actors
Search Engine
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Request Enrichment
• Redirection
• Spelling suggestions
• Metadata suggestions
• Enhancing
• Add/Remove clauses
• Stemming, Synonyms, stop words
Enterprise Search
(Publication Process)
Actors
Search Engine
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Searching & Ordering
• Filtering
• Field Search
• Grouping
• Add group information
• Field collapsing, Faceted Search & Clustering
• Sorting
• Sort on Field
• Ranking
Enterprise Search
(Publication Process)
Actors
Search Engine
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Response Enrichment
• Redirection
• Suggestions
• More like this
• Enhancing
• Add/Remove response fields
• Schema information
• Editorial information
Enterprise Search
(Publication Process)
Actors
Search Engine
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Response outbound
• Stateless
• No security
• XSLT, SolrJS
• Statefull
• Security
• Web2.0
• Web Application Framework
Technology Stack
(Collection Process)
•Use ESB for the flow: Apache ServiceMix with Camel
• Leverage standard ESB components (Transformers, Validation, Splitter,
Filter, Routers, Scripting)
• Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE)
• Custom: Crawler Apache Nutch
• Leverage only crawl framework
• Extend NutchIndexWriter; asynchronously pushing crawled documents
back into ESB flow (reply-to)
•ESB Makes distributed flow possibleContent based routing
•Hot deploy Easy maintenance
•Reusing services across collection processes
•Search Engine independent
Collection Process Flow
Content Indexer
Content Inbound
2
1
Documents
Message
N
D
Document
Messages
D D
Lucene/Solr
INDEX
HTTP Transport
(Channel Adapter)
Lucene/SOLR
(SOLRJ)
D
SOLR Document
Message
Splitter
Channel
Content Validation Content Enrichment
Enricher
Content Filter
Content Enricher
Syntactic Validation
(Channel Purger)
Push Inbound
(Message Endpoint)
Semantic Validation
(Channel Purger)
Invalid Message
Channel
!
?
Invalid Message
ChannelChannel Transformer
(Message Translator)
Technology Stack
(Publication Process)
•Use flow from Apache Lucene/Solr
• Leverage standard Solr components (synonyms, stopwords,
stemming, MLT, spelling, faceted search, …)
• Custom components: using Solr’s extendability framework
• Security: authority field in schema with Apache Shiro integration
• Field filters (zipcode,…)
•User interfaces
• Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter
• Statefull: Apache Wicket with Spring
Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Extraction Enhancement Filtering
Collection Process Publication Process
Content Validation
SemanticSyntactic
Enterprise Search
(Logical Architecture)
Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Extraction Enhancement Filtering
Collection Process Publication Process
Content Validation
SemanticSyntactic
Lucene/SOLR
ServiceMix/Camel
Nutch
Apache WicketSolrJS/XSLT
Enterprise Search
(Logical Architecture)
Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Extraction Enhancement Filtering
Collection Process Publication Process
Content Validation
SemanticSyntactic
Enterprise Search
(Logical Architecture)
Luminis Enricher Framework
Luminis Enricher Framework
•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.
• Enriching = one or more actions (extraction, enhancing &
filtering) performed on documents with fields
• Same enricher to be used for:
• Collection process:
• Documents  enriching, filtering & splitting
• Publication process:
• Search requests’first-components’ searchcomponent
• Search response’last-components’ searchcomponent
Luminis Enricher Framework
•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.
• Enriching = one or more actions (extraction, enhancing &
filtering) performed on documents with fields
• Same enricher to be used for:
• Collection process:
• Documents  enriching, filtering & splitting
• Publication process:
• Search requests’first-components’ searchcomponent
• Search response’last-components’ searchcomponent
Content Indexer
Content Inbound
2
1
Documents
Message
N
D
Document
Messages
D D
Lucene/Solr
INDEX
SOLR Indexer
(Channel Adapter)
Lucene/SOLR
(SOLRJ)
D
SOLR Document
Message
Splitter
Channel
Content Validation Content Enrichment
Enricher
Content Filter
Content Enricher
Syntactic Validation
(Channel Purger)
Push Inbound
(Message Endpoint)
Semantic Validation
(Channel Purger)
Invalid Message
Channel
!
?
Invalid Message
ChannelChannel
Luminis Enricher Framework
•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.
• Enriching = one or more actions (extraction, enhancing &
filtering) performed on documents with fields
• Same enricher to be used for:
• Collection process:
• Documents  enriching, filtering & splitting
• Publication process:
• Search requests’first-components’ searchcomponent
• Search response’last-components’ searchcomponent
Content Indexer
Content Inbound
2
1
Documents
Message
N
D
Document
Messages
D D
Lucene/Solr
INDEX
SOLR Indexer
(Channel Adapter)
Lucene/SOLR
(SOLRJ)
D
SOLR Document
Message
Splitter
Channel
Content Validation Content Enrichment
Enricher
Content Filter
Content Enricher
Syntactic Validation
(Channel Purger)
Push Inbound
(Message Endpoint)
Semantic Validation
(Channel Purger)
Invalid Message
Channel
!
?
Invalid Message
ChannelChannel
<<SearchHandler>>
RequestHandler
"first-components" "components" "last-components"
<<XML>>
Response
<<SearchComponent>>
query
<<SearchComponent>>
facet
<<SearchComponent>>
mlt
<<SearchComponent>>
highlight
<<SearchComponent>>
stats
<<SearchComponent>>
debug
<<SOLRQueryRequest>>
Query
<<XSLT>>
XML2HTML
<<QueryResponseWriter>>
XSLTResponseWriter
<<(X)HTML>>
Resultaat
Luminis Enricher Framework
(architecture)
•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported
Luminis Enricher Framework
(architecture)
•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported
Action
(select C where ${B})
Action
(remove A2)
Document
[[A1,A2],[B]]
Document
[[A1],[B]]
Document
[[A1],[B],[C1]]
If [B=3]
YES
Action
(select C where ${A})
Document
[[A1],[B],[C2]]
NO
Luminis Enricher Framework
(Configuration)
•Enricher flow and expression configuration via XML based DSL
• Conditional: if-then-else & switch-case-else (with regex support)
• Actions: Add & remove fields and field values using expressions
• Expression handlers currently supported:
• Field
• Function (execute methods via Java Reflection)
• HttpClient (retrieve content by URL described by field values)
• Xslt, Xpath, Xquery (external XML databases)
• JDBC
• SparQL (OpenRDF)
• Apache Lucene/Solr
• Apache Tika (Meta and Text extraction)
Luminis Enricher Framework
(Examples)
<enricher name="Field" >
<field name="a">AA1</field>
<field name="b">BB1</field>
<field name="b">BB2</field>
<multivalue-field name="c">CC1</multivalue-field>
<multivalue-field name="c">CC2</multivalue-field>
<if test="field::c" pattern="CC2">
<then>
<field name="e">EE1</field>
</then>
</if>
<if test="field::a">
<then>
<field name="f">FF1</field>
</then>
</if>
<rename-field name="b">d</rename-field>
<remove-field name="a"/>
</enricher>
Luminis Enricher Framework
(Examples)
<enricher name="Field" >
<field name="a">AA1</field>
<field name="b">BB1</field>
<field name="b">BB2</field>
<multivalue-field name="c">CC1</multivalue-field>
<multivalue-field name="c">CC2</multivalue-field>
<if test="field::c" pattern="CC2">
<then>
<field name="e">EE1</field>
</then>
</if>
<if test="field::a">
<then>
<field name="f">FF1</field>
</then>
</if>
<rename-field name="b">d</rename-field>
<remove-field name="a"/>
</enricher>
<enricher name="XPath”
xmlns:str="http://exslt.org/strings"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:html="http://www.w3.org/1999/xhtml">
field name="Description" expression-type="xpath">
//html:meta[@name='DC.description']/@content
</field>
<multivalue-field name="Type" expression-type="xpath">
//html:meta[@name='DC.type' and
(@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
@scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
@scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
]/@content
</multivalue-field>
<field name="publisher" expression-type="xpath">
fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
</field>
<field name="publisher" expression-type="xpath">
fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
//html:meta[@name='DC.creator']/@content)
</field>
</enricher>
Luminis Enricher Framework
(Examples)
<enricher name="Field" >
<field name="a">AA1</field>
<field name="b">BB1</field>
<field name="b">BB2</field>
<multivalue-field name="c">CC1</multivalue-field>
<multivalue-field name="c">CC2</multivalue-field>
<if test="field::c" pattern="CC2">
<then>
<field name="e">EE1</field>
</then>
</if>
<if test="field::a">
<then>
<field name="f">FF1</field>
</then>
</if>
<rename-field name="b">d</rename-field>
<remove-field name="a"/>
</enricher>
<enricher name="XPath”
xmlns:str="http://exslt.org/strings"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:html="http://www.w3.org/1999/xhtml">
field name="Description" expression-type="xpath">
//html:meta[@name='DC.description']/@content
</field>
<multivalue-field name="Type" expression-type="xpath">
//html:meta[@name='DC.type' and
(@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
@scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
@scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
]/@content
</multivalue-field>
<field name="publisher" expression-type="xpath">
fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
</field>
<field name="publisher" expression-type="xpath">
fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
//html:meta[@name='DC.creator']/@content)
</field>
</enricher>
<enricher name="SPARQL">
<field name="place">http://www.my.com/#channels</field>
<field expression-type="sparql" repository="TESTRDF">
<![CDATA[
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?definition
WHERE {
?${place} skos:definition ?definition.
}
]]>
</field>
</enricher>
Luminis Enricher Framework
(Examples)
<enricher name="Field" >
<field name="a">AA1</field>
<field name="b">BB1</field>
<field name="b">BB2</field>
<multivalue-field name="c">CC1</multivalue-field>
<multivalue-field name="c">CC2</multivalue-field>
<if test="field::c" pattern="CC2">
<then>
<field name="e">EE1</field>
</then>
</if>
<if test="field::a">
<then>
<field name="f">FF1</field>
</then>
</if>
<rename-field name="b">d</rename-field>
<remove-field name="a"/>
</enricher>
<enricher name="XPath”
xmlns:str="http://exslt.org/strings"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:html="http://www.w3.org/1999/xhtml">
field name="Description" expression-type="xpath">
//html:meta[@name='DC.description']/@content
</field>
<multivalue-field name="Type" expression-type="xpath">
//html:meta[@name='DC.type' and
(@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
@scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
@scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
]/@content
</multivalue-field>
<field name="publisher" expression-type="xpath">
fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
</field>
<field name="publisher" expression-type="xpath">
fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
//html:meta[@name='DC.creator']/@content)
</field>
</enricher>
<enricher name="SPARQL">
<field name="place">http://www.my.com/#channels</field>
<field expression-type="sparql" repository="TESTRDF">
<![CDATA[
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?definition
WHERE {
?${place} skos:definition ?definition.
}
]]>
</field>
</enricher>
<enricher name=”HttpAndTika">
<field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field>
<field expression-type=”http" name="content.file">field:content.url</field>
<field name="auteur" source="field::content.file">xpath://H1</field>
<multivalue-field expression-type=”tika.meta” source="field::content.file”/>
<field name=”content" expression-type=”tika.text” source="field::content.file”/>
<switch test=”field::content.url
<case pattern=".*.rijksweb.nl.*"><field name=”source">Rijksweb</field></case>
<case pattern=".*.deventer.nl.*"><field name=”source">Gemeente Deventer</field></case>
<case pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case>
<else><field name=”source">Overige</field></else>
</switch>
</enricher>
Luminis Enricher Framework
(Technology)
•Enricher and expresion handlers are Java based OSGi
services:
• Hot pluggable and updatable
• Flow and expression configuration changes no restart
• Extendible: New expression handlers immediatly available in
actions after installing OSGi bundle
•Runs in Apache Felix
• Collection Process: ServiceMix contains OSGi container
• Publication Process: Custom OSGi loader for Lucene/Solr
•Centralized & transactional provisioning (Apache Ace)
‑ Components & Configuration
Deployment Architecture
<<device>>
Slave Publication Server
(Slave2)
<<Container>>
Apache Tomcat
Enricher
(Luminis)
Lucene/SOLR
(Apache)
Wicket
(Apache)
<<config>>
SOLR::schema.xml
<<config>>
Luminis:Enricher.xml
<<config>>
SOLR::solrconfig.xml
Felix OSGi
(Apache)
<<device>>
Firewall <<device>>
HTTP Load Balancer
<<device>>
Master Collection Server
<<Container>>
Apache Tomcat
Enricher
(Luminis)
Nutch
(Apache)
ServiceMix
(Apache)
Tika
(Apache)
Lucene/SOLR
(Apache)
<<config>>
SOLR::solrconfig.xml
<<config>>
Luminis:Enricher.xml
<<config>>
SOLR::schema.xml
<<config>>
servicenix::config.xml
OpenRDF
<<Data Container>>
SQL
<<Database>>
Knowledge Models
<<RDFTripleStore>>
Knowledge Models
<<HTTP>>
<<HTTP>>
<<HTTP>>
<<JDBC>>
<<HTTP>>
Felix OSGi
(Apache)
<<HTTP>>
<<HTTP/ReST>>
<<HTTP/ReST>>
<<device>>
Deployment Server
Ace
(Apache)
Felix OSGi
(Apache)
<<PROVISIONING>>
<<JDBC>>
<<device>>
Slave Publication Server
(Slave1)
<<Container>>
Apache Tomcat
Enricher
(Luminis)
Lucene/SOLR
(Apache)
Wicket
(Apache)
<<config>>
SOLR::schema.xml
<<config>>
Luminis:Enricher.xml
<<config>>
SOLR::solrconfig.xml
Felix OSGi
(Apache)
Conclusions
•Enterprise Search Solution is not Google search
•Open Source paves the way; misses some ingredients
• Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel,
Wicket, MySQL, OpenRDF, Felix/Ace
• Missing ingredients: Enricher
•Interesting developments:
• Apache Chemistry (CMIS)
• Apache Clerezza
• Apache Nutch
• Apache Connectors Framework (ManifoldCF)
Questions & (answers?)
Marc Teutelink
marc.teutelink@luminis.eu
@mteutelink
MEAP December 2010 

More Related Content

Viewers also liked

ProjectHub
ProjectHubProjectHub
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
sebastian_nagel
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
Julien Nioche
 
Search engine
Search engineSearch engine
Search engine
Alisha Korpal
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
Alfresco Software
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
Julien Nioche
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Lucidworks (Archived)
 
Apache tika
Apache tikaApache tika
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
David Gil Sánchez
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
Rahul Singh
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
Iván Campaña Naranjo
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
Toni de la Fuente
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Alfresco Software
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
Jorge Luis Betancourt Gonzalez
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
Jorge Luis Betancourt Gonzalez
 

Viewers also liked (20)

ProjectHub
ProjectHubProjectHub
ProjectHub
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Apache tika
Apache tikaApache tika
Apache tika
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
 

Recently uploaded

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

Open source enterprise search and retrieval platform

  • 1. Datum 21 augustus 2010 Enterprise Search EAI Semantic Web Open Source Search & Retrieval Platform Marc Teutelink
  • 2. How Apache open source software is used during the implementation of an Enterprise Search and Retrieval Platform (Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)
  • 3. Marc Teutelink marc.teutelink@luminis.eu @mteutelink •Software architect at Luminis •15+ years experience in software development; specialized in Enterprise Search, Enterprise Application Integration and Semantic Web technology •Currently writing “Enterprise Search in Action” for Manning (Mid-2011)
  • 4. Agenda •Enterprise Search • What is Enterprise Search: Functions and features • Challenges • Logical Architecture •Enterprise Search Solution • Technology Stack • Collection Process • Publication Process • Enricher framework • Deployment •Conclusion
  • 5. What is Enterprise Search? “Enterprise Search offers a solution for searching, finding and presenting enterprise related information in the larger sense of the word” Enterprise search is all about searching through documents from any type and format from any sources located anywhere with the upmost flexibility • Web search: limited to public documents on the web • Desktop search: limited to private documents on the local machine • Enterprise search: no limitations on document type and location
  • 6. Enterprise Search (features) •Information Sources and Types • Wide range of sources: local and remote filesystems, content repositories, e-mail, databases, internet, intranet and extranet • Type not limited: any type ranging from structured to unstructured data, text and binary formats and compound formats (zip) •Usage • Not limited to interactive use  automated business processes •Security • Integrations with enterprise security infrastructure •User Interaction and personalization • Identity enables more personalized search results
  • 7. Enterprise Search (features) •Extended metadata • More metadata  better and more precise search results • More control over schema (for example Dynamic Fields) •Ranking • More control over ranking: personalized ranking (group) •Data extraction and derivation • Extract data using various techniques: Xpath, Xquery • Derive data: using external knowledge models: RDBMS, RDF Store, Web Services • Conditional extraction & derivation •Managing and monitoring • On-the-fly management (JMX) • Real time monitoring
  • 8. Enterprise Search (features) •User Interfaces • Web search • All about selling advertisements to the mass • Generalistic & minimalistic screens; focus on adds • Enterprise search • All about finding: rich navigation; focus on quick find • Small targeted audience • Specialized and customized screens (use of ontologies, taxonomies and classifications) • Use of identity (results customized to user) and web 2.0 • Grouping • field collapsing, faceted search & clustering
  • 9. Enterprise Search (Challenges) •Performance and scalability •Rich functions and features •Managebility •Flexibility •Easy maintenance •Quick issue and problem solving •Reduce total cost of ownerschip
  • 10. Enterprise Search (Challenges) •Performance and scalability •Rich functions and features •Managebility •Flexibility •Easy maintenance •Quick issue and problem solving •Reduce total cost of ownerschip Commercial Search Engines?
  • 11. Enterprise Search (Challenges) •Performance and scalability •Rich functions and features •Managebility •Flexibility •Easy maintenance •Quick issue and problem solving •Reduce total cost of ownerschip Apache Based (Open Source) Search & Retrieval Platform
  • 12. Enterprise Search (Logical Architecture) Actors Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Content Enrichment Extraction Enhancement Filtering Collection Process Publication Process Content Validation SemanticSyntactic
  • 13. Enterprise Search (Collection Process) Sources • Any document format • Any type • Structured and unstructured • Textual and binary • Compound • Residing Anywhere • Security Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Content Enrichment Extraction Enhancement Filtering Collection Process Content Validation SemanticSyntactic
  • 14. Enterprise Search (Collection Process) Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Content Enrichment Extraction Enhancement Filtering Collection Process Content Validation SemanticSyntactic Content Inbound • Pull (Crawling/Spidering) • Internet, intranet & extranet • Local and remote filesystems • Pull (Harvesting) • Databases • Content Repositories / Mgmt Systems • Webservices inbound • Push • Webservices (SOAP/REST) • Real time indexing
  • 15. Enterprise Search (Collection Process) Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Content Enrichment Extraction Enhancement Filtering Collection Process Content Validation SemanticSyntactic Content Validation • Syntactic validation • Based on DTD / XML-Schema • Structure and limited content • Semantic validation • Based on algorithms: • Groovy, XPath, Regex, … • Think about exception handling • Placed anywhere in flow • During inbound: XML-Schema validation • After Enrichment: Validate derived metadata
  • 16. Enterprise Search (Collection Process) Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Content Enrichment Extraction Enhancement Filtering Collection Process Content Validation SemanticSyntactic Content Enrichment • Extraction • Metadata • Content (free text of document) • Enhancing • Derive new and alter existing metadata • Filtering • Remove (parts of) metadata • Leverage external knowledge models • Conditional enrichment
  • 17. Enterprise Search (Collection Process) Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Content Enrichment Extraction Enhancement Filtering Collection Process Content Validation SemanticSyntactic Indexing • Store in search engine(s) • Content based routing • Document boosting
  • 18. Enterprise Search (Publication Process) Request Inbound • HTTP/Get • URL based with parameters • Response in XML, JSON, … • HTTP/Post • XML (SOAP, REST) request • XML (SOAP, REST) response • API • Java, Perl, … • Wrappers on HTTP/Get Actors Search Engine Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Publication Process
  • 19. Enterprise Search (Publication Process) Actors Search Engine Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Publication Process Request Validation • Syntactic Validation • Correct Query syntax? • Semantic Validation • Correct Field Filters? • Based on algorithms: Groovy, Regex • Placed anywhere in flow • @inbound: XML-Schema validation • @enrichment: Validate derived request clauses
  • 20. Enterprise Search (Publication Process) Actors Search Engine Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Publication Process Request Enrichment • Redirection • Spelling suggestions • Metadata suggestions • Enhancing • Add/Remove clauses • Stemming, Synonyms, stop words
  • 21. Enterprise Search (Publication Process) Actors Search Engine Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Publication Process Searching & Ordering • Filtering • Field Search • Grouping • Add group information • Field collapsing, Faceted Search & Clustering • Sorting • Sort on Field • Ranking
  • 22. Enterprise Search (Publication Process) Actors Search Engine Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Publication Process Response Enrichment • Redirection • Suggestions • More like this • Enhancing • Add/Remove response fields • Schema information • Editorial information
  • 23. Enterprise Search (Publication Process) Actors Search Engine Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Publication Process Response outbound • Stateless • No security • XSLT, SolrJS • Statefull • Security • Web2.0 • Web Application Framework
  • 24. Technology Stack (Collection Process) •Use ESB for the flow: Apache ServiceMix with Camel • Leverage standard ESB components (Transformers, Validation, Splitter, Filter, Routers, Scripting) • Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE) • Custom: Crawler Apache Nutch • Leverage only crawl framework • Extend NutchIndexWriter; asynchronously pushing crawled documents back into ESB flow (reply-to) •ESB Makes distributed flow possibleContent based routing •Hot deploy Easy maintenance •Reusing services across collection processes •Search Engine independent
  • 25. Collection Process Flow Content Indexer Content Inbound 2 1 Documents Message N D Document Messages D D Lucene/Solr INDEX HTTP Transport (Channel Adapter) Lucene/SOLR (SOLRJ) D SOLR Document Message Splitter Channel Content Validation Content Enrichment Enricher Content Filter Content Enricher Syntactic Validation (Channel Purger) Push Inbound (Message Endpoint) Semantic Validation (Channel Purger) Invalid Message Channel ! ? Invalid Message ChannelChannel Transformer (Message Translator)
  • 26. Technology Stack (Publication Process) •Use flow from Apache Lucene/Solr • Leverage standard Solr components (synonyms, stopwords, stemming, MLT, spelling, faceted search, …) • Custom components: using Solr’s extendability framework • Security: authority field in schema with Apache Shiro integration • Field filters (zipcode,…) •User interfaces • Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter • Statefull: Apache Wicket with Spring
  • 27. Actors Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Content Enrichment Extraction Enhancement Filtering Collection Process Publication Process Content Validation SemanticSyntactic Enterprise Search (Logical Architecture)
  • 28. Actors Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Content Enrichment Extraction Enhancement Filtering Collection Process Publication Process Content Validation SemanticSyntactic Lucene/SOLR ServiceMix/Camel Nutch Apache WicketSolrJS/XSLT Enterprise Search (Logical Architecture)
  • 29. Actors Search Engine Indexing Sources Content Inbound Pull (Crawling) Pull (Harvesting) Push (SOAP/ReST) Searching & Ordering Filtering Grouping Request Inbound HTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Request Validation Syntactic Semantic Request Enrichment Redirection (Suggestions) Enhancement (add/remove clauses) Response Enrichment Redirection (more like this) Enhancement (metadata, editorial) Response Outbound Stateless (XSLT, SolrJS) Statefull (Webapp Framework) Sorting Content Enrichment Extraction Enhancement Filtering Collection Process Publication Process Content Validation SemanticSyntactic Enterprise Search (Logical Architecture) Luminis Enricher Framework
  • 30. Luminis Enricher Framework •Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. • Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields • Same enricher to be used for: • Collection process: • Documents  enriching, filtering & splitting • Publication process: • Search requests’first-components’ searchcomponent • Search response’last-components’ searchcomponent
  • 31. Luminis Enricher Framework •Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. • Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields • Same enricher to be used for: • Collection process: • Documents  enriching, filtering & splitting • Publication process: • Search requests’first-components’ searchcomponent • Search response’last-components’ searchcomponent Content Indexer Content Inbound 2 1 Documents Message N D Document Messages D D Lucene/Solr INDEX SOLR Indexer (Channel Adapter) Lucene/SOLR (SOLRJ) D SOLR Document Message Splitter Channel Content Validation Content Enrichment Enricher Content Filter Content Enricher Syntactic Validation (Channel Purger) Push Inbound (Message Endpoint) Semantic Validation (Channel Purger) Invalid Message Channel ! ? Invalid Message ChannelChannel
  • 32. Luminis Enricher Framework •Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. • Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields • Same enricher to be used for: • Collection process: • Documents  enriching, filtering & splitting • Publication process: • Search requests’first-components’ searchcomponent • Search response’last-components’ searchcomponent Content Indexer Content Inbound 2 1 Documents Message N D Document Messages D D Lucene/Solr INDEX SOLR Indexer (Channel Adapter) Lucene/SOLR (SOLRJ) D SOLR Document Message Splitter Channel Content Validation Content Enrichment Enricher Content Filter Content Enricher Syntactic Validation (Channel Purger) Push Inbound (Message Endpoint) Semantic Validation (Channel Purger) Invalid Message Channel ! ? Invalid Message ChannelChannel <<SearchHandler>> RequestHandler "first-components" "components" "last-components" <<XML>> Response <<SearchComponent>> query <<SearchComponent>> facet <<SearchComponent>> mlt <<SearchComponent>> highlight <<SearchComponent>> stats <<SearchComponent>> debug <<SOLRQueryRequest>> Query <<XSLT>> XML2HTML <<QueryResponseWriter>> XSLTResponseWriter <<(X)HTML>> Resultaat
  • 33. Luminis Enricher Framework (architecture) •Pipe-and-filter architecture • Documents flow through series of actions • Output from one action is input to another action • Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values •Conditional flows supported •Reuse of flows & Subflows supported
  • 34. Luminis Enricher Framework (architecture) •Pipe-and-filter architecture • Documents flow through series of actions • Output from one action is input to another action • Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values •Conditional flows supported •Reuse of flows & Subflows supported Action (select C where ${B}) Action (remove A2) Document [[A1,A2],[B]] Document [[A1],[B]] Document [[A1],[B],[C1]] If [B=3] YES Action (select C where ${A}) Document [[A1],[B],[C2]] NO
  • 35. Luminis Enricher Framework (Configuration) •Enricher flow and expression configuration via XML based DSL • Conditional: if-then-else & switch-case-else (with regex support) • Actions: Add & remove fields and field values using expressions • Expression handlers currently supported: • Field • Function (execute methods via Java Reflection) • HttpClient (retrieve content by URL described by field values) • Xslt, Xpath, Xquery (external XML databases) • JDBC • SparQL (OpenRDF) • Apache Lucene/Solr • Apache Tika (Meta and Text extraction)
  • 36. Luminis Enricher Framework (Examples) <enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/> </enricher>
  • 37. Luminis Enricher Framework (Examples) <enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/> </enricher> <enricher name="XPath” xmlns:str="http://exslt.org/strings" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:html="http://www.w3.org/1999/xhtml"> field name="Description" expression-type="xpath"> //html:meta[@name='DC.description']/@content </field> <multivalue-field name="Type" expression-type="xpath"> //html:meta[@name='DC.type' and (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') ]/@content </multivalue-field> <field name="publisher" expression-type="xpath"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') </field> <field name="publisher" expression-type="xpath"> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, //html:meta[@name='DC.creator']/@content) </field> </enricher>
  • 38. Luminis Enricher Framework (Examples) <enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/> </enricher> <enricher name="XPath” xmlns:str="http://exslt.org/strings" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:html="http://www.w3.org/1999/xhtml"> field name="Description" expression-type="xpath"> //html:meta[@name='DC.description']/@content </field> <multivalue-field name="Type" expression-type="xpath"> //html:meta[@name='DC.type' and (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') ]/@content </multivalue-field> <field name="publisher" expression-type="xpath"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') </field> <field name="publisher" expression-type="xpath"> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, //html:meta[@name='DC.creator']/@content) </field> </enricher> <enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <field expression-type="sparql" repository="TESTRDF"> <![CDATA[ PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?definition WHERE { ?${place} skos:definition ?definition. } ]]> </field> </enricher>
  • 39. Luminis Enricher Framework (Examples) <enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/> </enricher> <enricher name="XPath” xmlns:str="http://exslt.org/strings" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:html="http://www.w3.org/1999/xhtml"> field name="Description" expression-type="xpath"> //html:meta[@name='DC.description']/@content </field> <multivalue-field name="Type" expression-type="xpath"> //html:meta[@name='DC.type' and (@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or @scheme='OVERHEIDbm.bekendmakingtypeProvincie' or @scheme='OVERHEIDbm.bekendmakingtypeWaterschap') ]/@content </multivalue-field> <field name="publisher" expression-type="xpath"> fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '') </field> <field name="publisher" expression-type="xpath"> fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content, //html:meta[@name='DC.creator']/@content) </field> </enricher> <enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <field expression-type="sparql" repository="TESTRDF"> <![CDATA[ PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?definition WHERE { ?${place} skos:definition ?definition. } ]]> </field> </enricher> <enricher name=”HttpAndTika"> <field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field> <field expression-type=”http" name="content.file">field:content.url</field> <field name="auteur" source="field::content.file">xpath://H1</field> <multivalue-field expression-type=”tika.meta” source="field::content.file”/> <field name=”content" expression-type=”tika.text” source="field::content.file”/> <switch test=”field::content.url <case pattern=".*.rijksweb.nl.*"><field name=”source">Rijksweb</field></case> <case pattern=".*.deventer.nl.*"><field name=”source">Gemeente Deventer</field></case> <case pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case> <else><field name=”source">Overige</field></else> </switch> </enricher>
  • 40. Luminis Enricher Framework (Technology) •Enricher and expresion handlers are Java based OSGi services: • Hot pluggable and updatable • Flow and expression configuration changes no restart • Extendible: New expression handlers immediatly available in actions after installing OSGi bundle •Runs in Apache Felix • Collection Process: ServiceMix contains OSGi container • Publication Process: Custom OSGi loader for Lucene/Solr •Centralized & transactional provisioning (Apache Ace) ‑ Components & Configuration
  • 41. Deployment Architecture <<device>> Slave Publication Server (Slave2) <<Container>> Apache Tomcat Enricher (Luminis) Lucene/SOLR (Apache) Wicket (Apache) <<config>> SOLR::schema.xml <<config>> Luminis:Enricher.xml <<config>> SOLR::solrconfig.xml Felix OSGi (Apache) <<device>> Firewall <<device>> HTTP Load Balancer <<device>> Master Collection Server <<Container>> Apache Tomcat Enricher (Luminis) Nutch (Apache) ServiceMix (Apache) Tika (Apache) Lucene/SOLR (Apache) <<config>> SOLR::solrconfig.xml <<config>> Luminis:Enricher.xml <<config>> SOLR::schema.xml <<config>> servicenix::config.xml OpenRDF <<Data Container>> SQL <<Database>> Knowledge Models <<RDFTripleStore>> Knowledge Models <<HTTP>> <<HTTP>> <<HTTP>> <<JDBC>> <<HTTP>> Felix OSGi (Apache) <<HTTP>> <<HTTP/ReST>> <<HTTP/ReST>> <<device>> Deployment Server Ace (Apache) Felix OSGi (Apache) <<PROVISIONING>> <<JDBC>> <<device>> Slave Publication Server (Slave1) <<Container>> Apache Tomcat Enricher (Luminis) Lucene/SOLR (Apache) Wicket (Apache) <<config>> SOLR::schema.xml <<config>> Luminis:Enricher.xml <<config>> SOLR::solrconfig.xml Felix OSGi (Apache)
  • 42. Conclusions •Enterprise Search Solution is not Google search •Open Source paves the way; misses some ingredients • Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel, Wicket, MySQL, OpenRDF, Felix/Ace • Missing ingredients: Enricher •Interesting developments: • Apache Chemistry (CMIS) • Apache Clerezza • Apache Nutch • Apache Connectors Framework (ManifoldCF)
  • 43. Questions & (answers?) Marc Teutelink marc.teutelink@luminis.eu @mteutelink MEAP December 2010 