Open SourceSearch & RetrievalPlatform                         Enterprise Search                         EAIMarc Teutelink ...
How Apache open source software is used     during the implementation of anEnterprise Search and Retrieval Platform (Lucen...
Marc Teutelink   marc.teutelink@luminis.eu   @mteutelink•Software architect at Luminis•15+ years experience in software de...
Agenda•Enterprise Search • What is Enterprise Search: Functions and features • Challenges • Logical Architecture•Enterpris...
What is Enterprise Search?“Enterprise Search offers a solution for searching,finding and presenting enterprise related inf...
Enterprise Search(features)•Information Sources and Types • Wide range of sources: local and remote filesystems, content r...
Enterprise Search(features)•Extended metadata • More metadata  better and more precise search results • More control over...
Enterprise Search(features)•User Interfaces • Web search    • All about selling advertisements to the mass    • Generalist...
Enterprise Search(Challenges)•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy mainte...
Enterprise Search(Challenges)•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy mainte...
Enterprise Search(Challenges)•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy mainte...
Enterprise Search(Logical Architecture)Collection Process                                                 Publication Proc...
Enterprise Search(Collection Process)                                    Collection ProcessSources                        ...
Enterprise Search(Collection Process)                                         Collection ProcessContent Inbound           ...
Enterprise Search(Collection Process)                                                 Collection ProcessContent Validation...
Enterprise Search(Collection Process)                                            Collection ProcessContent Enrichment     ...
Enterprise Search(Collection Process)                              Collection ProcessIndexing                             ...
Enterprise Search                       (Publication Process)                       Publication Process                   ...
Enterprise Search                       (Publication Process)                       Publication Process                   ...
Enterprise Search                       (Publication Process)                       Publication Process                   ...
Enterprise Search                       (Publication Process)                       Publication Process                   ...
Enterprise Search                       (Publication Process)                       Publication Process                   ...
Enterprise Search                       (Publication Process)                       Publication Process                   ...
Technology Stack(Collection Process)•Use ESB for the flow: Apache ServiceMix with Camel • Leverage standard ESB components...
Collection Process Flow                                Content Inbound                                              1     ...
Technology Stack(Publication Process)•Use flow from Apache Lucene/Solr • Leverage standard Solr components (synonyms, stop...
Enterprise Search(Logical Architecture)Collection Process                                                 Publication Proc...
Enterprise Search(Logical Architecture) Collection Process                                                 Publication Pro...
Enterprise Search(Logical Architecture)Collection Process                                                      Publication...
Luminis Enricher Framework•Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. • Enrichi...
Luminis Enricher Framework                                 Content Inbound                                               1...
Luminis Enricher Framework                                                        Content Inbound                         ...
Luminis Enricher Framework(architecture)•Pipe-and-filter architecture  • Documents flow through series of actions • Output...
Luminis Enricher Framework(architecture)•Pipe-and-filter architecture  • Documents flow through series of actions • Output...
Luminis Enricher Framework(Configuration)•Enricher flow and expression configuration via XML based DSL • Conditional: if-t...
Luminis Enricher Framework         (Examples)<enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</fie...
Luminis Enricher Framework        (Examples)               <enricher name="XPath”                   xmlns:str="http://exsl...
Luminis Enricher Framework        (Examples)                       <enricher name="SPARQL">                               ...
Luminis Enricher Framework        (Examples)                        <enricher name="SPARQL">                              ...
Luminis Enricher Framework(Technology)•Enricher and expresion handlers are Java based OSGiservices: • Hot pluggable and up...
Deployment Architecture                                                                                                 <<...
Conclusions•Enterprise Search Solution is not Google search•Open Source paves the way; misses some ingredients  • Useful i...
Questions & (answers?)Marc Teutelink  marc.teutelink@luminis.eu  @mteutelink                MEAP December 2010 
Upcoming SlideShare
Loading in …5
×

Open source enterprise search and retrieval platform

853 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
853
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • Content Repository vs Content Management Systems\nSecurity: mention LDAP \nIdentity: you have to be authorized\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • Security: logging in on the source\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Open source enterprise search and retrieval platform

    1. 1. Open SourceSearch & RetrievalPlatform Enterprise Search EAIMarc Teutelink Semantic WebDatum 21 augustus 2010
    2. 2. How Apache open source software is used during the implementation of anEnterprise Search and Retrieval Platform (Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)
    3. 3. Marc Teutelink marc.teutelink@luminis.eu @mteutelink•Software architect at Luminis•15+ years experience in software development; specialized inEnterprise Search, Enterprise Application Integration andSemantic Web technology•Currently writing “Enterprise Search in Action” for Manning(Mid-2011)
    4. 4. Agenda•Enterprise Search • What is Enterprise Search: Functions and features • Challenges • Logical Architecture•Enterprise Search Solution • Technology Stack • Collection Process • Publication Process • Enricher framework • Deployment•Conclusion
    5. 5. What is Enterprise Search?“Enterprise Search offers a solution for searching,finding and presenting enterprise related informationin the larger sense of the word”Enterprise search is all about searching through documents fromany type and format from any sources located anywhere with theupmost flexibility • Web search: limited to public documents on the web • Desktop search: limited to private documents on the local machine • Enterprise search: no limitations on document type and location
    6. 6. Enterprise Search(features)•Information Sources and Types • Wide range of sources: local and remote filesystems, content repositories, e-mail, databases, internet, intranet and extranet • Type not limited: any type ranging from structured to unstructured data, text and binary formats and compound formats (zip)•Usage • Not limited to interactive use  automated business processes•Security • Integrations with enterprise security infrastructure•User Interaction and personalization • Identity enables more personalized search results
    7. 7. Enterprise Search(features)•Extended metadata • More metadata  better and more precise search results • More control over schema (for example Dynamic Fields)•Ranking • More control over ranking: personalized ranking (group)•Data extraction and derivation • Extract data using various techniques: Xpath, Xquery • Derive data: using external knowledge models: RDBMS, RDF Store, Web Services • Conditional extraction & derivation•Managing and monitoring • On-the-fly management (JMX) • Real time monitoring
    8. 8. Enterprise Search(features)•User Interfaces • Web search • All about selling advertisements to the mass • Generalistic & minimalistic screens; focus on adds • Enterprise search • All about finding: rich navigation; focus on quick find • Small targeted audience • Specialized and customized screens (use of ontologies, taxonomies and classifications) • Use of identity (results customized to user) and web 2.0 • Grouping • field collapsing, faceted search & clustering
    9. 9. Enterprise Search(Challenges)•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy maintenance•Quick issue and problem solving•Reduce total cost of ownerschip
    10. 10. Enterprise Search(Challenges)•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy maintenance•Quick issue and problem solving•Reduce total cost of ownerschip Commercial Search Engines?
    11. 11. Enterprise Search(Challenges)•Performance and scalability•Rich functions and features•Managebility•Flexibility•Easy maintenance•Quick issue and problem solving•Reduce total cost of ownerschip Apache Based (Open Source) Search & Retrieval Platform
    12. 12. Enterprise Search(Logical Architecture)Collection Process Publication Process Sources Actors Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation Request Validation Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine
    13. 13. Enterprise Search(Collection Process) Collection ProcessSources Sources • Any document format • Any type Pull Pull (Harvesting) Push (Crawling) (SOAP/ReST) • Structured and unstructured Content Inbound • Textual and binary Syntactic Semantic Content Validation • Compound • Residing Anywhere Extraction Enhancement Filtering Content Enrichment • Security Indexing Search Engine
    14. 14. Enterprise Search(Collection Process) Collection ProcessContent Inbound Sources• Pull (Crawling/Spidering) • Internet, intranet & extranet Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST) • Local and remote filesystems Content Inbound Syntactic Semantic• Pull (Harvesting) Content Validation • Databases Extraction Enhancement Filtering • Content Repositories / Mgmt Systems Content Enrichment • Webservices inbound Indexing• Push Search Engine • Webservices (SOAP/REST) • Real time indexing
    15. 15. Enterprise Search(Collection Process) Collection ProcessContent Validation Sources• Syntactic validation • Based on DTD / XML-Schema Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST) • Structure and limited content Content Inbound• Semantic validation Syntactic Semantic • Based on algorithms: Content Validation • Groovy, XPath, Regex, … Extraction Enhancement Filtering• Think about exception handling Content Enrichment• Placed anywhere in flow • During inbound: XML-Schema validation Indexing • After Enrichment: Validate derived metadata Search Engine
    16. 16. Enterprise Search(Collection Process) Collection ProcessContent Enrichment Sources• Extraction • Metadata Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST) • Content (free text of document) Content Inbound• Enhancing Syntactic Semantic • Derive new and alter existing metadata Content Validation• Filtering Extraction Enhancement Filtering • Remove (parts of) metadata Content Enrichment• Leverage external knowledge models Indexing• Conditional enrichment Search Engine
    17. 17. Enterprise Search(Collection Process) Collection ProcessIndexing Sources• Store in search engine(s) • Content based routing Pull Pull Push (Crawling) (Harvesting) (SOAP/ReST)• Document boosting Content Inbound Syntactic Semantic Content Validation Extraction Enhancement Filtering Content Enrichment Indexing Search Engine
    18. 18. Enterprise Search (Publication Process) Publication Process Request Inbound • HTTP/Get Actors • URL based with parameters • Response in XML, JSON, … • HTTP/PostHTTP/Get HTTP/Post API Stateless Statefull (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Request Inbound Response Outbound • XML (SOAP, REST) request Syntactic Semantic Request Validation • XML (SOAP, REST) response Redirection Enhancement Redirection Enhancement • API (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment • Java, Perl, … Filtering Grouping Sorting • Wrappers on HTTP/Get Searching & Ordering Search Engine
    19. 19. Enterprise Search (Publication Process) Publication Process Request Validation • Syntactic Validation Actors • Correct Query syntax? • Semantic Validation • Correct Field Filters?HTTP/Get HTTP/Post API Stateless Statefull (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Request Inbound Response Outbound • Based on algorithms: Groovy, Regex Syntactic Semantic Request Validation Redirection Enhancement Redirection Enhancement • Placed anywhere in flow (Suggestions) Request Enrichment (add/remove clauses) (more like this) Response Enrichment (metadata, editorial) • @inbound: XML-Schema validation • @enrichment: Validate derived request clauses Filtering Grouping Sorting Searching & Ordering Search Engine
    20. 20. Enterprise Search (Publication Process) Publication Process Request Enrichment • Redirection Actors • Spelling suggestions • Metadata suggestionsHTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Stateless (XSLT, SolrJS) Statefull (Webapp Framework) • Enhancing Request Inbound Response Outbound • Add/Remove clauses Syntactic Semantic • Stemming, Synonyms, stop words Request Validation Redirection Enhancement Redirection Enhancement (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment Filtering Grouping Sorting Searching & Ordering Search Engine
    21. 21. Enterprise Search (Publication Process) Publication Process Searching & Ordering • Filtering Actors • Field Search • Grouping • Add group informationHTTP/Get HTTP/Post API Stateless Statefull (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Request Inbound Response Outbound • Field collapsing, Faceted Search & Clustering • Sorting Syntactic Semantic Request Validation Redirection Enhancement Redirection Enhancement • Sort on Field (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment • Ranking Filtering Grouping Sorting Searching & Ordering Search Engine
    22. 22. Enterprise Search (Publication Process) Publication Process Response Enrichment • Redirection Actors • Suggestions • More like thisHTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Stateless (XSLT, SolrJS) Statefull (Webapp Framework) • Enhancing Request Inbound Response Outbound • Add/Remove response fields Syntactic Semantic • Schema information Request Validation • Editorial information Redirection Enhancement Redirection Enhancement (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment Filtering Grouping Sorting Searching & Ordering Search Engine
    23. 23. Enterprise Search (Publication Process) Publication Process Response outbound • Stateless Actors • No security • XSLT, SolrJSHTTP/Get (URL) HTTP/Post (SOAP/ReST) API (Java,Perl,...) Stateless (XSLT, SolrJS) Statefull (Webapp Framework) • Statefull Request Inbound Response Outbound • Security Syntactic Semantic • Web2.0 Request Validation • Web Application Framework Redirection Enhancement Redirection Enhancement (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Request Enrichment Response Enrichment Filtering Grouping Sorting Searching & Ordering Search Engine
    24. 24. Technology Stack(Collection Process)•Use ESB for the flow: Apache ServiceMix with Camel • Leverage standard ESB components (Transformers, Validation, Splitter, Filter, Routers, Scripting) • Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE) • Custom: Crawler Apache Nutch • Leverage only crawl framework • Extend NutchIndexWriter; asynchronously pushing crawled documents back into ESB flow (reply-to)•ESB Makes distributed flow possibleContent based routing•Hot deploy Easy maintenance•Reusing services across collection processes•Search Engine independent
    25. 25. Collection Process Flow Content Inbound 1 2 D D D N Document Push Inbound Syntactic Validation Splitter Messages Documents (Message Endpoint) (Channel Purger) Message Channel Content Validation Content Enrichment Content Indexer Semantic Validation Channel Channel Transformer HTTP Transport (Channel Purger) Content Filter (Message Translator) (Channel Adapter) ? Content Enricher D Invalid Message Enricher SOLR Document Message ! Lucene/Solr INDEX Invalid Message Channel Lucene/SOLR (SOLRJ)
    26. 26. Technology Stack(Publication Process)•Use flow from Apache Lucene/Solr • Leverage standard Solr components (synonyms, stopwords, stemming, MLT, spelling, faceted search, …) • Custom components: using Solr’s extendability framework • Security: authority field in schema with Apache Shiro integration • Field filters (zipcode,…)•User interfaces • Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter • Statefull: Apache Wicket with Spring
    27. 27. Enterprise Search(Logical Architecture)Collection Process Publication Process Sources Actors Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation Request Validation Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine
    28. 28. Enterprise Search(Logical Architecture) Collection Process Publication Process Sources Actors ServiceMix/Camel Apache WicketNutch SolrJS/XSLT Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation Request Validation Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine Lucene/SOLR
    29. 29. Enterprise Search(Logical Architecture)Collection Process Publication Process Sources Actors Pull Pull Push HTTP/Get HTTP/Post API Stateless Statefull (Crawling) (Harvesting) (SOAP/ReST) (URL) (SOAP/ReST) (Java,Perl,...) (XSLT, SolrJS) (Webapp Framework) Content Inbound Request Inbound Response Outbound Syntactic Semantic Syntactic Semantic Content Validation LuminisRequest Validation Enricher Framework Redirection Enhancement Redirection Enhancement Extraction Enhancement Filtering (Suggestions) (add/remove clauses) (more like this) (metadata, editorial) Content Enrichment Request Enrichment Response Enrichment Filtering Grouping Sorting Indexing Searching & Ordering Search Engine
    30. 30. Luminis Enricher Framework•Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. • Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields • Same enricher to be used for: • Collection process: • Documents  enriching, filtering & splitting • Publication process: • Search requests’first-components’ searchcomponent • Search response’last-components’ searchcomponent
    31. 31. Luminis Enricher Framework Content Inbound 1 2 D D D N Document Push Inbound Syntactic Validation Splitter Messages Documents (Message Endpoint) (Channel Purger) Message•Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. Channel Content Validation Content Enrichment Content Indexer • Enriching = one or more actions (extraction, enhancing & filtering) performed on documents with fields Semantic Validation Channel Channel SOLR Indexer (Channel Purger) Content Filter (Channel Adapter) • Same enricher to be used for: ? Content Enricher • Collection process: D Invalid Message Enricher SOLR Document Message • Documents  enriching, filtering & splitting ! • Publication process: Invalid Message Lucene/Solr INDEX • Search requests’first-components’ searchcomponent Channel Lucene/SOLR (SOLRJ) • Search response’last-components’ searchcomponent
    32. 32. Luminis Enricher Framework Content Inbound 1 2 D D D N Document Push Inbound Syntactic Validation Splitter Messages Documents (Message Endpoint) (Channel Purger) Message •Custom Enricher Framework • Existing ESB & SOLR enricher capabilities not sufficient. Channel <<XSLT>> XML2HTML Content Validation Content Enrichment Content Indexer • Enriching = one or more actions (extraction, enhancing &<<SOLRQueryRequest>> <<(X)HTML>> Query Resultaat <<SearchHandler>> filtering) performed on documents with fields <<XML>> <<QueryResponseWriter>> RequestHandler Response XSLTResponseWriter Semantic Validation Channel Channel SOLR Indexer "first-components" (Channel Purger) "components" "last-components" Content Filter (Channel Adapter) • Same enricher to be used for: ? Content Enricher • Collection process: D Invalid Message Enricher SOLR Document Message <<SearchComponent>> • Documents  enriching, filtering & splitting <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> query facet mlt highlight stats debug ! • Publication process: Invalid Message Lucene/Solr INDEX • Search requests’first-components’ searchcomponent Channel Lucene/SOLR (SOLRJ) • Search response’last-components’ searchcomponent
    33. 33. Luminis Enricher Framework(architecture)•Pipe-and-filter architecture • Documents flow through series of actions • Output from one action is input to another action • Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values •Conditional flows supported •Reuse of flows & Subflows supported
    34. 34. Luminis Enricher Framework(architecture)•Pipe-and-filter architecture • Documents flow through series of actions • Output from one action is input to another action • Fields from input document can be used in action’s clauses: values in expressions filled by replacing velocity type patterns with field values •Conditional flows supported •Reuse of flows & Subflows supported Action Document (select C where ${B}) [[A1],[B],[C1]] YES Document Action Document [[A1,A2],[B]] [[A1],[B]] If [B=3] (remove A2) NO Action Document (select C where ${A}) [[A1],[B],[C2]]
    35. 35. Luminis Enricher Framework(Configuration)•Enricher flow and expression configuration via XML based DSL • Conditional: if-then-else & switch-case-else (with regex support) • Actions: Add & remove fields and field values using expressions • Expression handlers currently supported: • Field • Function (execute methods via Java Reflection) • HttpClient (retrieve content by URL described by field values) • Xslt, Xpath, Xquery (external XML databases) • JDBC • SparQL (OpenRDF) • Apache Lucene/Solr • Apache Tika (Meta and Text extraction)
    36. 36. Luminis Enricher Framework (Examples)<enricher name="Field" > <field name="a">AA1</field> <field name="b">BB1</field> <field name="b">BB2</field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="c">CC2</multivalue-field> <if test="field::c" pattern="CC2"> <then> <field name="e">EE1</field> </then> </if> <if test="field::a"> <then> <field name="f">FF1</field> </then> </if> <rename-field name="b">d</rename-field> <remove-field name="a"/></enricher>
    37. 37. Luminis Enricher Framework (Examples) <enricher name="XPath” xmlns:str="http://exslt.org/strings"<enricher name="Field" > xmlns:fn="http://www.w3.org/2005/xpath-functions" <field name="a">AA1</field> xmlns:html="http://www.w3.org/1999/xhtml"> <field name="b">BB1</field> field name="Description" expression-type="xpath"> //html:meta[@name=DC.description]/@content <field name="b">BB2</field> </field> <multivalue-field name="c">CC1</multivalue-field> <multivalue-field name="Type" expression-type="xpath"> <multivalue-field name="c">CC2</multivalue-field> //html:meta[@name=DC.type and <if test="field::c" pattern="CC2"> (@scheme=OVERHEIDbm.bekendmakingtypeGemeente or <then> @scheme=OVERHEIDbm.bekendmakingtypeProvincie or <field name="e">EE1</field> @scheme=OVERHEIDbm.bekendmakingtypeWaterschap) </then> ]/@content </if> </multivalue-field> <field name="publisher" expression-type="xpath"> <if test="field::a"> fn:string-join((Blow, , blow, , thou , winter , wind!), ) <then> </field> <field name="f">FF1</field> <field name="publisher" expression-type="xpath"> </then> fn:concat(//html:meta[@name=OVERHEID.organisationType]/@content, </if> //html:meta[@name=DC.creator]/@content) </field> <rename-field name="b">d</rename-field> </enricher> <remove-field name="a"/></enricher>
    38. 38. Luminis Enricher Framework (Examples) <enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <enricher name="XPath” <field expression-type="sparql" repository="TESTRDF"> xmlns:str="http://exslt.org/strings" <![CDATA[<enricher name="Field" > xmlns:fn="http://www.w3.org/2005/xpath-functions" PREFIX skos: <http://www.w3.org/2004/02/skos/core#> <field name="a">AA1</field> xmlns:html="http://www.w3.org/1999/xhtml"> SELECT ?definition <field name="b">BB1</field> field name="Description" expression-type="xpath"> WHERE { //html:meta[@name=DC.description]/@content <field name="b">BB2</field> ?${place} skos:definition ?definition. </field> <multivalue-field name="c">CC1</multivalue-field> } <multivalue-field name="Type" expression-type="xpath"> ]]> <multivalue-field name="c">CC2</multivalue-field> //html:meta[@name=DC.type and </field> <if test="field::c" pattern="CC2"> (@scheme=OVERHEIDbm.bekendmakingtypeGemeente or </enricher> <then> @scheme=OVERHEIDbm.bekendmakingtypeProvincie or <field name="e">EE1</field> @scheme=OVERHEIDbm.bekendmakingtypeWaterschap) </then> ]/@content </if> </multivalue-field> <field name="publisher" expression-type="xpath"> <if test="field::a"> fn:string-join((Blow, , blow, , thou , winter , wind!), ) <then> </field> <field name="f">FF1</field> <field name="publisher" expression-type="xpath"> </then> fn:concat(//html:meta[@name=OVERHEID.organisationType]/@content, </if> //html:meta[@name=DC.creator]/@content) </field> <rename-field name="b">d</rename-field> </enricher> <remove-field name="a"/></enricher>
    39. 39. Luminis Enricher Framework (Examples) <enricher name="SPARQL"> <field name="place">http://www.my.com/#channels</field> <enricher name="XPath” <field expression-type="sparql" repository="TESTRDF"> xmlns:str="http://exslt.org/strings" <![CDATA[<enricher name="Field" > xmlns:fn="http://www.w3.org/2005/xpath-functions" PREFIX skos: <http://www.w3.org/2004/02/skos/core#> <field name="a">AA1</field> xmlns:html="http://www.w3.org/1999/xhtml"> SELECT ?definition <field name="b">BB1</field> field name="Description" expression-type="xpath"> WHERE { //html:meta[@name=DC.description]/@content <field name="b">BB2</field> ?${place} skos:definition ?definition. </field> <multivalue-field name="c">CC1</multivalue-field> } <multivalue-field name="Type" expression-type="xpath"> ]]> <multivalue-field name="c">CC2</multivalue-field> //html:meta[@name=DC.type and </field> <if test="field::c" pattern="CC2"> (@scheme=OVERHEIDbm.bekendmakingtypeGemeente or </enricher> <then> @scheme=OVERHEIDbm.bekendmakingtypeProvincie or <field name="e">EE1</field> @scheme=OVERHEIDbm.bekendmakingtypeWaterschap) <enricher name=”HttpAndTika"> </then> ]/@content </if> <field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field> </multivalue-field> <field expression-type=”http" name="content.file">field:content.url</field> <field name="publisher" expression-type="xpath"> <if test="field::a"> <field name="auteur" source="field::content.file">xpath://H1</field> fn:string-join((Blow, , blow, , thou , winter , wind!), ) <then> <multivalue-field expression-type=”tika.meta” source="field::content.file”/> </field> <field name="f">FF1</field> <field name=”content" expression-type=”tika.text” source="field::content.file”/> <field name="publisher" expression-type="xpath"> </then> <switch test=”field::content.url fn:concat(//html:meta[@name=OVERHEID.organisationType]/@content, </if> <case pattern=".*.rijksweb.nl.*"><field name=”source">Rijksweb</field></case> //html:meta[@name=DC.creator]/@content) <case name="b">d</rename-field> <rename-field pattern=".*.deventer.nl.*"><field name=”source">Gemeente Deventer</field></case> </field> <case</enricher> pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case> <remove-field name="a"/> <else><field name=”source">Overige</field></else></enricher> </switch> </enricher>
    40. 40. Luminis Enricher Framework(Technology)•Enricher and expresion handlers are Java based OSGiservices: • Hot pluggable and updatable • Flow and expression configuration changes no restart • Extendible: New expression handlers immediatly available in actions after installing OSGi bundle•Runs in Apache Felix • Collection Process: ServiceMix contains OSGi container • Publication Process: Custom OSGi loader for Lucene/Solr•Centralized & transactional provisioning (Apache Ace) ‑ Components & Configuration
    41. 41. Deployment Architecture <<HTTP>> <<device>> Firewall <<device>> HTTP Load Balancer <<HTTP>> <<device>>Deployment Server <<device>> <<HTTP>> Felix OSGi Master Collection Server (Apache) <<device>> <<Container>> Slave Publication Server Ace Apache Tomcat (Slave2) (Apache) ServiceMix Felix OSGi (Apache) (Apache) <<Container>> <<device>> Enricher Nutch Apache Tomcat (Luminis) Slave Publication Server (Apache) Felix OSGi (Slave1) (Apache) <<config>> SOLR::solrconfig.xml Lucene/SOLR <<HTTP/ReST>> <<PROVISIONING>> <<config>> (Apache) Lucene/SOLR Luminis:Enricher.xml <<Container>> (Apache) <<config>> Tika Apache Tomcat SOLR::schema.xml (Apache) <<HTTP/ReST>> Felix OSGi Wicket <<config>> servicenix::config.xml (Apache) (Apache) OpenRDF <<HTTP>> Lucene/SOLR Enricher (Apache) (Luminis) <<HTTP>> <<config>> <<Data Container>> Wicket SOLR::solrconfig.xml SQL <<JDBC>> (Apache) <<config>> Luminis:Enricher.xml <<JDBC>> <<Database>> <<RDFTripleStore>> Enricher <<config>> Knowledge Models Knowledge Models SOLR::schema.xml (Luminis) <<config>> SOLR::solrconfig.xml <<config>> Luminis:Enricher.xml <<config>> SOLR::schema.xml
    42. 42. Conclusions•Enterprise Search Solution is not Google search•Open Source paves the way; misses some ingredients • Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel, Wicket, MySQL, OpenRDF, Felix/Ace • Missing ingredients: Enricher•Interesting developments: • Apache Chemistry (CMIS) • Apache Clerezza • Apache Nutch • Apache Connectors Framework (ManifoldCF)
    43. 43. Questions & (answers?)Marc Teutelink marc.teutelink@luminis.eu @mteutelink MEAP December 2010 

    ×