The document discusses how Apache open source software is used in implementing an enterprise search and retrieval platform. It provides an overview of enterprise search, including its key features and challenges. It then outlines the logical architecture of an enterprise search solution, covering the collection and publication processes. The collection process involves pulling or pushing content from various sources, validating, enriching and indexing it. The publication process involves handling search requests, filtering, grouping results and returning responses.
Apache Solr is the popular, blazing fast open source enterprise search platform; it uses
Lucene as its core search engine. Solr’s major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database integration, and complex queries.
Solr is highly scalable, providing distributed search and index replication, and it powers the
search and navigation features of many of the world's largest internet sites.
Apache Solr is the popular, blazing fast open source enterprise search platform; it uses
Lucene as its core search engine. Solr’s major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database integration, and complex queries.
Solr is highly scalable, providing distributed search and index replication, and it powers the
search and navigation features of many of the world's largest internet sites.
10 Things I Like in SharePoint 2013 SearchSPC Adriatics
Speaker: Agnes Molnar;
Based on my SharePoint and FAST Search experience, I’ll demonstrate my “Research Path” on SharePoint 2013 Search. What’s new, what improvements we can find there as well as how to use our existing Search knowledge and experience in SharePoint 2013 Search.
You will learn:
Config options in SharePoint 2013 Search – Central Admin vs. PowerShell
Crawled and Managed Properties across Content Sources
Ranking and Relevancy
NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:
NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.
Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
Tune in for Portent SEO Marianne Sweeny’s January webinar: “How to SEO a Terrific – and Profitable – User Experience.” Learn how search engine algorithms are now incorporating IA, UX and content strategy, as well as methods for directing Google, Bing & Co. to perform better for your users.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
This month C/D/H, with partners BA Insight and Microsoft, hosted a half-day seminar on SharePoint 2010 & FAST Search for SharePoint – and using it as a single, enterprise-wide search tool.
View C/D/H’s FAST SharePoint slide deck to see real-world examples of search-driven information portals. We’ll also show you how FAST can dramatically improve end-user productivity.
And for more on Search and other topics, visit our blog at www.cdhtalkstech.com.
10 Things I Like in SharePoint 2013 SearchSPC Adriatics
Speaker: Agnes Molnar;
Based on my SharePoint and FAST Search experience, I’ll demonstrate my “Research Path” on SharePoint 2013 Search. What’s new, what improvements we can find there as well as how to use our existing Search knowledge and experience in SharePoint 2013 Search.
You will learn:
Config options in SharePoint 2013 Search – Central Admin vs. PowerShell
Crawled and Managed Properties across Content Sources
Ranking and Relevancy
NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:
NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.
Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
Tune in for Portent SEO Marianne Sweeny’s January webinar: “How to SEO a Terrific – and Profitable – User Experience.” Learn how search engine algorithms are now incorporating IA, UX and content strategy, as well as methods for directing Google, Bing & Co. to perform better for your users.
Global introduction to elastisearch presented at BigData meetup.
Use cases, getting started, Rest CRUD API, Mapping, Search API, Query DSL with queries and filters, Analyzers, Analytics with facets and aggregations, Percolator, High Availability, Clients & Integrations, ...
This month C/D/H, with partners BA Insight and Microsoft, hosted a half-day seminar on SharePoint 2010 & FAST Search for SharePoint – and using it as a single, enterprise-wide search tool.
View C/D/H’s FAST SharePoint slide deck to see real-world examples of search-driven information portals. We’ll also show you how FAST can dramatically improve end-user productivity.
And for more on Search and other topics, visit our blog at www.cdhtalkstech.com.
Open source enterprise search and retrieval platform
1. Open Source
Search & Retrieval
Platform
Enterprise Search
EAI
Marc Teutelink Semantic Web
Datum 21 augustus 2010
2. How Apache open source software is used
during the implementation of an
Enterprise Search and Retrieval Platform
(Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)
3. Marc Teutelink
marc.teutelink@luminis.eu
@mteutelink
•Software architect at Luminis
•15+ years experience in software development; specialized in
Enterprise Search, Enterprise Application Integration and
Semantic Web technology
•Currently writing “Enterprise Search in Action” for Manning
(Mid-2011)
4. Agenda
•Enterprise Search
• What is Enterprise Search: Functions and features
• Challenges
• Logical Architecture
•Enterprise Search Solution
• Technology Stack
• Collection Process
• Publication Process
• Enricher framework
• Deployment
•Conclusion
5. What is Enterprise Search?
“Enterprise Search offers a solution for searching,
finding and presenting enterprise related information
in the larger sense of the word”
Enterprise search is all about searching through documents from
any type and format from any sources located anywhere with the
upmost flexibility
• Web search: limited to public documents on the web
• Desktop search: limited to private documents on the local machine
• Enterprise search: no limitations on document type and location
6. Enterprise Search
(features)
•Information Sources and Types
• Wide range of sources: local and remote filesystems, content repositories,
e-mail, databases, internet, intranet and extranet
• Type not limited: any type ranging from structured to unstructured data, text
and binary formats and compound formats (zip)
•Usage
• Not limited to interactive use automated business processes
•Security
• Integrations with enterprise security infrastructure
•User Interaction and personalization
• Identity enables more personalized search results
7. Enterprise Search
(features)
•Extended metadata
• More metadata better and more precise search results
• More control over schema (for example Dynamic Fields)
•Ranking
• More control over ranking: personalized ranking (group)
•Data extraction and derivation
• Extract data using various techniques: Xpath, Xquery
• Derive data: using external knowledge models: RDBMS, RDF Store, Web Services
• Conditional extraction & derivation
•Managing and monitoring
• On-the-fly management (JMX)
• Real time monitoring
8. Enterprise Search
(features)
•User Interfaces
• Web search
• All about selling advertisements to the mass
• Generalistic & minimalistic screens; focus on adds
• Enterprise search
• All about finding: rich navigation; focus on quick find
• Small targeted audience
• Specialized and customized screens (use of ontologies, taxonomies
and classifications)
• Use of identity (results customized to user) and web 2.0
• Grouping
• field collapsing, faceted search & clustering
9. Enterprise Search
(Challenges)
•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip
10. Enterprise Search
(Challenges)
•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip
Commercial Search Engines?
11. Enterprise Search
(Challenges)
•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip
Apache Based (Open Source)
Search & Retrieval Platform
30. Luminis Enricher Framework
•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.
• Enriching = one or more actions (extraction, enhancing &
filtering) performed on documents with fields
• Same enricher to be used for:
• Collection process:
• Documents enriching, filtering & splitting
• Publication process:
• Search requests’first-components’ searchcomponent
• Search response’last-components’ searchcomponent
31. Luminis Enricher Framework
Content Inbound
1
2 D D D
N Document
Push Inbound Syntactic Validation Splitter Messages
Documents
(Message Endpoint) (Channel Purger)
Message
•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.
Channel
Content Validation Content Enrichment Content Indexer
• Enriching = one or more actions (extraction, enhancing &
filtering) performed on documents with fields
Semantic Validation Channel Channel SOLR Indexer
(Channel Purger) Content Filter (Channel Adapter)
• Same enricher to be used for:
? Content Enricher
• Collection process:
D
Invalid Message Enricher SOLR Document
Message
• Documents enriching, filtering & splitting
!
• Publication process:
Invalid Message
Lucene/Solr
INDEX
• Search requests’first-components’ searchcomponent
Channel Lucene/SOLR
(SOLRJ)
• Search response’last-components’ searchcomponent
32. Luminis Enricher Framework
Content Inbound
1
2 D D D
N Document
Push Inbound Syntactic Validation Splitter Messages
Documents
(Message Endpoint) (Channel Purger)
Message
•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.
Channel
<<XSLT>>
XML2HTML
Content Validation Content Enrichment Content Indexer
• Enriching = one or more actions (extraction, enhancing &
<<SOLRQueryRequest>> <<(X)HTML>>
Query Resultaat
<<SearchHandler>>
filtering) performed on documents with fields
<<XML>> <<QueryResponseWriter>>
RequestHandler Response XSLTResponseWriter
Semantic Validation Channel Channel SOLR Indexer
"first-components"
(Channel Purger)
"components" "last-components"
Content Filter (Channel Adapter)
• Same enricher to be used for:
? Content Enricher
• Collection process:
D
Invalid Message Enricher SOLR Document
Message
<<SearchComponent>>
• Documents enriching, filtering & splitting
<<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>> <<SearchComponent>>
query facet mlt highlight stats debug
!
• Publication process:
Invalid Message
Lucene/Solr
INDEX
• Search requests’first-components’ searchcomponent
Channel Lucene/SOLR
(SOLRJ)
• Search response’last-components’ searchcomponent
33. Luminis Enricher Framework
(architecture)
•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported
34. Luminis Enricher Framework
(architecture)
•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported
Action Document
(select C where ${B}) [[A1],[B],[C1]]
YES
Document Action Document
[[A1,A2],[B]] [[A1],[B]] If [B=3]
(remove A2)
NO
Action Document
(select C where ${A}) [[A1],[B],[C2]]
35. Luminis Enricher Framework
(Configuration)
•Enricher flow and expression configuration via XML based DSL
• Conditional: if-then-else & switch-case-else (with regex support)
• Actions: Add & remove fields and field values using expressions
• Expression handlers currently supported:
• Field
• Function (execute methods via Java Reflection)
• HttpClient (retrieve content by URL described by field values)
• Xslt, Xpath, Xquery (external XML databases)
• JDBC
• SparQL (OpenRDF)
• Apache Lucene/Solr
• Apache Tika (Meta and Text extraction)
40. Luminis Enricher Framework
(Technology)
•Enricher and expresion handlers are Java based OSGi
services:
• Hot pluggable and updatable
• Flow and expression configuration changes no restart
• Extendible: New expression handlers immediatly available in
actions after installing OSGi bundle
•Runs in Apache Felix
• Collection Process: ServiceMix contains OSGi container
• Publication Process: Custom OSGi loader for Lucene/Solr
•Centralized & transactional provisioning (Apache Ace)
‑ Components & Configuration