DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Building an open-source based search solution –
ﬁrst steps

Roman Kern

Institute of Knowledge Management
Graz University of Technology
Know-Center Graz
rkern@tugraz.at, rkern@know-center.at

Data Science Meetup / 2012-04-12

Overview Graz University of Technology

Motivation

Background

Solr Ecosystem

Solr Features

Conclusions

2 / 28

Motivation Graz University of Technology

Search
Change in users expectations
Missing, sub-optimal search causes frustration

Science
Information retrieval
Success story
Mostly focused on web search

Industry
Enterprise search
Heterogeneous data sources

3 / 28

Background of the Speaker Graz University of Technology

http://a1.net

http://wissen.de
4 / 28

Apache Lucene Umbrella Project Graz University of Technology

Components
Search engine ⇒ Lucene
Search server ⇒ Solr
Web search engine ⇒ Nutch
Lightweight crawler ⇒ Droids
File-format parsing ⇒ Tika
Communicate with CMS ⇒ ManifoldCF
Distributed coordination ⇒ ZooKeeper
Natural language processing ⇒ OpenNLP
Related projects: Hadoop, Mahout, Carrot2, ...

Common aspects
Apache license, implemented in Java, community
5 / 28

Lucene Graz University of Technology

Search Engine Library
Java API
Only for expert users
Search-Index
File-system
In-memory index
Advanced features
Incremental indexing
Update while searching
Base for many projects
Solr
ir-lib
elasticsearch
LIA (Lucene in Action)

http://lucene.apache.org/core/ 6 / 28

Nutch Graz University of Technology

Web search engine
Builds upon Solr
Web crawler
Link database, crawl database
Distributed
Runs on Hadoop
Mode of operation
Crawl a single domain
Crawl the web with seed sites

http://nutch.apache.org/

7 / 28

Droids Graz University of Technology

Crawler component
Lightweight crawler
Main features
Throttling
Multi-threaded
Well behaved (robots.txt)

http://incubator.apache.org/droids/

8 / 28

Tika Graz University of Technology

Text extraction
Text & meta-data
File-formats
Oﬃce
Microsoft Formats (Apache POI)
OpenDocument
Common text formats
PDF (PDFBox)
HTML (tagsoup)
Non-text
Images
Sound

http://tika.apache.org/

9 / 28

ManifoldCF Graz University of Technology

Content Management System Connectors
Communicate with CMS/DMS
Connectors
FileNet P8 (IBM)
Documentum (EMC)
LiveLink (OpenText)
Meridio (Autonomy)
Windows shares (Microsoft)
SharePoint (Microsoft)
More: Alfresco, JDBC, ...
Data is then stored and indexed
e.g. Solr

http://incubator.apache.org/connectors/

10 / 28

ZooKeeper Graz University of Technology

Distributed coordination
Orchestrate servers
Distributed
Conﬁguration
Name lookup
Synchronization

http://zookeeper.apache.org/
11 / 28

OpenNLP Graz University of Technology

Natural language processing
Process plain text
Maximum entropy classiﬁcation with beam search
Models
Sentence splitting
Token splitting
Part-of-speech (POS) tagging
Named entity recognition
more: chunker, parser, co-reference resolution

http://opennlp.sourceforge.net/

12 / 28

Hadoop Graz University of Technology

Distributed computing
Scale out framework
Distributed ﬁle-system
Data is partitioned
Stored on multiple nodes
Map/Reduce paradigm
Map your algorithms to mappers & reducers

Related projects: HBase, Pig, Hive, ...

http://hadoop.apache.org/

13 / 28

Mahout Graz University of Technology

Distributed machine learning
Scale out framework
Machine learning
Recommender systems
Clustering
Classiﬁcation
Integration
Standalone
Hadoop
Amazon EC2

http://mahout.apache.org/

14 / 28

Details Graz University of Technology

15 / 28

Search Server Graz University of Technology

What Solr is
Web-Service
Full-text indexing & search
Support to store arbitrary content

What Solr isn’t
Solr = grep
Database
But, somehow similar to No-SQL databases

Solr vs. IR-Lib
Solr: easy to use, easy to integrate, XML conﬁguration
IR-Lib: expert knowledge to use, Java conﬁguration, fast

16 / 28

Index Structure Graz University of Technology

Inverted Index
Dictionary of words (terms)
Map from term to document

Document
List of fields
Input fields are them mapped according to the schema

Field-types
Defined in the schema
Type (string, boolean, date, number) - internally mapped to
string

17 / 28

Index Management Graz University of Technology

API
HTTP Server
Various formats (XML, binary, JavaScript, ...)

Document life-cycle
There is no update
Delete (done automatically by Solr)
Insert
Implications
An unique id is necessary
Use batch updates
Commit, rollback (and optimize)

18 / 28

Input Handling Graz University of Technology

Diﬀerent input formats
XML
CSV
JDBC (database)
DIH (data import handler)
Support incremental updates (via timestamps)
Solr Cell
Binary content
Apache Tika
Text content and metadata

19 / 28

Text Processing Graz University of Technology

Scope
During indexing & query

Tokenization
Split text into tokens
Lower-case alignment
Stemming (e.g. ponies, pony ⇒ poni, triplicate ⇒
triplic, ...)
Synonyms (via Thesaurus)
Stop-word ﬁltering
Multi-word splitting (e.g. Wi-Fi ⇒ Wi, Fi)
n-grams, soundex, umlauts

20 / 28

Query Processing Graz University of Technology

Query parsers
Lucene query parser (rich syntax)
AND, OR, NOT, range queries, wildcards, fuzzy query, phrase
query
Boosting of individual parts
Example: ((boltzmann OR schroedinger) NOT einstein)
Dismax query parser
No query syntax
Searches over multiple fields (separate boost for each field)
Configure the amount of terms to be mandatory
Distance between terms is used for ranking (phrase boosting)

Dismax is a good starting point, but may become expensive

21 / 28

Search Features Graz University of Technology

Query ﬁlter
Additional query
No impact on ranking
Results are cached

Boosting query
Only in Dismax

Query elevation
Fix certain queries

Request handler
Pre-deﬁne clauses
Invariants
22 / 28

Search Result Graz University of Technology

Ranking
Relevance
Sort on field value (only single term per document)

Available data & features
Sequence of IDs & score
Stored fields
Snippets (plus highlighting)
Facets
Count the search hits
Types: field value, dates, queries
Sort, prefix, ...
Could be used for term suggestion (aka. query suggestion)
Field collapsing (grouping)
Spell checking (did-you-mean)
23 / 28

Additional Solr Features Graz University of Technology

Query by Example
More like this

Stats
Per ﬁeld
Min, max, sum, missing, ...

Admin-GUI
Webapp to troubleshoot queries
Browse schema

JMX
Read properties & statistics
Can be accessed remotely
24 / 28

Integration Graz University of Technology

Deployment
Within a web application server
Embedded

Monitor
Log output

Access
Various language bindings
Java, Ruby, JavaScript, PHP, ...

25 / 28

Multi-core Graz University of Technology

Multiple indices
Each index has its own conﬁguration

Operations
Reload (when conﬁguration has been changed)
Rename
Swap
Merge
Create, Status

26 / 28

Scale Solr Graz University of Technology

Replication
Master and slaves nodes
Replication
Slaves poll master

Dispatch search request
Load balancer

27 / 28

Sharding Indexes Graz University of Technology

Single index
Index spawned over multiple machines
Search is done in parallel

Mapping
Application has to provide a deterministic mapping
Document ⇒ index

28 / 28

Conclusions Graz University of Technology

Ecosystem
Vivid community
Corporative backing

Solr
Easy to get started
Hard to optimize for speciﬁc requirements

29 / 28

The End Graz University of Technology

Thank you!

30 / 28

DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Recommended

Recommended

More Related Content

Similar to DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps

Similar to DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps (20)

Recently uploaded

Recently uploaded (20)

DataScience Meeting II - Roman Kern - Building an open source based search solution - first steps