Apache Solr, il motore di ricerca enterprise open source
1. Apache Solr
la piattaforma di ricerca enterprise
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
2. Chi sono
Luca Bonesini
Informatico Lanciatore di giavellotti
Programmatore
Suonatore di chitarra basso
Sistemista Imprenditore
IT Manager Marito
http://www.lucabonesini.it
Tecnico di prevendita
Mountainbike-ista
Webmaster Padre2
@lbonesini
http://it.linkedin.com/in/lucabonesini/
l.bonesini@sourcesense.com
+39 366 688 7125
Venditore
Cantore
Markettaro
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
3. Sourcesense
Making sense of Open Source
Contributors
Lucene/Solr
Apache Chemistry
Apache Jackrabbit
OpenSSO-Alfresco
Committers
Lead developer
Hibernate Search
Lucene
Project
Infinispan
Apache/UIMA project
integration
JBoss GateIn Portal
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
4. Lucene e Solr
Cosa sono?
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
5. Apache Lucene (core)
Search by ASF
“Apache Lucene is a high-performance, fullfeatured text search engine library written
entirely in Java. It is a technology suitable for
nearly any application that requires full-text
search, especially cross-platform”.
http://lucene.apache.org/core/
fast and efficient scoring and indexing algorithms
lots of contributions to make common tasks easier: highlighting, spatial,
query parsers, benchmarking tools, etc.
most widely deployed search library on the planet
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
6. Apache Solr
Search by ASF
“Solr is the popular, blazing fast open source
enterprise search platform from the Apache Lucene
project. Its major features include powerful full-text
search, hit highlighting, faceted search, near realtime indexing, dynamic clustering, database
integration, rich document (e.g., Word, PDF)
handling, and geospatial search”.
Highly reliable, scalable, fault tolerant, distributed indexing, replication,
load-balanced querying, automated failover and recovery, centralized
configuration.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
7. Apache Solr
Search by ASF
Solr is written in Java and runs as a standalone full-text
search server within a servlet container such as Jetty.
Solr uses the Lucene Java search library at its core for
full-text indexing and search, and has REST-like
HTTP/XML and JSON APIs that make it easy to use from
virtually any programming language.
http://lucene.apache.org/solr
Access Lucene over HTTP: Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
Most programming tasks in Lucene are configuration tasks in Solr
Faceting (guided navigation, filters, etc.)
Replication and distributed search support
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
9. Enterprise Search, cosa e come.
“Enterprise search is the practice of
making content from multiple enterprisetype sources, such as databases and
intranets, searchable to a defined
audience”. [wikipedia]
Ingestion → Processing and analysis → Indexing → Query parsing → Matching
Ingestion → Processing and analysis → Indexing → Query parsing → Matching
Pull
Integration
API
Push
Crawler
connector
Documents types and formats
( XML, HTML, Office, etc.) to
plain text
Stemming, lemmatization,
synonym expansion, entity
extraction, part of speech
tagging, tokenization.
Dictionary of
all unique
words in the
corpus.
Ranking.
Term
frequency.
User query.
Faceting.
Paging.
Query-index
comparison.
References
to source
documents.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
10. Enterprise Search, cosa e come.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
11. Enterprise Search, cosa e come.
●
●
●
●
●
●
●
Crawler: an Internet bot that systematically browses the World Wide Web, typically for
the purpose of Web indexing (also called Web spider, ant, automatic indexer, web
scutter
Precision/Recall: in pattern recognition and information retrieval, precision (also called
positive predictive value) is the fraction of retrieved instances that are relevant, while
recall (also known as sensitivity) is the fraction of relevant instances that are retrieved
Stemming: the process for reducing inflected (or sometimes derived) words to their
stem, base or root form (ie: "fishing", "fished", and "fisher" to the root word, "fish")
Lemmatization: in linguistics is the process of grouping together the different inflected
forms of a word so they can be analysed as a single item (ie: word "better" has "good" as
its lemma)
Named-entity recognition (entity extraction) is a subtask of information extraction that
seeks to locate and classify atomic elements in text into predefined categories such as
the names of persons, organizations, locations, expressions of times, quantities,
monetary values, percentages, etc.
Part of speech: a linguistic category of words (or more precisely lexical items), which is
generally defined by the syntactic or morphological behaviour of the lexical item in
question (ie: noun and verb)
Tokenization: the process of demarcating and possibly classifying sections of a string of
input characters. The resulting tokens are then passed on to some other form of
processing. The process can be considered a sub-task of parsing input.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
12. Search e Open Source
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
13. Enterprise Search: prodotti e vendor
Vendors of proprietary enterprise
search software
AskMeNow, Attivio, Concept Searching Limited,
Content Analyst Company LLC, Coveo, Dassault
Systèmes (acquired Exalead), Denodo,
Dieselpoint, Inc., dtSearch Corp., EMC Corp.,
Exorbyte GmbH, Expert System S.p.A., Exterro, Inc.,
Fabasoft, Funnelback, Google Search Appliance,
HP (acquired Autonomy Corporation which in
turn acquired Verity K2 and Ultraseek), IBM
(acquired Vivisimo), Inbenta, inter:gator
Enterprise Search, ISYS Search Software,
MarkLogic, Microsoft (includes Microsoft Search
Server, Fast Search & Transfer), Mindbreeze,
Neofonie (includes WeFind), Omniture (acquired
by Adobe Systems), Open Text Corporation,
Oracle Corporation (includes Secure Enterprise
Search and Endeca Technologies Inc.),
Perception Software, PolySpot, Q-go, Q-Sensei,
Recommind, SAP (includes SAP NetWeaver
Enterprise Search, Search Services in SAP
NetWeaver AS ABAP, and Search and
Classification TREX), Sinequa, SLI_Systems,
Sophia Search Limited, TeraText, X1 Technologies,
Inc., ZyLAB Technologies, ZL Technologies
Free and open source
enterprise search software
Apache Solr, DataparkSearch,
ElasticSearch, ht://Dig,
Jumper 2.0, mnoGoSearch,
OpenSearchServer,
Searchdaimon, Sphinx
V e n d o rs o f o p e n s o u rc e
e n te rp ris e s e a rc h s o ftw a re
3 0 D ig its ,p a c h e S o ftw a re
A
F o u n d a tio Lu cid W o rks ,
,n
S e m a te x t, F la x
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
14. Open Source, lo fanno anche loro.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
15. Open Source
Open Standard
Interoperabilità
Innovazione
Perché Innovazione = Bu$ine$$
OAGi OASIS
W3C IETF IEEE
ETSI Ecma OGF
IEC ISO ITU
CENELEC CEN
BSI UNI CEI
DKE DIN
AFNOR GIETS
LDTI
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
17. Solr features
●
●
●
●
●
●
Advanced Full-Text Search
Capabilities
Optimized for High Volume Web
Traffic
Standards Based Open Interfaces XML, JSON and HTTP
Comprehensive HTML
Administration Interfaces
Server statistics exposed over JMX
for monitoring
Linearly scalable, auto index
replication, auto failover and
recovery
●
A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
●
Powerful Extensions to the Lucene Query Language
●
Faceted Search and Filtering
●
Geospatial Search with support for multiple points per document and
geo polygons
●
Advanced, Configurable Text Analysis
●
Highly Configurable and User Extensible Caching
●
Performance Optimizations
●
External Configuration via XML
●
An AJAX based administration interface
●
Monitorable Logging
●
Fast near real-time incremental indexing and index replication
●
●
●
●
Highly Scalable Distributed search with sharded index across multiple
hosts
JSON, XML, CSV/delimited-text, and binary update formats
Easy ways to pull in data from databases and XML files from local disk
and HTTP sources
Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using
Apache Tika
●
●
●
●
Near Real-time indexing
Flexible and Adaptable with XML
configuration
Extensible Plugin Architecture
Apache UIMA integration for configurable metadata extraction
●
Multiple search indices
Related Projects: Apache Hadoop, Apache
ManifoldCF, Apache Lucene.Net, Apache Lucy,
Apache Mahout, Apache Nutch, Apache
OpenNLP, Apache Tika, Apache Zookeeper
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
18. Search, già una 'commodity'
S e a rc h is E v e ry w h e re ! K e y w o rd s e a rc h is a c o m m o d ity
H o lis tic v ie w o f th e d a ta a n d th e u s e rs is c ritic a l
S c a la b le S e a rc h , D is c o v e ry a n d A n a ly tic s a re th e k e y to
u n lo c k in g th is v ie w o f u s e rs a n d d a ta
Documen
ts
Content
Relationships
User
interacti
on
Access
Traditional
• Fast, fuzzy text matching across a
large document collection
• De-normalized data, “light”
relational
• Top N problems
• Key-value (top 1)
• Recommendations
• “Good enough” classification,
clustering
• Faceting, slicing and dicing of
enumerated data
• Spatial, spell checking, record
linkage, highlighting
And:
●eCommerce
●Search + Recs + Analysis of users
●Knowledge Management
●Financial, transportation, pharma
●Fraud detection
●Social media
●Trend monitoring
●Information technology
●Log monitoring, analysis
●Healthcare
●DNA Analysis
• NoSQL
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013