Solr Masterclass Bangkok, June 2014

Apache Solr
Masterclass
From zero to hero
June 2014
www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014

2
Alexandre Rafalovitch
www.outerthoughts.com

Web search engines !
are quite sophisticated
3

But the real search needs !
are!
much DEEPER and BROADER
5

Searching people and companies
7

Understanding full-text search
SELECT *  
FROM database 
WHERE ﬁeld LIKE ‘%word%’#
This DOES NOT Scale#
Instead: #
break text into tokens#
domain-speciﬁc processing (e.g. lower-casing)#
build fast-access structures#
algorithms for term, phrases, proximity search
11

Basic search engine features
Search (Duh!): keyword, phrase, ﬁeld-speciﬁc#
Positive and negative terms#
Sort: relevancy, recency#
Pagination#
Compact summary in results#
SPEED
12

Advanced search engine features
Facets/Taxonomy - based navigation with live counts#
Language-specific processing#
Domain-specific text processing (WiFi = Wi-Fi = WIFI)#
Geographic search#
More-like-this, did-you-mean, autocomplete#
Scaling/Clustering#
NOT web crawling - different, but related
13

Search engine solutions?
Solr#
Elastic Search#
Xapian#
Sphinx#
Groonga#
Searchdaimon#
{F}lexSearch#
Algolia (SaaS)#
Searchify
(SaaS)#
ForageJS#
Lunr.js#
FACT-Finder#
DtSearch#
MarkLogic#
Verity#
Fast#
Most databases#
!
!
…AND MORE
14

Used with permission from SemaText
Open Source Search Evolution
15

Secret Ingredient - Lucene
Solr#
Elastic Search#
SwiftType#
Galene (LinkedIn’s)#
PyLucene (Python
wrapper)#
Lucene.net (C# port)
Scalable, high-performance
indexing#
Incremental indexing#
Full-text search#
Information-Retrieval
algorithms#
Implemented in Java#
Written in 1999, still going
strong
16

Secret Ingredient - Solr
Certiﬁed distributions#
LucidWorks#
HelioSearch#
Big Data platforms#
Cloudera#
Hortonworks HDP#
Hosted and SaaS#
Amazon CloudSearch#
WebSolr, SolrHQ, SearchBox
Lucene full-text-search#
XML and REST conﬁg#
Schema/Schemaless#
SolrCloud (clustering)#
Caching#
Near real-time#
Rich-document indexing (Tika
inside)#
Plugins, components, processors
17

Solr Ecosystem sample
Drupal#
Project Blacklight#
LuxDB#
SolrMeter#
CrafterCMS#
Typo3#
Magenta#
HippoCMS#
ColdFusion#
SolrNet#
DataStax#
Dovecot#
NGData Lily#
Basho Riak#
YaCy#
Apache ManifoldCF#
Apache Camel#
FranzAllegrograph#
BitNami Solr Stack#
Carrot2!
Broadleaf Commerce#
Cloudera CDK!
CodeLibs Fess (フェス)!
Splunk#
Alfresco#
Rosette by BasisTech!
Luwak by Flax!
Quepid by OSC!
TwigKit!
SPM by SemaText!
SILK by LucidWorks!
Banana (O/S Solr
Kibana)
18

DEMO - Basic
Unzip#
Go to example directory#
Run Solr#
Import some documents from example docs#
grep -l store *.xml | xargs ./post.sh#
Show oﬀ Solr 4 admin panel
20

DEMO - Browse handler
Restart Solr with -Dsolr.clustering.enabled=true#
Visit http://localhost:8983/solr/browse/ #
Show oﬀ#
Search#
Facets - Categories and Ranges#
Spatial/Geo-distance#
Clusters
21

Start for free
Download, unzip, cd example; java -jar start.jar#
Go through basic tutorial in docs/tutorial.html#
Copy example directory, modify schema.xml until happy#
If coming from ElasticSearch, look at example-schemaless#
Do NOT follow this path to production#
Example schema is a kitchen sink !!! Read it as a story.#
<solr>/examples/solr/collection1/conf/{schema.xml|solrconﬁg.xml}
23

Simplest Solr - directory layout
solr-home - point here with -Dsolr.solr.home
collection1 - default collection name, without solr.xml
conf - configuration directory for the collection
schema.xml - defines fields and types
solrconfig.xml - defines low-level configuration but also
components, handlers, and chains for UpdateRequestProcessor
24

Simplest Solr - schema.xml
<?xml version="1.0" encoding="UTF-8" ?>
<schema version="1.5" name="simplest-solr">
<fieldType name="string" class=“solr.StrField"/>
!
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<dynamicField name="*" type="string" indexed="true"
stored="true" multiValued="true"/>
!
<uniqueKey>id</uniqueKey>
</schema>
25

Simplest Solr - solrconﬁg.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>LUCENE_4_9</luceneMatchVersion>
<requestDispatcher handleSelect="false">
<httpCaching never304="true" />
</requestDispatcher>
<requestHandler name="/select" class="solr.SearchHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin" class="solr.admin.AdminHandlers" />
<requestHandler name="/analysis/field"
class="solr.FieldAnalysisRequestHandler" startup="lazy" />
</config>
26

DEMO
https://github.com/arafalov/simplest-solr-config
java -Dsolr.solr.home=…./simplest-solr
Go to <solr>/example/exampledocs
grep -l store *.xml |xargs ./post.sh (same, same)
Check Admin UI
Query - same, but different (multivalue, date)
Schema browser
27

Lots of things missing
Some admin UI items disabled (Ping, Files)#
No Near-Real-Time or atomic/partial update#
No types (apart from String)#
No dynamic schema#
No SolrCloud#
DOES NOT MATTER. NOTYET!
28

Two ways of learning
You can follow a path (going forward)#
A tutorial#
A book#
Learn what it teaches#
You can reach for the goal (going backwards)#
Have an idea#
Try to achieve it#
Learn what’s on the critical path#
Both are valuable. The second is harder, but gives you more.
29

Goal-driven Solr
1. Start with the simplest conﬁguration that works#
2. Get something in (import data)#
3. Get something out (display data)#
4. Celebrate!!
5. Decide/Fine-tune what/how you want to ﬁnd things#
6. Change the schema to match#
7. Change the import/display to match#
8. GOTO 5 (never really stops)
30

Getting data in
curl#
post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help#
Admin UI (core/Documents)#
Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr-
clients/)#
Formats: XML, JSON, CSV, other formats (processed with Tika)#
DataImportHandler to pull data from external sources#
BigData connectors (Hadoop, Flume, etc) #
BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on
HDFS)
31

Getting data out
Curl#
Web browser#
Admin UI (core/Query)#
Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP,
CSV)#
UI toolkits (Cloudera HUE, TwigKit)#
Internal post-processors (we saw VelocityResponseWriter at /browse)#
Needs middleware or strong proxy - not secure otherwise
32

Celebrate!
You achieved basic end-to-end test#
You got Solr running#
You ﬁgured out how to display it#
You now know where the issues are#
FIX THOSE NEXT
33

Fine-tune schema
Solr is not friends with your data, it’s here to get your documents
found.#
<ﬁeld name="features" stored="true" indexed="true"
type="text_general" multiValued=“true"/>#
stored=true - that’s for you#
indexed=true - that’s for Solr, where the magic happens#
type=“type_name” - deﬁnes what analyser chain to use!
SeeAdminUI core/Analysis#
See http://www.solr-start.com/info/analyzers/ for full list
34

Analyzers - English
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">#
<analyzer type="index">#
<tokenizer class="solr.StandardTokenizerFactory"/>#
<filter class=“solr.StopFilterFactory" ignoreCase=“true" words=“lang/
stopwords_en.txt"/>#
<filter class="solr.LowerCaseFilterFactory"/>#
# <filter class="solr.EnglishPossessiveFilterFactory"/>#
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>#
<filter class=“solr.PorterStemFilterFactory”/>….#
</analyzer>….
35

Analyzers - Persian
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">#
<analyzer>#
<charFilter class="solr.PersianCharFilterFactory"/>#
<tokenizer class="solr.StandardTokenizerFactory"/>#
<filter class="solr.LowerCaseFilterFactory"/>#
<filter class="solr.ArabicNormalizationFilterFactory"/>#
<filter class="solr.PersianNormalizationFilterFactory"/>#
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/
stopwords_fa.txt" />#
</analyzer>#
</fieldType>
36

copyField FTW
<copyField source="cat" dest="text"/>#
<copyField source="*_t" dest="text" maxChars="3000"/>#
Indexing book authors  
“Schildt, Herbert; Wolpert, Lewis; Davies, P. “#
For searching: Tokenized, case-folded, punctuation-stripped: 
schildt / herbert / wolpert / lewis / davies / p #
For sorting: Untokenized, case-folded, punctuation-stripped: 
schildt herbert wolpert lewis davies p #
For faceting: Primary author only, using a solr.StringField: 
Schildt, Herbert
37

Fine-tune search
Default query parser supports Lucene search syntax:#
text +compulsory -negated field:value#
uses default field or explicit field#
not very good for complex analysis#
eDisMax supports that plus searching across many fields#
Many more specialised types: https://cwiki.apache.org/
confluence/display/solr/Other+Parsers
38

Fine-tune indexing
UpdateRequestProcessor#
after you send your data to Solr #
before it hits the schema#
Deal with missing values, do pre-processing, identify
languages, secret to schemaless mode (see example-schemaless)#
Deﬁned in solrconﬁg.xml, search for
updateRequestProcessorChain#
Full list at: http://www.solr-start.com/info/update-request-
processors/
39

Fine-tune display
Sorting #
Faceting - automatic taxonomy with counts (indexed value)#
Highlighting#
MoreLikeThis#
Statistics#
Grouping, Pivoting#
Debug for troubleshooting
40

Documentation
Solr WIKI - old but still has a lot of information#
Solr Reference Guide - new; online and downloadable#
http://www.solr-start.com/ - my resources of learners#
http://heliosearch.org/author/joel-bernstein/ - about new
features
41

With Solr, how far can I go?
Cloudera (BigData) has > 1,000,000,000 $USD
investments - opportunities?#
8M+ searches/day, 40 languages, 100ms NRT, 1024 cores,
256 shards, 32 servers on #solr at Bloomberg http://bit.ly/
1jmG72G (via @FlaxSearch)
42

First steps
Install Solr 4.9#
Go through the tutorial - gives you basics and end-to-end test#
Join the Slack chat (invitations are coming)#
Twit #SolrMasterclassBkk , @SolrStart, if have space :-)#
Attend breakout sessions#
Choose your own adventure (next)
44

Path 1 - Solr indexing book
Great for ﬁrst timers#
Gets you from zero to comfortable#
All example are provided#
If are you stuck, I will help you#
Probably will not win you any prizes….. #
Do it for the skills
45

Path 2 - Your own dataset
Get it in at any costs#
Get it displayed#
Start iterating#
Book a time slot to discuss your questions#
Demo tips#
Explain problem domain (what is your dataset)#
Show how far you got#
Discuss the challenges
46

Path 3 - Need a dataset
Index your favourite Git repository (e.g. Solr):  
https://github.com/arafalov/git-to-solr#
Your own WordPress blog export (with DataImportHandler)#
Your own hard-drive#
Demo tips#
How far did you get#
Concentrate on displaying something cool (statistics?)#
Coolest Solr feature you found
47

Path 4 - A bigger challenge
Project Guttenberg (ask me for a copy of RDF dump)#
WorldCup matches data: http://worldcup.sfg.io/ #
Twitter feed (e.g. with Spring XD/Integration)#
Your own photographs collection (Tika extracts metadata)
48

DEMO Rules
There are no rules#
And the prizes are not terribly important#
What we are looking for is learning#
Make something new out of something old#
Learn a new features and show others#
Learn, teach, share - everybody wins
49

Accelerate your learning
If still feel like a beginner, buy my book - seriously. That’s what it’s for#
All code/data is at: https://github.com/arafalov/solr-indexing-book #
Buy Solr InAction - recently and is a great reference,  
follow @ManningBooks for discounts#
Use my www.solr-start.com resources and join the mailing list  
(I’ll do that for you this time)#
Join solr-user mailing list - full of advanced hackers#
Watch Lucid Revolution videos for background#
Start helping out on Stack Overﬂow #solr#
Blog what you learned, twit with #Solr
51

Other Search-related books
Designing the Search Experience: The Information
Architecture of Discovery - by a TwigKit creator +1#
SearchAnalytics for Your Site: Conversations with Your
Customers by Louis Rosenfeld - see also Quepid#
Enterprise Search by Martin White
52

53
Alexandre Rafalovitch
www.outerthoughts.com

Solr Masterclass Bangkok, June 2014

More Related Content

What's hot

Similar to Solr Masterclass Bangkok, June 2014

Recently uploaded

Solr Masterclass Bangkok, June 2014