Your SlideShare is downloading. ×
0
Apache Solr
Masterclass
From zero to hero
June 2014
www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014
2
Alexandre Rafalovitch
www.outerthoughts.com
Web search engines !
are quite sophisticated
3
4
But the real search needs !
are!
much DEEPER and BROADER
5
Searching code
6
Searching people and companies
7
Searching products
8
Searching library material
9
Searching languages
10
Understanding full-text search
SELECT * 

FROM database

WHERE field LIKE ‘%word%’#
This DOES NOT Scale#
Instead: #
break t...
Basic search engine features
Search (Duh!): keyword, phrase, field-specific#
Positive and negative terms#
Sort: relevancy, r...
Advanced search engine features
Facets/Taxonomy - based navigation with live counts#
Language-specific processing#
Domain-s...
Search engine solutions?
Solr#
Elastic Search#
Xapian#
Sphinx#
Groonga#
Searchdaimon#
{F}lexSearch#
Algolia (SaaS)#
Search...
Used with permission from SemaText
Open Source Search Evolution
15
Secret Ingredient - Lucene
Solr#
Elastic Search#
SwiftType#
Galene (LinkedIn’s)#
PyLucene (Python
wrapper)#
Lucene.net (C#...
Secret Ingredient - Solr
Certified distributions#
LucidWorks#
HelioSearch#
Big Data platforms#
Cloudera#
Hortonworks HDP#
H...
Solr Ecosystem sample
Drupal#
Project Blacklight#
LuxDB#
SolrMeter#
CrafterCMS#
Typo3#
Magenta#
HippoCMS#
ColdFusion#
Solr...
DEMO Time
19
DEMO - Basic
Unzip#
Go to example directory#
Run Solr#
Import some documents from example docs#
grep -l store *.xml | xarg...
DEMO - Browse handler
Restart Solr with -Dsolr.clustering.enabled=true#
Visit http://localhost:8983/solr/browse/ #
Show off...
Getting into Solr
22
Start for free
Download, unzip, cd example; java -jar start.jar#
Go through basic tutorial in docs/tutorial.html#
Copy exa...
Simplest Solr - directory layout
solr-home - point here with -Dsolr.solr.home
collection1 - default collection name, witho...
Simplest Solr - schema.xml
<?xml version="1.0" encoding="UTF-8" ?>
<schema version="1.5" name="simplest-solr">
<fieldType ...
Simplest Solr - solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>LUCENE_4_9</luceneMatch...
DEMO
https://github.com/arafalov/simplest-solr-config
java -Dsolr.solr.home=…./simplest-solr
Go to <solr>/example/exampled...
Lots of things missing
Some admin UI items disabled (Ping, Files)#
No Near-Real-Time or atomic/partial update#
No types (a...
Two ways of learning
You can follow a path (going forward)#
A tutorial#
A book#
Learn what it teaches#
You can reach for t...
Goal-driven Solr
1. Start with the simplest configuration that works#
2. Get something in (import data)#
3. Get something o...
Getting data in
curl#
post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help#
Admin UI (core/Documents)#
...
Getting data out
Curl#
Web browser#
Admin UI (core/Query)#
Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP,
CSV)...
Celebrate!
You achieved basic end-to-end test#
You got Solr running#
You figured out how to display it#
You now know where ...
Fine-tune schema
Solr is not friends with your data, it’s here to get your documents
found.#
<field name="features" stored=...
Analyzers - English
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">#
<analyzer type="index">#
...
Analyzers - Persian
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">#
<analyzer>#
<charFilter c...
copyField FTW
<copyField source="cat" dest="text"/>#
<copyField source="*_t" dest="text" maxChars="3000"/>#
Indexing book ...
Fine-tune search
Default query parser supports Lucene search syntax:#
text +compulsory -negated field:value#
uses default fi...
Fine-tune indexing
UpdateRequestProcessor#
after you send your data to Solr #
before it hits the schema#
Deal with missing...
Fine-tune display
Sorting #
Faceting - automatic taxonomy with counts (indexed value)#
Highlighting#
MoreLikeThis#
Statist...
Documentation
Solr WIKI - old but still has a lot of information#
Solr Reference Guide - new; online and downloadable#
htt...
With Solr, how far can I go?
Cloudera (BigData) has > 1,000,000,000 $USD
investments - opportunities?#
8M+ searches/day, 4...
Hackathon
43
First steps
Install Solr 4.9#
Go through the tutorial - gives you basics and end-to-end test#
Join the Slack chat (invitat...
Path 1 - Solr indexing book
Great for first timers#
Gets you from zero to comfortable#
All example are provided#
If are you...
Path 2 - Your own dataset
Get it in at any costs#
Get it displayed#
Start iterating#
Book a time slot to discuss your ques...
Path 3 - Need a dataset
Index your favourite Git repository (e.g. Solr): 

https://github.com/arafalov/git-to-solr#
Your o...
Path 4 - A bigger challenge
Project Guttenberg (ask me for a copy of RDF dump)#
WorldCup matches data: http://worldcup.sfg...
DEMO Rules
There are no rules#
And the prizes are not terribly important#
What we are looking for is learning#
Make someth...
For later
50
Accelerate your learning
If still feel like a beginner, buy my book - seriously. That’s what it’s for#
All code/data is at...
Other Search-related books
Designing the Search Experience: The Information
Architecture of Discovery - by a TwigKit creat...
53
Alexandre Rafalovitch
www.outerthoughts.com
Upcoming SlideShare
Loading in...5
×

Solr Masterclass Bangkok, June 2014

1,518

Published on

Presentation given to mostly Thai audience in Bangkok, June 2014.

Published in: Internet, Technology

Transcript of "Solr Masterclass Bangkok, June 2014"

  1. 1. Apache Solr Masterclass From zero to hero June 2014 www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014
  2. 2. 2 Alexandre Rafalovitch www.outerthoughts.com
  3. 3. Web search engines ! are quite sophisticated 3
  4. 4. 4
  5. 5. But the real search needs ! are! much DEEPER and BROADER 5
  6. 6. Searching code 6
  7. 7. Searching people and companies 7
  8. 8. Searching products 8
  9. 9. Searching library material 9
  10. 10. Searching languages 10
  11. 11. Understanding full-text search SELECT * 
 FROM database
 WHERE field LIKE ‘%word%’# This DOES NOT Scale# Instead: # break text into tokens# domain-specific processing (e.g. lower-casing)# build fast-access structures# algorithms for term, phrases, proximity search 11
  12. 12. Basic search engine features Search (Duh!): keyword, phrase, field-specific# Positive and negative terms# Sort: relevancy, recency# Pagination# Compact summary in results# SPEED 12
  13. 13. Advanced search engine features Facets/Taxonomy - based navigation with live counts# Language-specific processing# Domain-specific text processing (WiFi = Wi-Fi = WIFI)# Geographic search# More-like-this, did-you-mean, autocomplete# Scaling/Clustering# NOT web crawling - different, but related 13
  14. 14. Search engine solutions? Solr# Elastic Search# Xapian# Sphinx# Groonga# Searchdaimon# {F}lexSearch# Algolia (SaaS)# Searchify (SaaS)# ForageJS# Lunr.js# FACT-Finder# DtSearch# MarkLogic# Verity# Fast# Most databases# ! ! …AND MORE 14
  15. 15. Used with permission from SemaText Open Source Search Evolution 15
  16. 16. Secret Ingredient - Lucene Solr# Elastic Search# SwiftType# Galene (LinkedIn’s)# PyLucene (Python wrapper)# Lucene.net (C# port) Scalable, high-performance indexing# Incremental indexing# Full-text search# Information-Retrieval algorithms# Implemented in Java# Written in 1999, still going strong 16
  17. 17. Secret Ingredient - Solr Certified distributions# LucidWorks# HelioSearch# Big Data platforms# Cloudera# Hortonworks HDP# Hosted and SaaS# Amazon CloudSearch# WebSolr, SolrHQ, SearchBox Lucene full-text-search# XML and REST config# Schema/Schemaless# SolrCloud (clustering)# Caching# Near real-time# Rich-document indexing (Tika inside)# Plugins, components, processors 17
  18. 18. Solr Ecosystem sample Drupal# Project Blacklight# LuxDB# SolrMeter# CrafterCMS# Typo3# Magenta# HippoCMS# ColdFusion# SolrNet# DataStax# Dovecot# NGData Lily# Basho Riak# YaCy# Apache ManifoldCF# Apache Camel# FranzAllegrograph# BitNami Solr Stack# Carrot2! Broadleaf Commerce# Cloudera CDK! CodeLibs Fess (フェス)! Splunk# Alfresco# Rosette by BasisTech! Luwak by Flax! Quepid by OSC! TwigKit! SPM by SemaText! SILK by LucidWorks! Banana (O/S Solr Kibana) 18
  19. 19. DEMO Time 19
  20. 20. DEMO - Basic Unzip# Go to example directory# Run Solr# Import some documents from example docs# grep -l store *.xml | xargs ./post.sh# Show off Solr 4 admin panel 20
  21. 21. DEMO - Browse handler Restart Solr with -Dsolr.clustering.enabled=true# Visit http://localhost:8983/solr/browse/ # Show off# Search# Facets - Categories and Ranges# Spatial/Geo-distance# Clusters 21
  22. 22. Getting into Solr 22
  23. 23. Start for free Download, unzip, cd example; java -jar start.jar# Go through basic tutorial in docs/tutorial.html# Copy example directory, modify schema.xml until happy# If coming from ElasticSearch, look at example-schemaless# Do NOT follow this path to production# Example schema is a kitchen sink !!! Read it as a story.# <solr>/examples/solr/collection1/conf/{schema.xml|solrconfig.xml} 23
  24. 24. Simplest Solr - directory layout solr-home - point here with -Dsolr.solr.home collection1 - default collection name, without solr.xml conf - configuration directory for the collection schema.xml - defines fields and types solrconfig.xml - defines low-level configuration but also components, handlers, and chains for UpdateRequestProcessor 24
  25. 25. Simplest Solr - schema.xml <?xml version="1.0" encoding="UTF-8" ?> <schema version="1.5" name="simplest-solr"> <fieldType name="string" class=“solr.StrField"/> ! <field name="id" type="string" indexed="true" stored="true" required="true"/> <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/> ! <uniqueKey>id</uniqueKey> </schema> 25
  26. 26. Simplest Solr - solrconfig.xml <?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>LUCENE_4_9</luceneMatchVersion> <requestDispatcher handleSelect="false"> <httpCaching never304="true" /> </requestDispatcher> <requestHandler name="/select" class="solr.SearchHandler" /> <requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/admin" class="solr.admin.AdminHandlers" /> <requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /> </config> 26
  27. 27. DEMO https://github.com/arafalov/simplest-solr-config java -Dsolr.solr.home=…./simplest-solr Go to <solr>/example/exampledocs grep -l store *.xml |xargs ./post.sh (same, same) Check Admin UI Query - same, but different (multivalue, date) Schema browser 27
  28. 28. Lots of things missing Some admin UI items disabled (Ping, Files)# No Near-Real-Time or atomic/partial update# No types (apart from String)# No dynamic schema# No SolrCloud# DOES NOT MATTER. NOTYET! 28
  29. 29. Two ways of learning You can follow a path (going forward)# A tutorial# A book# Learn what it teaches# You can reach for the goal (going backwards)# Have an idea# Try to achieve it# Learn what’s on the critical path# Both are valuable. The second is harder, but gives you more. 29
  30. 30. Goal-driven Solr 1. Start with the simplest configuration that works# 2. Get something in (import data)# 3. Get something out (display data)# 4. Celebrate!! 5. Decide/Fine-tune what/how you want to find things# 6. Change the schema to match# 7. Change the import/display to match# 8. GOTO 5 (never really stops) 30
  31. 31. Getting data in curl# post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help# Admin UI (core/Documents)# Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr- clients/)# Formats: XML, JSON, CSV, other formats (processed with Tika)# DataImportHandler to pull data from external sources# BigData connectors (Hadoop, Flume, etc) # BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on HDFS) 31
  32. 32. Getting data out Curl# Web browser# Admin UI (core/Query)# Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP, CSV)# UI toolkits (Cloudera HUE, TwigKit)# Internal post-processors (we saw VelocityResponseWriter at /browse)# Needs middleware or strong proxy - not secure otherwise 32
  33. 33. Celebrate! You achieved basic end-to-end test# You got Solr running# You figured out how to display it# You now know where the issues are# FIX THOSE NEXT 33
  34. 34. Fine-tune schema Solr is not friends with your data, it’s here to get your documents found.# <field name="features" stored="true" indexed="true" type="text_general" multiValued=“true"/># stored=true - that’s for you# indexed=true - that’s for Solr, where the magic happens# type=“type_name” - defines what analyser chain to use! SeeAdminUI core/Analysis# See http://www.solr-start.com/info/analyzers/ for full list 34
  35. 35. Analyzers - English <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"># <analyzer type="index"># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class=“solr.StopFilterFactory" ignoreCase=“true" words=“lang/ stopwords_en.txt"/># <filter class="solr.LowerCaseFilterFactory"/># # <filter class="solr.EnglishPossessiveFilterFactory"/># <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/># <filter class=“solr.PorterStemFilterFactory”/>….# </analyzer>…. 35
  36. 36. Analyzers - Persian <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"># <analyzer># <charFilter class="solr.PersianCharFilterFactory"/># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class="solr.LowerCaseFilterFactory"/># <filter class="solr.ArabicNormalizationFilterFactory"/># <filter class="solr.PersianNormalizationFilterFactory"/># <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/ stopwords_fa.txt" /># </analyzer># </fieldType> 36
  37. 37. copyField FTW <copyField source="cat" dest="text"/># <copyField source="*_t" dest="text" maxChars="3000"/># Indexing book authors 
 “Schildt, Herbert; Wolpert, Lewis; Davies, P. “# For searching: Tokenized, case-folded, punctuation-stripped:
 schildt / herbert / wolpert / lewis / davies / p # For sorting: Untokenized, case-folded, punctuation-stripped:
 schildt herbert wolpert lewis davies p # For faceting: Primary author only, using a solr.StringField:
 Schildt, Herbert 37
  38. 38. Fine-tune search Default query parser supports Lucene search syntax:# text +compulsory -negated field:value# uses default field or explicit field# not very good for complex analysis# eDisMax supports that plus searching across many fields# Many more specialised types: https://cwiki.apache.org/ confluence/display/solr/Other+Parsers 38
  39. 39. Fine-tune indexing UpdateRequestProcessor# after you send your data to Solr # before it hits the schema# Deal with missing values, do pre-processing, identify languages, secret to schemaless mode (see example-schemaless)# Defined in solrconfig.xml, search for updateRequestProcessorChain# Full list at: http://www.solr-start.com/info/update-request- processors/ 39
  40. 40. Fine-tune display Sorting # Faceting - automatic taxonomy with counts (indexed value)# Highlighting# MoreLikeThis# Statistics# Grouping, Pivoting# Debug for troubleshooting 40
  41. 41. Documentation Solr WIKI - old but still has a lot of information# Solr Reference Guide - new; online and downloadable# http://www.solr-start.com/ - my resources of learners# http://heliosearch.org/author/joel-bernstein/ - about new features 41
  42. 42. With Solr, how far can I go? Cloudera (BigData) has > 1,000,000,000 $USD investments - opportunities?# 8M+ searches/day, 40 languages, 100ms NRT, 1024 cores, 256 shards, 32 servers on #solr at Bloomberg http://bit.ly/ 1jmG72G (via @FlaxSearch) 42
  43. 43. Hackathon 43
  44. 44. First steps Install Solr 4.9# Go through the tutorial - gives you basics and end-to-end test# Join the Slack chat (invitations are coming)# Twit #SolrMasterclassBkk , @SolrStart, if have space :-)# Attend breakout sessions# Choose your own adventure (next) 44
  45. 45. Path 1 - Solr indexing book Great for first timers# Gets you from zero to comfortable# All example are provided# If are you stuck, I will help you# Probably will not win you any prizes….. # Do it for the skills 45
  46. 46. Path 2 - Your own dataset Get it in at any costs# Get it displayed# Start iterating# Book a time slot to discuss your questions# Demo tips# Explain problem domain (what is your dataset)# Show how far you got# Discuss the challenges 46
  47. 47. Path 3 - Need a dataset Index your favourite Git repository (e.g. Solr): 
 https://github.com/arafalov/git-to-solr# Your own WordPress blog export (with DataImportHandler)# Your own hard-drive# Demo tips# How far did you get# Concentrate on displaying something cool (statistics?)# Coolest Solr feature you found 47
  48. 48. Path 4 - A bigger challenge Project Guttenberg (ask me for a copy of RDF dump)# WorldCup matches data: http://worldcup.sfg.io/ # Twitter feed (e.g. with Spring XD/Integration)# Your own photographs collection (Tika extracts metadata) 48
  49. 49. DEMO Rules There are no rules# And the prizes are not terribly important# What we are looking for is learning# Make something new out of something old# Learn a new features and show others# Learn, teach, share - everybody wins 49
  50. 50. For later 50
  51. 51. Accelerate your learning If still feel like a beginner, buy my book - seriously. That’s what it’s for# All code/data is at: https://github.com/arafalov/solr-indexing-book # Buy Solr InAction - recently and is a great reference, 
 follow @ManningBooks for discounts# Use my www.solr-start.com resources and join the mailing list 
 (I’ll do that for you this time)# Join solr-user mailing list - full of advanced hackers# Watch Lucid Revolution videos for background# Start helping out on Stack Overflow #solr# Blog what you learned, twit with #Solr 51
  52. 52. Other Search-related books Designing the Search Experience: The Information Architecture of Discovery - by a TwigKit creator +1# SearchAnalytics for Your Site: Conversations with Your Customers by Louis Rosenfeld - see also Quepid# Enterprise Search by Martin White 52
  53. 53. 53 Alexandre Rafalovitch www.outerthoughts.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×