Designing IA for AI - Information Architecture Conference 2024
Solr: 4 big features
1. APACHE SOLR
Four Big Features:
• Faceting
• Query auto-complete
• Geospatial
• Scaling
2014 March Presented by David Smiley at the Boston Java Meetup Group
2. About David Smiley
➢ Software Engineer (14 years)
○ Search (5 years)
○ Java, Web, Spatial
➢ Part-time employed at MITRE
➢ Part-time search consultant
➢ Apache Lucene / Solr committer & PMC
➢ Published 1st book on Solr
➢ Presented at several conferences
➢ Taught several Solr classes
3. Faceting
• Do you know what I mean by “faceting”?
• AKA: faceted navigation, or parametric search
• Popular Apps:
• eBay, Amazon, and many e-commerce sites
• Apps I use that don’t use faceting but I wish they did:
• http://search.maven.org and all Maven repository software: Nexus,
Artifactory, Archiva
• JIRA
• Compare this to: http://jirasearch.mikemccandless.com/
4. Faceted Navigation & Analytics
by example…
Notice the counts
Optionally start with
a keyword search or
filter
Extremely useful feature supported by very few platforms:
Solr, ElasticSearch, Sphinx, … (no DBs)
Credit: Trey Grainger; CareerBuilder
5. How to: Field Faceting
• Index setup: schema.xml:
<field name=“category” type=“string” />
<field name=“manufacturer” type=“string” />
• Facet search:
http://localhost:8983/solr/
collection1/
select?
q=*:*&
facet=true&
facet.field=category&
facet.field=manufacturer
6. How to: Numeric/Date Faceting
• Index setup: schema.xml:
<field name=“timestamp” type=“tdate” />
• Facet search:
http://localhost:8983/solr/
collection1/
select?
q=*:*&
facet=true&
facet.range=timestamp&
facet.range.start=NOW/YEAR-10YEAR
facet.range.end=NOW/YEAR+1YEAR
facet.range.gap=+1YEAR
7. Query Suggest / Autocomplete
If you aren’t doing this then
you really should!
8. Several Types
• Instant search
• Direct navigation to documents, usually by name/title/id, etc.
• Implement via edge n-grams or a Suggester
• Ex: iTunes, Netflix, …
• Query log completion
• Searches user queries you’ve captured & indexed
• Implement via edge n-grams or FreeTextSuggester
• Ex: Google
• Term completion
• Completes indexed words
• Implement via facet.prefix technique or a Suggester
• Facet / field value completion
• Ex: Mint.com
Not mutually exclusive!
9. Tools for Completing / Suggesting
• The Suggester
• A specialization of the spell-check Solr component
• 8 implementations to choose from! Different pros/cons
• Weighted? Analyzing? Infix? Highlight? Fuzzy? N-gram model?
• Faceting with facet.prefix
• Respects your current filters – don’t suggest a 0-result response
• Edge n-grams, with standard search
• Terms component
11. Geospatial Features
• Lucene/Solr can index text, numbers, dates, and spatial
data
• Features:
• Index latitude & longitude coordinates or any X Y pairs
• Index polygons or other geometry
• Query by point-radius, rectangle, or polygon geometry
• Including “IsWithin” vs “Intersects” vs “Contains” predicates
• 2d/flat Euclidean OR geodetic spherical world model
• Sort or relevancy-boost by distance to indexed points
The NoSQL solutions with the best spatial are CouchDB, MongoDB, Solr, and ElasticSearch
12. How to: Spatial Filter & Sort
• Index setup:
schema.xml:
<field name=“geo”
type=“location_rpt”
/>
• Index latitude comma
longitude in your
document:
37.7752,-100.0232
• Filter :
http://localhost:
8983/solr/
collection1/
select?
q=*:*&
fq={!geofilt}&
sort=geodist() asc&
sfield=geo&
pt=45.15,-93.85&
d=5
13. Cool Technology Under the Hood
• Grid / tile based
recursive indexed
structure using a prefix
tree / trie indexing
approach on standard
Lucene inverted index
• Future:
• Precise indexed shapes
• Geodetic polygons
• Hilbert curve ordering
14. Scaling Solr
Solr’s mechanisms for scaling:
• Replication
• Eliminates single point of failure
• Reduces query load on any one node
• Backups
• Distributed-search (for sharded indexes)
• For collections of large multi-million document collections
• SolrCloud
• Combines distributed-search and real-time replicated indexing
• Centrally manages configuration
• A higher level logical API, manages lots of coordination underneath
• Advanced: doc routing, shard splitting, migration
15. Replication & Sharding
Illustrated with a metaphor of an encyclopedia at a library
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
26 Shards
3 Replicas A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
17. More Advanced SolrCloud Features
• Document routing customization
• Answers: Which shard does a document belong in?
• Hash (i.e. random) distribution
• Or keep certain related documents together (ex: for same user)
• Helps scale when searching by a subset
• Or manage it yourself manually (ex: index by month)
• Shard splitting
• When your shard(s) get to be too big
• Live; no down-time
• Inter-collection document migration
• Copies a subset of one collection to another, possibly new
collection
• Live; no down-time
18. That’s all for now; thanks for coming!
Need Lucene/Solr guidance or custom development?
Contact me: dsmiley@apache.org
ETA: June 2014