Ā© Copyright 2013
Open Source Search FTW!
Grant Ingersoll
Lucene/Solr Committer, Apache Soft.
Found.
CTO, LucidWorks
@gsingers
Ā© 2013 LucidWorks
2
Preaching to the Converted!
• Embrace fuzziness!
• Search is a system building
block
• If the algorithms fit,
use them!
• Search use leads to search
abuse
• Scoring features are
everywhere
http://cheezburger.com/5243950080
Ā© 2013 LucidWorks
3
Topics
• Quick Intro to Lucene and Solr
• Whatā€˜s new in Lucene and Solr 4.x?
- Lucene/Solr for Info Retrieval
• (Ab)Using Search Engine Tech. for Fun and Profit
Ā© 2013 LucidWorks
4
Quick Intro to Lucene and Solr
Ā© 2013 LucidWorks
Relax, You’re Among Friends
• Large, diverse search community with many non-traditional
search engine usages
- Object stores, Record linkage, Social, mobile -> web
• Open Dev. > Open Source
• ―The Apache Way‖
- Meritocracy – Those who do, decide!
• Always Be Testing
- Randomized system tests are all the rage
- http://vimeo.com/32087114
• Patches Welcome!
Ā© 2013 LucidWorks
Ā© Copyright 2013
Ā© 2013 LucidWorks
Lucene: Speed and Memory
• Native Near Real Time (NRT) support
- Per segment
- FieldCache can be controlled to only load new segments
- Soft commit -- faster without fsync, allows quicker update
visibility
• DWPT (Document Writer per Thread)
- Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• String -> BytesRef
- Much improved data structure
- … means less memory and less garbage collection effort
Ā© 2013 LucidWorks
Up and to the Right
• http://people.apache.org/~mikemccand/lucenebench/in
dexing.html
9
Ā© 2013 LucidWorks
Lucene: Flexibility
• Flexible Index Formats
- New posting list codecs: Block, Simple Text, Append (HDFS..),
etc
- Pulsing codec: improves performance of primary key searches,
inlining docs, positions, and payloads, saves disk seeks
• Pluggable Scoring
- Decoupled from TF/IDF
- Built in alternatives include BM25 & DFR, and others
Ā» http://en.wikipedia.org/wiki/Okapi_BM25
Ā» http://terrier.org/docs/v3.5/dfr_description.html
- Add your own
Ā© 2013 LucidWorks
FS(A|T)
• Keys:
- byte[] – write-once
- Linear time build of min. automata (nlogn if not sorted)
- Compression
- Reverse lookups
- Weights (used for auto-suggest)
- Pluggable Algebra
• Uses:
- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More:
- http://slidesha.re/vKtpVA
- http://bit.ly/Pkjyu0
- ―Smaller Representation of Finite State Automata‖
Ā» Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011,
vol. 6807, 2011, pp. 118—192.
Ā© Copyright 2013
Ā© 2013 LucidWorks
Solr 4: New Features
• Search/Faceting/Relevance
- New Relevance Function Queries (tf, df, others)
- Pivot Faceting
- Pseudo-join
- Improved Spatial (more later)
- Full support for Lucene Codecs, pluggable scoring
• Indexing
- New Update Processors, including scripting option
- Near real time
• Codec/Similarity support from Lucene 4
• Other
- New Admin UI
Ā© 2013 LucidWorks
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point in a circle
• Indexing:
- "geo‖:‖43.17614,-90.57341‖
- ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖
- ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖
• Searching:
- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"
- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
0, -10 30)))‖
Ā© 2013 LucidWorks
Scaling Solr
• Distributed/sharded indexing & search
- Auto distributes updates and queries to appropriate shards
- Near Real Time (NRT) indexing capable
• Dynamically scalable
- New SolrCloud instances add indexing and query capacity
- Supports re-balancing
• Reliable
- No single point of failure
- Transactions logged
- Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
Ā© 2013 LucidWorks
16
New in 4.4 (just released)
• HDFS backed directory for storing index and
transaction logs in Apache Hadoop
• New Core discovery capabilities
• Schemaless/External Schema/Field Guessing
• Schema APIs
• Add documents from the Admin UI
Ā© Copyright 2013
Hacking Search
Engines for Fun and
Profit
17
Ā© 2013 LucidWorks
… Find your Keys, Store Your Content
• Lucene/Solr is a fast key-value
store
- Bonus: search your values!
• NoSQL before NoSQL was cool
• Solr: distributed key/value
- Durable, Isolated, Redundant, Fast,
Real-time
- Joins, Column Storage
• Solr or Tika + Lucene can index
popular office formats
• Solr can backup/replicate and
scale as content grows
Ā© 2013 LucidWorks
… Find Love! Upsell! Cross-sell!
• Cross recommendation as search
- with search used to build cross recommendation!
• Recommend content to people who exhibit certain
behaviors (clicks, query terms, other)
• (Ab)use of a search engine
- but not as a search engine for content
- more like a search engine for behavior
• See Ted Dunningā€˜s talk from Berlin Buzzwords on Multi-
modal Recommendation Algorithms
- http://berlinbuzzwords.com/sessions/multi-modal-recommendation-
algorithms
• Go get Mahout/Myrrix or just do it in y(our) search engine
Ā© 2013 LucidWorks
20
… Avoid Delays
Ā© 2013 LucidWorks
21
… Time travel?
• Leverage Solrā€˜s new
spatial capabilities to
index non-spatial data,
such as time ranges
- Useful for Open Hours, Shifts,
etc.
• Query using rectangle
intersections
- q = shift:"Intersects(0 19 23
365)‖
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
Ā© 2013 LucidWorks
22
Boldly go forth and rank!
• Faster
• More Flexible
• Easier than ever scaling
• More reliable than ever
• Reduced cost of experimentation
Ā© 2013 LucidWorks
• Lucene/Solr EU
Conference:
- Dublin, IE, November 4-7:
http://lucenerevolution.org/
- CFP Open Now
Where to Next?
• Lucene/Solr
- http://lucene.apache.org
- {java-user|solr-user}@lucene.apache.org
- SIGIR ‗12 Open Source Workshop
Ā» http://opensearchlab.otago.ac.nz/paper_
10.pdf
• LucidWorks
- http://www.lucidworks.com
- Commercial support, products, etc. for
Lucene/Solr
• Me
- grant@lucidworks.com
- @gsingers on Twitter
- ―Taming Text‖ – Engineerā€˜s guide to open
source search and NLP
Ā» http:///www.manning.com/ingersoll
23
Ā© 2013 LucidWorks
24
Credits
• All of the Lucene/Solr committers and contributors
• Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html
• Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1
• Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring-
The-American.jpg
• Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/
• Love: http://www.msruntheus.com/above-all-love-each-other-deeply/
• TARDIS: http://2.bp.blogspot.com/-
ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_do
ctor_who.jpg

Open Source Search FTW

  • 1.
    Ā© Copyright 2013 OpenSource Search FTW! Grant Ingersoll Lucene/Solr Committer, Apache Soft. Found. CTO, LucidWorks @gsingers
  • 2.
    Ā© 2013 LucidWorks 2 Preachingto the Converted! • Embrace fuzziness! • Search is a system building block • If the algorithms fit, use them! • Search use leads to search abuse • Scoring features are everywhere http://cheezburger.com/5243950080
  • 3.
    Ā© 2013 LucidWorks 3 Topics •Quick Intro to Lucene and Solr • Whatā€˜s new in Lucene and Solr 4.x? - Lucene/Solr for Info Retrieval • (Ab)Using Search Engine Tech. for Fun and Profit
  • 4.
    Ā© 2013 LucidWorks 4 QuickIntro to Lucene and Solr
  • 5.
    Ā© 2013 LucidWorks Relax,You’re Among Friends • Large, diverse search community with many non-traditional search engine usages - Object stores, Record linkage, Social, mobile -> web • Open Dev. > Open Source • ―The Apache Way‖ - Meritocracy – Those who do, decide! • Always Be Testing - Randomized system tests are all the rage - http://vimeo.com/32087114 • Patches Welcome!
  • 6.
  • 7.
  • 8.
    Ā© 2013 LucidWorks Lucene:Speed and Memory • Native Near Real Time (NRT) support - Per segment - FieldCache can be controlled to only load new segments - Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) - Faster more consistent index speed • Faster fuzzy & wildcard query processing • String -> BytesRef - Much improved data structure - … means less memory and less garbage collection effort
  • 9.
    Ā© 2013 LucidWorks Upand to the Right • http://people.apache.org/~mikemccand/lucenebench/in dexing.html 9
  • 10.
    Ā© 2013 LucidWorks Lucene:Flexibility • Flexible Index Formats - New posting list codecs: Block, Simple Text, Append (HDFS..), etc - Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring - Decoupled from TF/IDF - Built in alternatives include BM25 & DFR, and others Ā» http://en.wikipedia.org/wiki/Okapi_BM25 Ā» http://terrier.org/docs/v3.5/dfr_description.html - Add your own
  • 11.
    Ā© 2013 LucidWorks FS(A|T) •Keys: - byte[] – write-once - Linear time build of min. automata (nlogn if not sorted) - Compression - Reverse lookups - Weights (used for auto-suggest) - Pluggable Algebra • Uses: - Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others - FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: - http://slidesha.re/vKtpVA - http://bit.ly/Pkjyu0 - ―Smaller Representation of Finite State Automata‖ Ā» Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
  • 12.
  • 13.
    Ā© 2013 LucidWorks Solr4: New Features • Search/Faceting/Relevance - New Relevance Function Queries (tf, df, others) - Pivot Faceting - Pseudo-join - Improved Spatial (more later) - Full support for Lucene Codecs, pluggable scoring • Indexing - New Update Processors, including scripting option - Near real time • Codec/Similarity support from Lucene 4 • Other - New Admin UI
  • 14.
    Ā© 2013 LucidWorks Geospatialimprovements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: - "geo‖:‖43.17614,-90.57341‖ - ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖ - ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖ • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))‖
  • 15.
    Ā© 2013 LucidWorks ScalingSolr • Distributed/sharded indexing & search - Auto distributes updates and queries to appropriate shards - Near Real Time (NRT) indexing capable • Dynamically scalable - New SolrCloud instances add indexing and query capacity - Supports re-balancing • Reliable - No single point of failure - Transactions logged - Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud
  • 16.
    Ā© 2013 LucidWorks 16 Newin 4.4 (just released) • HDFS backed directory for storing index and transaction logs in Apache Hadoop • New Core discovery capabilities • Schemaless/External Schema/Field Guessing • Schema APIs • Add documents from the Admin UI
  • 17.
    Ā© Copyright 2013 HackingSearch Engines for Fun and Profit 17
  • 18.
    Ā© 2013 LucidWorks …Find your Keys, Store Your Content • Lucene/Solr is a fast key-value store - Bonus: search your values! • NoSQL before NoSQL was cool • Solr: distributed key/value - Durable, Isolated, Redundant, Fast, Real-time - Joins, Column Storage • Solr or Tika + Lucene can index popular office formats • Solr can backup/replicate and scale as content grows
  • 19.
    Ā© 2013 LucidWorks …Find Love! Upsell! Cross-sell! • Cross recommendation as search - with search used to build cross recommendation! • Recommend content to people who exhibit certain behaviors (clicks, query terms, other) • (Ab)use of a search engine - but not as a search engine for content - more like a search engine for behavior • See Ted Dunningā€˜s talk from Berlin Buzzwords on Multi- modal Recommendation Algorithms - http://berlinbuzzwords.com/sessions/multi-modal-recommendation- algorithms • Go get Mahout/Myrrix or just do it in y(our) search engine
  • 20.
  • 21.
    Ā© 2013 LucidWorks 21 …Time travel? • Leverage Solrā€˜s new spatial capabilities to index non-spatial data, such as time ranges - Useful for Open Hours, Shifts, etc. • Query using rectangle intersections - q = shift:"Intersects(0 19 23 365)‖ https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
  • 22.
    Ā© 2013 LucidWorks 22 Boldlygo forth and rank! • Faster • More Flexible • Easier than ever scaling • More reliable than ever • Reduced cost of experimentation
  • 23.
    Ā© 2013 LucidWorks •Lucene/Solr EU Conference: - Dublin, IE, November 4-7: http://lucenerevolution.org/ - CFP Open Now Where to Next? • Lucene/Solr - http://lucene.apache.org - {java-user|solr-user}@lucene.apache.org - SIGIR ‗12 Open Source Workshop Ā» http://opensearchlab.otago.ac.nz/paper_ 10.pdf • LucidWorks - http://www.lucidworks.com - Commercial support, products, etc. for Lucene/Solr • Me - grant@lucidworks.com - @gsingers on Twitter - ―Taming Text‖ – Engineerā€˜s guide to open source search and NLP Ā» http:///www.manning.com/ingersoll 23
  • 24.
    Ā© 2013 LucidWorks 24 Credits •All of the Lucene/Solr committers and contributors • Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html • Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1 • Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring- The-American.jpg • Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/ • Love: http://www.msruntheus.com/above-all-love-each-other-deeply/ • TARDIS: http://2.bp.blogspot.com/- ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_do ctor_who.jpg

Editor's Notes

  • #3Ā Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much more
  • #5Ā What is Lucene?What is Solr?
  • #16Ā Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  • #19Ā Oh, BTW, it can do search over the valuesKeys can be anything, not just strings