Next Generation Search with
Lucene and Solr 4
Grant Ingersoll
CTO, LucidWorks
Read More: http://ibm.co/1dJvL9k

© Copyrigh...
Search is Dead, Long Live Search
• Search is Everywhere!

• The Bar is Raised

• Holistic view of the data AND the users i...
Search is good for…
• Classic: Fast, fuzzy text matching across a large
document collection
• NoSQL and De-normalized data...
© Copyright 2013
Lucene: Speed and Memory
• Native Near Real Time (NRT) support
- Per segment
- FieldCache can be controlled to only load n...
Up and to the Right

• http://people.apache.org/~mikemccand/lucenebench/in
dexing.html
6

© 2013 LucidWorks
Lucene: Flexibility
• Flexible Index Formats
- New posting list codecs: Block, Simple Text, Append (HDFS..),
etc
- Pulsing...
FS(A|T)
• Keys:
- byte[] – write-once
- Linear time build of min. automata (nlogn if not sorted, which isn’t our case)

- ...
Recent Additions
• Replication module

• New Faceting capabilities
• New Suggester to handle infix suggestions

• Analysis...
© Copyright 2013
Solr 4: New Features
• Search/Faceting/Relevance
-

New Relevance Function Queries (tf, df, others)
Pivot Faceting
Pseudo-...
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)
• More complex interactions than point i...
Scaling Solr
• Distributed/sharded indexing & search
- Auto distributes updates and queries to appropriate shards
- Near R...
Solr as NoSQL
• Characteristics
-

Non-traditional data stores
Not designed for SQL type queries
Distributed fault toleran...
Recent Additions
• HDFS backed directory for storing index and
transaction logs in Apache Hadoop
• New Core discovery capa...
Applications

16 Copyright 2013
©
… Find your Keys, Store Your Content
• Lucene/Solr is a fast key-value
store
- Bonus: search your values!

• NoSQL before ...
… Find Love! Upsell! Cross-sell!
• Cross recommendation as search
- with search used to build cross recommendation!

• Rec...
… Avoid Delays

19

© 2013 LucidWorks
… Wibbly-wobbly Timey-wimey Stuff
• Leverage Solr’s new
spatial capabilities to
index non-spatial data,
such as time range...
Summary
• Lucene/Solr 4.x:
-

Faster
More Flexible
Easier than ever scaling
More reliable than ever

• If you need to rank...
Where to Next?
• Full article: http://ibm.co/1dJvL9k
•
• http://www.lucidworks.com
• http://lucene.apache.org/
• Training:...
Upcoming SlideShare
Loading in...5
×

Data IO: Next Generation Search with Lucene and Solr 4

853

Published on

Overview talk on Lucene and Solr 4 features, using search for alternative problems.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
853
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
21
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
  • Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  • CharacteristicsConflicts from other clients
  • Oh, BTW, it can do search over the valuesKeys can be anything, not just strings
  • Data IO: Next Generation Search with Lucene and Solr 4

    1. 1. Next Generation Search with Lucene and Solr 4 Grant Ingersoll CTO, LucidWorks Read More: http://ibm.co/1dJvL9k © Copyright 2013
    2. 2. Search is Dead, Long Live Search • Search is Everywhere! • The Bar is Raised • Holistic view of the data AND the users is critical © 2013 LucidWorks
    3. 3. Search is good for… • Classic: Fast, fuzzy text matching across a large document collection • NoSQL and De-normalized data - ―light‖ relational • Top N problems • Faceting, slicing and dicing of numerical/enumerated data • Spatial, spell checking, record linkage, highlighting 3 © 2013 LucidWorks
    4. 4. © Copyright 2013
    5. 5. Lucene: Speed and Memory • Native Near Real Time (NRT) support - Per segment - FieldCache can be controlled to only load new segments - Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) - Faster more consistent index speed • Faster fuzzy & wildcard query processing • String -> BytesRef - Much improved data structure - … means less memory and less garbage collection effort © 2013 LucidWorks
    6. 6. Up and to the Right • http://people.apache.org/~mikemccand/lucenebench/in dexing.html 6 © 2013 LucidWorks
    7. 7. Lucene: Flexibility • Flexible Index Formats - New posting list codecs: Block, Simple Text, Append (HDFS..), etc - Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring - Decoupled from TF/IDF - Built in alternatives include BM25 & DFR, and others » http://en.wikipedia.org/wiki/Okapi_BM25 » http://terrier.org/docs/v3.5/dfr_description.html - Add your own © 2013 LucidWorks
    8. 8. FS(A|T) • Keys: - byte[] – write-once - Linear time build of min. automata (nlogn if not sorted, which isn’t our case) - Compression - Reverse lookups - Weights (used for auto-suggest) - Pluggable Algebra • Uses: - Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others - FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: - http://slidesha.re/vKtpVA - http://bit.ly/Pkjyu0 - ―Smaller Representation of Finite State Automata‖ » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192. © 2013 LucidWorks
    9. 9. Recent Additions • Replication module • New Faceting capabilities • New Suggester to handle infix suggestions • Analysis Additions - Norwegian, Scandinavian alternatives • Memory and FST improvements 9 © 2013 LucidWorks
    10. 10. © Copyright 2013
    11. 11. Solr 4: New Features • Search/Faceting/Relevance - New Relevance Function Queries (tf, df, others) Pivot Faceting Pseudo-join Improved Spatial (more later) Full support for Lucene Codecs, pluggable scoring • Indexing - New Update Processors, including scripting option - Near real time • Codec/Similarity support from Lucene 4 • Other - New Admin UI © 2013 LucidWorks
    12. 12. Geospatial improvements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle using Well Known Text • Indexing: - "geo‖:‖43.17614,-90.57341‖ - ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖ - ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖ • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))‖ © 2013 LucidWorks
    13. 13. Scaling Solr • Distributed/sharded indexing & search - Auto distributes updates and queries to appropriate shards - Near Real Time (NRT) indexing capable • Dynamically scalable - New SolrCloud instances add indexing and query capacity - Supports re-balancing • Reliable - No single point of failure - Transactions logged - Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud © 2013 LucidWorks
    14. 14. Solr as NoSQL • Characteristics - Non-traditional data stores Not designed for SQL type queries Distributed fault tolerant architecture Document oriented, data format agnostic(JSON, XML, CSV, binary) • Updated durability via transaction log • Real-time /get fetches latest version w/o hard commit • Versioning and optimistic locking - w/ Real Time GET, allows read/write/update w/o conflicts • Atomic updates - Can add/remove/change and increment a field in existing doc w/o re-indexing © 2013 LucidWorks
    15. 15. Recent Additions • HDFS backed directory for storing index and transaction logs in Apache Hadoop • New Core discovery capabilities • Schemaless/External Schema/Field Guessing • Schema APIs • Add documents from the Admin UI 15 © 2013 LucidWorks
    16. 16. Applications 16 Copyright 2013 ©
    17. 17. … Find your Keys, Store Your Content • Lucene/Solr is a fast key-value store - Bonus: search your values! • NoSQL before NoSQL was cool • Solr: distributed key/value - Durable, Isolated, Redundant, Fast, Real-time - Joins, Column Storage • Solr or Tika + Lucene can index popular office formats • Solr can backup/replicate and scale as content grows © 2013 LucidWorks
    18. 18. … Find Love! Upsell! Cross-sell! • Cross recommendation as search - with search used to build cross recommendation! • Recommend content to people who exhibit certain behaviors (clicks, query terms, other) • (Ab)use of a search engine - but not as a search engine for content - more like a search engine for behavior • See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms - http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms • Go get Mahout/Myrrix or just do it in y(our) search engine © 2013 LucidWorks
    19. 19. … Avoid Delays 19 © 2013 LucidWorks
    20. 20. … Wibbly-wobbly Timey-wimey Stuff • Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges - Useful for Open Hours, Shifts, etc. • Query using rectangle intersections - q = shift:"Intersects(0 19 23 365)‖ https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/ 20 © 2013 LucidWorks
    21. 21. Summary • Lucene/Solr 4.x: - Faster More Flexible Easier than ever scaling More reliable than ever • If you need to rank a bunch of stuff according to some notion of similarity, a search engine is the way to go 21 © 2013 LucidWorks
    22. 22. Where to Next? • Full article: http://ibm.co/1dJvL9k • • http://www.lucidworks.com • http://lucene.apache.org/ • Training: http://bit.ly/lws-training • LucidWorks Search (Solr++) more info: http://bit.ly/lws-moreinfo • Twitter: @gsingers, @LucidWorks • Taming Text: http://www.manning.com/ingersoll 22 © 2013 LucidWorks
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×