Open Source Search FTW

1,957 views

Published on

http://sigir2013.ie/industry_track.html#GrantIngersoll
Abstract: Apache Lucene and Solr are the most widely deployed search technology on the planet, powering sites like Twitter, Wikipedia, Zappos and countless applications across a large array of domains. They are also free, open source, extensible and extremely scalable. Lucene and Solr also contain a large number of features for solving common information retrieval problems ranging from pluggable posting list compression and scoring algorithms to faceting and spell checking. Increasingly, Lucene and Solr also are being (ab)used to power applications going way beyond the search box. In this talk, we'll explore the features and capabilities of Lucene and Solr 4.x, as well as look at how to (ab)use your search engine technology for fun and profit.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,957
On SlideShare
0
From Embeds
0
Number of Embeds
470
Actions
Shares
0
Downloads
25
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Search Abuse Can discuss how I started just doing free text, but then a curious thing happened, started to see people using the engine for things like: key/value, denormalized DBs, browsing engines, plagiarism detection, teaching languages, record linkage and much, much more
  • What is Lucene?What is Solr?
  • Power users are often more likely to recoverTools for recovery:Auto-suggest, related searches, spelling suggestions
  • Oh, BTW, it can do search over the valuesKeys can be anything, not just strings
  • Open Source Search FTW

    1. 1. © Copyright 2013 Open Source Search FTW! Grant Ingersoll Lucene/Solr Committer, Apache Soft. Found. CTO, LucidWorks @gsingers
    2. 2. © 2013 LucidWorks 2 Preaching to the Converted! • Embrace fuzziness! • Search is a system building block • If the algorithms fit, use them! • Search use leads to search abuse • Scoring features are everywhere http://cheezburger.com/5243950080
    3. 3. © 2013 LucidWorks 3 Topics • Quick Intro to Lucene and Solr • What‘s new in Lucene and Solr 4.x? - Lucene/Solr for Info Retrieval • (Ab)Using Search Engine Tech. for Fun and Profit
    4. 4. © 2013 LucidWorks 4 Quick Intro to Lucene and Solr
    5. 5. © 2013 LucidWorks Relax, You’re Among Friends • Large, diverse search community with many non-traditional search engine usages - Object stores, Record linkage, Social, mobile -> web • Open Dev. > Open Source • ―The Apache Way‖ - Meritocracy – Those who do, decide! • Always Be Testing - Randomized system tests are all the rage - http://vimeo.com/32087114 • Patches Welcome!
    6. 6. © 2013 LucidWorks
    7. 7. © Copyright 2013
    8. 8. © 2013 LucidWorks Lucene: Speed and Memory • Native Near Real Time (NRT) support - Per segment - FieldCache can be controlled to only load new segments - Soft commit -- faster without fsync, allows quicker update visibility • DWPT (Document Writer per Thread) - Faster more consistent index speed • Faster fuzzy & wildcard query processing • String -> BytesRef - Much improved data structure - … means less memory and less garbage collection effort
    9. 9. © 2013 LucidWorks Up and to the Right • http://people.apache.org/~mikemccand/lucenebench/in dexing.html 9
    10. 10. © 2013 LucidWorks Lucene: Flexibility • Flexible Index Formats - New posting list codecs: Block, Simple Text, Append (HDFS..), etc - Pulsing codec: improves performance of primary key searches, inlining docs, positions, and payloads, saves disk seeks • Pluggable Scoring - Decoupled from TF/IDF - Built in alternatives include BM25 & DFR, and others » http://en.wikipedia.org/wiki/Okapi_BM25 » http://terrier.org/docs/v3.5/dfr_description.html - Add your own
    11. 11. © 2013 LucidWorks FS(A|T) • Keys: - byte[] – write-once - Linear time build of min. automata (nlogn if not sorted) - Compression - Reverse lookups - Weights (used for auto-suggest) - Pluggable Algebra • Uses: - Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others - FuzzyQuery is 100x faster -- http://bit.ly/hgO65c • More: - http://slidesha.re/vKtpVA - http://bit.ly/Pkjyu0 - ―Smaller Representation of Finite State Automata‖ » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.
    12. 12. © Copyright 2013
    13. 13. © 2013 LucidWorks Solr 4: New Features • Search/Faceting/Relevance - New Relevance Function Queries (tf, df, others) - Pivot Faceting - Pseudo-join - Improved Spatial (more later) - Full support for Lucene Codecs, pluggable scoring • Indexing - New Update Processors, including scripting option - Near real time • Codec/Similarity support from Lucene 4 • Other - New Admin UI
    14. 14. © 2013 LucidWorks Geospatial improvements • Index shapes other than points (circles, polygons, etc) • More complex interactions than point in a circle • Indexing: - "geo‖:‖43.17614,-90.57341‖ - ―geo‖:‖Circle(4.56,1.23 d=0.0710)‖ - ―geo‖:‖POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))‖ • Searching: - fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)" - fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30)))‖
    15. 15. © 2013 LucidWorks Scaling Solr • Distributed/sharded indexing & search - Auto distributes updates and queries to appropriate shards - Near Real Time (NRT) indexing capable • Dynamically scalable - New SolrCloud instances add indexing and query capacity - Supports re-balancing • Reliable - No single point of failure - Transactions logged - Robust, automatic recover • http://wiki.apache.org/solr/SolrCloud
    16. 16. © 2013 LucidWorks 16 New in 4.4 (just released) • HDFS backed directory for storing index and transaction logs in Apache Hadoop • New Core discovery capabilities • Schemaless/External Schema/Field Guessing • Schema APIs • Add documents from the Admin UI
    17. 17. © Copyright 2013 Hacking Search Engines for Fun and Profit 17
    18. 18. © 2013 LucidWorks … Find your Keys, Store Your Content • Lucene/Solr is a fast key-value store - Bonus: search your values! • NoSQL before NoSQL was cool • Solr: distributed key/value - Durable, Isolated, Redundant, Fast, Real-time - Joins, Column Storage • Solr or Tika + Lucene can index popular office formats • Solr can backup/replicate and scale as content grows
    19. 19. © 2013 LucidWorks … Find Love! Upsell! Cross-sell! • Cross recommendation as search - with search used to build cross recommendation! • Recommend content to people who exhibit certain behaviors (clicks, query terms, other) • (Ab)use of a search engine - but not as a search engine for content - more like a search engine for behavior • See Ted Dunning‘s talk from Berlin Buzzwords on Multi- modal Recommendation Algorithms - http://berlinbuzzwords.com/sessions/multi-modal-recommendation- algorithms • Go get Mahout/Myrrix or just do it in y(our) search engine
    20. 20. © 2013 LucidWorks 20 … Avoid Delays
    21. 21. © 2013 LucidWorks 21 … Time travel? • Leverage Solr‘s new spatial capabilities to index non-spatial data, such as time ranges - Useful for Open Hours, Shifts, etc. • Query using rectangle intersections - q = shift:"Intersects(0 19 23 365)‖ https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
    22. 22. © 2013 LucidWorks 22 Boldly go forth and rank! • Faster • More Flexible • Easier than ever scaling • More reliable than ever • Reduced cost of experimentation
    23. 23. © 2013 LucidWorks • Lucene/Solr EU Conference: - Dublin, IE, November 4-7: http://lucenerevolution.org/ - CFP Open Now Where to Next? • Lucene/Solr - http://lucene.apache.org - {java-user|solr-user}@lucene.apache.org - SIGIR ‗12 Open Source Workshop » http://opensearchlab.otago.ac.nz/paper_ 10.pdf • LucidWorks - http://www.lucidworks.com - Commercial support, products, etc. for Lucene/Solr • Me - grant@lucidworks.com - @gsingers on Twitter - ―Taming Text‖ – Engineer‘s guide to open source search and NLP » http:///www.manning.com/ingersoll 23
    24. 24. © 2013 LucidWorks 24 Credits • All of the Lucene/Solr committers and contributors • Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html • Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1 • Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring- The-American.jpg • Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/ • Love: http://www.msruntheus.com/above-all-love-each-other-deeply/ • TARDIS: http://2.bp.blogspot.com/- ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_do ctor_who.jpg

    ×