Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to solr

479 views

Published on

Introduction to solr: A talk gave by Radu Gheorghe at Big Data Bucharest Meetup.

Published in: Technology
  • Be the first to comment

Introduction to solr

  1. 1. Introduction to Solr Radu Gheorghe Sematext Group, Inc.
  2. 2. About me Logsene SPM ES API metrics ... Products Services + https://sematext.com/blog/author/radu7gheorghe/ + https://www.manning.com/books/elasticsearch-in-action
  3. 3. Agenda What is Solr When to use it When not to use it How it works Demo Pleeeeease ask questions. Otherwise it will be boring :(
  4. 4. What is Open source Search engine Based on Apache * Distributed (SolrCloud) or not (master-slave) * Actually the two project merged in 2010
  5. 5. More on search: the term dictionary and its friends Term Docs Positions counts, stored, etc big 1,2 [0],[2] ... bucharest 3 [0] data 1 [1] fun 1 ... is 1,3 other 2 text 2 1) Big data is fun 2) Other text 3) Bucharest is big analysis big AND data “big data”
  6. 6. Segments and merging
  7. 7. The [relevancy] score BM25: bag-of-words based on TF-IDFq=big AND data big big big big big big I have big big big data Term Frequency data data Inverse Document Frequency more occurrences in the document, more weight less occurrences in the index, more weight
  8. 8. Relevancy tuning title: Big Data description: this is a book about big data published: 2016 title: Spark Rulz description: big data big data big data big data published: 2015 q=big AND data boost fields boost values
  9. 9. Back to sorting: where the inverted index fails Term Docs 1 [star] 1,2,8,5,128 2 7,84,129, 3 3,29,345 4 11,123,455 5 12,14,16,17 Search returned docs 84, 455, 12 and 8 Now sort them by rating. ¯_(ツ)_/¯
  10. 10. Enter doc values Doc Terms 8 1 12 3 84 5 129 4 455 2 Search returned docs 84, 455, 12 and 8 Now sort them by rating. Similar, but not quite like stored fields* * Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields and in-memory field cache
  11. 11. Facets search returns doc IDs facet=true facet.field=host doc1: host=server01 doc2: host=server02 doc3: host=server01 doc4: host=server01 server01: 3 server02: 1 doc values, usually* * can be filter cache on low cardinality fields (depends on facet.method)
  12. 12. Facets can be hierarchical top_genres:{ terms:{ field: genre, limit: 5, facet:{ top_authors:{ terms:{ field: author, limit: 2 "top_genres":{ "buckets":[ { "val":"Fantasy", "count":5432, "top_authors":{ // top authors in the "Fantasy" genre "buckets":[{ "val":"Mercedes Lackey", "count":121}, { "val":"Piers Anthony", "count":98} ] } }, { "val":"Mystery", "count":4322, "top_authors":{ // top authors in the "Mystery" genre "buckets":[{ "val":"James Patterson", "count":146}, Can also be numeric/date ranges or functions like avg, sum, unique or percentile
  13. 13. Beyond the shards: streaming aggregations Sources search facet jdbc ... Decorators rollup unique innerJoin parallel ... shard1 shard2 worker1 worker2 Solr endpoint client app
  14. 14. Beyond the shards: streaming aggregations Sources search facet jdbc ... Decorators rollup unique innerJoin parallel ... Parallel SQL Text Classification Graph Traversal ⇒ shard1 shard2 worker1 worker2 Solr endpoint client app
  15. 15. Master-slave indexer master slave1 slave2 slave3 searcher docs queries replicates segments
  16. 16. Master-slave: high-QPS on static data indexer master slave1 slave2 slave3 searcher replicates segments docs queries Simple Battle-tested Index data only once Slaves can cache like crazy Separate roles ⇒ separate (see optimized) hardware and configs
  17. 17. SolrCloud leader2 leader1 replica2 replica1 Zookeeper Solr nodes indexer searcher
  18. 18. SolrCloud leader2 leader1 replica2 replica1 Zookeeper Solr nodes indexer searcher Near realtime search Durability Scales both reads and writes No SPOF Central config, nicer APIs
  19. 19. In a nutshell Typical use-cases Typical challenges Product search (books, movies, bikes weapons… anything that requires relevancy) Updates (though there’s WiP for numeric doc values in SOLR-5944) Time-series data (logs, metrics, social media...) Not really schema-less (schema can only be appended) Search on top of (or as a source of) other Big Data tools (Spark, HDFS…) Doesn’t like sparse data (again, there’s ongoing work to make it better, see LUCENE-7253) Search on top of (or alongside) relational DBs Some relational, stream and batch processing capabilities, but not the tool for those jobs
  20. 20. Demo Commands available at https://github.com/sematext/meetups/blob/master/introduction_to_solr_demo_commands.sh
  21. 21. Thank you! Radu Gheorghe radu.gheorghe@sematext.com @radu0gheorghe Sematext info@sematext.com http://sematext.com @sematext Join Us! We are hiring! http://sematext.com/jobs Backend, UI, Sales, Consulting, Trainers

×