Your SlideShare is downloading. ×
0
Battle of the GiantsApache Solr 4.0 vs ElasticSearch 0.20       Rafał Kuć – Sematext International      @kucrafal @sematex...
Who Am I•   „Solr 3.1 Cookbook” author (4.0 inc)•   Sematext consultant & engineer•   Solr.pl co-founder•   Father and hus...
What Will I Talk About ?     Copyright 2012 Sematext Int’l. All rights reserved
Under the Hood• ElasticSearch 0.20  – Apache Lucene 3.6.1• Apache Solr 4.0  – Apache Lucene 4.0               Copyright 20...
Architecture• What we expect  – Scalability  – Fault toleranance  – High availablity  – Features• What we are also looking...
ElasticSearch Cluster Architecture•   Distributed•   Fault tolerant•   Only ElasticSearch nodes•   Single leader•   Automa...
SolrCloud Cluster Architecture•   Distributed•   Fault tolerant•   Apache Solr + ZooKeeper ensemble•   Leader per shard•  ...
Collection vs Index• Collection – Solr main logical index• Index – ElasticSearch main logic structure• Collections and Ind...
Multiple Document Types in Index• ElasticSearch - multiple document types in a  single index• Apache Solr - multiple docum...
Shards and Replicas•   Index / Collection can have many shards•   Each shard can have 0 or more replicas•   Replicas are a...
Index and Query Routing• Control where documents are going• Control where queries are going• Manual data distribution     ...
Querying Without Routing      Shard 1               Shard 2                 Shard 3               Shard 4      Shard 5    ...
Query With Routing      Shard 1               Shard 2                 Shard 3               Shard 4      Shard 5          ...
Routing Docs and Queries in Solr• Requires some effort• Defaults to hash based on document  identifiers• Can be turned off...
Routing Docs and Queries - ElasticSearch• routing parameter controls target shard  which document/query will be forwarded ...
Apache Solr Index Structure•   Field types defined in schema.xml file•   Fields defined in schema.xml file•   Allows autom...
ElasticSearch Index Structure•   Schema - less•   Analyzers and filters defined with HTTP API•   Fields defined with an HT...
Index Structure Manipulation• Possible to some extent in Solr as well as  ElasticSearch• ElasticSearch allows dynamic mapp...
Aliasing• Solr  – Allows core aliasing• ElasticSearch  – Allows index aliasing  – We can add filter to alias  – We can add...
Server Configuration• Solr                                     • ElasticSearch  – Static in solrconfig.xml                ...
ElasticSearch Gateway Module• Your data time machine• Stores indices and meta data• Currently available:  – Local  – Share...
Discovery• Apache Solr uses ZooKeeper• ElasticSearch uses Zen Discovery              Copyright 2012 Sematext Int’l. All ri...
ElasticSearch Zen Discovery• Allows automatic node discovery• Provides multicast and unicast discovery  methods• Automatic...
Apache Solr & Apache ZooKeeper• Requires additional software• ZooKeeper ensemble with 1+ ZooKeeper  instances• Prevents sp...
API• HTTP REST API in ElasticSearch or Query String  for simple queries• HTTP with Query String in Apache Solr• Both provi...
Apache Solr and Query String• Queries are built of request parameters• Some degree of structuring allowed (local  params) ...
ElasticSearch REST End-Points• Simple queries built of request parameters• Stuctured queries built as JSON objects    curl...
Data Handling• Solr  – Multiple formats allowed as input  – Can return results in multiple formats• ElasticSearch  – JSON ...
Single or Batch• Solr                                     • ElasticSearch  – Single or multiple                           ...
Partial Document Updates• Not based on LUCENE-3837 proposed by  Andrzej Białecki• Document reindexing on the side of searc...
ElasticSearch Partial Doc Update• Special end – point exposed - _update• Supports parameters like  routing, parent, replic...
Apache Solr Partial Doc Update• Sent to the standard update handler• Requires _version_ field to be present    curl localh...
Solr Collections API• Built on top of Core Admin• Allows:  – Collection creation  – Collection reload  – Collection deleti...
ElasticSearch Indices REST API• Allows:  – Index creation  – Index deletion  – Index closing and opening  – Index refreshi...
Analysis Chain Definition• Solr                                    • ElasticSearch  – Static in schema.xml                ...
Multilingual Data Handling• Both ElasticSearch and Apache Solr built on  top of Apache Lucene• Solr – analyzers defined pe...
Results Grouping• Available in Apache Solr only• Allows for results grouping based on:  – Field value  – Query  – Function...
Prospective Search• Allows for checking if a document matches a  stored query• Not available in Apache Solr• Available in ...
Spellchecker• Allows to check and correct spelling mistakes• Not available in ElasticSearch currently• Multiple implementa...
Full Text Search Capabilities•   Variety of queries•   Ability to control score calculation•   Different query parsers ava...
Score Calculation•   Leverage Lucene scoring capabilities•   Control over document importance•   Control over query import...
Apache Solr and Score Influence• Index time  – Document boosts  – Field boosts• Query time  – Term boosts  – Field boosts ...
ElasticSearch and Score Influence• Index time  – Document and field boosts• Query time  – Different queries provide differ...
Nested Objects• Possible only in ElasticSearch• Indexed as separate documents• Stored in the same part of the index as the...
More Like This• Lets us find similar documents• Solr  – More Like This Component• ElasticSearch  – More Like This Query  –...
Solr Parent – Child Relationship• Used at query time• Multi core joins possible      http://localhost:8983/solr/select?q={...
ElasticSearch Parent – Child Handling• Proper indexing required• Indexed as separate documents• Standard queries don’t ret...
Filters•   Used to narrown down query results•   Good candidates for caching and reuse•   Supported by ElasticSearch and A...
Apache Solr Filter Queries•   Multiple filters per query•   Filters are addictive•   Different query parsers can be used• ...
ElasticSearch Filtered Queries• Can be defined using queries exposed by the  Query DSL• Can be used for custom score calcu...
Filter Cache Control• Both Solr and ElasticSearch let us control  cache for filters• Solr  – Using local params and cache ...
Faceting• Both provide common facets  – Terms  – Range & query  – Terms statistics  – Spatial distance• Solr  – Pivot face...
Real Time Or Not ?• Allow getting document not yet indexed• Don’t need searcher reopening• ElasticSearch  – Separate Get a...
Caches and Warming• ElasticSearch and Solr allow caching• Both allow running warming queries• ElasticSearch by default doe...
Solr Caches• Types   – Filter Cache   – Query Result Cache   – Document Cache• Implementation choices   – LRUCache   – Fas...
ElasticSearch Caches• Types  – Filter Cache  – Field Data Cache• Implementation choices  – Resident  – Soft  – Weak• Other...
Cluster State Monitoring• Apache Solr – multiple mbeans exposed by  JMX• ElasticSearch – multiple REST end – points  expos...
ElasticSearch Statistics API•   Health and State Check•   Nodes Information and Statistics•   Cache Statistics•   Index Se...
Cluster Monitoring  Copyright 2012 Sematext Int’l. All rights reserved
Cluster Monitoring with SPM       Copyright 2012 Sematext Int’l. All rights reserved
Cluster Settings Update• ElasticSearch lets us:  – Control rebalancing  – Control recovery  – Control allocation  – Change...
Custom Shard Allocation• Possible in ElasticSearch• Cluster level:  curl -XPUT localhost:9200/_cluster/settings -d {     "...
Moving Shards and Replicas• Possible in ElasticSearch, not available in Solr• Allows to move shards and replicas to any  n...
And The Winner Is ?   Copyright 2012 Sematext Int’l. All rights reserved
How to Reach Us• Rafał Kuć  – Twitter: @kucrafal  – E-mail: rafal.kuc@sematext.com• Sematext  – Twitter: @sematext  – Webs...
We Are Hiring !•   Dig Search ?•   Dig Analytics ?•   Dig Big Data ?•   Dig Performance ?•   Dig working with and in open ...
Upcoming SlideShare
Loading in...5
×

Battle of the giants: Apache Solr vs ElasticSearch

38,276

Published on

Slides from my talk during ApacheCon EU 2012 - "Battle of the giants: Apache Solr vs ElasticSearch". Video available at http://player.vimeo.com/video/55645629

Published in: Sports
1 Comment
70 Likes
Statistics
Notes
No Downloads
Views
Total Views
38,276
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
570
Comments
1
Likes
70
Embeds 0
No embeds

No notes for slide

Transcript of "Battle of the giants: Apache Solr vs ElasticSearch"

  1. 1. Battle of the GiantsApache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – Sematext International @kucrafal @sematext sematext.com
  2. 2. Who Am I• „Solr 3.1 Cookbook” author (4.0 inc)• Sematext consultant & engineer• Solr.pl co-founder• Father and husband  Copyright 2012 Sematext Int’l. All rights reserved
  3. 3. What Will I Talk About ? Copyright 2012 Sematext Int’l. All rights reserved
  4. 4. Under the Hood• ElasticSearch 0.20 – Apache Lucene 3.6.1• Apache Solr 4.0 – Apache Lucene 4.0 Copyright 2012 Sematext Int’l. All rights reserved
  5. 5. Architecture• What we expect – Scalability – Fault toleranance – High availablity – Features• What we are also looking for – Manageability – Installation ease – Tools Copyright 2012 Sematext Int’l. All rights reserved
  6. 6. ElasticSearch Cluster Architecture• Distributed• Fault tolerant• Only ElasticSearch nodes• Single leader• Automatic leader election Copyright 2012 Sematext Int’l. All rights reserved
  7. 7. SolrCloud Cluster Architecture• Distributed• Fault tolerant• Apache Solr + ZooKeeper ensemble• Leader per shard• Automatic leader election Copyright 2012 Sematext Int’l. All rights reserved
  8. 8. Collection vs Index• Collection – Solr main logical index• Index – ElasticSearch main logic structure• Collections and Indices can be spread among different nodes in the cluster Copyright 2012 Sematext Int’l. All rights reserved
  9. 9. Multiple Document Types in Index• ElasticSearch - multiple document types in a single index• Apache Solr - multiple document types in a single collection – shared schema.xml Copyright 2012 Sematext Int’l. All rights reserved
  10. 10. Shards and Replicas• Index / Collection can have many shards• Each shard can have 0 or more replicas• Replicas are automatically updated• Replicas can be promoted to leaders when a leader shard goes off-line Copyright 2012 Sematext Int’l. All rights reserved
  11. 11. Index and Query Routing• Control where documents are going• Control where queries are going• Manual data distribution Copyright 2012 Sematext Int’l. All rights reserved
  12. 12. Querying Without Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8Collection / Index Application Copyright 2012 Sematext Int’l. All rights reserved
  13. 13. Query With Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8Collection / Index Application Copyright 2012 Sematext Int’l. All rights reserved
  14. 14. Routing Docs and Queries in Solr• Requires some effort• Defaults to hash based on document identifiers• Can be turned off using solr.NoOpDistributingUpdateProcessorFactory <updateRequestProcessorChain> <processor class="solr.LogUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> <processor class="solr.NoOpDistributingUpdateProcessorFactory" /> </updateRequestProcessorChain> Copyright 2012 Sematext Int’l. All rights reserved
  15. 15. Routing Docs and Queries - ElasticSearch• routing parameter controls target shard which document/query will be forwarded to• defaults to document identifiers• can be changed to any value curl -XPUT localhost:9200/sematext/test/1?routing=1234 -d { "title" : "Test routing document" } curl –XGET localhost:9200/sematext/test/_search/?q=*&routing=1234 Copyright 2012 Sematext Int’l. All rights reserved
  16. 16. Apache Solr Index Structure• Field types defined in schema.xml file• Fields defined in schema.xml file• Allows automatic value copying• Allows dynamic fields• Allows custom similarity definition Copyright 2012 Sematext Int’l. All rights reserved
  17. 17. ElasticSearch Index Structure• Schema - less• Analyzers and filters defined with HTTP API• Fields defined with an HTTP request• Multi – field support• Allows nested documents• Allows parent – child relationship• Allows structured data Copyright 2012 Sematext Int’l. All rights reserved
  18. 18. Index Structure Manipulation• Possible to some extent in Solr as well as ElasticSearch• ElasticSearch allows dynamic mappings update (not always) Copyright 2012 Sematext Int’l. All rights reserved
  19. 19. Aliasing• Solr – Allows core aliasing• ElasticSearch – Allows index aliasing – We can add filter to alias – We can add index routing – We can add search routing Copyright 2012 Sematext Int’l. All rights reserved
  20. 20. Server Configuration• Solr • ElasticSearch – Static in solrconfig.xml – Static in elasticsearch.yml – Can be reloaded – Properties can be during runtime with changed during runtime collection/core reload (although not all) without reloading Copyright 2012 Sematext Int’l. All rights reserved
  21. 21. ElasticSearch Gateway Module• Your data time machine• Stores indices and meta data• Currently available: – Local – Shared FS – Hadoop – S3 Copyright 2012 Sematext Int’l. All rights reserved
  22. 22. Discovery• Apache Solr uses ZooKeeper• ElasticSearch uses Zen Discovery Copyright 2012 Sematext Int’l. All rights reserved
  23. 23. ElasticSearch Zen Discovery• Allows automatic node discovery• Provides multicast and unicast discovery methods• Automatic master detection• Two - way failure detection Copyright 2012 Sematext Int’l. All rights reserved
  24. 24. Apache Solr & Apache ZooKeeper• Requires additional software• ZooKeeper ensemble with 1+ ZooKeeper instances• Prevents split – brain situations• Holds collections configurations• Solr needs to know address of one of the ZooKeeper instances Copyright 2012 Sematext Int’l. All rights reserved
  25. 25. API• HTTP REST API in ElasticSearch or Query String for simple queries• HTTP with Query String in Apache Solr• Both provide specialized Java API – SolrJ for Apache Solr and CloudSolrServer – ElasticSearch with TransportClient for remote connections Copyright 2012 Sematext Int’l. All rights reserved
  26. 26. Apache Solr and Query String• Queries are built of request parameters• Some degree of structuring allowed (local params) curl http://localhost:8983/solr/select?q=text:weird&sort=date+desc Copyright 2012 Sematext Int’l. All rights reserved
  27. 27. ElasticSearch REST End-Points• Simple queries built of request parameters• Stuctured queries built as JSON objects curl –XGET localhost:9200/sematext/test/_search/?q=_all:weird&sort=date:desc curl -XGET localhost:9200/sematext/test_search -d { "query" : { "term" : { "_all" : "weird" }, "sort" : { "date" : { "order" : "desc" } } } Copyright 2012 Sematext Int’l. All rights reserved
  28. 28. Data Handling• Solr – Multiple formats allowed as input – Can return results in multiple formats• ElasticSearch – JSON in / JSON out Copyright 2012 Sematext Int’l. All rights reserved
  29. 29. Single or Batch• Solr • ElasticSearch – Single or multiple – Single document with a documents per standard indexing call request – _bulk end – point exposed for batch indexing – _bulk UDP end – point can be exposed for low latency batch indexing Copyright 2012 Sematext Int’l. All rights reserved
  30. 30. Partial Document Updates• Not based on LUCENE-3837 proposed by Andrzej Białecki• Document reindexing on the side of search server• Both servers use versioning to prevent changes being overwritten• Can lead to decreased network traffic in some cases Copyright 2012 Sematext Int’l. All rights reserved
  31. 31. ElasticSearch Partial Doc Update• Special end – point exposed - _update• Supports parameters like routing, parent, replication, percolate, etc (similar to Index API)• Uses scripts to perform document updates curl -XPOST localhost:9200/sematext/test/12345/_update -d { "script" : "ctx._source.enabled = enabled", "params" : { "enabled" : true } } Copyright 2012 Sematext Int’l. All rights reserved
  32. 32. Apache Solr Partial Doc Update• Sent to the standard update handler• Requires _version_ field to be present curl localhost:8983/solr/update?commit=true -H Content- type:application/json -d [ { "id" : "12345", "enabled" : { "set" : true } } ] Copyright 2012 Sematext Int’l. All rights reserved
  33. 33. Solr Collections API• Built on top of Core Admin• Allows: – Collection creation – Collection reload – Collection deletion Copyright 2012 Sematext Int’l. All rights reserved
  34. 34. ElasticSearch Indices REST API• Allows: – Index creation – Index deletion – Index closing and opening – Index refreshing – Existence checking Copyright 2012 Sematext Int’l. All rights reserved
  35. 35. Analysis Chain Definition• Solr • ElasticSearch – Static in schema.xml – Static in elasticsearch.yml – Can be reloaded – Defined during index/type during runtime with creation with REST call collection/core reload – Possible to change with update mapping call (not all changes allowed) Copyright 2012 Sematext Int’l. All rights reserved
  36. 36. Multilingual Data Handling• Both ElasticSearch and Apache Solr built on top of Apache Lucene• Solr – analyzers defined per field in schema.xml file• ElasticSearch – analyzer defined in mappings, but can be set during query or specified on the basis of field values Copyright 2012 Sematext Int’l. All rights reserved
  37. 37. Results Grouping• Available in Apache Solr only• Allows for results grouping based on: – Field value – Query – Function query (not available during distributed searching) Copyright 2012 Sematext Int’l. All rights reserved
  38. 38. Prospective Search• Allows for checking if a document matches a stored query• Not available in Apache Solr• Available in ElasticSearch under the name of Percolator Copyright 2012 Sematext Int’l. All rights reserved
  39. 39. Spellchecker• Allows to check and correct spelling mistakes• Not available in ElasticSearch currently• Multiple implementations available in Apache Solr – IndexBasedSpellChecker – WordBreakSolrSpellChecker – DirectSolrSpellChecker Copyright 2012 Sematext Int’l. All rights reserved
  40. 40. Full Text Search Capabilities• Variety of queries• Ability to control score calculation• Different query parsers available• Advanced Lucene queries (like SpanQueries) exposed Copyright 2012 Sematext Int’l. All rights reserved
  41. 41. Score Calculation• Leverage Lucene scoring capabilities• Control over document importance• Control over query importance• Control over term and phrase importance Copyright 2012 Sematext Int’l. All rights reserved
  42. 42. Apache Solr and Score Influence• Index time – Document boosts – Field boosts• Query time – Term boosts – Field boosts – Phrases boost – Function queries Copyright 2012 Sematext Int’l. All rights reserved
  43. 43. ElasticSearch and Score Influence• Index time – Document and field boosts• Query time – Different queries provide different boost controls – Can calculate distributed term frequencies – Negative and Positive boosting queries – Custom score filters• Scripts – Control scoring with scripts Copyright 2012 Sematext Int’l. All rights reserved
  44. 44. Nested Objects• Possible only in ElasticSearch• Indexed as separate documents• Stored in the same part of the index as the root document• Hidden from standard queries and filters• Need appropriate queries and filters (nested) Copyright 2012 Sematext Int’l. All rights reserved
  45. 45. More Like This• Lets us find similar documents• Solr – More Like This Component• ElasticSearch – More Like This Query – More Like This Field Query – _mlt REST end – point Copyright 2012 Sematext Int’l. All rights reserved
  46. 46. Solr Parent – Child Relationship• Used at query time• Multi core joins possible http://localhost:8983/solr/select?q={!join from=parent to=id}color:Yellow Copyright 2012 Sematext Int’l. All rights reserved
  47. 47. ElasticSearch Parent – Child Handling• Proper indexing required• Indexed as separate documents• Standard queries don’t return child documents• In order to retrieve parent docs one should use appropriate queries and filters (has_child, has_parent, top_children) Copyright 2012 Sematext Int’l. All rights reserved
  48. 48. Filters• Used to narrown down query results• Good candidates for caching and reuse• Supported by ElasticSearch and Apache Solr• Should be used for repeatable query elements Copyright 2012 Sematext Int’l. All rights reserved
  49. 49. Apache Solr Filter Queries• Multiple filters per query• Filters are addictive• Different query parsers can be used• Local params can be used• Narrow down faceting results Copyright 2012 Sematext Int’l. All rights reserved
  50. 50. ElasticSearch Filtered Queries• Can be defined using queries exposed by the Query DSL• Can be used for custom score calculation (i.e., custom filters score query)• Doesn’t narrow down faceting results by default (facets have their own filters) Copyright 2012 Sematext Int’l. All rights reserved
  51. 51. Filter Cache Control• Both Solr and ElasticSearch let us control cache for filters• Solr – Using local params and cache property• ElasticSearch – _cache property – _cache_key property Copyright 2012 Sematext Int’l. All rights reserved
  52. 52. Faceting• Both provide common facets – Terms – Range & query – Terms statistics – Spatial distance• Solr – Pivot faceting• ElasticSearch – Histograms Copyright 2012 Sematext Int’l. All rights reserved
  53. 53. Real Time Or Not ?• Allow getting document not yet indexed• Don’t need searcher reopening• ElasticSearch – Separate Get and Multi Get API’s• Apache Solr – Separate Realtime Get Handler – Can be used as a search component Copyright 2012 Sematext Int’l. All rights reserved
  54. 54. Caches and Warming• ElasticSearch and Solr allow caching• Both allow running warming queries• ElasticSearch by default doesn’t limit cache sizes Copyright 2012 Sematext Int’l. All rights reserved
  55. 55. Solr Caches• Types – Filter Cache – Query Result Cache – Document Cache• Implementation choices – LRUCache – FastLRUCache – LFUCache• Other configuration options: – Size – Maximum size – Autowarming count Copyright 2012 Sematext Int’l. All rights reserved
  56. 56. ElasticSearch Caches• Types – Filter Cache – Field Data Cache• Implementation choices – Resident – Soft – Weak• Other configuration options: – Max size (entries per segment) – Expiration time Copyright 2012 Sematext Int’l. All rights reserved
  57. 57. Cluster State Monitoring• Apache Solr – multiple mbeans exposed by JMX• ElasticSearch – multiple REST end – points exposed to get different statistics Copyright 2012 Sematext Int’l. All rights reserved
  58. 58. ElasticSearch Statistics API• Health and State Check• Nodes Information and Statistics• Cache Statistics• Index Segments Information• Index Information and Statistics• Mappings Information Copyright 2012 Sematext Int’l. All rights reserved
  59. 59. Cluster Monitoring Copyright 2012 Sematext Int’l. All rights reserved
  60. 60. Cluster Monitoring with SPM Copyright 2012 Sematext Int’l. All rights reserved
  61. 61. Cluster Settings Update• ElasticSearch lets us: – Control rebalancing – Control recovery – Control allocation – Change the above on the live cluster Copyright 2012 Sematext Int’l. All rights reserved
  62. 62. Custom Shard Allocation• Possible in ElasticSearch• Cluster level: curl -XPUT localhost:9200/_cluster/settings -d { "persistent" : { "cluster.routing.allocation.exclude._ip" : "192.168.2.1" } }• Index level: curl -XPUT localhost:9200/sematext/ -d { "index.routing.allocation.include.tag" : "nodeOne,nodeTwo" } Copyright 2012 Sematext Int’l. All rights reserved
  63. 63. Moving Shards and Replicas• Possible in ElasticSearch, not available in Solr• Allows to move shards and replicas to any node in the cluster on demand• Available in ElasticSearch: curl -XPOST localhost:9200/_cluster/reroute -d { "commands" : [ {"move" : {"index" : "sematext", "shard" : 0, "from_node" : "node1", "to_node" : "node2"}}, {"allocate" : {"index" : "sematext", "shard" : 1, "node" : "node3"}} ] } Copyright 2012 Sematext Int’l. All rights reserved
  64. 64. And The Winner Is ? Copyright 2012 Sematext Int’l. All rights reserved
  65. 65. How to Reach Us• Rafał Kuć – Twitter: @kucrafal – E-mail: rafal.kuc@sematext.com• Sematext – Twitter: @sematext – Website: http://sematext.com• Solr vs ElasticSearch series: • http://blog.sematext.com/2012/08/23/solr-vs- elasticsearch-part-1-overview/ Copyright 2012 Sematext Int’l. All rights reserved
  66. 66. We Are Hiring !• Dig Search ?• Dig Analytics ?• Dig Big Data ?• Dig Performance ?• Dig working with and in open – source ?• We’re hiring world – wide ! http://sematext.com/about/jobs.html Copyright 2012 Sematext Int’l. All rights reserved
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×