Battle of the Giants
Rafał Kuć – Sematext Group, Inc.
@kucrafal @sematext sematext.com
Ich bin ein…
Sematext consultant & engineer
Solr Cookbook series author
„ElasticSearch Server” author
„Mastering ElasticSearch” author
Solr.pl co-founder
Father and husband 
Copyright 2013 Sematext Group. Inc. All rights reserved
Copyright 2013 Sematext Group. Inc. All rights reserved
Under the Hood
Copyright 2013 Sematext Group. Inc. All rights reserved
Lucene 4.3Lucene 4.3
Expectations
Scalability
Fault toleranance
High availablity
Features
Manageability
Ease of installation
Tools
Support
Copyright 2013 Sematext Group. Inc. All rights reserved
Expectations vs Reality
Only ElasticSearch nodes
Single leader
Copyright 2013 Sematext Group. Inc. All rights reserved
Solr + ZooKeeper
Leader per shard
Distributed
Fault tolerant
Automatic leader election
All Time Top Committers
Copyright 2013 Sematext Group. Inc. All rights reserved
Active Contributors
Copyright 2013 Sematext Group. Inc. All rights reserved
The Code
Copyright 2013 Sematext Group. Inc. All rights reserved
The Mailing Lists
Copyright 2013 Sematext Group. Inc. All rights reserved
Trends
Copyright 2013 Sematext Group. Inc. All rights reserved
Collection vs Index
Collections and Indices can be spread among
different nodes in the cluster
Copyright 2013 Sematext Group. Inc. All rights reserved
Collection – main
logical index
Index – main
logical structure
Apache Solr Index Structure
Field and types defined in schema
Automatic value copying
Dynamic fields
Custom similarity
Custom postings format
Multiple document types require shared schema
Can be read using API
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Index Structure
Schema - less
Fields and types defined with HTTP API
Multi – field support
Nested and parent – child documents
Custom similarity
Custom postings format
Multiple document with different structure
Can be read and written using API
Copyright 2013 Sematext Group. Inc. All rights reserved
Shards and Replicas
Many shards
0 or more replicas
Replica can become leader
Replicas can be created on
live cluster
Copyright 2013 Sematext Group. Inc. All rights reserved
Configuration
Static in solrconfig.xml
Can be reloaded with
core reload
Static in elasticsearch.yml
Changable at runtime
Copyright 2013 Sematext Group. Inc. All rights reserved
Discovery
Copyright 2013 Sematext Group. Inc. All rights reserved
Zen DiscoveryApache Zookeeper
Solr & ZooKeeper
Requires additional software
Prevents split – brain situations
Holds collections configurations
ZooKeeper ensemble needed
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Zen Discovery
Automatic node discovery
Multicast and unicast discovery methods
Automatic master detection
Two - way failure detection
Copyright 2013 Sematext Group. Inc. All rights reserved
HTTP FTW
HTTP REST API in ElasticSearch or Query String
for simple queries
HTTP with Query String in Apache Solr
Both provide specialized Java API
Copyright 2013 Sematext Group. Inc. All rights reserved
Results Grouping
Group on:
field value
query result
function query
Copyright 2013 Sematext Group. Inc. All rights reserved
Prospective Search
Called Percolator
Matches documents to stored queries
Copyright 2013 Sematext Group. Inc. All rights reserved
Full Text Search Capabilities
Variety of queries
Control score calculation
Different query parsers
Advanced Lucene queries
Copyright 2013 Sematext Group. Inc. All rights reserved
Score Calculation
Leverage Lucene scoring
Control importance of:
documents
queries
terms
phrases
Similiarity configuration
Copyright 2013 Sematext Group. Inc. All rights reserved
Apache Solr and Score Influence
Index - time boosting
Query - time
Term boosts
Field boosts
Phrases boost
Function queries
Sub-queries used for boosting
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch and Score Influence
Index - time
Query - time
Different queries provide different boost controls
Can calculate distributed term frequencies
Negative and Positive boosting queries
Custom score filters
Scripts
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Query Rescore
Reorders top N hits by using other query
Executed on shards before results are returned
to the node handling it
Not executed with scan and count
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Nested Objects
Indexed as separate documents
Stored in the same part of index as root doc
Hidden from standard queries and filters
Need appropriate queries and filters (nested)
Top level documents can be sorted on the basis
of nested ones
Copyright 2013 Sematext Group. Inc. All rights reserved
Solr Parent – Child Relationship
Used at query time
Multi core joins possible
select?q={!join from=parent to=id}color:Yellow
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Parent – Child
Proper indexing required
Indexed as separate documents
Standard queries don’t return child documents
Retrieve parent docs using queries and filters
(has_child, has_parent, top_children)
Copyright 2013 Sematext Group. Inc. All rights reserved
Filters
Used to narrown down query results
Good candidates for caching and reuse
Copyright 2013 Sematext Group. Inc. All rights reserved
Addictive
Can use different query parsers
Can use local params
Narrows down faceting results
Defined using Query DSL
Can be used for score calculation
Doesn’t narrow down faceting
results
Faceting
Copyright 2013 Sematext Group. Inc. All rights reserved
Terms
Range & query
Terms statistics
Spatial distance
Pivot Histograms
Real Time Or Not ?
Get not yet indexed docs from transaction log
Don’t need searcher reopening
Copyright 2013 Sematext Group. Inc. All rights reserved
Separate Get and
Multi Get API
Separate Realtime Get
Handler
Data Handling
Single and batch indexing supported
Copyright 2013 Sematext Group. Inc. All rights reserved
JSON in / JSON out
(and YAML)
Different formats allowed
(XML, JSON, CSV, binary)
Partial Document Updates
Not based on LUCENE-3837
Server-side doc reindexing
Both servers use versioning
Decreases network traffic
Copyright 2013 Sematext Group. Inc. All rights reserved
Apache Solr Partial Doc Update
Sent to the standard update handler
Requires _version_ field
curl 'localhost:8983/solr/update?commit=true' -H
'Content-type:application/json' -d '[ {
"id" : "12345",
"enabled" : {
"set" : true
}
} ]'
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Partial Doc Update
Special end – point exposed - _update
Supports parameters like routing, parent,
replication, percolate, etc (similar to Index API)
Uses scripts to perform document updates
curl -XPOST 'localhost:9200/sematext/test/12345/_update' -d '{
"script" : "ctx._source.enabled = enabled",
"params" : {
"enabled" : true
}
}'
Copyright 2013 Sematext Group. Inc. All rights reserved
Solr Collections API
Collection
creation
reload
deletion
shards splitting
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Indices REST API
Index
creation
deletion
closing and opening
refreshing
existence checking
Copyright 2013 Sematext Group. Inc. All rights reserved
Apache Solr Shard Splitting
Copyright 2013 Sematext Group. Inc. All rights reserved
admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1
Cluster State Monitoring
Copyright 2013 Sematext Group. Inc. All rights reserved
Multiple MBeans exposed by
JMX
Multiple REST end – points
exposed to get different
statistics
ElasticSearch Statistics API
Health and state check
Nodes information
Cache statistics
Segments information
Index information
Mappings information
Copyright 2013 Sematext Group. Inc. All rights reserved
SPM – „One to rule them all”
ElasticSearch Cluster Settings Update
Control
rebalancing
recovery
allocation
Change cluster configuration properties
Copyright 2013 Sematext Group. Inc. All rights reserved
ElasticSearch Custom Shard Allocation
Cluster level:
Index level:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"cluster.routing.allocation.exclude._ip" : "192.168.2.1"
}
}'
curl -XPUT localhost:9200/sematext/_settings/ -d '{
"index.routing.allocation.include.tag" : "nodeOne,nodeTwo"
}'
Copyright 2013 Sematext Group. Inc. All rights reserved
Moving Shards and Replicas
Move shards between nodes on demand
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{
"commands" : [
{"move" : {"index" : "sematext", "shard" : 0, "from_node" : "node1",
"to_node" : "node2"}},
{"allocate" : {"index" : "sematext", "shard" : 1, "node" : "node3"}}
]
}'
Copyright 2013 Sematext Group. Inc. All rights reserved
Copyright 2013 Sematext Group. Inc. All rights reserved
The Verdict
And The Winner Is ?
Copyright 2013 Sematext Group. Inc. All rights reserved
We Are Hiring !
Dig Search ?
Dig Analytics ?
Dig Big Data ?
Dig Performance ?
Dig working with and in open – source ?
We’re hiring world – wide !
http://sematext.com/about/jobs.html
Copyright 2013 Sematext Group. Inc. All rights reserved
Copyright 2013 Sematext Group. Inc. All rights reserved
Rafał Kuć
@kucrafal
rafal.kuc@sematext.com
Sematext
@sematext
http://sematext.com
http://blog.sematext.com
ElasticSearch Server 25% off:
MREESS25
Thank You !

Battle of the Giants round 2

  • 1.
    Battle of theGiants Rafał Kuć – Sematext Group, Inc. @kucrafal @sematext sematext.com
  • 2.
    Ich bin ein… Sematextconsultant & engineer Solr Cookbook series author „ElasticSearch Server” author „Mastering ElasticSearch” author Solr.pl co-founder Father and husband  Copyright 2013 Sematext Group. Inc. All rights reserved
  • 3.
    Copyright 2013 SematextGroup. Inc. All rights reserved
  • 4.
    Under the Hood Copyright2013 Sematext Group. Inc. All rights reserved Lucene 4.3Lucene 4.3
  • 5.
    Expectations Scalability Fault toleranance High availablity Features Manageability Easeof installation Tools Support Copyright 2013 Sematext Group. Inc. All rights reserved
  • 6.
    Expectations vs Reality OnlyElasticSearch nodes Single leader Copyright 2013 Sematext Group. Inc. All rights reserved Solr + ZooKeeper Leader per shard Distributed Fault tolerant Automatic leader election
  • 7.
    All Time TopCommitters Copyright 2013 Sematext Group. Inc. All rights reserved
  • 8.
    Active Contributors Copyright 2013Sematext Group. Inc. All rights reserved
  • 9.
    The Code Copyright 2013Sematext Group. Inc. All rights reserved
  • 10.
    The Mailing Lists Copyright2013 Sematext Group. Inc. All rights reserved
  • 11.
    Trends Copyright 2013 SematextGroup. Inc. All rights reserved
  • 12.
    Collection vs Index Collectionsand Indices can be spread among different nodes in the cluster Copyright 2013 Sematext Group. Inc. All rights reserved Collection – main logical index Index – main logical structure
  • 13.
    Apache Solr IndexStructure Field and types defined in schema Automatic value copying Dynamic fields Custom similarity Custom postings format Multiple document types require shared schema Can be read using API Copyright 2013 Sematext Group. Inc. All rights reserved
  • 14.
    ElasticSearch Index Structure Schema- less Fields and types defined with HTTP API Multi – field support Nested and parent – child documents Custom similarity Custom postings format Multiple document with different structure Can be read and written using API Copyright 2013 Sematext Group. Inc. All rights reserved
  • 15.
    Shards and Replicas Manyshards 0 or more replicas Replica can become leader Replicas can be created on live cluster Copyright 2013 Sematext Group. Inc. All rights reserved
  • 16.
    Configuration Static in solrconfig.xml Canbe reloaded with core reload Static in elasticsearch.yml Changable at runtime Copyright 2013 Sematext Group. Inc. All rights reserved
  • 17.
    Discovery Copyright 2013 SematextGroup. Inc. All rights reserved Zen DiscoveryApache Zookeeper
  • 18.
    Solr & ZooKeeper Requiresadditional software Prevents split – brain situations Holds collections configurations ZooKeeper ensemble needed Copyright 2013 Sematext Group. Inc. All rights reserved
  • 19.
    ElasticSearch Zen Discovery Automaticnode discovery Multicast and unicast discovery methods Automatic master detection Two - way failure detection Copyright 2013 Sematext Group. Inc. All rights reserved
  • 20.
    HTTP FTW HTTP RESTAPI in ElasticSearch or Query String for simple queries HTTP with Query String in Apache Solr Both provide specialized Java API Copyright 2013 Sematext Group. Inc. All rights reserved
  • 21.
    Results Grouping Group on: fieldvalue query result function query Copyright 2013 Sematext Group. Inc. All rights reserved
  • 22.
    Prospective Search Called Percolator Matchesdocuments to stored queries Copyright 2013 Sematext Group. Inc. All rights reserved
  • 23.
    Full Text SearchCapabilities Variety of queries Control score calculation Different query parsers Advanced Lucene queries Copyright 2013 Sematext Group. Inc. All rights reserved
  • 24.
    Score Calculation Leverage Lucenescoring Control importance of: documents queries terms phrases Similiarity configuration Copyright 2013 Sematext Group. Inc. All rights reserved
  • 25.
    Apache Solr andScore Influence Index - time boosting Query - time Term boosts Field boosts Phrases boost Function queries Sub-queries used for boosting Copyright 2013 Sematext Group. Inc. All rights reserved
  • 26.
    ElasticSearch and ScoreInfluence Index - time Query - time Different queries provide different boost controls Can calculate distributed term frequencies Negative and Positive boosting queries Custom score filters Scripts Copyright 2013 Sematext Group. Inc. All rights reserved
  • 27.
    ElasticSearch Query Rescore Reorderstop N hits by using other query Executed on shards before results are returned to the node handling it Not executed with scan and count Copyright 2013 Sematext Group. Inc. All rights reserved
  • 28.
    ElasticSearch Nested Objects Indexedas separate documents Stored in the same part of index as root doc Hidden from standard queries and filters Need appropriate queries and filters (nested) Top level documents can be sorted on the basis of nested ones Copyright 2013 Sematext Group. Inc. All rights reserved
  • 29.
    Solr Parent –Child Relationship Used at query time Multi core joins possible select?q={!join from=parent to=id}color:Yellow Copyright 2013 Sematext Group. Inc. All rights reserved
  • 30.
    ElasticSearch Parent –Child Proper indexing required Indexed as separate documents Standard queries don’t return child documents Retrieve parent docs using queries and filters (has_child, has_parent, top_children) Copyright 2013 Sematext Group. Inc. All rights reserved
  • 31.
    Filters Used to narrowndown query results Good candidates for caching and reuse Copyright 2013 Sematext Group. Inc. All rights reserved Addictive Can use different query parsers Can use local params Narrows down faceting results Defined using Query DSL Can be used for score calculation Doesn’t narrow down faceting results
  • 32.
    Faceting Copyright 2013 SematextGroup. Inc. All rights reserved Terms Range & query Terms statistics Spatial distance Pivot Histograms
  • 33.
    Real Time OrNot ? Get not yet indexed docs from transaction log Don’t need searcher reopening Copyright 2013 Sematext Group. Inc. All rights reserved Separate Get and Multi Get API Separate Realtime Get Handler
  • 34.
    Data Handling Single andbatch indexing supported Copyright 2013 Sematext Group. Inc. All rights reserved JSON in / JSON out (and YAML) Different formats allowed (XML, JSON, CSV, binary)
  • 35.
    Partial Document Updates Notbased on LUCENE-3837 Server-side doc reindexing Both servers use versioning Decreases network traffic Copyright 2013 Sematext Group. Inc. All rights reserved
  • 36.
    Apache Solr PartialDoc Update Sent to the standard update handler Requires _version_ field curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[ { "id" : "12345", "enabled" : { "set" : true } } ]' Copyright 2013 Sematext Group. Inc. All rights reserved
  • 37.
    ElasticSearch Partial DocUpdate Special end – point exposed - _update Supports parameters like routing, parent, replication, percolate, etc (similar to Index API) Uses scripts to perform document updates curl -XPOST 'localhost:9200/sematext/test/12345/_update' -d '{ "script" : "ctx._source.enabled = enabled", "params" : { "enabled" : true } }' Copyright 2013 Sematext Group. Inc. All rights reserved
  • 38.
    Solr Collections API Collection creation reload deletion shardssplitting Copyright 2013 Sematext Group. Inc. All rights reserved
  • 39.
    ElasticSearch Indices RESTAPI Index creation deletion closing and opening refreshing existence checking Copyright 2013 Sematext Group. Inc. All rights reserved
  • 40.
    Apache Solr ShardSplitting Copyright 2013 Sematext Group. Inc. All rights reserved admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1
  • 41.
    Cluster State Monitoring Copyright2013 Sematext Group. Inc. All rights reserved Multiple MBeans exposed by JMX Multiple REST end – points exposed to get different statistics
  • 42.
    ElasticSearch Statistics API Healthand state check Nodes information Cache statistics Segments information Index information Mappings information Copyright 2013 Sematext Group. Inc. All rights reserved SPM – „One to rule them all”
  • 43.
    ElasticSearch Cluster SettingsUpdate Control rebalancing recovery allocation Change cluster configuration properties Copyright 2013 Sematext Group. Inc. All rights reserved
  • 44.
    ElasticSearch Custom ShardAllocation Cluster level: Index level: curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent" : { "cluster.routing.allocation.exclude._ip" : "192.168.2.1" } }' curl -XPUT localhost:9200/sematext/_settings/ -d '{ "index.routing.allocation.include.tag" : "nodeOne,nodeTwo" }' Copyright 2013 Sematext Group. Inc. All rights reserved
  • 45.
    Moving Shards andReplicas Move shards between nodes on demand curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ {"move" : {"index" : "sematext", "shard" : 0, "from_node" : "node1", "to_node" : "node2"}}, {"allocate" : {"index" : "sematext", "shard" : 1, "node" : "node3"}} ] }' Copyright 2013 Sematext Group. Inc. All rights reserved
  • 46.
    Copyright 2013 SematextGroup. Inc. All rights reserved The Verdict
  • 47.
    And The WinnerIs ? Copyright 2013 Sematext Group. Inc. All rights reserved
  • 48.
    We Are Hiring! Dig Search ? Dig Analytics ? Dig Big Data ? Dig Performance ? Dig working with and in open – source ? We’re hiring world – wide ! http://sematext.com/about/jobs.html Copyright 2013 Sematext Group. Inc. All rights reserved
  • 49.
    Copyright 2013 SematextGroup. Inc. All rights reserved Rafał Kuć @kucrafal rafal.kuc@sematext.com Sematext @sematext http://sematext.com http://blog.sematext.com ElasticSearch Server 25% off: MREESS25 Thank You !