• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content




New features and changes in Elasticsearch

New features and changes in Elasticsearch



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Elasticsearch Elasticsearch Presentation Transcript

    • ELASTICSEARCH What’s new since 0.90? techtalk @ ferret
    • • Latest stable release: Elasticsearch 1.1.0 • Released: 25.03.2014 • Based on Lucene 4.6.1
    • BREAKING CHANGES in versions 1.x
    • CONFIGURATION • The cluster.routing.allocation settings (disable_allocation, disable_new_allocation and disable_replica_location) have been replaced by the single setting: cluster.routing.allocation.enable: all|primaries|new_primaries| none • Elasticsearch on 64 bit Linux now uses mmapfs by default. Make sure that you set MAX_MAP_COUNT to a sufficiently high number. The RPM and Debian packages default this value to 262144.
    • MULTI-FIELDS Existing multi-fields will be upgraded to the new format automatically. "title": { "type": "multi_field", "fields": { "title": { "type": "string" }, "raw": { "type":“string", "index": "not_analyzed" } } } "title": { "type": "string", "fields": { "raw": { "type":“string", "index": "not_analyzed" } } }
    • STOPWORDS • Previously, the standard and pattern analyzers used the list of English stopwords by default, which caused some hard to debug indexing issues. • Now they are set to use the empty stopwords list (ie _none_) instead.
    • RETURNVALUES • The ok return value has been removed from all response bodies as it added no useful information. • The found, not_found and exists return values have been unified as found on all relevant APIs. • Field values, in response to the fields parameter, are now always returned as arrays. Metadata fields are always returned as scalars. • The analyze API no longer supports the text response format, but does support JSON andYAML.
    • DEPRECATIONS • Per-document boosting with the _boost field has been removed.You can use the function_score instead. • The custom_score and custom_boost_score is no longer supported. You can use function_score instead. • The field query has been removed. Use the query_string query instead. • The path parameter in mappings has been deprecated. Use the copy_to parameter instead.
    • AGGREGATIONS since version 1.0.0
    • AGGREGATIONTYPES • Bucketing aggregations Aggregations that build buckets, where each bucket is associated with a key and a document criterion. ! Examples: range, terms, histogram ! Bucketing aggregations can have sub-aggregations (bucketing or metric). The sub-aggregations will be computed for the buckets which their parent aggregation generates. • Metrics aggregations Aggregations that keep track and compute metrics over a set of documents. ! Examples: min, max, stats
    • { "aggs" : { "price_ranges" : { "range" : { "field" : "price", "ranges" : [ { "to" : 50 }, { "from" : 100 } ] }, "aggs" : { "price_stats" : { "stats" : { "field" : "price" } } } } } } { "aggregations": { "price_ranges" : { "buckets": [ { "to": 50, "doc_count": 2, "price_stats": { "count": 2, "min": 20, "max": 47, "avg": 33.5, "sum": 67 } }, … ] } } }
    • CARDINALITY The cardinality aggregation is a metric aggregation that allows to compute approximate unique counts based on the HyperLogLog++ algorithm which has the nice properties of both being close to accurate on low cardinalities and having fixed memory usage so that estimating high cardinalities doesn't blow up memory. { "aggs" : { "author_count" : { "cardinality" : { "field" : "author" } } } }
    • PERCENTILES A percentiles aggregation would allow to compute (approximate) values of arbitrary percentiles based on the t-digest algorithm. Computing exact percentiles is not reasonably feasible as it would require shards to stream all values to the node that coordinates search execution, which could be gigabytes on a high-cardinality field. 1.1.0 { "aggs" : { "load_time_outlier" : { "percentiles" : { "field" : "load_time" } } } } { ... "aggregations": { "load_time_outlier": { "1.0": 15, "5.0": 20, "25.0": 23, "50.0": 25, "75.0": 29, "95.0": 60, "99.0": 150 } } }
    • SIGNIFICANT_TERMS { "query" : { "terms" : { "force" : [ "BritishTransport Police" ] } }, "aggregations" : { "significantCrimeTypes" : { "significant_terms" : { "field" : "crime_type" } } } } An aggregation that identifies terms that are significant rather than merely popular in a result set. Significance is related to the changes in document frequency observed between everyday use in the corpus and frequency observed in the result set. 1.1.0 { "aggregations" : { "significantCrimeTypes" : { "doc_count": 47347, "buckets" : [ { "key": "Bicycle theft", "doc_count": 3640, "score": 0.371235374214817, "bg_count": 66799 }, … ] } } }
    • IMPROVEMENTS 1.1.0
    • TERMS AGGREGATION • Before 1.1.0 terms aggregations return up to size terms, so the way to get all matching terms back was to set size to an arbitrary high number that would be larger than the number of unique terms. ! • Since version 1.1.0 to get ALL terms just set size=0
    • MULTI-FIELD SEARCH • The multi_match query now supports three types of execution:
 • best_fields (field-centric, default) Find the field that best matches the query string. Useful for finding a single concept like “full text search” in either the title or the body field. ! • most_fields (field-centric) Find all matching fields and add up their scores. Useful for matching against multi-fields, where the same text has been analyzed in different ways to improve the relevance score: with/without stemming, shingles, edge-ngrams etc. ! • cross_fields (term-centric) New execution mode which looks for each term in any of the listed fields. Useful for documents whose identifying features are spread across multiple fields, such as first_name and last_name, and supports the minimum_should_match operator in a more natural way than the other two modes.
    • CAT API since version 1.0.0
    • JSON is great… for computers. Human eyes, especially when looking at an ssh terminal, need compact and aligned text.The cat API aims to meet this need. $ curl 'localhost:9200/_cat/nodes?h=ip,port,heapPercent,name' 9300 40.3 Captain Universe 9300 15.3 Kaluu 9300 17.0Yellowjacket 9300 12.3 Remy LeBeau 9300 43.9 Ramsey, Doug
    • TRIBE NODES since version 1.0.0
    • The tribes feature allows a tribe node to act as a federated client across multiple clusters. tribe: t1: cluster.name: cluster_one t2: cluster.name: cluster_two elasticsearch.yml The merged global cluster state means that almost all operations work in the sam way as a single cluster: distributed search, suggest, percolation, indexing, etc. ! However, there are a few exceptions: • The merged view cannot handle indices with the same name in multiple cluster • Master level read operations (eg Cluster State, Cluster Health) will automati execute with a local flag set to true since there is no master. • Master level write operations (eg Create Index) are not allowed.These should performed on a single cluster.
    • BACKUP & RESTORE since version 1.0.0
    • REPOSITORIES $ curl -XPUT 'http://localhost:9200/_snapshot/my_backup' -d '{ "type": "fs", "settings": { "location": "/mount/backups/my_backup", "compress": true }}' Before any snapshot or restore operation can be performed a snapshot repository should be registered in Elasticsearch. Supported repository types: • fs (filesystem) • S3 • HDFS (Hadoop) • Azure
    • SNAPSHOTS $ curl -XPUT "localhost:9200/_snapshot/my_backup/snapshot_1" -d '{ "indices": "index_1,index_2" }' A repository can contain multiple snapshots of the same cluster. Snapshot are identified by unique names within the cluster. • The index snapshot process is incremental. • Only one snapshot process can be executed in the cluster at any time. • Snapshotting process is executed in non-blocking fashion
    • RESTORE $ curl -XPOST "localhost:9200/_snapshot/my_backup/snapshot_1/_restore" -d '{ "indices": "index_1,index_2", "rename_pattern": "index_(.+)", "rename_replacement": "restored_index_$1" }' A snapshot can be restored using the following command: • The restore operation can be performed on a functioning cluster. • An existing index can be only restored if it’s closed. • The restored persistent settings are added to the existing persistent settings.
    • ELASTICSEARCH-PY Official low-level client for Elasticsearch
    • Features: • translating basic Python data types to and from json (datetimes are not decoded for performance reasons) • configurable automatic discovery of cluster nodes • persistent connections • load balancing (with pluggable selection strategy) across all available nodes • failed connection penalization (time based - failed connections won’t be retried until a timeout is reached) • thread safety • pluggable architecture Versioning: • There are two branches - master and 0.4. Master branch is used to track all the changes for Elasticsearch 1.0 and beyond whereas 0.4 tracks Elasticsearch 0.90.