Your SlideShare is downloading. ×
0
elasticsearchADVANCED FEATURES IN PRACTICE          @JSUCHAL         #RUBYSLAVA
elasticwhat? based on Apache Lucene REST API Data & API in JSON Schema-free Real time Distributed Advanced function...
Quickstart1. Download & extract from   http://www.elasticsearch.org/download/2. $ bin/elasticsearch –f3. There is no step 3.
Quickstart - index$    curl -XPOST http://localhost:9200/rubyslava/talks/1 -d {      "title" : "elasticsearch - advanced f...
Quickstart - search$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘=>{     "took" : 2,     "t...
Advanced features Search   analyzer, stemming, ngrams, ascii folding & custom analyzers   boosting, fragment highlighti...
The Case Study Find suspicious government contracts   using heuristics       IT contract where price > 1M euro       S...
The SolutionFaceted search
The Solution Faceted search   Search        e.g. Find all contracts by Orange Slovakia    Analyze      e.g. Which dep...
Facets Types   term, range, histogram, statistical, geo distance$ curl -XPOST http://localhost:9200/rubyslava/talks/_sea...
Facets - results{    "took" : 2,    …    "hits" : { … },    "facets" : {      "tags_facet" : {        "_type" : "terms",  ...
Facets - advanced Problem   Generate options for facets with some    selected restrictions Solution   global facet   ...
Percolate Problem   New contract/document added, which heuristics does it    match? Solution  1. Save heuristics/search...
Percolate$ curl -XPUT localhost:9200/_percolator/rubyslava/heuristic-1 -d {    "query" : {        "term" : {            "t...
Scroll Problem   New heuristic added and matches many (1K+) documents   Add heuristic to all matching documents   + Of...
Scroll$ curl -XGET http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty -d {    "query": {        "match_all" : ...
Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block)   scroll_id = nil   processed = 0   begin  ...
Tutorials & Guides http://www.slideshare.net/clintongormley/cool-  bonsai-cool-an-introduction-to-elasticsearch http://w...
elasticsearch - advanced features in practice
Upcoming SlideShare
Loading in...5
×

elasticsearch - advanced features in practice

13,898

Published on

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

Published in: Technology, News & Politics
2 Comments
27 Likes
Statistics
Notes
No Downloads
Views
Total Views
13,898
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
144
Comments
2
Likes
27
Embeds 0
No embeds

No notes for slide

Transcript of "elasticsearch - advanced features in practice"

  1. 1. elasticsearchADVANCED FEATURES IN PRACTICE @JSUCHAL #RUBYSLAVA
  2. 2. elasticwhat? based on Apache Lucene REST API Data & API in JSON Schema-free Real time Distributed Advanced functionality
  3. 3. Quickstart1. Download & extract from http://www.elasticsearch.org/download/2. $ bin/elasticsearch –f3. There is no step 3.
  4. 4. Quickstart - index$ curl -XPOST http://localhost:9200/rubyslava/talks/1 -d { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"]}=> {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
  5. 5. Quickstart - search$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘=>{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] }}
  6. 6. Advanced features Search  analyzer, stemming, ngrams, ascii folding & custom analyzers  boosting, fragment highlighting, fuzzy search  … Facets Percolate Scroll and more…
  7. 7. The Case Study Find suspicious government contracts  using heuristics  IT contract where price > 1M euro  Supplier company age < 3 months  using crowdsourcing Data  Central government contract repositories  www.crz.gov.sk, zmluvy.egov.sk  ~70K contracts in 8 months  100+ GB pdf/doc/scan
  8. 8. The SolutionFaceted search
  9. 9. The Solution Faceted search  Search  e.g. Find all contracts by Orange Slovakia  Analyze  e.g. Which department has most contracts with Orange Slovakia?  e.g. What is the contract price distribution for Orange Slovakia? …  Define penalty heuristics
  10. 10. Facets Types  term, range, histogram, statistical, geo distance$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?pretty -d { "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } }}
  11. 11. Facets - results{ "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } }}
  12. 12. Facets - advanced Problem  Generate options for facets with some selected restrictions Solution  global facet  facet_filter { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
  13. 13. Percolate Problem  New contract/document added, which heuristics does it match? Solution 1. Save heuristics/searches in percolator index 2. Percolate new documents
  14. 14. Percolate$ curl -XPUT localhost:9200/_percolator/rubyslava/heuristic-1 -d { "query" : { "term" : { "tags" : "rubyslava" } }}$ curl -XPOST http://localhost:9200/rubyslava/talks/_percolate -d { "doc" : { "tags" : ["rubyslava", "rocks", "too"] }}‘=> {"ok":true,"matches":["heuristic-1"]}
  15. 15. Scroll Problem  New heuristic added and matches many (1K+) documents  Add heuristic to all matching documents  + Offset performance problem known in RDBMS Solution  Use async background job  Scroll through results (a.k.a. cursor)
  16. 16. Scroll$ curl -XGET http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty -d { "query": { "match_all" : {} }}‘=>{ "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7MDs=", "took" : 2, … "hits" : […],}$ curl -XGET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…=> more results & repeat
  17. 17. Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
  18. 18. Tutorials & Guides http://www.slideshare.net/clintongormley/cool- bonsai-cool-an-introduction-to-elasticsearch http://www.slideshare.net/clintongormley/terms- of-endearment-the-elasticsearch-query-dsl- explained http://www.elasticsearch.org/guide/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×