elasticsearch - advanced features in practice

  • 12,255 views
Uploaded on

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • @saden1 Now you can!
    Are you sure you want to
    Your message goes here
  • I would like to download this presentation so i can reference it in the future. Can you please enable the ability to download it?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
12,255
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
99
Comments
2
Likes
26

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. elasticsearchADVANCED FEATURES IN PRACTICE @JSUCHAL #RUBYSLAVA
  • 2. elasticwhat? based on Apache Lucene REST API Data & API in JSON Schema-free Real time Distributed Advanced functionality
  • 3. Quickstart1. Download & extract from http://www.elasticsearch.org/download/2. $ bin/elasticsearch –f3. There is no step 3.
  • 4. Quickstart - index$ curl -XPOST http://localhost:9200/rubyslava/talks/1 -d { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"]}=> {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
  • 5. Quickstart - search$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘=>{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] }}
  • 6. Advanced features Search  analyzer, stemming, ngrams, ascii folding & custom analyzers  boosting, fragment highlighting, fuzzy search  … Facets Percolate Scroll and more…
  • 7. The Case Study Find suspicious government contracts  using heuristics  IT contract where price > 1M euro  Supplier company age < 3 months  using crowdsourcing Data  Central government contract repositories  www.crz.gov.sk, zmluvy.egov.sk  ~70K contracts in 8 months  100+ GB pdf/doc/scan
  • 8. The SolutionFaceted search
  • 9. The Solution Faceted search  Search  e.g. Find all contracts by Orange Slovakia  Analyze  e.g. Which department has most contracts with Orange Slovakia?  e.g. What is the contract price distribution for Orange Slovakia? …  Define penalty heuristics
  • 10. Facets Types  term, range, histogram, statistical, geo distance$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?pretty -d { "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } }}
  • 11. Facets - results{ "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } }}
  • 12. Facets - advanced Problem  Generate options for facets with some selected restrictions Solution  global facet  facet_filter { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
  • 13. Percolate Problem  New contract/document added, which heuristics does it match? Solution 1. Save heuristics/searches in percolator index 2. Percolate new documents
  • 14. Percolate$ curl -XPUT localhost:9200/_percolator/rubyslava/heuristic-1 -d { "query" : { "term" : { "tags" : "rubyslava" } }}$ curl -XPOST http://localhost:9200/rubyslava/talks/_percolate -d { "doc" : { "tags" : ["rubyslava", "rocks", "too"] }}‘=> {"ok":true,"matches":["heuristic-1"]}
  • 15. Scroll Problem  New heuristic added and matches many (1K+) documents  Add heuristic to all matching documents  + Offset performance problem known in RDBMS Solution  Use async background job  Scroll through results (a.k.a. cursor)
  • 16. Scroll$ curl -XGET http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty -d { "query": { "match_all" : {} }}‘=>{ "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7MDs=", "took" : 2, … "hits" : […],}$ curl -XGET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…=> more results & repeat
  • 17. Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
  • 18. Tutorials & Guides http://www.slideshare.net/clintongormley/cool- bonsai-cool-an-introduction-to-elasticsearch http://www.slideshare.net/clintongormley/terms- of-endearment-the-elasticsearch-query-dsl- explained http://www.elasticsearch.org/guide/