elasticsearch - advanced features in practice
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

elasticsearch - advanced features in practice

  • 13,037 views
Uploaded on

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • @saden1 Now you can!
    Are you sure you want to
    Your message goes here
  • I would like to download this presentation so i can reference it in the future. Can you please enable the ability to download it?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
13,037
On Slideshare
12,531
From Embeds
506
Number of Embeds
9

Actions

Shares
Downloads
94
Comments
2
Likes
26

Embeds 506

http://speakerrate.com 222
http://www.scoop.it 151
https://www.assembla.com 73
http://lanyrd.com 32
http://rubyslavavoter.dev 8
http://www.linkedin.com 8
https://www.linkedin.com 6
http://s.medcl.net 3
http://s.medcl.com 3

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. elasticsearchADVANCED FEATURES IN PRACTICE @JSUCHAL #RUBYSLAVA
  • 2. elasticwhat? based on Apache Lucene REST API Data & API in JSON Schema-free Real time Distributed Advanced functionality
  • 3. Quickstart1. Download & extract from http://www.elasticsearch.org/download/2. $ bin/elasticsearch –f3. There is no step 3.
  • 4. Quickstart - index$ curl -XPOST http://localhost:9200/rubyslava/talks/1 -d { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"]}=> {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
  • 5. Quickstart - search$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘=>{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] }}
  • 6. Advanced features Search  analyzer, stemming, ngrams, ascii folding & custom analyzers  boosting, fragment highlighting, fuzzy search  … Facets Percolate Scroll and more…
  • 7. The Case Study Find suspicious government contracts  using heuristics  IT contract where price > 1M euro  Supplier company age < 3 months  using crowdsourcing Data  Central government contract repositories  www.crz.gov.sk, zmluvy.egov.sk  ~70K contracts in 8 months  100+ GB pdf/doc/scan
  • 8. The SolutionFaceted search
  • 9. The Solution Faceted search  Search  e.g. Find all contracts by Orange Slovakia  Analyze  e.g. Which department has most contracts with Orange Slovakia?  e.g. What is the contract price distribution for Orange Slovakia? …  Define penalty heuristics
  • 10. Facets Types  term, range, histogram, statistical, geo distance$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?pretty -d { "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } }}
  • 11. Facets - results{ "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } }}
  • 12. Facets - advanced Problem  Generate options for facets with some selected restrictions Solution  global facet  facet_filter { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
  • 13. Percolate Problem  New contract/document added, which heuristics does it match? Solution 1. Save heuristics/searches in percolator index 2. Percolate new documents
  • 14. Percolate$ curl -XPUT localhost:9200/_percolator/rubyslava/heuristic-1 -d { "query" : { "term" : { "tags" : "rubyslava" } }}$ curl -XPOST http://localhost:9200/rubyslava/talks/_percolate -d { "doc" : { "tags" : ["rubyslava", "rocks", "too"] }}‘=> {"ok":true,"matches":["heuristic-1"]}
  • 15. Scroll Problem  New heuristic added and matches many (1K+) documents  Add heuristic to all matching documents  + Offset performance problem known in RDBMS Solution  Use async background job  Scroll through results (a.k.a. cursor)
  • 16. Scroll$ curl -XGET http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty -d { "query": { "match_all" : {} }}‘=>{ "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7MDs=", "took" : 2, … "hits" : […],}$ curl -XGET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…=> more results & repeat
  • 17. Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
  • 18. Tutorials & Guides http://www.slideshare.net/clintongormley/cool- bonsai-cool-an-introduction-to-elasticsearch http://www.slideshare.net/clintongormley/terms- of-endearment-the-elasticsearch-query-dsl- explained http://www.elasticsearch.org/guide/