Your SlideShare is downloading. ×
elasticsearch - advanced features in practice
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

elasticsearch - advanced features in practice

13,314

Published on

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

Published in: Technology, News & Politics
2 Comments
26 Likes
Statistics
Notes
No Downloads
Views
Total Views
13,314
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
128
Comments
2
Likes
26
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. elasticsearchADVANCED FEATURES IN PRACTICE @JSUCHAL #RUBYSLAVA
  • 2. elasticwhat? based on Apache Lucene REST API Data & API in JSON Schema-free Real time Distributed Advanced functionality
  • 3. Quickstart1. Download & extract from http://www.elasticsearch.org/download/2. $ bin/elasticsearch –f3. There is no step 3.
  • 4. Quickstart - index$ curl -XPOST http://localhost:9200/rubyslava/talks/1 -d { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"]}=> {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
  • 5. Quickstart - search$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘=>{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] }}
  • 6. Advanced features Search  analyzer, stemming, ngrams, ascii folding & custom analyzers  boosting, fragment highlighting, fuzzy search  … Facets Percolate Scroll and more…
  • 7. The Case Study Find suspicious government contracts  using heuristics  IT contract where price > 1M euro  Supplier company age < 3 months  using crowdsourcing Data  Central government contract repositories  www.crz.gov.sk, zmluvy.egov.sk  ~70K contracts in 8 months  100+ GB pdf/doc/scan
  • 8. The SolutionFaceted search
  • 9. The Solution Faceted search  Search  e.g. Find all contracts by Orange Slovakia  Analyze  e.g. Which department has most contracts with Orange Slovakia?  e.g. What is the contract price distribution for Orange Slovakia? …  Define penalty heuristics
  • 10. Facets Types  term, range, histogram, statistical, geo distance$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?pretty -d { "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } }}
  • 11. Facets - results{ "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } }}
  • 12. Facets - advanced Problem  Generate options for facets with some selected restrictions Solution  global facet  facet_filter { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
  • 13. Percolate Problem  New contract/document added, which heuristics does it match? Solution 1. Save heuristics/searches in percolator index 2. Percolate new documents
  • 14. Percolate$ curl -XPUT localhost:9200/_percolator/rubyslava/heuristic-1 -d { "query" : { "term" : { "tags" : "rubyslava" } }}$ curl -XPOST http://localhost:9200/rubyslava/talks/_percolate -d { "doc" : { "tags" : ["rubyslava", "rocks", "too"] }}‘=> {"ok":true,"matches":["heuristic-1"]}
  • 15. Scroll Problem  New heuristic added and matches many (1K+) documents  Add heuristic to all matching documents  + Offset performance problem known in RDBMS Solution  Use async background job  Scroll through results (a.k.a. cursor)
  • 16. Scroll$ curl -XGET http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty -d { "query": { "match_all" : {} }}‘=>{ "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7MDs=", "took" : 2, … "hits" : […],}$ curl -XGET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…=> more results & repeat
  • 17. Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
  • 18. Tutorials & Guides http://www.slideshare.net/clintongormley/cool- bonsai-cool-an-introduction-to-elasticsearch http://www.slideshare.net/clintongormley/terms- of-endearment-the-elasticsearch-query-dsl- explained http://www.elasticsearch.org/guide/

×