elasticsearch - advanced features in practice
Upcoming SlideShare
Loading in...5
×
 

elasticsearch - advanced features in practice

on

  • 12,218 views

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

How we used faceted search, percolator and scroll api to identify suspicious contracts published by slovak government.

Statistics

Views

Total Views
12,218
Slideshare-icon Views on SlideShare
11,794
Embed Views
424

Actions

Likes
24
Downloads
89
Comments
2

8 Embeds 424

http://speakerrate.com 222
http://www.scoop.it 144
http://lanyrd.com 31
http://rubyslavavoter.dev 8
http://www.linkedin.com 8
https://www.linkedin.com 5
http://s.medcl.net 3
http://s.medcl.com 3
More...

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • @saden1 Now you can!
    Are you sure you want to
    Your message goes here
    Processing…
  • I would like to download this presentation so i can reference it in the future. Can you please enable the ability to download it?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    elasticsearch - advanced features in practice elasticsearch - advanced features in practice Presentation Transcript

    • elasticsearchADVANCED FEATURES IN PRACTICE @JSUCHAL #RUBYSLAVA
    • elasticwhat? based on Apache Lucene REST API Data & API in JSON Schema-free Real time Distributed Advanced functionality
    • Quickstart1. Download & extract from http://www.elasticsearch.org/download/2. $ bin/elasticsearch –f3. There is no step 3.
    • Quickstart - index$ curl -XPOST http://localhost:9200/rubyslava/talks/1 -d { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"]}=> {"ok":true,"_index":"rubyslava","_type":"talks","_id":"1","_version":1}
    • Quickstart - search$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?q=jsuchal&pretty‘=>{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.054244425, "hits" : [ { "_index" : "rubyslava", "_type" : "talks", "_id" : "1", "_score" : 0.054244425, "_source" : { "title" : "elasticsearch - advanced features in practice", "presenter" : "jsuchal", "presented_at" : "2011-09-22T19:00:00", "message" : "hopefully clear", "tags" : ["elasticsearch", "rocks"] } } ] }}
    • Advanced features Search  analyzer, stemming, ngrams, ascii folding & custom analyzers  boosting, fragment highlighting, fuzzy search  … Facets Percolate Scroll and more…
    • The Case Study Find suspicious government contracts  using heuristics  IT contract where price > 1M euro  Supplier company age < 3 months  using crowdsourcing Data  Central government contract repositories  www.crz.gov.sk, zmluvy.egov.sk  ~70K contracts in 8 months  100+ GB pdf/doc/scan
    • The SolutionFaceted search
    • The Solution Faceted search  Search  e.g. Find all contracts by Orange Slovakia  Analyze  e.g. Which department has most contracts with Orange Slovakia?  e.g. What is the contract price distribution for Orange Slovakia? …  Define penalty heuristics
    • Facets Types  term, range, histogram, statistical, geo distance$ curl -XPOST http://localhost:9200/rubyslava/talks/_search?pretty -d { "query" : { "match_all" : { } }, "facets" : { "tags_facet" : { "terms" : { "field" : "tags", "size" : 10 } } }}
    • Facets - results{ "took" : 2, … "hits" : { … }, "facets" : { "tags_facet" : { "_type" : "terms", "missing" : 0, "total" : 2, "other" : 0, "terms" : [ { "term" : "rocks", "count" : 1 }, { "term" : "elasticsearch", "count" : 1 } ] } }}
    • Facets - advanced Problem  Generate options for facets with some selected restrictions Solution  global facet  facet_filter { "facets" : { "<FACET NAME>" : { "<FACET TYPE>" : { ... }, "global" : true, "facet_filter" : { "term" : { “supplier.untouched" : “Orange Slovakia, a.s."} } } } }
    • Percolate Problem  New contract/document added, which heuristics does it match? Solution 1. Save heuristics/searches in percolator index 2. Percolate new documents
    • Percolate$ curl -XPUT localhost:9200/_percolator/rubyslava/heuristic-1 -d { "query" : { "term" : { "tags" : "rubyslava" } }}$ curl -XPOST http://localhost:9200/rubyslava/talks/_percolate -d { "doc" : { "tags" : ["rubyslava", "rocks", "too"] }}‘=> {"ok":true,"matches":["heuristic-1"]}
    • Scroll Problem  New heuristic added and matches many (1K+) documents  Add heuristic to all matching documents  + Offset performance problem known in RDBMS Solution  Use async background job  Scroll through results (a.k.a. cursor)
    • Scroll$ curl -XGET http://localhost:9200/rubyslava/talks/_search?scroll=5m&pretty -d { "query": { "match_all" : {} }}‘=>{ "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTs5MzpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzk0OlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7OTU6UGJpc19yNlRSUUtHUUhIRl9rek00UTs5MjpQYmlzX3I2VFJRS0dRSEhGX2t6TTRROzkxOlBiaXNfcjZUUlFLR1FISEZfa3pNNFE7MDs=", "took" : 2, … "hits" : […],}$ curl -XGET http://localhost:9200/_search/scroll?scroll=5m&scroll_id=cXVlcnlUaGVuRmV0Y2g…=> more results & repeat
    • Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end
    • Tutorials & Guides http://www.slideshare.net/clintongormley/cool- bonsai-cool-an-introduction-to-elasticsearch http://www.slideshare.net/clintongormley/terms- of-endearment-the-elasticsearch-query-dsl- explained http://www.elasticsearch.org/guide/