View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
The Case Study Find suspicious government contracts using heuristics IT contract where price > 1M euro Supplier company age < 3 months using crowdsourcing Data Central government contract repositories www.crz.gov.sk, zmluvy.egov.sk ~70K contracts in 8 months 100+ GB pdf/doc/scan
The Solution Faceted search Search e.g. Find all contracts by Orange Slovakia Analyze e.g. Which department has most contracts with Orange Slovakia? e.g. What is the contract price distribution for Orange Slovakia? … Define penalty heuristics
Scroll Problem New heuristic added and matches many (1K+) documents Add heuristic to all matching documents + Offset performance problem known in RDBMS Solution Use async background job Scroll through results (a.k.a. cursor)
Ruby Scroll API Mimics find_each in ActiveRecord def find_each(query, &block) scroll_id = nil processed = 0 begin unless scroll_id result = initiate_scroll(query) scroll_id = result.scroll_id else result = scroll(scroll_id) end result.hits.each do |document| yield document end processed += result.hits.size end while processed < result.hits.total end