Turning search upside down
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Turning search upside down

on

  • 1,526 views

As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about ...

As part of their work with large media monitoring companies, Flax has developed a technique for applying tens of thousands of stored Lucene queries to a document in under a second. We'll talk about how we built intelligent filters to reduce the number of actual queries applied and how we extended Lucene to extract the exact hit positions of matches, the challenges of implementation, and how it can be used, including applications that monitor hundreds of thousands of news stories every day.

Statistics

Views

Total Views
1,526
Views on SlideShare
1,274
Embed Views
252

Actions

Likes
0
Downloads
19
Comments
0

2 Embeds 252

http://www.lucenerevolution.org 248
http://lucenerevolution.org 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Turning search upside down Presentation Transcript

  • 1. Turning Search Upside Down using Lucene for very fast stored queries Charlie Hull & Alan Woodward Managing Director & Principal Developer at Flax www.flax.co.uk @FlaxSearch
  • 2. Who are Flax? Search engine specialists with decades of experience  Developers, innovators and strategists  Based in Cambridge, UK  Technology agnostic – but open source exponents  UK Authorized Partner of LucidWorks  Customers include Reed Specialist Recruitment, NLA, Gorkana, Financial Times, News UK, UK Cabinet Office, Australian Associated Press, Accenture, University of Cambridge...  We also run a Meetup group: 
  • 3. Who are we?  Charlie Hull – Managing Director and co-founder at Flax – I've been working in search since 1999 – When I'm not running Flax I'm sometimes a stiltwalker & juggler
  • 4. Who are we?   Charlie Hull – Managing Director and co-founder at Flax – I've been working in search since 1999 – When I'm not running Flax I'm sometimes a stiltwalker & juggler Alan Woodward – Principal developer at Flax – I used to run a 500m document FAST ESP index for Proquest – Lucene/Solr committer – When I'm not hacking I sing in the Cambridge University Musical Society Chorus
  • 5. Media monitoring  Standard search – few queries over many documents  Monitoring search – many queries over each document  Customers interests manually turned into queries  Humans probably still have the final say on relevance  Eventual result is a list of articles emailed - or even printed  Exact position of a match is important to the customer
  • 6. Media monitoring - parameters  Tens of thousands of stored expressions or 'Keywords' – can't rewrite these so must use same syntax!  Hundreds of thousands of articles to monitor every day  Source data can sometimes be scanned & OCR'd  False positives cost human operator time: false negatives cost customers!  Traditional approach is brute force using standard search engine software
  • 7. Media monitoring – a Keyword ("PALM BEACH COUNTY"; W/48 (("TOURIS*"; OR "TOUR"; OR "TOURS"; OR "TRAVEL*"; OR "HOLIDAY*"; OR "HOL"; OR "HOLS"; OR "HOTEL*"; OR "VISIT*"; OR "TRIP"; OR "TRIPS"; OR "DAYTRIP*"; OR "BEACH"; OR "!BEACHES"; OR "COAST"; OR "!COASTLINE*"; OR "ABTA"; OR "DAY TRIP*"; OR “SUITE"; OR "SUITES"; OR "A%CCOMMODATION"; OR "BED AND ! BREAKFAST"; OR "B&B"; OR "BED & !BREAKFAST"; OR "!BREAKFAST AND BOARD"; OR "FULL BOARD"; OR "HALF BOARD"; OR "ALL !INCLUSIVE"; OR "THINGS TO DO"; OR "HOSP? TALITY"; OR "SHORT BREAK*"; OR "!WEEKEND BREAK"; OR "CITY BREAK*"; OR "!SIGHTSEE*"; OR "!VACATION*"; OR "E%XCURSION*"; OR "FLY* WITH"; OR "FLY* THERE"; OR "FLY* DRIVE"; OR "!GETAWAY"; OR "!BACKPACK*"; OR "BACK PACK*"; OR "!ECOTOURIS*"; OR "! WATERSPORT*"; OR "WATER SPORT*"; OR "FESTIVAL*"; OR "RESORT* & SPA"; OR "RESORT* AND SPA"; OR "WHALE WATCH*"; OR "GET THERE"; OR "WHERE TO STAY"; OR "GETTING THERE"; OR "STAYCATION*"; OR "VILLA"; OR "VILLAS"; OR "AIRPORT*"; OR "SPA"; OR "SPAS"; OR "OUTDOOR EVENT*"; OR "OUTDOOR ADVENTURE*"; OR "OUTDOOR PURSUIT*"; OR "OUTDOOR ACTIVIT*"; OR "CLIMBING WALL*"; OR "CLIMBING CENTRE*"; OR "ROCK CLIMB*"; OR "WHITE WATER RAFTING";) OR ("PLACES"; W/4 ("TO STAY"; OR "TO SEE"; OR "TO EAT";)) OR (("FLIGHT*"; OR "FLY"; OR "FLYING"; OR "CRUISE*";) W/4 ("OFFER"; OR "AVAILABLE"; OR "DEPART*"; OR "FROM"; OR "TRANSFER*";))))
  • 8. Media monitoring – another Keyword (((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND ! WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";! EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MISSELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
  • 9. Media monitoring – an article <?xml version="1.0" encoding="UTF-8"?> <document> <attributes> <field name="ARTICLEID">2277480</field> <field name="HEADLINE" type="TEXT">Holding a man of integrity, honour</field> <field name="SOURCEID" type="STRING">2</field> <field name="ARTAREA" type="TEXT">161.82</field> <field name="PAGESUM" type="STRING">1</field> <field name="SDATE" type="STRING">2011-08-02</field> <field name="PAGE" type="STRING">A03 NEW NAA</field> <field name="CDATE" type="DATE">2011-08-02 01:08:16</field> <field name="AUTHORNAME" type="STRING">MICHELLE GRATTAN</field> <field name="ARTAREAPERCENT" type="STRING">7.66</field> <field name="SENTENCES" type="STRING">LONG-time Victorian Labor leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch with his local community.</field> <field name="TEXT">Holding a man of integrity, honour By MICHELLE GRATTAN LONG-time Victorian Labor leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch with his local community. Mr Holding, sitting in the seat of Richmond, was Victorian opposition leader from 1967 to 1977 before entering federal politics as MP for Melbourne Ports in 1977. He held several portfolios in the Hawke government...</field> <field name="ORGFILENAME" type="STRING">AAAGE20110802NAANEWA03.pdf</field> <field name="SUBTITLE" type="STRING"></field> </attributes> </document>
  • 10. So how does it work?  Incoming article is added to a Lucene MemoryIndex  We parse the original Keyword syntax into a store of Lucene queries  Stored queries are run against this single-document index  Matches from each query are gathered and emitted  Simple!
  • 11. It can't be that easy? Problems:   Doesn't scale very well with increasing numbers of queries Complex queries need to be rewritten for every document.
  • 12. It can't be that easy? Problems:   Doesn't scale very well with increasing numbers of queries Complex queries need to be rewritten for every document. Solution:   Cut down the number of queries run over each document Important not to get false negatives
  • 13. “Turning search upside down”       Narrow down our query space by selecting queries likely to match a given document. Sounds a bit like... a search? Create an index of queries, using terms extracted from the query objects themselves. Incoming documents are converted to a disjunction query using the MemoryIndex's terms list. Specialised Collector then runs each matching query against the document. Massively reduces the number of queries run against a given document.
  • 14. Indexing Queries        TermQueries are simple – just extract the term For Disjunction queries, we extract terms from all subclauses For Conjunction queries, we can just use a single subclause Positional queries are treated as conjunctions Simple Numeric queries treated as terms Empty term sets are indexed with special terms to match all docs NumericRange queries are ignored
  • 15. Indexing Queries  RegexpQueries are a bit more complex:    Extract the longest exact substring from the regex and index that Create ngrams from the MemoryIndex termlist when building the document disjunction query Reduces the effectiveness of the query filtering, but still allows significant speedups.
  • 16. Hacking Lucene & Solr To allow MemoryIndex to expose positions we fixed for Lucene 4.x: LUCENE-3831 - Passing a null fieldname to MemoryFields#terms in throws a NPE LUCENE-3827 - Make term offsets work in MemoryIndex MemoryIndex Contributed to the work on adding interval iterators to Lucene Scorers: LUCENE 2878 - Allow Scorer to expose positions and payloads aka. nuke spans ...and then we rebuilt Solr on top of these modifications.
  • 17. Why not use Elasticsearch percolators?  Need to emit exact hit positions using Interval Iterators Percolator doesn't allow such efficient optimisations based on document-specific filters – we need to cut down the number of queries we run as much as possible. 
  • 18. Why not use Elasticsearch percolators?  Need to emit exact hit positions using Interval Iterators Percolator doesn't allow such efficient optimisations based on document-specific filters – we need to cut down the number of queries we run as much as possible.  The fastest way to run a query is not to run it at all!
  • 19. Introducing 'Luwak' A library for running many stored queries, fast. Features  Works with standard Lucene queries (TermQuery, BooleanQuery, RegexpQuery...)  Can be extended to work with custom query types  Run multiple instances of it to scale for document volume  Based on our fork of Lucene (should be merged into trunk soon though...)  Available Real Soon Now on Github
  • 20. Using Luwak Monitor monitor = new Monitor(); monitor.addQueries(querylist); InputDocument doc = source.getNextDocument(); DocumentMatches matches = monitor.match(doc); DocumentMatches object contains exact hit positions per-field for each query that matched the InputDocument, as well as some metrics (querycount, preptime, querytime).
  • 21. Luwak output { "stats": {"preptime":3,"querytime":23,"querycount":1}, "id":"C:TESTBMA20130417e10a0737.xml", "uniqueId":"e10a0737", "Content":"… lots of xml …", "QueryMatches":[ {"QueryId":"1", "FieldHighlights":[ {"Highlights":[ {"WordOffset":11,"Offset":80,"Length":7,"Word":"quibble"}, {"WordOffset":115,"Offset":739,"Length":7,"Word":"quibble"}, {"WordOffset":117,"Offset":749,"Length":7,"Word":"quibble"}], "Field":"bodytext"}]}]}
  • 22. How fast is Luwak? For simple keywords (<20 terms):  70,000 keywords applied per second to an article  Tested on a Macbook  20 times faster than our own previous implementation (using Xapian) For more complex keywords (some run to three pages!)  20,000 keywords applied in 0.5 seconds  Processes approximately 2000 articles/hour
  • 23. Building Luwak into a monitoring system Monitors are packaged up into a service using Dropwizard  We run multiple instances of the monitoring system to handle the required article throughput  Queries held in a central database, and updates pushed out to each monitor instance using notification queues  Articles are supplied in NewsML format, either on disk or via message brokers.  We build a separate Solr index of articles – a searchable archive – for testing new Keywords and for end customers to use  Everything uses RESTful Web Service APIs 
  • 24. What next?  3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems?
  • 25. What next?   3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems? Improve Luwak performance by sharding the Keyword set  We know of cases of 1.5m keywords, hundreds of articles/sec
  • 26. What next?    3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems? Improve Luwak performance by sharding the Keyword set  We know of cases of 1.5m keywords, hundreds of articles/sec Get Lucene intervals code into trunk, so Luwak isn't based on a fork
  • 27. What next?     3 clients are live with one in progress  Replacing systems built on dtSearch, Verity & Autonomy  Can we create query parsers for other legacy systems? Improve Luwak performance by sharding the Keyword set  We know of cases of 1.5m keywords, hundreds of articles/sec Get Lucene intervals code into trunk, so Luwak isn't based on a fork Find other uses for Luwak?  Add metadata to items based on their content  Classification into a taxonomy, nodes defined by Keywords
  • 28. Thankyou! Any questions? charlie@flax.co.uk @FlaxSearch alan@flax.co.uk @romseygeek www.flax.co.uk/blog +44 (0) 8700 118334