Turning Search Upside Down - how Flax works with media monitoring companies to build powerful and scalable 'inverted search' systems, applying hundreds of thousands of stored queries to millions of documents in real time. Features Apache Lucene/Solr as a replacement for Autonomy IDOL and our Luwak library as a replacement for Autonomy Verity.
Turning search upside down with powerful open source search software
1. Turning Search Upside Down
with open source software
Charlie Hull - Managing Director
28th November 2014
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch
2. Who are Flax?
We design, build and support open source powered
search applications
3. Who are Flax?
We design, build and support open source powered
search applications
Based in Cambridge U.K., technology agnostic &
independent – but open source exponents & committers
4. Who are Flax?
We design, build and support open source powered
search applications
Based in Cambridge U.K., technology agnostic &
independent – but open source exponents & committers
UK Authorized Partner of
5. Who are Flax?
We design, build and support open source powered
search applications
Based in Cambridge U.K., technology agnostic &
independent – but open source exponents & committers
UK Authorized Partner of
Customers include Reed Specialist Recruitment, Mydeco,
NLA, Gorkana, Financial Times, News UK, EMBL-EBI,
Accenture, University of Cambridge, UK Government...
6. Who are Flax?
We design, build and support open source powered
search applications
Based in Cambridge U.K., technology agnostic &
independent – but open source exponents & committers
UK Authorized Partner of
Customers in recruitment, government, e-commerce,
news & media, bioinformatics, consulting, law...
7. Who are Flax?
We design, build and support open source powered
search applications
Based in Cambridge U.K., technology agnostic &
independent – but open source exponents & committers
UK Authorized Partner of
Customers in recruitment, government, e-commerce,
news & media, bioinformatics, consulting, law...
8. What I'll cover today
Monitoring & classifying data with stored queries
Media monitoring - how it used to be done
What we can't change
Turning search upside down with Flax's Luwak
Case studies:
Media monitoring in Scandinavia
Who else is using Luwak?
Conclusions
10. Monitoring & classifying with
stored queries
'Saved searches' / 'alerting' / 'stored profiles'
Use cases:
– Monitor news for client interests
– Tag/classify incoming data according to certain rules
11. Monitoring & classifying with
stored queries
'Saved searches' / 'alerting' / 'stored profiles'
Use cases:
– Monitor news for client interests
– Tag/classify incoming data according to certain rules
Volume & speed
– 10k → 1m stored queries
– 10k → 1m new items a day
– Latencies as low as 50-100ms
13. Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
14. Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
People translate requirements & verify output...
15. Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
People translate requirements & verify output...
..but volume is handled by software
16. Media monitoring
Provide a news monitoring service to clients
Usually also provide a searchable news archive
People translate requirements & verify output...
..but volume is handled by software
UK/Europe – Gorkana, Precise, Cision;
USA – BurellesLuce
17. Media monitoring
– how it used to be done
Manual process of translating client interests into stored
queries
18. Media monitoring
– how it used to be done
Manual process of translating client interests into stored
queries
– many pages of Boolean terms
– 10k-250k characters
– Evolved over many years, grown not built
– Difficult to maintain and troubleshoot
19. Like this...
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S";
OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR
";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH
SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT
MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!
TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR
";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR
";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48
((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR
";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR
";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR
";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR
";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL
YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE
PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR
";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*";
OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR
";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR
";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*";
OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR
((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
…and that's an easy one!
20. Media monitoring
– how it used to be done
Complex pipelines with many sources monitored
– print & scan & OCR
– XML feeds (e.g. NLA media access in UK)
– Web content, Twitter etc.
Lots of search engines running in parallel
– dtSearch on 100+ servers, then merged
– 'Stored query' functionality built in e.g. Verity
21. What we can't change...
Syntax of the stored queries (sometimes)
– Rewriting these is a huge, risky job for the business
22. What we can't change...
Syntax of the stored queries (sometimes)
– Rewriting these is a huge, risky job for the business
Volume keeps increasing
23. What we can't change...
Syntax of the stored queries (sometimes)
– Rewriting these is a huge, risky job for the business
Volume keeps increasing
The need to test new and old queries on archive data -
using the same syntax
24. What we can't change...
Syntax of the stored queries (sometimes)
– Rewriting these is a huge, risky job for the business
Volume keeps increasing
The need to test new and old queries on archive data -
using the same syntax
Source data will often be low quality
– Scans
– PDF extracts
– Re-typed XML
25. Speaking the same language
Parse the old query syntax into a new form
– Either on the fly (dtSearch, Verity)
– Or off-line (Verity)
– In theory we can translate any old syntax
– A neutral syntax protects against lock-in
• Why not?
26. Speaking the same language
Parse the old query syntax into a new form
– Either on the fly (dtSearch, Verity)
– Or off-line (Verity)
– In theory we can translate any old syntax
– A neutral syntax protects against lock-in
• Why not?
Testing very, very important
– Agree a test set
– Watch for broken queries (in several ways)
– Use the client's resources to help
28. Turning search upside down..
Docs
Result
QQueureyry Stored
Queries
1 million queries
Some 250k long
Complex rules
1 million new
documents a
day
Within 5-100ms
29. Turning search upside down..
Docs
Result
QQueureyry Stored
Queries
1 million queries
Some 250k long
Complex rules
1 million new
documents a
day
$$$$$$
Within 5-100ms
30. Turning search upside down..
Docs
Result
QQueureyry Stored
Queries
1 million queries
Some 250k long
Complex rules
1 million new
documents a
day
$$$$$$
Within 5-100ms
31. Turning search upside down..
Docs
QQueureyry Stored
Queries 1.
Pre
Query
Subset
1 million queries
Some 250k long
Complex rules
~200
Doc
1 million new
documents a
day
32. Turning search upside down..
Docs
QQueureyry Stored
Queries 1.
Pre
Query
Subset
Doc
Result
1 million queries
Some 250k long
Complex rules
~200
2.
Search
1 million new
documents a
day
33. Turning search upside down..
How to avoid running 1 million searches on 1 million
documents?
34. Turning search upside down..
How to avoid running 1 million searches on 1 million
documents? – don't run all of them!
– 1. Presearch - Find the likely candidates first by
searching the stored queries using the document
35. Turning search upside down..
• How to avoid running 1 million searches on 1 million
documents? – don't run all of them!
– 1. Presearch - Find the likely candidates first by
searching the stored queries using the document
– 2. Search - Then perform a normal search, but with
far fewer queries, and do it in memory for speed
36. Introducing Luwak
Based on a (slight) fork of Apache Lucene
A library for efficiently running many stored queries
Two stages:
– Presearcher (find candidate queries)
– Monitor Searcher (run candidate queries against a
single document in-memory index)
Very fast – 70-100k queries per second when first tested
and 3-4 times faster now
https://github.com/flaxsearch/luwak
37. The gory details...
Custom QueryParsers for dtSearch, Verity VQL/OTL etc.
Hacked Lucene to expose positional information
LUCENE-3831, LUCENE-3827 in Lucene 4.x
LUCENE-2878 'Nuke Spans' coming in Lucene 5.0
Works with standard Lucene queries
TermQuery, BooleanQuery, RegexpQuery...
Can be extended to work with custom query types
Lots (and lots) of performance tuning
38. Using Luwak in a larger system
Run multiple instances of Luwak to handle throughput
Use message passing platforms (e.g. RabbitMQ, Apache
Kafka)
Build a separate index of articles – a searchable archive –
for testing new stored queries and/or for customers
Use RESTful Web Service APIs everywhere
Test against known matches from older systems
39. Likely problems
Some old queries will be broken
– Replicate the broken behaviour, or fix it?
40. Likely problems
Some old queries will be broken
– Replicate the broken behaviour, or fix it?
Some queries will need troubleshooting
– Queries can be very large
– Possibly in a foreign language
41. Likely problems
Some old queries will be broken
– Replicate the broken behaviour, or fix it?
Some queries will need troubleshooting
– Queries can be very large
– Possibly in a foreign language
Performance tuning is hard
– JVM settings, caching, storage types
42. Likely problems
Some old queries will be broken
– Replicate the broken behaviour, or fix it?
Some queries will need troubleshooting
– Queries can be very large
– Possibly in a foreign language
Performance tuning is hard
– JVM settings, caching, storage types
Remember this is a different system – 1 to 1 matching
correspondence is impossible!
43. Case study –
media monitoring in Scandinavia
“The leading Danish provider of media intelligence, i.e.
media search, media monitoring and media analysis”
Old system based on Verity (monitoring) and Autonomy
IDOL (archive search)
Issues:
Systems at full capacity - even with lots of hardware
Archive search very, very slow
Complex workflow with some very old parts (FTP, batch
files...)
Proprietary query format (Verity VQL)
44. Case study –
media monitoring in Scandinavia
Flax engaged in 2013 as part of a complete re-architecture
Phased approach:
1. Translate all old queries to new in-house syntax 'IQL'
2. New monitoring pipeline (RabbitMQ, RESTful Web
Services, Flax's Luwak)
3. New archive search (Apache Solr)
45. Case study –
media monitoring in Scandinavia
Flax engaged in 2013 as part of a complete re-architecture
Phased approach:
1. Translate all old queries to new in-house syntax 'IQL'
2. New monitoring pipeline (RabbitMQ, RESTful Web
Services, Flax's Luwak)
3. New archive search (Apache Solr)
Performance/volume targets:
Monitoring 20 articles/second, aim for 100 eventually
100 qps for archive search, response within 250ms
17000 stored queries, 85 million articles in archive
46. Case study –
media monitoring in Scandinavia
Monitoring using 2 servers, Archive search using 8
servers (including failover)
6 cores, 18GB per server
launch more servers to scale the system
47. Case study –
media monitoring in Scandinavia
Monitoring using 2 servers, Archive search using 8
servers (including failover)
6 cores, 18GB per server
launch more servers to scale the system
All stored queries translated to new IQL syntax
Including the ones Verity didn't tell us weren't working...
48. Case study –
media monitoring in Scandinavia
Monitoring using 2 servers, Archive search using 8
servers (including failover)
6 cores, 18GB per server
launch more servers to scale the system
All stored queries translated to new IQL syntax
Including the ones Verity didn't tell us weren't working...
Monitoring customers are being migrated in stages from
Oct 2014, Archive search beta testing during Oct 2014,
completion by end 2014
49. Case study –
media monitoring in Scandinavia
The benefits:
– No vendor lock-in (open source, IQL)
– Industry standard software (Apache Lucene/Solr,
RabbitMQ)
– Predictable scaling & costs
50. Who else is using Luwak?
Bloomberg
http://www.flax.co.uk/blog/2014/03/24/london-search-meetup-serious-solr-at-bloomberg-elasticsearch-Booz Allen Hamilton
http://www.slideshare.net/BryanBende/realtime-inverted-search-nyc-aslug-oct-2014
Early versions are installed at
We know of users in other sectors e.g. recruitment
We're hoping to make it a standard part of Lucene/Solr
and Elasticsearch (i.e. to improve the Percolator)
Apache Samza integration?https://github.com/romseygeek/samza-luwak
52. Conclusions
To solve a hard problem, sometimes you need to turn it
upside down
When you migrate to open source, you can keep some of
the parts of your system (if you must)
53. Conclusions
To solve a hard problem, sometimes you need to turn it
upside down
When you migrate to open source, you can keep some of
the parts of your system (if you must)
Open source helps you scale - with predictable costs
54. Conclusions
To solve a hard problem, sometimes you need to turn it
upside down
When you migrate to open source, you can keep some of
the parts of your system (if you must)
Open source helps you scale - with predictable costs
Leading companies are doing it
55. Conclusions
To solve a hard problem, sometimes you need to turn it
upside down
When you migrate to open source, you can keep some of
the parts of your system (if you must)
Open source helps you scale - with predictable costs
Leading companies are doing it
Go and turn Search Upside Down!
56. Another thing...
BioSolr – a funded project to improve the way search
technology is used in bioinformatics
– EBI & NCBI are involved
– http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/
– Already producing Solr patches e.g SOLR-1387
– Come and join us!