Turning search upside down with powerful open source search software

Turning Search Upside Down
with open source software
Charlie Hull - Managing Director
28th November 2014
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch

Who are Flax?
We design, build and support open source powered
search applications

Who are Flax?
search applications
Based in Cambridge U.K., technology agnostic &
independent – but open source exponents & committers

Who are Flax?
search applications
UK Authorized Partner of

Who are Flax?
search applications
Customers include Reed Specialist Recruitment, Mydeco,
NLA, Gorkana, Financial Times, News UK, EMBL-EBI,
Accenture, University of Cambridge, UK Government...

Who are Flax?
search applications
Customers in recruitment, government, e-commerce,
news & media, bioinformatics, consulting, law...

What I'll cover today
Monitoring & classifying data with stored queries
Media monitoring - how it used to be done
What we can't change
Turning search upside down with Flax's Luwak
Case studies:
Media monitoring in Scandinavia
Who else is using Luwak?
Conclusions

Monitoring & classifying with
stored queries
'Saved searches' / 'alerting' / 'stored profiles'

stored queries
Use cases:
– Monitor news for client interests
– Tag/classify incoming data according to certain rules

stored queries
Use cases:
– Monitor news for client interests
– Tag/classify incoming data according to certain rules
Volume & speed
– 10k → 1m stored queries
– 10k → 1m new items a day
– Latencies as low as 50-100ms

Media monitoring
Provide a news monitoring service to clients

Media monitoring
Usually also provide a searchable news archive

Media monitoring
People translate requirements & verify output...

Media monitoring
..but volume is handled by software

Media monitoring
..but volume is handled by software
UK/Europe – Gorkana, Precise, Cision;
USA – BurellesLuce

Media monitoring
– how it used to be done
Manual process of translating client interests into stored
queries

Media monitoring
Manual process of translating client interests into stored
queries
– many pages of Boolean terms
– 10k-250k characters
– Evolved over many years, grown not built
– Difficult to maintain and troubleshoot

Like this...
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S";
OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR
";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH
SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT
MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!
TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR
";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR
";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48
((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR
";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR
";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR
";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR
";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL
YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE
PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR
";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*";
OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR
";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR
";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*";
OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR
((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
…and that's an easy one!

Media monitoring
Complex pipelines with many sources monitored
– print & scan & OCR
– XML feeds (e.g. NLA media access in UK)
– Web content, Twitter etc.
Lots of search engines running in parallel
– dtSearch on 100+ servers, then merged
– 'Stored query' functionality built in e.g. Verity

What we can't change...
Syntax of the stored queries (sometimes)
– Rewriting these is a huge, risky job for the business

Volume keeps increasing

The need to test new and old queries on archive data -
using the same syntax

The need to test new and old queries on archive data -
using the same syntax
Source data will often be low quality
– Scans
– PDF extracts
– Re-typed XML

Speaking the same language
Parse the old query syntax into a new form
– Either on the fly (dtSearch, Verity)
– Or off-line (Verity)
– In theory we can translate any old syntax
– A neutral syntax protects against lock-in
• Why not?

Speaking the same language
Parse the old query syntax into a new form
– Either on the fly (dtSearch, Verity)
– Or off-line (Verity)
– In theory we can translate any old syntax
– A neutral syntax protects against lock-in
• Why not?
Testing very, very important
– Agree a test set
– Watch for broken queries (in several ways)
– Use the client's resources to help

Turning search upside down..
Docs
Result
QQueureyry Stored
Queries

Docs
Result
QQueureyry Stored
Queries
1 million queries
Some 250k long
Complex rules
1 million new
documents a
day
Within 5-100ms

Docs
Result
QQueureyry Stored
Queries
1 million queries
Some 250k long
Complex rules
1 million new
documents a
day
$$$$$$
Within 5-100ms

Docs
QQueureyry Stored
Queries 1.
Pre
Query
Subset
1 million queries
Some 250k long
Complex rules
~200
Doc
1 million new
documents a
day

Docs
QQueureyry Stored
Queries 1.
Pre
Query
Subset
Doc
Result
1 million queries
Some 250k long
Complex rules
~200
2.
Search
1 million new
documents a
day

How to avoid running 1 million searches on 1 million
documents?

How to avoid running 1 million searches on 1 million
documents? – don't run all of them!
– 1. Presearch - Find the likely candidates first by
searching the stored queries using the document

• How to avoid running 1 million searches on 1 million
documents? – don't run all of them!
– 1. Presearch - Find the likely candidates first by
searching the stored queries using the document
– 2. Search - Then perform a normal search, but with
far fewer queries, and do it in memory for speed

Introducing Luwak
Based on a (slight) fork of Apache Lucene
A library for efficiently running many stored queries
Two stages:
– Presearcher (find candidate queries)
– Monitor Searcher (run candidate queries against a
single document in-memory index)
Very fast – 70-100k queries per second when first tested
and 3-4 times faster now
https://github.com/flaxsearch/luwak

The gory details...
Custom QueryParsers for dtSearch, Verity VQL/OTL etc.
Hacked Lucene to expose positional information
LUCENE-3831, LUCENE-3827 in Lucene 4.x
LUCENE-2878 'Nuke Spans' coming in Lucene 5.0
Works with standard Lucene queries
TermQuery, BooleanQuery, RegexpQuery...
Can be extended to work with custom query types
Lots (and lots) of performance tuning

Using Luwak in a larger system
Run multiple instances of Luwak to handle throughput
Use message passing platforms (e.g. RabbitMQ, Apache
Kafka)
Build a separate index of articles – a searchable archive –
for testing new stored queries and/or for customers
Use RESTful Web Service APIs everywhere
Test against known matches from older systems

Likely problems
Some old queries will be broken
– Replicate the broken behaviour, or fix it?

Likely problems
Some queries will need troubleshooting
– Queries can be very large
– Possibly in a foreign language

Likely problems
Performance tuning is hard
– JVM settings, caching, storage types

Likely problems
Performance tuning is hard
– JVM settings, caching, storage types
Remember this is a different system – 1 to 1 matching
correspondence is impossible!

Case study –
media monitoring in Scandinavia
“The leading Danish provider of media intelligence, i.e.
media search, media monitoring and media analysis”
Old system based on Verity (monitoring) and Autonomy
IDOL (archive search)
Issues:
Systems at full capacity - even with lots of hardware
Archive search very, very slow
Complex workflow with some very old parts (FTP, batch
files...)
Proprietary query format (Verity VQL)

Case study –
Flax engaged in 2013 as part of a complete re-architecture
Phased approach:
1. Translate all old queries to new in-house syntax 'IQL'
2. New monitoring pipeline (RabbitMQ, RESTful Web
Services, Flax's Luwak)
3. New archive search (Apache Solr)

Case study –
Flax engaged in 2013 as part of a complete re-architecture
Phased approach:
1. Translate all old queries to new in-house syntax 'IQL'
2. New monitoring pipeline (RabbitMQ, RESTful Web
Services, Flax's Luwak)
3. New archive search (Apache Solr)
Performance/volume targets:
Monitoring 20 articles/second, aim for 100 eventually
100 qps for archive search, response within 250ms
17000 stored queries, 85 million articles in archive

Case study –
Monitoring using 2 servers, Archive search using 8
servers (including failover)
6 cores, 18GB per server
launch more servers to scale the system

Case study –
All stored queries translated to new IQL syntax
Including the ones Verity didn't tell us weren't working...

Case study –
All stored queries translated to new IQL syntax
Including the ones Verity didn't tell us weren't working...
Monitoring customers are being migrated in stages from
Oct 2014, Archive search beta testing during Oct 2014,
completion by end 2014

Case study –
The benefits:
– No vendor lock-in (open source, IQL)
– Industry standard software (Apache Lucene/Solr,
RabbitMQ)
– Predictable scaling & costs

Who else is using Luwak?
Bloomberg
http://www.flax.co.uk/blog/2014/03/24/london-search-meetup-serious-solr-at-bloomberg-elasticsearch-Booz Allen Hamilton
http://www.slideshare.net/BryanBende/realtime-inverted-search-nyc-aslug-oct-2014
Early versions are installed at
We know of users in other sectors e.g. recruitment
We're hoping to make it a standard part of Lucene/Solr
and Elasticsearch (i.e. to improve the Percolator)
Apache Samza integration?https://github.com/romseygeek/samza-luwak

Conclusions
To solve a hard problem, sometimes you need to turn it
upside down

Conclusions
upside down
When you migrate to open source, you can keep some of
the parts of your system (if you must)

Conclusions
upside down
Open source helps you scale - with predictable costs

Conclusions
upside down
Leading companies are doing it

Conclusions
upside down
Leading companies are doing it
Go and turn Search Upside Down!

Another thing...
BioSolr – a funded project to improve the way search
technology is used in bioinformatics
– EBI & NCBI are involved
– http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/
– Already producing Solr patches e.g SOLR-1387
– Come and join us!

Thankyou!
Any questions?
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch

Turning search upside down with powerful open source search software

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Turning search upside down with powerful open source search software

Similar to Turning search upside down with powerful open source search software (20)

More from Charlie Hull

More from Charlie Hull (6)

Recently uploaded

Recently uploaded (20)

Turning search upside down with powerful open source search software