What's the story with Open Source?

What's the
story with
open
source?
Searching and monitoring news media with open
source technology
Charlie Hull, Flax
BCS IRSG Search Solutions 2010
Photo source: http://www.flickr.com/photos/shironekoeuro/

www.flax.co.uk 2
What is Flax?

www.flax.co.uk 3
What is Flax?
Search engine specialists
Formed in 2001 from the ashes of Muscat Ltd
and Webtop as Lemur Consulting Ltd
Based in Cambridge UK
Contributors to and users of Xapian
Recently selected as UK Authorized Partner by
Lucid Imagination
Customers include Mydeco, NLA, Durrants
Ltd, Financial Times, MediaMiser, MySkreen
Apache Lucene and Solr are trademarks of The Apache Software Foundation

www.flax.co.uk 4
The challenges

www.flax.co.uk 5
The challenges
Content is created for publication, not for search

www.flax.co.uk 6
The challenges
Content isn't published consistently or available to all

www.flax.co.uk 7
The challenges
Ranking is never simple

www.flax.co.uk 8
The challenges
“We just want something like Google”

www.flax.co.uk 9
The challenges
Every system will have to scale beyond its originally
planned size

www.flax.co.uk 10
The challenges
Every system will have to scale beyond its originally
planned size
- Every project is different

www.flax.co.uk 11
So how do we build news search?

www.flax.co.uk 12
Indexing

www.flax.co.uk 13
Indexing
Historical, daily & updates (i.e. later editions)

www.flax.co.uk 14
Indexing
Must cope with high volume, quickly

www.flax.co.uk 15
Indexing
Essential metadata – byline, title, source

www.flax.co.uk 16
Indexing
File format translation not always necessary

www.flax.co.uk 17
Indexing
BUT Pre-processing sometimes required

www.flax.co.uk 18
Indexing
Content restriction & embargo data

www.flax.co.uk 19
Indexing
Content restriction & embargo data
Solution
Lightweight, customisable index scripts
using powerful open source libraries

www.flax.co.uk 20
import xapian
import flax.core
db = xapian.WritableDatabase('db', xapian.DB_CREATE)
fm = flax.core.Fieldmap()
fm.language = 'en' # stem for English
fm.setfield('mytext', False) # freetext field
fm.setfield('mydate', True) # filter field
fm.save(db)
doc = fm.document()
doc.index('mytext', "I don't like spam.")
doc.index('mydate', datetime(2010, 2, 3, 12, 0))
fm.add_document(db, doc)
db.flush()

www.flax.co.uk 21
Searching

www.flax.co.uk 22
Searching
Free text with Boolean operators

www.flax.co.uk 23
Searching
Filters for metadata & date ranges

www.flax.co.uk 24
Searching
Combine date and relevance ranking

www.flax.co.uk 25
Searching
Faceted search where appropriate

www.flax.co.uk 26
Searching
Saved searches & Alerting

www.flax.co.uk 27
Searching
'More like this'

www.flax.co.uk 28
Searching
'More like this'
Content restriction & embargo filters

www.flax.co.uk 29
Searching
'More like this'
Solution
Template-based user interface scripts,
again using open source libraries

www.flax.co.uk 30
Searching
'More like this'
Solution
Template-based user interface scripts,
again using open source libraries
Beware Javascript & older browsers!

www.flax.co.uk 31
Administration
Indexing failures common
Logging is essential

www.flax.co.uk 32
Administration
Log to text as a first pass, reports later

www.flax.co.uk 33
Administration
Scalability
Content is always growing
Both indexing & searching must scale

www.flax.co.uk 34
Administration
Scalability
Content is always growing
Both indexing & searching must scale
Open source search libraries provide
distributed indexing, replication, remote
indexes
Not simple to get this right!

www.flax.co.uk 35
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), ...

www.flax.co.uk 36
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), …
We can use whatever works!

www.flax.co.uk 37
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
http://www.nla-clipshare.com

www.flax.co.uk 38
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
One of very few ways to search content
from all the papers within hours of
publication
http://www.nla-clipshare.com

www.flax.co.uk 42
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
http://presscuttings.ft.com

www.flax.co.uk 43
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
Built from scratch in a fortnight
Designed as a prototype, scaled to
production use without significant change
http://presscuttings.ft.com

www.flax.co.uk 45
A different task – news monitoring
Non-traditional use of search

www.flax.co.uk 46
Many automated searches on incoming
content

www.flax.co.uk 47
content
Searches reflect complex client needs

www.flax.co.uk 48
content
False positives require human checking

www.flax.co.uk 49
content
False positives require human checking
False negatives should never occur!

www.flax.co.uk 50
An example
Durrants Ltd.

www.flax.co.uk 51
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline

www.flax.co.uk 52
An example
Durrants Ltd.
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture

www.flax.co.uk 53
An example
Durrants Ltd.
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture
Accuracy improved in some cases from 95%
rejected to 95% accepted
Hardware budget 15% of previous system

www.flax.co.uk 54
Why open source?
Flexible, extendable

www.flax.co.uk 55
Why open source?
Powerful & scalable

www.flax.co.uk 56
Why open source?
Powerful & scalable
Lower cost

www.flax.co.uk 57
Why open source?
Powerful & scalable
Lower cost
Commercial support available as necessary

www.flax.co.uk 58
Why open source?
Powerful & scalable
Lower cost
Commercial support available as necessary
- Freedom to innovate

www.flax.co.uk 59
Looking to the future

www.flax.co.uk 60
More and more content including social media

www.flax.co.uk 61
Multiple delivery platforms

www.flax.co.uk 62
Search-powered websites & applications

www.flax.co.uk 63
'No-SQL'

www.flax.co.uk 64
'No-SQL'
Cloud

www.flax.co.uk 65
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation

www.flax.co.uk 66
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation
Open source no longer an
outsider, but the obvious choice

www.flax.co.uk 67
Thankyou!
Questions?
charlie@flax.co.uk
www.flax.co.uk/blog
Twitter: @FlaxSearch
Photo source: http://www.flickr.com/photos/katerha/4259440136/

What's the story with Open Source?

More Related Content

What's hot

Similar to What's the story with Open Source?

More from Charlie Hull

What's the story with Open Source?