Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

What's the story with Open Source?

on

  • 1,199 views

 

Statistics

Views

Total Views
1,199
Views on SlideShare
1,199
Embed Views
0

Actions

Likes
1
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

What's the story with Open Source? What's the story with Open Source? Presentation Transcript

  • What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source: http://www.flickr.com/photos/shironekoeuro/
  • What is Flax?
  • What is Flax?
    • Search engine specialists
    • Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur Consulting Ltd
    • Based in Cambridge UK
    • Contributors to and users of Xapian
    • Recently selected as UK Authorized Partner by Lucid Imagination
    • Customers include Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen
    Apache Lucene and Solr are trademarks of The Apache Software Foundation
  • The challenges
  • The challenges
    • Content is created for publication, not for search
  • The challenges
    • Content is created for publication, not for search
    • Content isn't published consistently or available to all
  • The challenges
    • Content is created for publication, not for search
    • Content isn't published consistently or available to all
    • Ranking is never simple
  • The challenges
    • Content is created for publication, not for search
    • Content isn't published consistently or available to all
    • Ranking is never simple
    • “ We just want something like Google”
  • The challenges
    • Content is created for publication, not for search
    • Content isn't published consistently or available to all
    • Ranking is never simple
    • “ We just want something like Google”
    • Every system will have to scale beyond its originally planned size
  • The challenges
    • Content is created for publication, not for search
    • Content isn't published consistently or available to all
    • Ranking is never simple
    • “ We just want something like Google”
    • Every system will have to scale beyond its originally planned size
    - Every project is different
  • So how do we build news search?
  • So how do we build news search?
    • Indexing
  • So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
  • So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • Must cope with high volume, quickly
  • So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • Must cope with high volume, quickly
      • Essential metadata – byline, title, source
  • So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • Must cope with high volume, quickly
      • Essential metadata – byline, title, source
      • File format translation not always necessary
  • So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • Must cope with high volume, quickly
      • Essential metadata – byline, title, source
      • File format translation not always necessary
      • BUT Pre-processing sometimes required
  • So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • Must cope with high volume, quickly
      • Essential metadata – byline, title, source
      • File format translation not always necessary
      • BUT Pre-processing sometimes required
      • Content restriction & embargo data
  • So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • Must cope with high volume, quickly
      • Essential metadata – byline, title, source
      • File format translation not always necessary
      • BUT Pre-processing sometimes required
      • Content restriction & embargo data
    • Solution
      • Lightweight, customisable index scripts using powerful open source libraries
  • So how do we build news search? import xapian import flax.core db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db) doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()
  • So how do we build news search?
    • Searching
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
        • Combine date and relevance ranking
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
        • Combine date and relevance ranking
        • Faceted search where appropriate
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
        • Combine date and relevance ranking
        • Faceted search where appropriate
        • Saved searches & Alerting
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
        • Combine date and relevance ranking
        • Faceted search where appropriate
        • Saved searches & Alerting
        • 'More like this'
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
        • Combine date and relevance ranking
        • Faceted search where appropriate
        • Saved searches & Alerting
        • 'More like this'
        • Content restriction & embargo filters
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
        • Combine date and relevance ranking
        • Faceted search where appropriate
        • Saved searches & Alerting
        • 'More like this'
        • Content restriction & embargo filters
      • Solution
        • Template-based user interface scripts, again using open source libraries
  • So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • Filters for metadata & date ranges
        • Combine date and relevance ranking
        • Faceted search where appropriate
        • Saved searches & Alerting
        • 'More like this'
        • Content restriction & embargo filters
      • Solution
        • Template-based user interface scripts, again using open source libraries
        • Beware Javascript & older browsers!
  • So how do we build news search?
    • Administration
        • Indexing failures common
        • Logging is essential
  • So how do we build news search?
    • Administration
        • Indexing failures common
        • Logging is essential
        • Log to text as a first pass, reports later
  • So how do we build news search?
    • Administration
        • Indexing failures common
        • Logging is essential
        • Log to text as a first pass, reports later
    • Scalability
        • Content is always growing
        • Both indexing & searching must scale
  • So how do we build news search?
    • Administration
        • Indexing failures common
        • Logging is essential
        • Log to text as a first pass, reports later
    • Scalability
        • Content is always growing
        • Both indexing & searching must scale
        • Open source search libraries provide distributed indexing, replication, remote indexes
        • Not simple to get this right!
  • So how do we build news search?
    • Available open source technologies
      • Languages – C/C++, Java, Python, Javascript
      • Search libraries – Xapian, Lucene
      • Search bindings/servers – Xappy, Flax.core, Solr
      • External libraries – pyparsing, CherryPy, xmllib, mxODBC, ...
      • Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), ...
  • So how do we build news search?
    • Available open source technologies
      • Languages – C/C++, Java, Python, Javascript
      • Search libraries – Xapian, Lucene
      • Search bindings/servers – Xappy, Flax.core, Solr
      • External libraries – pyparsing, CherryPy, xmllib, mxODBC, ...
      • Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), …
      • We can use whatever works!
  • Some examples
      Newspaper Licensing Agency – NLA Clipshare
        • 20 million newspaper stories
        • 6500 users
        • Content from every major newspaper (and most regionals)
        • Used by journalists, clippings agencies, media monitors
        • Replacing internal systems at major newspapers
    http://www.nla-clipshare.com
  • Some examples
      Newspaper Licensing Agency – NLA Clipshare
        • 20 million newspaper stories
        • 6500 users
        • Content from every major newspaper (and most regionals)
        • Used by journalists, clippings agencies, media monitors
        • Replacing internal systems at major newspapers
        • One of very few ways to search content from all the papers within hours of publication
    http://www.nla-clipshare.com
  •  
  •  
  •  
  • Some examples
      Financial Times – press cuttings
      • Web Service for easy integration
      • XML source data
      • Faceted search
      • Area filters (whole article, body, headline, byline or any combination)
      • Synonyms, spelling suggestions
    http://presscuttings.ft.com
  • Some examples
      Financial Times – press cuttings
      • Web Service for easy integration
      • XML source data
      • Faceted search
      • Area filters (whole article, body, headline, byline or any combination)
      • Synonyms, spelling suggestions
      • Built from scratch in a fortnight
      • Designed as a prototype, scaled to production use without significant change
    http://presscuttings.ft.com
  •  
  • A different task – news monitoring
      Non-traditional use of search
  • A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
  • A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
      • Searches reflect complex client needs
  • A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
      • Searches reflect complex client needs
      • False positives require human checking
  • A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
      • Searches reflect complex client needs
      • False positives require human checking
      • False negatives should never occur!
  • A different task – news monitoring
      An example
      • Durrants Ltd.
  • A different task – news monitoring
      An example
      • Durrants Ltd.
          • Thousands of client search profiles
          • Hundreds of thousands of articles per day
          • Complex publication heirarchy
          • Established pipeline
  • A different task – news monitoring
      An example
      • Durrants Ltd.
          • Thousands of client search profiles
          • Hundreds of thousands of articles per day
          • Complex publication heirarchy
          • Established pipeline
      • Solution
          • Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting
          • Supports features of previous engine
          • Scalable master-slave architecture
  • A different task – news monitoring
      An example
      • Durrants Ltd.
          • Thousands of client search profiles
          • Hundreds of thousands of articles per day
          • Complex publication heirarchy
          • Established pipeline
      • Solution
          • Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting
          • Supports features of previous engine
          • Scalable master-slave architecture
      • Accuracy improved in some cases from 95% rejected to 95% accepted
      • Hardware budget 15% of previous system
  • Why open source?
    • Flexible, extendable
  • Why open source?
    • Flexible, extendable
    • Powerful & scalable
  • Why open source?
    • Flexible, extendable
    • Powerful & scalable
    • Lower cost
  • Why open source?
    • Flexible, extendable
    • Powerful & scalable
    • Lower cost
    • Commercial support available as necessary
  • Why open source?
    • Flexible, extendable
    • Powerful & scalable
    • Lower cost
    • Commercial support available as necessary
    - Freedom to innovate
  • Looking to the future
  • Looking to the future
    • More and more content including social media
  • Looking to the future
    • More and more content including social media
    • Multiple delivery platforms
  • Looking to the future
    • More and more content including social media
    • Multiple delivery platforms
    • Search-powered websites & applications
  • Looking to the future
    • More and more content including social media
    • Multiple delivery platforms
    • Search-powered websites & applications
    • 'No-SQL'
  • Looking to the future
    • More and more content including social media
    • Multiple delivery platforms
    • Search-powered websites & applications
    • 'No-SQL'
    • Cloud
  • Looking to the future
    • More and more content including social media
    • Multiple delivery platforms
    • Search-powered websites & applications
    • 'No-SQL'
    • Cloud
      • Search no longer a bolt-on, but a platform for innovation
  • Looking to the future
    • More and more content including social media
    • Multiple delivery platforms
    • Search-powered websites & applications
    • 'No-SQL'
    • Cloud
      • Search no longer a bolt-on, but a platform for innovation
      • Open source no longer an outsider, but the obvious choice
  • Thankyou! Questions?
          • [email_address]
    www.flax.co.uk/blog Twitter: @FlaxSearch Photo source: http://www.flickr.com/photos/katerha/4259440136/