What's the story with Open Source?

2,017 views
1,789 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,017
On SlideShare
0
From Embeds
0
Number of Embeds
66
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

What's the story with Open Source?

  1. 1. What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source: http://www.flickr.com/photos/shironekoeuro/
  2. 2. www.flax.co.uk 2 What is Flax?
  3. 3. www.flax.co.uk 3 What is Flax? Search engine specialists Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur Consulting Ltd Based in Cambridge UK Contributors to and users of Xapian Recently selected as UK Authorized Partner by Lucid Imagination Customers include Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen Apache Lucene and Solr are trademarks of The Apache Software Foundation
  4. 4. www.flax.co.uk 4 The challenges
  5. 5. www.flax.co.uk 5 The challenges Content is created for publication, not for search
  6. 6. www.flax.co.uk 6 The challenges Content is created for publication, not for search Content isn't published consistently or available to all
  7. 7. www.flax.co.uk 7 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple
  8. 8. www.flax.co.uk 8 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google”
  9. 9. www.flax.co.uk 9 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size
  10. 10. www.flax.co.uk 10 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size - Every project is different
  11. 11. www.flax.co.uk 11 So how do we build news search?
  12. 12. www.flax.co.uk 12 So how do we build news search? Indexing
  13. 13. www.flax.co.uk 13 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions)
  14. 14. www.flax.co.uk 14 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly
  15. 15. www.flax.co.uk 15 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source
  16. 16. www.flax.co.uk 16 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary
  17. 17. www.flax.co.uk 17 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required
  18. 18. www.flax.co.uk 18 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data
  19. 19. www.flax.co.uk 19 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data Solution Lightweight, customisable index scripts using powerful open source libraries
  20. 20. www.flax.co.uk 20 So how do we build news search? import xapian import flax.core db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db) doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()
  21. 21. www.flax.co.uk 21 So how do we build news search? Searching
  22. 22. www.flax.co.uk 22 So how do we build news search? Searching Free text with Boolean operators
  23. 23. www.flax.co.uk 23 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges
  24. 24. www.flax.co.uk 24 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking
  25. 25. www.flax.co.uk 25 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate
  26. 26. www.flax.co.uk 26 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting
  27. 27. www.flax.co.uk 27 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this'
  28. 28. www.flax.co.uk 28 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters
  29. 29. www.flax.co.uk 29 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries
  30. 30. www.flax.co.uk 30 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries Beware Javascript & older browsers!
  31. 31. www.flax.co.uk 31 So how do we build news search? Administration Indexing failures common Logging is essential
  32. 32. www.flax.co.uk 32 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later
  33. 33. www.flax.co.uk 33 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale
  34. 34. www.flax.co.uk 34 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale Open source search libraries provide distributed indexing, replication, remote indexes Not simple to get this right!
  35. 35. www.flax.co.uk 35 So how do we build news search? ●Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC, ... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), ...
  36. 36. www.flax.co.uk 36 So how do we build news search? ●Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC, ... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), … We can use whatever works!
  37. 37. www.flax.co.uk 37 Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers http://www.nla-clipshare.com
  38. 38. www.flax.co.uk 38 Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers One of very few ways to search content from all the papers within hours of publication http://www.nla-clipshare.com
  39. 39. www.flax.co.uk 39
  40. 40. www.flax.co.uk 40
  41. 41. www.flax.co.uk 41
  42. 42. www.flax.co.uk 42 Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions http://presscuttings.ft.com
  43. 43. www.flax.co.uk 43 Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions Built from scratch in a fortnight Designed as a prototype, scaled to production use without significant change http://presscuttings.ft.com
  44. 44. www.flax.co.uk 44
  45. 45. www.flax.co.uk 45 A different task – news monitoring Non-traditional use of search
  46. 46. www.flax.co.uk 46 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content
  47. 47. www.flax.co.uk 47 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs
  48. 48. www.flax.co.uk 48 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking
  49. 49. www.flax.co.uk 49 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking False negatives should never occur!
  50. 50. www.flax.co.uk 50 A different task – news monitoring An example Durrants Ltd.
  51. 51. www.flax.co.uk 51 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline
  52. 52. www.flax.co.uk 52 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture
  53. 53. www.flax.co.uk 53 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture Accuracy improved in some cases from 95% rejected to 95% accepted Hardware budget 15% of previous system
  54. 54. www.flax.co.uk 54 Why open source? Flexible, extendable
  55. 55. www.flax.co.uk 55 Why open source? Flexible, extendable Powerful & scalable
  56. 56. www.flax.co.uk 56 Why open source? Flexible, extendable Powerful & scalable Lower cost
  57. 57. www.flax.co.uk 57 Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary
  58. 58. www.flax.co.uk 58 Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary - Freedom to innovate
  59. 59. www.flax.co.uk 59 Looking to the future
  60. 60. www.flax.co.uk 60 Looking to the future More and more content including social media
  61. 61. www.flax.co.uk 61 Looking to the future More and more content including social media Multiple delivery platforms
  62. 62. www.flax.co.uk 62 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications
  63. 63. www.flax.co.uk 63 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL'
  64. 64. www.flax.co.uk 64 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud
  65. 65. www.flax.co.uk 65 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation
  66. 66. www.flax.co.uk 66 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation Open source no longer an outsider, but the obvious choice
  67. 67. www.flax.co.uk 67 Thankyou! Questions? charlie@flax.co.uk www.flax.co.uk/blog Twitter: @FlaxSearch Photo source: http://www.flickr.com/photos/katerha/4259440136/

×