• Like

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,059
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source: http://www.flickr.com/photos/shironekoeuro/
  • 2. What is Flax?
  • 3. What is Flax?
    • Search engine specialists
    • 4. Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur Consulting Ltd
    • 5. Based in Cambridge UK
    • 6. Contributors to and users of Xapian
    • 7. Recently selected as UK Authorized Partner by Lucid Imagination
    • 8. Customers include Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen
    Apache Lucene and Solr are trademarks of The Apache Software Foundation
  • 9. The challenges
  • 10. The challenges
    • Content is created for publication, not for search
  • 11. The challenges
    • Content is created for publication, not for search
    • 12. Content isn't published consistently or available to all
  • 13. The challenges
    • Content is created for publication, not for search
    • 14. Content isn't published consistently or available to all
    • 15. Ranking is never simple
  • 16. The challenges
    • Content is created for publication, not for search
    • 17. Content isn't published consistently or available to all
    • 18. Ranking is never simple
    • 19. “ We just want something like Google”
  • 20. The challenges
    • Content is created for publication, not for search
    • 21. Content isn't published consistently or available to all
    • 22. Ranking is never simple
    • 23. “ We just want something like Google”
    • 24. Every system will have to scale beyond its originally planned size
  • 25. The challenges
    • Content is created for publication, not for search
    • 26. Content isn't published consistently or available to all
    • 27. Ranking is never simple
    • 28. “ We just want something like Google”
    • 29. Every system will have to scale beyond its originally planned size
    - Every project is different
  • 30. So how do we build news search?
  • 31. So how do we build news search?
    • Indexing
  • 32. So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
  • 33. So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • 34. Must cope with high volume, quickly
  • 35. So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • 36. Must cope with high volume, quickly
      • 37. Essential metadata – byline, title, source
  • 38. So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • 39. Must cope with high volume, quickly
      • 40. Essential metadata – byline, title, source
      • 41. File format translation not always necessary
  • 42. So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • 43. Must cope with high volume, quickly
      • 44. Essential metadata – byline, title, source
      • 45. File format translation not always necessary
      • 46. BUT Pre-processing sometimes required
  • 47. So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • 48. Must cope with high volume, quickly
      • 49. Essential metadata – byline, title, source
      • 50. File format translation not always necessary
      • 51. BUT Pre-processing sometimes required
      • 52. Content restriction & embargo data
  • 53. So how do we build news search?
    • Indexing
      • Historical, daily & updates (i.e. later editions)
      • 54. Must cope with high volume, quickly
      • 55. Essential metadata – byline, title, source
      • 56. File format translation not always necessary
      • 57. BUT Pre-processing sometimes required
      • 58. Content restriction & embargo data
    • Solution
      • Lightweight, customisable index scripts using powerful open source libraries
  • 59. So how do we build news search? import xapian import flax.core db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db) doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()
  • 60. So how do we build news search?
    • Searching
  • 61. So how do we build news search?
    • Searching
        • Free text with Boolean operators
  • 62. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 63. Filters for metadata & date ranges
  • 64. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 65. Filters for metadata & date ranges
        • 66. Combine date and relevance ranking
  • 67. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 68. Filters for metadata & date ranges
        • 69. Combine date and relevance ranking
        • 70. Faceted search where appropriate
  • 71. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 72. Filters for metadata & date ranges
        • 73. Combine date and relevance ranking
        • 74. Faceted search where appropriate
        • 75. Saved searches & Alerting
  • 76. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 77. Filters for metadata & date ranges
        • 78. Combine date and relevance ranking
        • 79. Faceted search where appropriate
        • 80. Saved searches & Alerting
        • 81. 'More like this'
  • 82. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 83. Filters for metadata & date ranges
        • 84. Combine date and relevance ranking
        • 85. Faceted search where appropriate
        • 86. Saved searches & Alerting
        • 87. 'More like this'
        • 88. Content restriction & embargo filters
  • 89. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 90. Filters for metadata & date ranges
        • 91. Combine date and relevance ranking
        • 92. Faceted search where appropriate
        • 93. Saved searches & Alerting
        • 94. 'More like this'
        • 95. Content restriction & embargo filters
      • Solution
        • Template-based user interface scripts, again using open source libraries
  • 96. So how do we build news search?
    • Searching
        • Free text with Boolean operators
        • 97. Filters for metadata & date ranges
        • 98. Combine date and relevance ranking
        • 99. Faceted search where appropriate
        • 100. Saved searches & Alerting
        • 101. 'More like this'
        • 102. Content restriction & embargo filters
      • Solution
        • Template-based user interface scripts, again using open source libraries
        • 103. Beware Javascript & older browsers!
  • 104. So how do we build news search?
    • Administration
        • Indexing failures common
        • 105. Logging is essential
  • 106. So how do we build news search?
    • Administration
        • Indexing failures common
        • 107. Logging is essential
        • 108. Log to text as a first pass, reports later
  • 109. So how do we build news search?
    • Administration
        • Indexing failures common
        • 110. Logging is essential
        • 111. Log to text as a first pass, reports later
    • Scalability
        • Content is always growing
        • 112. Both indexing & searching must scale
  • 113. So how do we build news search?
    • Administration
        • Indexing failures common
        • 114. Logging is essential
        • 115. Log to text as a first pass, reports later
    • Scalability
        • Content is always growing
        • 116. Both indexing & searching must scale
        • 117. Open source search libraries provide distributed indexing, replication, remote indexes
        • 118. Not simple to get this right!
  • 119. So how do we build news search?
    • Available open source technologies
      • Languages – C/C++, Java, Python, Javascript
      • 120. Search libraries – Xapian, Lucene
      • 121. Search bindings/servers – Xappy, Flax.core, Solr
      • 122. External libraries – pyparsing, CherryPy, xmllib, mxODBC, ...
      • 123. Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), ...
  • 124. So how do we build news search?
    • Available open source technologies
      • Languages – C/C++, Java, Python, Javascript
      • 125. Search libraries – Xapian, Lucene
      • 126. Search bindings/servers – Xappy, Flax.core, Solr
      • 127. External libraries – pyparsing, CherryPy, xmllib, mxODBC, ...
      • 128. Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), …
      • 129. We can use whatever works!
  • 130. Some examples
      Newspaper Licensing Agency – NLA Clipshare
        • 20 million newspaper stories
        • 131. 6500 users
        • 132. Content from every major newspaper (and most regionals)
        • 133. Used by journalists, clippings agencies, media monitors
        • 134. Replacing internal systems at major newspapers
    http://www.nla-clipshare.com
  • 135. Some examples
      Newspaper Licensing Agency – NLA Clipshare
        • 20 million newspaper stories
        • 136. 6500 users
        • 137. Content from every major newspaper (and most regionals)
        • 138. Used by journalists, clippings agencies, media monitors
        • 139. Replacing internal systems at major newspapers
        • 140. One of very few ways to search content from all the papers within hours of publication
    http://www.nla-clipshare.com
  • 141.  
  • 142.  
  • 143.  
  • 144. Some examples
      Financial Times – press cuttings
      • Web Service for easy integration
      • 145. XML source data
      • 146. Faceted search
      • 147. Area filters (whole article, body, headline, byline or any combination)
      • 148. Synonyms, spelling suggestions
    http://presscuttings.ft.com
  • 149. Some examples
      Financial Times – press cuttings
      • Web Service for easy integration
      • 150. XML source data
      • 151. Faceted search
      • 152. Area filters (whole article, body, headline, byline or any combination)
      • 153. Synonyms, spelling suggestions
      • 154. Built from scratch in a fortnight
      • 155. Designed as a prototype, scaled to production use without significant change
    http://presscuttings.ft.com
  • 156.  
  • 157. A different task – news monitoring
      Non-traditional use of search
  • 158. A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
  • 159. A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
      • 160. Searches reflect complex client needs
  • 161. A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
      • 162. Searches reflect complex client needs
      • 163. False positives require human checking
  • 164. A different task – news monitoring
      Non-traditional use of search
      • Many automated searches on incoming content
      • 165. Searches reflect complex client needs
      • 166. False positives require human checking
      • 167. False negatives should never occur!
  • 168. A different task – news monitoring
      An example
      • Durrants Ltd.
  • 169. A different task – news monitoring
      An example
      • Durrants Ltd.
          • Thousands of client search profiles
          • 170. Hundreds of thousands of articles per day
          • 171. Complex publication heirarchy
          • 172. Established pipeline
  • 173. A different task – news monitoring
      An example
      • Durrants Ltd.
          • Thousands of client search profiles
          • 174. Hundreds of thousands of articles per day
          • 175. Complex publication heirarchy
          • 176. Established pipeline
      • Solution
          • Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting
          • 177. Supports features of previous engine
          • 178. Scalable master-slave architecture
  • 179. A different task – news monitoring
      An example
      • Durrants Ltd.
          • Thousands of client search profiles
          • 180. Hundreds of thousands of articles per day
          • 181. Complex publication heirarchy
          • 182. Established pipeline
      • Solution
          • Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting
          • 183. Supports features of previous engine
          • 184. Scalable master-slave architecture
      • Accuracy improved in some cases from 95% rejected to 95% accepted
      • 185. Hardware budget 15% of previous system
  • 186. Why open source?
    • Flexible, extendable
  • 187. Why open source?
    • Flexible, extendable
    • 188. Powerful & scalable
  • 189. Why open source?
    • Flexible, extendable
    • 190. Powerful & scalable
    • 191. Lower cost
  • 192. Why open source?
    • Flexible, extendable
    • 193. Powerful & scalable
    • 194. Lower cost
    • 195. Commercial support available as necessary
  • 196. Why open source?
    • Flexible, extendable
    • 197. Powerful & scalable
    • 198. Lower cost
    • 199. Commercial support available as necessary
    - Freedom to innovate
  • 200. Looking to the future
  • 201. Looking to the future
    • More and more content including social media
  • 202. Looking to the future
    • More and more content including social media
    • 203. Multiple delivery platforms
  • 204. Looking to the future
    • More and more content including social media
    • 205. Multiple delivery platforms
    • 206. Search-powered websites & applications
  • 207. Looking to the future
    • More and more content including social media
    • 208. Multiple delivery platforms
    • 209. Search-powered websites & applications
    • 210. 'No-SQL'
  • 211. Looking to the future
    • More and more content including social media
    • 212. Multiple delivery platforms
    • 213. Search-powered websites & applications
    • 214. 'No-SQL'
    • 215. Cloud
  • 216. Looking to the future
    • More and more content including social media
    • 217. Multiple delivery platforms
    • 218. Search-powered websites & applications
    • 219. 'No-SQL'
    • 220. Cloud
      • Search no longer a bolt-on, but a platform for innovation
  • 221. Looking to the future
    • More and more content including social media
    • 222. Multiple delivery platforms
    • 223. Search-powered websites & applications
    • 224. 'No-SQL'
    • 225. Cloud
      • Search no longer a bolt-on, but a platform for innovation
      • 226. Open source no longer an outsider, but the obvious choice
  • 227. Thankyou! Questions?
          • [email_address]
    www.flax.co.uk/blog Twitter: @FlaxSearch Photo source: http://www.flickr.com/photos/katerha/4259440136/