Turning search upside down with powerful open source search softwareCharlie Hull
Turning Search Upside Down - how Flax works with media monitoring companies to build powerful and scalable 'inverted search' systems, applying hundreds of thousands of stored queries to millions of documents in real time. Features Apache Lucene/Solr as a replacement for Autonomy IDOL and our Luwak library as a replacement for Autonomy Verity.
See some common myths, discover the various open source enterprise search packages available and see some case studies on how open source software has helped organisations build effective search.
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...Charlie Hull
How we extended Lucene's SpanQuery and developed a new Elasticsearch query to allow Arachnys to search for names and adverse terms; also how we replicated the relevance scores used by a commercial service.
Turning search upside down with powerful open source search softwareCharlie Hull
Turning Search Upside Down - how Flax works with media monitoring companies to build powerful and scalable 'inverted search' systems, applying hundreds of thousands of stored queries to millions of documents in real time. Features Apache Lucene/Solr as a replacement for Autonomy IDOL and our Luwak library as a replacement for Autonomy Verity.
See some common myths, discover the various open source enterprise search packages available and see some case studies on how open source software has helped organisations build effective search.
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...Charlie Hull
How we extended Lucene's SpanQuery and developed a new Elasticsearch query to allow Arachnys to search for names and adverse terms; also how we replicated the relevance scores used by a commercial service.
From Data Analytics to Fast Data IntelligenceTrieu Nguyen
1) How to understand users with Data Analytics ?
2) How to build Real-time Music Recommender System from Data Stream ?
3) How to boost profit with Cross Sale in Real-time ?
Key Ideas to build Fast Data Intelligence Platform from Open Source Tools:
+ Apache Kafka
+ Apache Spark
+ RFX framework
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
GraphQL - The new "Lingua Franca" for API-Developmentjexp
Three years ago, with the release of the GraphQL specification, Facebook took a fresh stab at the topic of "API design between remote services and applications." The key aspects of GraphQL provide a common, schema-based, domain-specific language and flexible, dynamic queries at interface boundaries.
In the talk, I'd like to compare GraphQL and REST and showcase benefits for developers and architects using a concrete example in application and API development, data source and system integration.
Not only, is our data is getting not just more complex but also more connected. In order not to lose sight of the web of information, but to use it as a source of new insights and opportunities, technologies such as graph databases can help.
For both analytical and transactional use cases, they allow efficient storage, retrieval, and processing of networked data without loss of detail. In this talk, we want to get to know existing tools and techniques for graph data processing.
This developer-focused webinar will explain how to use the Cypher graph query language. Cypher, a query language designed specifically for graphs, allows for expressing complex graph patterns using simple ASCII art-like notation and offers a simple but expressive approach for working with graph data.
During this webinar you'll learn:
-Basic Cypher syntax
-How to construct graph patterns using Cypher
-Querying existing data
-Data import with Cypher
-Using aggregations such as statistical functions
-Extending the power of Cypher using procedures and functions
MongoDB Days Germany: Data Processing with MongoDBMongoDB
Presented by Marc Schwering, Senior Solutions Architect, MongoDB
Modern architectures are moving away from "one size fits all" solutions. The best tools need to be put to the job and given the large amounts of options today, chances are that you’ll end up using MongoDB for your operational workload, as well as Data Processing Systems like Apache Flink or Spark for your high speed data processing needs. When documents or data structures are modeled, there are some key aspects that need to be attended. This takes into consideration the distribution of data nodes, streaming capabilities, performance, aggregation, and queryability options, and how we can integrate the different data processing software that can benefit from subtle but substantial model changes. This session will cover the way how you enhance your architecture using data processing technologies such as Apache Flink and Spark. It will take the audience through the evolution of an app from simple to complex with its architectural requirements . We´ll look into similarities and differences of available technologies and you will walk away with an understanding how to use MongoDB to fulfill more advanced tasks such as personalization through clustering algorithms.
Mastering On-Site Search / Custom Site SearchRalf Schwoebel
In this presentation about on-site search solutions, Todd Keup from magnifisites and I dig into the technologies behind the input field. Today, search technology behind your site is much more important than it looks at the first glance. Retargeting, dynamic pages, etc. are an important aspect for converting visitors to buyers and followers.
NSGIC 2011 Presentation on geo open sourceMichael Terner
Presentation on "Open Source Possibilities" for geospatial to NSGIC 2011 annual conference in Boise, ID. Presented by Michael Terner, AppGeo and Learon Dalby, Sanborn.
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
This talk is a case-study on how Apache Spark and the Spark-Solr library is being used at Flipp for driving search relevancy. Flipp is a Toronto based digital flyer and ecommerce company which helps shoppers save money on weekly shopping. Our customers have the option of browsing through our 5+ million products from the brick-and-mortar retailers in North America. This makes Search a very challenging function in our app. How to show the most relevant and personalized search results to users on a query?
The talk will focus on using user signals such as Click Through Rate (CTR) and Impressions to increase search relevancy. I will also talk about how PySpark is used to create the Flipp Search ETL platform for collecting user signals and reading product data from Solr. The problem scenario will be explained in which keyword search and basic relevancy algorithms become ineffective when dealing with a large product database. The solutions will cover the following implementations being used at Flipp to drive relevancy: – Utilizing user clicks and popularity data to derive and index normalized item weights to implement the Search Crowd Curation models in Apache Solr
– How around 5+ million items are classified into Google Categories in real time using Keras and Apache Spark to power product category curation in Solr.
– How to create a crowd sourced query intent categorizer in Solr using the Spark-Solr library.
– The use of offline and online metrics at Flipp for evaluating changes in search relevancy.
– Future plans for incorporating Kafka-connect in Apache Solr with structured streaming to perform real-time product indexing with Spark-Solr library.
Kasabi, an online data market based on linked data principles, offers data publishers an easy way to publish, link and monetise data, while giving developers of data-centric applications access to this data in different formats and through a number of different interfaces.
Scaling Your Architecture with Services and EventsRandy Shoup
This session is a deep dive into the modern best practices around asynchronous decoupling, resilience, and scalability that allow us to implement a large-scale software system from the building blocks of events and services, based on the speaker's experiences implementing such systems at Google, eBay, and other high-performing technology organizations. We will outline the various options for handling event delivery and event ordering in a distributed system. We will cover data and persistence in an event-driven architecture. Finally, we will describe how to combine events, services, and so-called 'serverless' functions into a powerful overall architecture. You will leave with practical suggestions to help you accelerate your development velocity and drive business results.
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
A walk through of creating a dataflow for ingest of twitter data and analyzing the stream with NLTK Vader Python Sentiment Analysis and Inception v3 TensorFlow via Python in Apache NiFi. Storage in Hadoop HDFS.
The UNESCO Internet website is the main tool used to disseminate information about the Organization and its programme of activities. A respected source of information, the UNESCO website is ranked among the top five of UN family websites and receives on average 1.8 million unique visitors (7 million page views) per month.
The Secretariat is located in its Paris headquarters and in 52 field offices around the world, and demands the high availability of the website, a mission critical working tool for the Secretariat and its communities.
In this Talk Chakir Piro (UNESCO) and Olivier Dobberkau (dkd) will give a short overview of the history of the usage of TYPO3 at www.unesco.org and how we are migrating more content from an old cms to TYPO3.
We will introduce the setup involved to deploy a multinational and multilingual website with TYPO3. Further on we will describe the requirements of such a project dealing with a large amount of stakeholders, communication channels and international events.
Chakir Piro will describe the role the department he works in to filter and aggregate the needs of the different sectors, field and cluster offices of UNESCO.
We will give practical insights on how organizations can adopt a fast track to deliver daily content to its website visitors.
From Data Analytics to Fast Data IntelligenceTrieu Nguyen
1) How to understand users with Data Analytics ?
2) How to build Real-time Music Recommender System from Data Stream ?
3) How to boost profit with Cross Sale in Real-time ?
Key Ideas to build Fast Data Intelligence Platform from Open Source Tools:
+ Apache Kafka
+ Apache Spark
+ RFX framework
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
GraphQL - The new "Lingua Franca" for API-Developmentjexp
Three years ago, with the release of the GraphQL specification, Facebook took a fresh stab at the topic of "API design between remote services and applications." The key aspects of GraphQL provide a common, schema-based, domain-specific language and flexible, dynamic queries at interface boundaries.
In the talk, I'd like to compare GraphQL and REST and showcase benefits for developers and architects using a concrete example in application and API development, data source and system integration.
Not only, is our data is getting not just more complex but also more connected. In order not to lose sight of the web of information, but to use it as a source of new insights and opportunities, technologies such as graph databases can help.
For both analytical and transactional use cases, they allow efficient storage, retrieval, and processing of networked data without loss of detail. In this talk, we want to get to know existing tools and techniques for graph data processing.
This developer-focused webinar will explain how to use the Cypher graph query language. Cypher, a query language designed specifically for graphs, allows for expressing complex graph patterns using simple ASCII art-like notation and offers a simple but expressive approach for working with graph data.
During this webinar you'll learn:
-Basic Cypher syntax
-How to construct graph patterns using Cypher
-Querying existing data
-Data import with Cypher
-Using aggregations such as statistical functions
-Extending the power of Cypher using procedures and functions
MongoDB Days Germany: Data Processing with MongoDBMongoDB
Presented by Marc Schwering, Senior Solutions Architect, MongoDB
Modern architectures are moving away from "one size fits all" solutions. The best tools need to be put to the job and given the large amounts of options today, chances are that you’ll end up using MongoDB for your operational workload, as well as Data Processing Systems like Apache Flink or Spark for your high speed data processing needs. When documents or data structures are modeled, there are some key aspects that need to be attended. This takes into consideration the distribution of data nodes, streaming capabilities, performance, aggregation, and queryability options, and how we can integrate the different data processing software that can benefit from subtle but substantial model changes. This session will cover the way how you enhance your architecture using data processing technologies such as Apache Flink and Spark. It will take the audience through the evolution of an app from simple to complex with its architectural requirements . We´ll look into similarities and differences of available technologies and you will walk away with an understanding how to use MongoDB to fulfill more advanced tasks such as personalization through clustering algorithms.
Mastering On-Site Search / Custom Site SearchRalf Schwoebel
In this presentation about on-site search solutions, Todd Keup from magnifisites and I dig into the technologies behind the input field. Today, search technology behind your site is much more important than it looks at the first glance. Retargeting, dynamic pages, etc. are an important aspect for converting visitors to buyers and followers.
NSGIC 2011 Presentation on geo open sourceMichael Terner
Presentation on "Open Source Possibilities" for geospatial to NSGIC 2011 annual conference in Boise, ID. Presented by Michael Terner, AppGeo and Learon Dalby, Sanborn.
Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr...Databricks
This talk is a case-study on how Apache Spark and the Spark-Solr library is being used at Flipp for driving search relevancy. Flipp is a Toronto based digital flyer and ecommerce company which helps shoppers save money on weekly shopping. Our customers have the option of browsing through our 5+ million products from the brick-and-mortar retailers in North America. This makes Search a very challenging function in our app. How to show the most relevant and personalized search results to users on a query?
The talk will focus on using user signals such as Click Through Rate (CTR) and Impressions to increase search relevancy. I will also talk about how PySpark is used to create the Flipp Search ETL platform for collecting user signals and reading product data from Solr. The problem scenario will be explained in which keyword search and basic relevancy algorithms become ineffective when dealing with a large product database. The solutions will cover the following implementations being used at Flipp to drive relevancy: – Utilizing user clicks and popularity data to derive and index normalized item weights to implement the Search Crowd Curation models in Apache Solr
– How around 5+ million items are classified into Google Categories in real time using Keras and Apache Spark to power product category curation in Solr.
– How to create a crowd sourced query intent categorizer in Solr using the Spark-Solr library.
– The use of offline and online metrics at Flipp for evaluating changes in search relevancy.
– Future plans for incorporating Kafka-connect in Apache Solr with structured streaming to perform real-time product indexing with Spark-Solr library.
Kasabi, an online data market based on linked data principles, offers data publishers an easy way to publish, link and monetise data, while giving developers of data-centric applications access to this data in different formats and through a number of different interfaces.
Scaling Your Architecture with Services and EventsRandy Shoup
This session is a deep dive into the modern best practices around asynchronous decoupling, resilience, and scalability that allow us to implement a large-scale software system from the building blocks of events and services, based on the speaker's experiences implementing such systems at Google, eBay, and other high-performing technology organizations. We will outline the various options for handling event delivery and event ordering in a distributed system. We will cover data and persistence in an event-driven architecture. Finally, we will describe how to combine events, services, and so-called 'serverless' functions into a powerful overall architecture. You will leave with practical suggestions to help you accelerate your development velocity and drive business results.
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFiTimothy Spann
A walk through of creating a dataflow for ingest of twitter data and analyzing the stream with NLTK Vader Python Sentiment Analysis and Inception v3 TensorFlow via Python in Apache NiFi. Storage in Hadoop HDFS.
The UNESCO Internet website is the main tool used to disseminate information about the Organization and its programme of activities. A respected source of information, the UNESCO website is ranked among the top five of UN family websites and receives on average 1.8 million unique visitors (7 million page views) per month.
The Secretariat is located in its Paris headquarters and in 52 field offices around the world, and demands the high availability of the website, a mission critical working tool for the Secretariat and its communities.
In this Talk Chakir Piro (UNESCO) and Olivier Dobberkau (dkd) will give a short overview of the history of the usage of TYPO3 at www.unesco.org and how we are migrating more content from an old cms to TYPO3.
We will introduce the setup involved to deploy a multinational and multilingual website with TYPO3. Further on we will describe the requirements of such a project dealing with a large amount of stakeholders, communication channels and international events.
Chakir Piro will describe the role the department he works in to filter and aggregate the needs of the different sectors, field and cluster offices of UNESCO.
We will give practical insights on how organizations can adopt a fast track to deliver daily content to its website visitors.
Boost your data analytics with open data and public news contentOntotext
Get guidance through the gigantic sea of freely available Open Data and learn how it can empower you analysis of any kind of sources.
This webinar is a live demo of news and data analytics, based on rich links within big knowledge graphs. It will show you how to:
Build ranking reports (e.g for people and organisations)
View topics linked implicitly (e.g. daughter companies, key personnel, products …)
Draw trend lines
Extend your analytics with additional data sources
IWMW 2004: Trials, Trips and Tribulations of an Integrated Web StrategyIWMW
Slides for plenary talk on "Trials, Trips and Tribulations of an Integrated Web Strategy" given at the IWMW 2004 event held at the University of Birmingham on 27-29 July 2004.
See http://www.ukoln.ac.uk/web-focus/events/workshops/webmaster-2004/talks/supple/
ElasticSearch - Suche im Zeitalter der Cloudsinovex GmbH
Eine performante Suche mit relevanten Ergebnissen in großen Datenbeständen ist inzwischen für uns alle immer und überall selbstverständlich. Suche wird nicht mehr nur in klassischen Szenarien wie Enterprise Search und Web Search eingesetzt, sondern organisiert den Zugriff auf Daten und Informationen in verschiedensten Anwendungen (Stichwort: Search-based Applications). Ein Großteil der gebräuchlichen Suchtechnologien basiert hierbei auf dem Apache-Lucene-Projekt. Im Bereich der Suchserver auf Lucene-Basis gibt es nun neben Apache Solr einen neuen Star in der Open-Soruce-Szene: ElasticSearch. Dieser Vortrag stellt ElasticSearch und die Einsatzszenarien eingehend vor und grenzt die Möglichkeiten gegenüber Lucene und Solr insbesondere im Bereich großer Datenmengen ab.
Search01 /certified fixed orthodontic courses by Indian dental academy Indian dental academy
The Indian Dental Academy is the Leader in continuing dental education , training dentists in all aspects of dentistry and offering a wide range of dental certified courses in different formats.
Indian dental academy provides dental crown & Bridge,rotary endodontics,fixed orthodontics,
Dental implants courses.for details pls visit www.indiandentalacademy.com ,or call
0091-9248678078
New from BookNet Canada: BNC BiblioShare - Tim Middleton - Tech Forum 2018BookNet Canada
Learn more about the Canadian book industry's bibliographic aggregation system and how it's supporting publishers, distributors, booksellers, and everyone in between.
Project Manager Tim Middleton will give an overview of what BiblioShare can do, the powerful data-disseminating services we've been working on recently, who's using all that data and why, and how you can take advantage of these tools to your benefit.
Do you need an external search platform for Adobe Experience Manager?therealgaston
Experience Manager provides some basic search capabilities out of the box. In this talk, we'll explore an external search platform for implementing an Experience Manager powered, search-driven site. As an example, we will use Apache Solr as a reference implementation and describe best practices for indexing content, exposing non-Experience Manager content via search, delivering search-driven experiences, and deploying the solution in a production setting.
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
Mike Limcaco, Analytics Specialist / Customer Engineer at Google
Measure trends in a particular topic or search term across Google Search across the US down to the city-level. Integrate these data signals into analytic pipelines to drive product, retail, media (video, audio, digital content) recommendations tailored to your audience segment. We'll discuss how Google unique datasets can be used with Google Cloud smart analytic services to process, enrich and surface the most relevant product or content that matches the ever-changing interests of your local customer segment.
Similar to What's the story with Open Source? (20)
BioSolr - Searching the stuff of life - Lucene/Solr Revolution 2015Charlie Hull
BioSolr, funded by the BBSRC, is a collaboration between open source search experts Flax and the European Bioinformatics Institute (EBI), aiming to significantly advance the state of the art with regard to indexing and querying biomedical data with freely available open source software
1. What's the
story with
open
source?
Searching and monitoring news media with open
source technology
Charlie Hull, Flax
BCS IRSG Search Solutions 2010
Photo source: http://www.flickr.com/photos/shironekoeuro/
3. www.flax.co.uk 3
What is Flax?
Search engine specialists
Formed in 2001 from the ashes of Muscat Ltd
and Webtop as Lemur Consulting Ltd
Based in Cambridge UK
Contributors to and users of Xapian
Recently selected as UK Authorized Partner by
Lucid Imagination
Customers include Mydeco, NLA, Durrants
Ltd, Financial Times, MediaMiser, MySkreen
Apache Lucene and Solr are trademarks of The Apache Software Foundation
7. www.flax.co.uk 7
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
8. www.flax.co.uk 8
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
9. www.flax.co.uk 9
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
Every system will have to scale beyond its originally
planned size
10. www.flax.co.uk 10
The challenges
Content is created for publication, not for search
Content isn't published consistently or available to all
Ranking is never simple
“We just want something like Google”
Every system will have to scale beyond its originally
planned size
- Every project is different
13. www.flax.co.uk 13
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
14. www.flax.co.uk 14
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
15. www.flax.co.uk 15
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
16. www.flax.co.uk 16
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
17. www.flax.co.uk 17
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
18. www.flax.co.uk 18
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
Content restriction & embargo data
19. www.flax.co.uk 19
So how do we build news search?
Indexing
Historical, daily & updates (i.e. later editions)
Must cope with high volume, quickly
Essential metadata – byline, title, source
File format translation not always necessary
BUT Pre-processing sometimes required
Content restriction & embargo data
Solution
Lightweight, customisable index scripts
using powerful open source libraries
20. www.flax.co.uk 20
So how do we build news search?
import xapian
import flax.core
db = xapian.WritableDatabase('db', xapian.DB_CREATE)
fm = flax.core.Fieldmap()
fm.language = 'en' # stem for English
fm.setfield('mytext', False) # freetext field
fm.setfield('mydate', True) # filter field
fm.save(db)
doc = fm.document()
doc.index('mytext', "I don't like spam.")
doc.index('mydate', datetime(2010, 2, 3, 12, 0))
fm.add_document(db, doc)
db.flush()
23. www.flax.co.uk 23
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
24. www.flax.co.uk 24
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
25. www.flax.co.uk 25
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
26. www.flax.co.uk 26
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
27. www.flax.co.uk 27
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
28. www.flax.co.uk 28
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
29. www.flax.co.uk 29
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
Solution
Template-based user interface scripts,
again using open source libraries
30. www.flax.co.uk 30
So how do we build news search?
Searching
Free text with Boolean operators
Filters for metadata & date ranges
Combine date and relevance ranking
Faceted search where appropriate
Saved searches & Alerting
'More like this'
Content restriction & embargo filters
Solution
Template-based user interface scripts,
again using open source libraries
Beware Javascript & older browsers!
31. www.flax.co.uk 31
So how do we build news search?
Administration
Indexing failures common
Logging is essential
32. www.flax.co.uk 32
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
33. www.flax.co.uk 33
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
Scalability
Content is always growing
Both indexing & searching must scale
34. www.flax.co.uk 34
So how do we build news search?
Administration
Indexing failures common
Logging is essential
Log to text as a first pass, reports later
Scalability
Content is always growing
Both indexing & searching must scale
Open source search libraries provide
distributed indexing, replication, remote
indexes
Not simple to get this right!
35. www.flax.co.uk 35
So how do we build news search?
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), ...
36. www.flax.co.uk 36
So how do we build news search?
●Available open source technologies
Languages – C/C++, Java, Python, Javascript
Search libraries – Xapian, Lucene
Search bindings/servers – Xappy, Flax.core,
Solr
External libraries – pyparsing, CherryPy,
xmllib, mxODBC, ...
Presentation & UI – HTMLTemplate, MochiKit,
JQuery, Yahoo! User Interface (YUI), …
We can use whatever works!
37. www.flax.co.uk 37
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
http://www.nla-clipshare.com
38. www.flax.co.uk 38
Some examples
Newspaper Licensing Agency – NLA Clipshare
20 million newspaper stories
6500 users
Content from every major newspaper (and
most regionals)
Used by journalists, clippings agencies,
media monitors
Replacing internal systems at major
newspapers
One of very few ways to search content
from all the papers within hours of
publication
http://www.nla-clipshare.com
42. www.flax.co.uk 42
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
http://presscuttings.ft.com
43. www.flax.co.uk 43
Some examples
Financial Times – press cuttings
Web Service for easy integration
XML source data
Faceted search
Area filters (whole article, body, headline,
byline or any combination)
Synonyms, spelling suggestions
Built from scratch in a fortnight
Designed as a prototype, scaled to
production use without significant change
http://presscuttings.ft.com
46. www.flax.co.uk 46
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
47. www.flax.co.uk 47
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
48. www.flax.co.uk 48
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
False positives require human checking
49. www.flax.co.uk 49
A different task – news monitoring
Non-traditional use of search
Many automated searches on incoming
content
Searches reflect complex client needs
False positives require human checking
False negatives should never occur!
51. www.flax.co.uk 51
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
52. www.flax.co.uk 52
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture
53. www.flax.co.uk 53
A different task – news monitoring
An example
Durrants Ltd.
Thousands of client search profiles
Hundreds of thousands of articles per day
Complex publication heirarchy
Established pipeline
Solution
Flexible query language allows OCR
errors, punctuation, fuzzy matching,
weighting
Supports features of previous engine
Scalable master-slave architecture
Accuracy improved in some cases from 95%
rejected to 95% accepted
Hardware budget 15% of previous system
57. www.flax.co.uk 57
Why open source?
Flexible, extendable
Powerful & scalable
Lower cost
Commercial support available as necessary
58. www.flax.co.uk 58
Why open source?
Flexible, extendable
Powerful & scalable
Lower cost
Commercial support available as necessary
- Freedom to innovate
61. www.flax.co.uk 61
Looking to the future
More and more content including social media
Multiple delivery platforms
62. www.flax.co.uk 62
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
63. www.flax.co.uk 63
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
64. www.flax.co.uk 64
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
65. www.flax.co.uk 65
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation
66. www.flax.co.uk 66
Looking to the future
More and more content including social media
Multiple delivery platforms
Search-powered websites & applications
'No-SQL'
Cloud
Search no longer a bolt-on, but a
platform for innovation
Open source no longer an
outsider, but the obvious choice