2. Overview
● Not a too technical talk
● (...wait while nerds move to next room...)
● I will present a complete solution =>
focus on architecture
● Two months from inception to
deployment
● A good example of using perl in business
applications, leveraging a lot from CPAN
3. Some background
● In years 2006-2008 we developed a
distributed solution for tracking
advertisements, built completely in Perl.
● In June we had the idea to exploit results
on the online news delivery market...
5. www.newsnow.co.uk
● 24/7 coverage of 33932+ sources in 20 languages from 141
countries
● TV news websites
● Online magazines and newswires
● Delivery options:
– Within minutes of publication, on a fully-branded secure Client
Portal
– Searchable 30-day archive and 'drill-down' facility (with Client
Portal)
● Search options
– Match articles only when given keywords occur within the same
sentence, clause, paragraph or article
– Reject articles that come from the wrong sources, are in the
wrong subject areas, or that specify irrelevant keywords or
phrases
– Match 1, 10 or 100s of keywords and phrases simultaneously
6. www.newsnow.co.uk
...They do a lot of things...
...We started 2 months ago so => much
less...
...Still...
...We do it better...
7. SoftNews
● Distributed acquisition
● Grabbing Phases (fetching, filtering,
comparing, transforming) strictly
decoupled
● Leverages on top of very powerful CPAN
libraries
● A “Stich&glue” delivery portal with
already enhanced features
8. SoftNews: main goals
AcquisitionStoringDelivery
● “Topic” oriented acquisition
● Scalable
● High accuracy (negligible false positives)
● Fast text indexing of massive data
collection
● NLP/Text processing techniques (stemming,
positive/negative mentioning,...
● Pluggable, customizable services
● Tag search, text highlight
● “Visual aids” (Tag clouds, graph trends,..)
10. SoftNews: main issues
● Many sites monitored at fixed intervals
– Polling time must be respected in time-critical
domains
– Should run with limited hardware/network
resources
● Large number of documents
– need fast indexing for retrieval
– provide the user with tools to conveniently
navigate the text collection
13. Softnews: Grabbing
● Look for something (=> Processor)
● Reject rubbish (=> Filter handlers)
● Remember what already has (=> Comparer)
Rely on MediaCampaign internet grabber
architecture
14. Acquisition: time constraints
Fetching process:
Strict time constraints
Network latency
Comparing and filtering processes
Loose time constraints
Lightweight
Transforming process:
Loose time constraints
May be an heavyweight process
Currently applied only for Flash animations
15. SoftNews: acq deployment opts
Go simple: One processing chain for each polling
interval
Fetcher Comparer Filter Transfor
m
10
mins
Fetcher Comparer Filter Transfor
m
12
hour
s
.
.
.
.
.
Queue Queue Queue
Queue Queue Queue
(US Polls news)
~150 web sites – 1 month: > 300.000 ...... ~
18. Filtering...
Word-pattern based retrieval
The more words provided, the more accurate results will be
The need for speed
More pages processed with a faster search
Fully configurable
Deal with different topics and different web page layouts
Exploits KinoSearch ranking
features
19. KinoSearch
What is KinoSearch
Text search engine library
A specialized and lightweight DBMS good for one thing:
fast search, ranked by relevance
Loose port of Apache Lucene
20. KinoSearch:features
Can handle millions of documents
Assigns each document a score, based on found
keywords
Advanced features Normalizer
Case-insensitive-search
Horses => horses
Tokenizer
Split text into tokens
“shoots and leaves” => “shoots|leaves”
Stemmer
Normalize word endings
horse, horses, horsing, horsed => hors
22. Delivery...
● Leverages on top of Eadt, an MDE platform
● Took us 3 days from design to deployment...
● Lets have a look !
23. Conclusions
● CPAN is full of gems
● Perl provided to be the best solution for
spidering, text processing, indexing,...
● Some (and sane) Perl hacking on holiday
may not be too bad...
Thank you !