SlideShare a Scribd company logo
1 of 23
Download to read offline
Internet News
Retrieval
Marco Masetti (grubert)
masetti@linux.it
Overview
● Not a too technical talk
● (...wait while nerds move to next room...)
● I will present a complete solution =>
focus on architecture
● Two months from inception to
deployment
● A good example of using perl in business
applications, leveraging a lot from CPAN
Some background
● In years 2006-2008 we developed a
distributed solution for tracking
advertisements, built completely in Perl.
● In June we had the idea to exploit results
on the online news delivery market...
“Some (two) months ago...”
Had a look at this: www.newsnow.co.uk
www.newsnow.co.uk
● 24/7 coverage of 33932+ sources in 20 languages from 141
countries
● TV news websites
● Online magazines and newswires
● Delivery options:
– Within minutes of publication, on a fully-branded secure Client
Portal
– Searchable 30-day archive and 'drill-down' facility (with Client
Portal)
● Search options
– Match articles only when given keywords occur within the same
sentence, clause, paragraph or article
– Reject articles that come from the wrong sources, are in the
wrong subject areas, or that specify irrelevant keywords or
phrases
– Match 1, 10 or 100s of keywords and phrases simultaneously
www.newsnow.co.uk
...They do a lot of things...
...We started 2 months ago so => much
less...
...Still...
...We do it better...
SoftNews
● Distributed acquisition
● Grabbing Phases (fetching, filtering,
comparing, transforming) strictly
decoupled
● Leverages on top of very powerful CPAN
libraries
● A “Stich&glue” delivery portal with
already enhanced features
SoftNews: main goals
AcquisitionStoringDelivery
● “Topic” oriented acquisition
● Scalable
● High accuracy (negligible false positives)
● Fast text indexing of massive data
collection
● NLP/Text processing techniques (stemming,
positive/negative mentioning,...
● Pluggable, customizable services
● Tag search, text highlight
● “Visual aids” (Tag clouds, graph trends,..)
SoftNews: domains tackled
● (EURO2008) European Soccer
Championship
● US Elections
SoftNews: main issues
● Many sites monitored at fixed intervals
– Polling time must be respected in time-critical
domains
– Should run with limited hardware/network
resources
● Large number of documents
– need fast indexing for retrieval
– provide the user with tools to conveniently
navigate the text collection
Workflow
Acquiring Storing Delivering
Acquisition
fetcher
sites circular queue
config
archives
filter comparer transformer
archives archives
Fetching Filtering Comparing Transforming
archives
config config config
YAML
WWW::Mechanize
HTML::Parser
XML::Simple
Net::FTP
Archive::Zip
Log::Log4perl
KinoSearch Cache::Cache
MD5::Digest
GD
Imager
SWF
artifacts
metadata
archive
Softnews: Grabbing
● Look for something (=> Processor)
● Reject rubbish (=> Filter handlers)
● Remember what already has (=> Comparer)
Rely on MediaCampaign internet grabber
architecture
Acquisition: time constraints
Fetching process:
Strict time constraints
Network latency
Comparing and filtering processes
Loose time constraints
Lightweight
Transforming process:
Loose time constraints
May be an heavyweight process
Currently applied only for Flash animations
SoftNews: acq deployment opts
Go simple: One processing chain for each polling
interval
Fetcher Comparer Filter Transfor
m
10
mins
Fetcher Comparer Filter Transfor
m
12
hour
s
.
.
.
.
.
Queue Queue Queue
Queue Queue Queue
(US Polls news)
~150 web sites – 1 month: > 300.000 ...... ~
Softnews: acq deployment opts
Massive acquisition: remote distributed fetchers/filters...
Fetcher
Comparer
Filter
Transformer
Fetcher
.
.
.
.
.
Filter
Configuring and monitoring
acquisition...
 A control GUI is provided, controlling all activities. YAML
Prima
Filtering...
 Word-pattern based retrieval
The more words provided, the more accurate results will be
 The need for speed
More pages processed with a faster search
 Fully configurable
Deal with different topics and different web page layouts
Exploits KinoSearch ranking
features
KinoSearch
 What is KinoSearch
Text search engine library
A specialized and lightweight DBMS good for one thing:
fast search, ranked by relevance
Loose port of Apache Lucene
KinoSearch:features
 Can handle millions of documents
 Assigns each document a score, based on found
keywords
 Advanced features Normalizer
Case-insensitive-search
Horses => horses
 Tokenizer
Split text into tokens
“shoots and leaves” => “shoots|leaves”
 Stemmer
Normalize word endings
horse, horses, horsing, horsed => hors
Storing...
KinoSearch
DBMS
MySQL
DBMS
KinoSearch
DBMS
loader
archives
YAML
Archive::Zip
XML::Simple
KinoSearch
Lingua::EN::Keywords
Lingua::EN::Tagger
Log::Log4perl
Delivery...
● Leverages on top of Eadt, an MDE platform
● Took us 3 days from design to deployment...
● Lets have a look !
Conclusions
● CPAN is full of gems
● Perl provided to be the best solution for
spidering, text processing, indexing,...
● Some (and sane) Perl hacking on holiday
may not be too bad...
Thank you !

More Related Content

What's hot

Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)MongoDB
 
Beyond the Basics 1: Storage Engines
Beyond the Basics 1: Storage Engines	Beyond the Basics 1: Storage Engines
Beyond the Basics 1: Storage Engines MongoDB
 
Open source monitoring systems
Open source monitoring systemsOpen source monitoring systems
Open source monitoring systemsForthscale
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalizationShriya Arora
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Zhenxiao Luo
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017Zhenxiao Luo
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Zhenxiao Luo
 
Data streaming-systems
Data streaming-systemsData streaming-systems
Data streaming-systemsimcpune
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedInkbajda
 

What's hot (15)

Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)
 
Beyond the Basics 1: Storage Engines
Beyond the Basics 1: Storage Engines	Beyond the Basics 1: Storage Engines
Beyond the Basics 1: Storage Engines
 
Open source monitoring systems
Open source monitoring systemsOpen source monitoring systems
Open source monitoring systems
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
 
Data streaming-systems
Data streaming-systemsData streaming-systems
Data streaming-systems
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 

Similar to SoftNews-lowres

Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Jimmy DeadcOde
 
ElasticSearch & Elastica in Symfony2 - SfLive 2015
ElasticSearch & Elastica in Symfony2 - SfLive 2015ElasticSearch & Elastica in Symfony2 - SfLive 2015
ElasticSearch & Elastica in Symfony2 - SfLive 2015Nicolas Badey
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghubDana Brophy
 
MongoDB@sfr.fr
MongoDB@sfr.frMongoDB@sfr.fr
MongoDB@sfr.frbeboutou
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackRohit Sharma
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseSplunk
 
ELK Solutions Enablement Session - 17th March'2020
ELK Solutions Enablement Session - 17th March'2020ELK Solutions Enablement Session - 17th March'2020
ELK Solutions Enablement Session - 17th March'2020Ashnikbiz
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Sandra Garcia
 
IWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW
 
Real-time Data Processing
Real-time Data ProcessingReal-time Data Processing
Real-time Data ProcessingBryan Warner
 
Data to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorData to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorWSO2
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
 
Factweavers capability document
Factweavers capability documentFactweavers capability document
Factweavers capability documentVineeth Mohan
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 

Similar to SoftNews-lowres (20)

Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
 
ElasticSearch & Elastica in Symfony2 - SfLive 2015
ElasticSearch & Elastica in Symfony2 - SfLive 2015ElasticSearch & Elastica in Symfony2 - SfLive 2015
ElasticSearch & Elastica in Symfony2 - SfLive 2015
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
MongoDB@sfr.fr
MongoDB@sfr.frMongoDB@sfr.fr
MongoDB@sfr.fr
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk Enterprise
 
ELK Solutions Enablement Session - 17th March'2020
ELK Solutions Enablement Session - 17th March'2020ELK Solutions Enablement Session - 17th March'2020
ELK Solutions Enablement Session - 17th March'2020
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)Building search and discovery services for Schibsted (LSRS '17)
Building search and discovery services for Schibsted (LSRS '17)
 
IWMW 1999: Indexing your web server
IWMW 1999: Indexing your web serverIWMW 1999: Indexing your web server
IWMW 1999: Indexing your web server
 
Real-time Data Processing
Real-time Data ProcessingReal-time Data Processing
Real-time Data Processing
 
Data to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity MonitorData to Insight: Introduction to WSO2 Business Activity Monitor
Data to Insight: Introduction to WSO2 Business Activity Monitor
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...
 
Factweavers capability document
Factweavers capability documentFactweavers capability document
Factweavers capability document
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 

SoftNews-lowres

  • 1. Internet News Retrieval Marco Masetti (grubert) masetti@linux.it
  • 2. Overview ● Not a too technical talk ● (...wait while nerds move to next room...) ● I will present a complete solution => focus on architecture ● Two months from inception to deployment ● A good example of using perl in business applications, leveraging a lot from CPAN
  • 3. Some background ● In years 2006-2008 we developed a distributed solution for tracking advertisements, built completely in Perl. ● In June we had the idea to exploit results on the online news delivery market...
  • 4. “Some (two) months ago...” Had a look at this: www.newsnow.co.uk
  • 5. www.newsnow.co.uk ● 24/7 coverage of 33932+ sources in 20 languages from 141 countries ● TV news websites ● Online magazines and newswires ● Delivery options: – Within minutes of publication, on a fully-branded secure Client Portal – Searchable 30-day archive and 'drill-down' facility (with Client Portal) ● Search options – Match articles only when given keywords occur within the same sentence, clause, paragraph or article – Reject articles that come from the wrong sources, are in the wrong subject areas, or that specify irrelevant keywords or phrases – Match 1, 10 or 100s of keywords and phrases simultaneously
  • 6. www.newsnow.co.uk ...They do a lot of things... ...We started 2 months ago so => much less... ...Still... ...We do it better...
  • 7. SoftNews ● Distributed acquisition ● Grabbing Phases (fetching, filtering, comparing, transforming) strictly decoupled ● Leverages on top of very powerful CPAN libraries ● A “Stich&glue” delivery portal with already enhanced features
  • 8. SoftNews: main goals AcquisitionStoringDelivery ● “Topic” oriented acquisition ● Scalable ● High accuracy (negligible false positives) ● Fast text indexing of massive data collection ● NLP/Text processing techniques (stemming, positive/negative mentioning,... ● Pluggable, customizable services ● Tag search, text highlight ● “Visual aids” (Tag clouds, graph trends,..)
  • 9. SoftNews: domains tackled ● (EURO2008) European Soccer Championship ● US Elections
  • 10. SoftNews: main issues ● Many sites monitored at fixed intervals – Polling time must be respected in time-critical domains – Should run with limited hardware/network resources ● Large number of documents – need fast indexing for retrieval – provide the user with tools to conveniently navigate the text collection
  • 12. Acquisition fetcher sites circular queue config archives filter comparer transformer archives archives Fetching Filtering Comparing Transforming archives config config config YAML WWW::Mechanize HTML::Parser XML::Simple Net::FTP Archive::Zip Log::Log4perl KinoSearch Cache::Cache MD5::Digest GD Imager SWF artifacts metadata archive
  • 13. Softnews: Grabbing ● Look for something (=> Processor) ● Reject rubbish (=> Filter handlers) ● Remember what already has (=> Comparer) Rely on MediaCampaign internet grabber architecture
  • 14. Acquisition: time constraints Fetching process: Strict time constraints Network latency Comparing and filtering processes Loose time constraints Lightweight Transforming process: Loose time constraints May be an heavyweight process Currently applied only for Flash animations
  • 15. SoftNews: acq deployment opts Go simple: One processing chain for each polling interval Fetcher Comparer Filter Transfor m 10 mins Fetcher Comparer Filter Transfor m 12 hour s . . . . . Queue Queue Queue Queue Queue Queue (US Polls news) ~150 web sites – 1 month: > 300.000 ...... ~
  • 16. Softnews: acq deployment opts Massive acquisition: remote distributed fetchers/filters... Fetcher Comparer Filter Transformer Fetcher . . . . . Filter
  • 17. Configuring and monitoring acquisition...  A control GUI is provided, controlling all activities. YAML Prima
  • 18. Filtering...  Word-pattern based retrieval The more words provided, the more accurate results will be  The need for speed More pages processed with a faster search  Fully configurable Deal with different topics and different web page layouts Exploits KinoSearch ranking features
  • 19. KinoSearch  What is KinoSearch Text search engine library A specialized and lightweight DBMS good for one thing: fast search, ranked by relevance Loose port of Apache Lucene
  • 20. KinoSearch:features  Can handle millions of documents  Assigns each document a score, based on found keywords  Advanced features Normalizer Case-insensitive-search Horses => horses  Tokenizer Split text into tokens “shoots and leaves” => “shoots|leaves”  Stemmer Normalize word endings horse, horses, horsing, horsed => hors
  • 22. Delivery... ● Leverages on top of Eadt, an MDE platform ● Took us 3 days from design to deployment... ● Lets have a look !
  • 23. Conclusions ● CPAN is full of gems ● Perl provided to be the best solution for spidering, text processing, indexing,... ● Some (and sane) Perl hacking on holiday may not be too bad... Thank you !