SlideShare a Scribd company logo
newsfeed.ijs.si


     Mitja Trampuš and Blaž Novak
       AI Lab, Jozef Stefan Institute
txt



 Monitor.   Clean.          Expose.
Download.   Enrich.          Use.
txt



 Monitor.   Clean.          Expose.
            Enrich.          Use.
Download.
• Sources: RSS, Google News, private
  feeds
  – 150 000 feeds
  – 15 000 publishers


• Sources of sources:
  – Bootstrap from public listings
  – Parse news articles for <link> entries
• Quality management:
  – Punish technical errors
  – Adjustable crawl time


• Discovery delay for articles: 3 hours
txt



 Monitor.   Clean.          Expose.
Download.   Enrich.          Use.
• Methods in published papers work great
  – If evaluated on 10 sites

• Heuristic: Find the first block-level HTML
  element with lots of <p>aragraphs
  – failing that, a <td> or <div> with lots of text
  – avoid elements with lots of markup
  – site-independent
• Support for rNews/Schema.org
• Pitfalls
   – Pages with no content
   – Comments
   – Copyright notices


• Evaluation
   – 150 sites, one page per site
      • include content-less pages
   – 95% precision, 95% recall
txt



             Clean.
 Monitor.                   Expose.
Download.   Enrich.          Use.
• Language detection:
  – 50 common languages: Chromium CLD
  – Long tail: Naive Bayes on character trigrams


• Language stats:
  – English 52%, German 7%, Spanish 7%,
    French 4%, Russian 3%, ...,
    Chinese 1%, Slovene 0.2%
  – 40 languages with >100 articles daily
  – 99% accuracy
• enrycher.ijs.si
  – DMOZ categorization
  – Named entity detection, resolution
  – (Sentiment)
  – (Deep parsing)
  – English, Slovene, more languages coming

• Geo-tagging
  – Publisher (WHOIS, public listings)
  – Content (named entities)
txt



 Monitor.   Clean.          Expose.
Download.   Enrich.          Use.
• XML, gzip  filesystem cache
• HTTP service (polling)
• Command-line client



• Live demo, API:
  http://newsfeed.ijs.si/
• Data volume: 100 000 articles/day
  Peak throughput: 10 articles/second
• One machine for semantic processing
  One machine for everything else
• Processing: Python, (Java, C++)
  Infrastructure: PostgreSQL, zeromq
  – Downloaders communicate through the DB
  – Processing strictly sequential, service-oriented
     • Each service: In case of errors, pass through
• Feed quality management

• Increase the number of sources
  – Non-western in particular


• Compute news clusters
mitja.trampus@ijs.si
 blaz.novak@ijs.si

More Related Content

Similar to Internals Of An Aggregated Web News Feed

Open / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware SystemsOpen / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware Systems
Charalampos Doukas
 
How to write a well-behaved Python command line application
How to write a well-behaved Python command line applicationHow to write a well-behaved Python command line application
How to write a well-behaved Python command line application
gjcross
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
Michael Nelson
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
Simeon Warner
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
Kyle Davis
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
Robert Viseur
 
How static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ codeHow static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ code
cppfrug
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
NAVER D2
 
Lares from LOW to PWNED
Lares from LOW to PWNEDLares from LOW to PWNED
Lares from LOW to PWNED
Chris Gates
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
PyData
 
AWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual Meetup
Anahit Pogosova
 
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
ChemAxon
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
Scott Sutherland
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
Cloud Elements
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
Try PurpleSearch
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
MapR Technologies
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Biblioteca Nacional de España
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Ontico
 
Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...
Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...
Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...
Lucidworks
 

Similar to Internals Of An Aggregated Web News Feed (20)

Open / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware SystemsOpen / Free Cloud platforms and Open Hardware Systems
Open / Free Cloud platforms and Open Hardware Systems
 
How to write a well-behaved Python command line application
How to write a well-behaved Python command line applicationHow to write a well-behaved Python command line application
How to write a well-behaved Python command line application
 
Profiling Web Archives
Profiling Web ArchivesProfiling Web Archives
Profiling Web Archives
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
 
Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
How static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ codeHow static analysis supports quality over 50 million lines of C++ code
How static analysis supports quality over 50 million lines of C++ code
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Lares from LOW to PWNED
Lares from LOW to PWNEDLares from LOW to PWNED
Lares from LOW to PWNED
 
Alexander Sibiryakov- Frontera
Alexander Sibiryakov- FronteraAlexander Sibiryakov- Frontera
Alexander Sibiryakov- Frontera
 
AWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual MeetupAWS Community Nordics Virtual Meetup
AWS Community Nordics Virtual Meetup
 
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
USUGM 2014 - Erin Bolstad (ChemAxon): Consultancy report - New capabilities a...
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
 
Filtering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media StreamingFiltering From the Firehose: Real Time Social Media Streaming
Filtering From the Firehose: Real Time Social Media Streaming
 
Redesigning our Combine Harvester
Redesigning our Combine HarvesterRedesigning our Combine Harvester
Redesigning our Combine Harvester
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...Frontera распределенный робот для обхода веба в больших объемах / Александр С...
Frontera распределенный робот для обхода веба в больших объемах / Александр С...
 
Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...
Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...
Edanz Journal Selector, A Prototype based on Solr/Nutch/Hadoop: Presented by ...
 

More from RENDER project

Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. Madalli
RENDER project
 
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
RENDER project
 
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja TrampusDiversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
RENDER project
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
RENDER project
 
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
RENDER project
 
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia RusuDiversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
RENDER project
 
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny VrandecicDiversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
RENDER project
 
Diversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena SimperlDiversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena Simperl
RENDER project
 
Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data Management
RENDER project
 
Diversity toolkit
Diversity toolkitDiversity toolkit
Diversity toolkit
RENDER project
 
RENDER Telefonica
RENDER TelefonicaRENDER Telefonica
RENDER Telefonica
RENDER project
 
Defining Diversity
Defining DiversityDefining Diversity
Defining Diversity
RENDER project
 
Render Project introduction and overview
Render Project introduction and overviewRender Project introduction and overview
Render Project introduction and overview
RENDER project
 

More from RENDER project (13)

Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. Madalli
 
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
 
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja TrampusDiversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
 
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
 
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia RusuDiversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
 
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny VrandecicDiversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
 
Diversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena SimperlDiversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena Simperl
 
Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data Management
 
Diversity toolkit
Diversity toolkitDiversity toolkit
Diversity toolkit
 
RENDER Telefonica
RENDER TelefonicaRENDER Telefonica
RENDER Telefonica
 
Defining Diversity
Defining DiversityDefining Diversity
Defining Diversity
 
Render Project introduction and overview
Render Project introduction and overviewRender Project introduction and overview
Render Project introduction and overview
 

Internals Of An Aggregated Web News Feed

  • 1. newsfeed.ijs.si Mitja Trampuš and Blaž Novak AI Lab, Jozef Stefan Institute
  • 2. txt Monitor. Clean. Expose. Download. Enrich. Use.
  • 3. txt Monitor. Clean. Expose. Enrich. Use. Download.
  • 4. • Sources: RSS, Google News, private feeds – 150 000 feeds – 15 000 publishers • Sources of sources: – Bootstrap from public listings – Parse news articles for <link> entries
  • 5. • Quality management: – Punish technical errors – Adjustable crawl time • Discovery delay for articles: 3 hours
  • 6. txt Monitor. Clean. Expose. Download. Enrich. Use.
  • 7. • Methods in published papers work great – If evaluated on 10 sites • Heuristic: Find the first block-level HTML element with lots of <p>aragraphs – failing that, a <td> or <div> with lots of text – avoid elements with lots of markup – site-independent • Support for rNews/Schema.org
  • 8. • Pitfalls – Pages with no content – Comments – Copyright notices • Evaluation – 150 sites, one page per site • include content-less pages – 95% precision, 95% recall
  • 9. txt Clean. Monitor. Expose. Download. Enrich. Use.
  • 10. • Language detection: – 50 common languages: Chromium CLD – Long tail: Naive Bayes on character trigrams • Language stats: – English 52%, German 7%, Spanish 7%, French 4%, Russian 3%, ..., Chinese 1%, Slovene 0.2% – 40 languages with >100 articles daily – 99% accuracy
  • 11. • enrycher.ijs.si – DMOZ categorization – Named entity detection, resolution – (Sentiment) – (Deep parsing) – English, Slovene, more languages coming • Geo-tagging – Publisher (WHOIS, public listings) – Content (named entities)
  • 12. txt Monitor. Clean. Expose. Download. Enrich. Use.
  • 13. • XML, gzip  filesystem cache • HTTP service (polling) • Command-line client • Live demo, API: http://newsfeed.ijs.si/
  • 14. • Data volume: 100 000 articles/day Peak throughput: 10 articles/second • One machine for semantic processing One machine for everything else • Processing: Python, (Java, C++) Infrastructure: PostgreSQL, zeromq – Downloaders communicate through the DB – Processing strictly sequential, service-oriented • Each service: In case of errors, pass through
  • 15. • Feed quality management • Increase the number of sources – Non-western in particular • Compute news clusters

Editor's Notes

  1. This talk is about a particular newsfeed aggregator.Interesting for the data mining community because provides data. Born out of data necessity at the dept.
  2. It all starts with the sources.
  3. It’s not too hard to get MANY feeds. But are they good?After discovery, things happen fast.
  4. We’re not the first to clean HTML; there are even dedicated challenges out there.
  5. Here’s some things to watch out for.Pages with no content: you HAVE to detect them unless RSS sources are carefully manually curated
  6. We’ve got the text, now we grab the green marker and tag the text with additional information.
  7. Deep parsing – problematic in real time (up to 10 docs/s)
  8. One machine = nasty details, e.g. overloading DNS servers.