Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Internals Of An Aggregated Web News Feed

924 views

Published on

  • Be the first to comment

  • Be the first to like this

Internals Of An Aggregated Web News Feed

  1. 1. newsfeed.ijs.si Mitja Trampuš and Blaž Novak AI Lab, Jozef Stefan Institute
  2. 2. txt Monitor. Clean. Expose.Download. Enrich. Use.
  3. 3. txt Monitor. Clean. Expose. Enrich. Use.Download.
  4. 4. • Sources: RSS, Google News, private feeds – 150 000 feeds – 15 000 publishers• Sources of sources: – Bootstrap from public listings – Parse news articles for <link> entries
  5. 5. • Quality management: – Punish technical errors – Adjustable crawl time• Discovery delay for articles: 3 hours
  6. 6. txt Monitor. Clean. Expose.Download. Enrich. Use.
  7. 7. • Methods in published papers work great – If evaluated on 10 sites• Heuristic: Find the first block-level HTML element with lots of <p>aragraphs – failing that, a <td> or <div> with lots of text – avoid elements with lots of markup – site-independent• Support for rNews/Schema.org
  8. 8. • Pitfalls – Pages with no content – Comments – Copyright notices• Evaluation – 150 sites, one page per site • include content-less pages – 95% precision, 95% recall
  9. 9. txt Clean. Monitor. Expose.Download. Enrich. Use.
  10. 10. • Language detection: – 50 common languages: Chromium CLD – Long tail: Naive Bayes on character trigrams• Language stats: – English 52%, German 7%, Spanish 7%, French 4%, Russian 3%, ..., Chinese 1%, Slovene 0.2% – 40 languages with >100 articles daily – 99% accuracy
  11. 11. • enrycher.ijs.si – DMOZ categorization – Named entity detection, resolution – (Sentiment) – (Deep parsing) – English, Slovene, more languages coming• Geo-tagging – Publisher (WHOIS, public listings) – Content (named entities)
  12. 12. txt Monitor. Clean. Expose.Download. Enrich. Use.
  13. 13. • XML, gzip  filesystem cache• HTTP service (polling)• Command-line client• Live demo, API: http://newsfeed.ijs.si/
  14. 14. • Data volume: 100 000 articles/day Peak throughput: 10 articles/second• One machine for semantic processing One machine for everything else• Processing: Python, (Java, C++) Infrastructure: PostgreSQL, zeromq – Downloaders communicate through the DB – Processing strictly sequential, service-oriented • Each service: In case of errors, pass through
  15. 15. • Feed quality management• Increase the number of sources – Non-western in particular• Compute news clusters
  16. 16. mitja.trampus@ijs.si blaz.novak@ijs.si

×