Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IPTC EXTRA Spring 2018

92 views

Published on

An update on the EXTRA project - an open source rules based classifier for news content. Including the application for additional funding from Google DNI for FRANCIS

Published in: Technology
  • Be the first to comment

  • Be the first to like this

IPTC EXTRA Spring 2018

  1. 1. “Extra” by Jeremy Brooks https://flic.kr/p/4aKH3c
  2. 2. EXTRA and FRANCIS Stuart Myles * Associated Press * 24th April 2018 © 2018 IPTC (www.iptc.org) All rights reserved https://flic.kr/p/fBshW3 https://flic.kr/p/atFSAr
  3. 3. Rules-Based Classification • Rules better for breaking news than statistical methods – You don’t need 50 examples before you can start tagging – A rule for a new topic doesn’t require other rules to change • More consistent and scalable than hand tagging • Easier to explain why rules classify content – Machine learning methods can be “black boxes” – Easier to precisely explain - and correct - mistakes © 2018 IPTC (www.iptc.org) All rights reserved 3
  4. 4. EXTRA EXTraction Rules Apparatus Rules-based classification of text Open source software https://iptc.github.io/extra/ EXTRA was developed by the IPTC €50,000 Grant from the Digital News Initiative https://www.digitalnewsinitiative.com/fund/ You can use your own taxonomy, rules and formats - Example rules help us drive development of the EXTRA system - You can use the example rules to see how to develop your own - Rules could apply IPTC Media Topics or any other taxonomy © 2018 IPTC (www.iptc.org) All rights reserved 4
  5. 5. Development Process The EXTRA software was developed by Infalia - All software is open source Two linguists creating rules in English and German - Samples rules to apply IPTC Media Topics Example news corpora licensed for EXTRA - English from Thomson Reuters - German from APA © 2018 IPTC (www.iptc.org) All rights reserved 5
  6. 6. EXTRA Components Elasticsearch Percolator + Custom Code Classification Rule authoring Corpus Testing Schema Management © 2018 IPTC (www.iptc.org) All rights reserved 6
  7. 7. Classification using Percolator • Elasticsearch – A sophisticated, open source full-text search engine – Lets you query documents stored in an index • Elasticsearch Percolator – Store queries in an index and match documents to queries – Classification uses the percolator to match documents to rules • EXTRA Rule Language – Rule-writer-friendly language (easier than ES DSL) – Access to all ES features, plus custom operators © 2018 IPTC (www.iptc.org) All rights reserved 7
  8. 8. Schema and Rules Example • Two fields - headline and body- with body allowed to be queried by paragraph headline body body_paragraph • A rule to require that “angela merkel” and “us elections” appear in the same paragraph (prox/unit=paragraph/distance=1 (body adj "angela merkel") (body adj "us elections") ) © 2018 IPTC (www.iptc.org) All rights reserved 8
  9. 9. FRANCIS* Using machine learning to empower rule-based classification of news with semantics. • “aboutness” evaluation – Given that a story is about a topic, how much is it about it? • Rule suggestion – Suggest rules based on a pre-tagged corpus • Enriched rule operators – For example, nested “count” operators – Using EXTRA as the foundation * St Francis de Sales is the patron saint of writers and journalists © 2018 IPTC (www.iptc.org) All rights reserved 9

×