Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IPTC EXTRA and EXTRA+ November 2017


Published on

EXTRA is an open source rules based classification engine, developed by IPTC supported by a grant from Google DNI. Why are rules better than machine learning for breaking news? How can automation better support the manual crafting of news rules.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

IPTC EXTRA and EXTRA+ November 2017

  1. 1. “Extra” by Jeremy Brooks
  2. 2. EXTRA and EXTRA+ Stuart Myles * Associated Press * 6th November 2017 © 2017 IPTC ( All rights reserved
  3. 3. Rules-Based Classification • Rules better for breaking news than statistical methods – You don’t need 50 examples before you can start tagging – A rule for a new topic doesn’t require other rules to change • More consistent and scalable than hand tagging • Easier to explain why rules classify content – Machine learning methods are still “black boxes” – Easier to precisely explain - and correct - mistakes • You can use your own taxonomy, rules and formats - Example rules help us drive development of the EXTRA system - You can use the example rules to see how to develop your own - Rules could apply IPTC Media Topics or any other taxonomy © 2017 IPTC ( All rights reserved 3
  4. 4. EXTRA EXTraction Rules Apparatus Rules-based classification of text Open source software EXTRA was being developed by the IPTC €50,000 Grant from the Digital News Initiative © 2017 IPTC ( All rights reserved 4
  5. 5. Development Process The EXTRA software is being developed by Infalia - All software is open source Two linguists creating rules in English and German - Samples rules to apply IPTC Media Topics Example news corpora licensed for EXTRA - English from Thomson Reuters - German from APA © 2017 IPTC ( All rights reserved 5
  6. 6. EXTRA Components Elasticsearch Percolator + Custom Code Classification Rule authoring Corpus Testing Schema Management © 2017 IPTC ( All rights reserved 6
  7. 7. Classification using Percolator • Elasticsearch – A sophisticated, open source full-text search engine – Lets you query documents stored in an index • Elasticsearch Percolator – Store queries in an index and match documents to queries – Classification uses the percolator to match documents to rules • EXTRA Rule Language – Rule-writer-friendly language (easier than ES DSL) – Access to all ES features, plus custom operators © 2017 IPTC ( All rights reserved 7
  8. 8. Schema and Rules • EXTRA Schema – Documents must be in (or converted to) a JSON format – But it can be any JSON format you choose – Allows validating that your rules reference valid fields • Granular, field-by-field control of analyzers – Such as whether and how to stem, e.g. by language – Different ways to tokenize fields, e.g. for slug – Allow a field to be queried as a whole or tokenized by sentence or paragraph – Allows validating that operators are valid by field type • E.g. to flag that your rule references paragraphs in a field that has none © 2017 IPTC ( All rights reserved 8
  9. 9. Schema and Rules Example • Two fields - headline and body- with body allowed to be queried by paragraph headline body body_paragraph • A rule to require that “angela merkel” and “us elections” appear in the same paragraph (prox/unit=paragraph/distance=1 (body adj "angela merkel") (body adj "us elections") ) © 2017 IPTC ( All rights reserved 9
  10. 10. EXTRA Source Code • The core classification engine – cql parsers, cql to es mapper, rule schema dict classes, dao classes, etc • EXTRA “extra” code – API, UI, docker files for deployment • Open source – MIT license for EXTRA-specific code – Apache license for Elasticsearch © 2017 IPTC ( All rights reserved 10
  11. 11. EXTRA Timetable • EXTRA was completed in Summer 2017 • You can access the source code now – Feedback welcome • We have applied for a second round of funding: EXTRA+ • Join the (low frequency) email list to stay up-to-date © 2017 IPTC ( All rights reserved 11
  12. 12. EXTRA+ Enriching Rule-based Classification of News with Powerful Semantics • “aboutness” evaluation – Given that a story is about a topic, how much is it about it? • Rule suggestion – Suggest rules based on a pre-tagged corpus • Enriched rule operators – For example, nested “count” operators © 2017 IPTC ( All rights reserved 12
  13. 13. Date and Place of Next Meeting Athens 23rd – 25th April 2018 ευχαριστώ και αντίο!! © 2017 IPTC ( All rights reserved 13