“Extra” by Jeremy Brooks https://flic.kr/p/4aKH3c
EXTRA and FRANCIS
Stuart Myles * Associated Press * 24th April 2018
© 2018 IPTC (www.iptc.org) All rights reserved
https://flic.kr/p/fBshW3
https://flic.kr/p/atFSAr
Rules-Based Classification
• Rules better for breaking news than statistical methods
– You don’t need 50 examples before you can start tagging
– A rule for a new topic doesn’t require other rules to change
• More consistent and scalable than hand tagging
• Easier to explain why rules classify content
– Machine learning methods can be “black boxes”
– Easier to precisely explain - and correct - mistakes
© 2018 IPTC (www.iptc.org) All rights reserved 3
EXTRA
EXTraction Rules Apparatus
Rules-based classification of text
Open source software https://iptc.github.io/extra/
EXTRA was developed by the IPTC
€50,000 Grant from the Digital News Initiative
https://www.digitalnewsinitiative.com/fund/
You can use your own taxonomy, rules and formats
- Example rules help us drive development of the EXTRA system
- You can use the example rules to see how to develop your own
- Rules could apply IPTC Media Topics or any other taxonomy
© 2018 IPTC (www.iptc.org) All rights reserved 4
Development Process
The EXTRA software was developed by Infalia
- All software is open source
Two linguists creating rules in English and German
- Samples rules to apply IPTC Media Topics
Example news corpora licensed for EXTRA
- English from Thomson Reuters
- German from APA
© 2018 IPTC (www.iptc.org) All rights reserved 5
EXTRA Components
Elasticsearch
Percolator
+ Custom
Code
Classification
Rule
authoring
Corpus
Testing
Schema
Management
© 2018 IPTC (www.iptc.org) All rights reserved 6
Classification using Percolator
• Elasticsearch
– A sophisticated, open source full-text search engine
– Lets you query documents stored in an index
• Elasticsearch Percolator
– Store queries in an index and match documents to queries
– Classification uses the percolator to match documents to rules
• EXTRA Rule Language
– Rule-writer-friendly language (easier than ES DSL)
– Access to all ES features, plus custom operators
© 2018 IPTC (www.iptc.org) All rights reserved 7
Schema and Rules Example
• Two fields - headline and body- with body allowed to be
queried by paragraph
headline
body
body_paragraph
• A rule to require that “angela merkel” and “us elections”
appear in the same paragraph
(prox/unit=paragraph/distance=1
(body adj "angela merkel")
(body adj "us elections")
)
© 2018 IPTC (www.iptc.org) All rights reserved 8
FRANCIS*
Using machine learning to empower rule-based
classification of news with semantics.
• “aboutness” evaluation
– Given that a story is about a topic, how much is it about it?
• Rule suggestion
– Suggest rules based on a pre-tagged corpus
• Enriched rule operators
– For example, nested “count” operators
– Using EXTRA as the foundation
* St Francis de Sales is the patron saint of writers and journalists
© 2018 IPTC (www.iptc.org) All rights reserved 9

IPTC EXTRA Spring 2018

  • 1.
    “Extra” by JeremyBrooks https://flic.kr/p/4aKH3c
  • 2.
    EXTRA and FRANCIS StuartMyles * Associated Press * 24th April 2018 © 2018 IPTC (www.iptc.org) All rights reserved https://flic.kr/p/fBshW3 https://flic.kr/p/atFSAr
  • 3.
    Rules-Based Classification • Rulesbetter for breaking news than statistical methods – You don’t need 50 examples before you can start tagging – A rule for a new topic doesn’t require other rules to change • More consistent and scalable than hand tagging • Easier to explain why rules classify content – Machine learning methods can be “black boxes” – Easier to precisely explain - and correct - mistakes © 2018 IPTC (www.iptc.org) All rights reserved 3
  • 4.
    EXTRA EXTraction Rules Apparatus Rules-basedclassification of text Open source software https://iptc.github.io/extra/ EXTRA was developed by the IPTC €50,000 Grant from the Digital News Initiative https://www.digitalnewsinitiative.com/fund/ You can use your own taxonomy, rules and formats - Example rules help us drive development of the EXTRA system - You can use the example rules to see how to develop your own - Rules could apply IPTC Media Topics or any other taxonomy © 2018 IPTC (www.iptc.org) All rights reserved 4
  • 5.
    Development Process The EXTRAsoftware was developed by Infalia - All software is open source Two linguists creating rules in English and German - Samples rules to apply IPTC Media Topics Example news corpora licensed for EXTRA - English from Thomson Reuters - German from APA © 2018 IPTC (www.iptc.org) All rights reserved 5
  • 6.
  • 7.
    Classification using Percolator •Elasticsearch – A sophisticated, open source full-text search engine – Lets you query documents stored in an index • Elasticsearch Percolator – Store queries in an index and match documents to queries – Classification uses the percolator to match documents to rules • EXTRA Rule Language – Rule-writer-friendly language (easier than ES DSL) – Access to all ES features, plus custom operators © 2018 IPTC (www.iptc.org) All rights reserved 7
  • 8.
    Schema and RulesExample • Two fields - headline and body- with body allowed to be queried by paragraph headline body body_paragraph • A rule to require that “angela merkel” and “us elections” appear in the same paragraph (prox/unit=paragraph/distance=1 (body adj "angela merkel") (body adj "us elections") ) © 2018 IPTC (www.iptc.org) All rights reserved 8
  • 9.
    FRANCIS* Using machine learningto empower rule-based classification of news with semantics. • “aboutness” evaluation – Given that a story is about a topic, how much is it about it? • Rule suggestion – Suggest rules based on a pre-tagged corpus • Enriched rule operators – For example, nested “count” operators – Using EXTRA as the foundation * St Francis de Sales is the patron saint of writers and journalists © 2018 IPTC (www.iptc.org) All rights reserved 9