“Extra” by Jeremy Brooks https://flic.kr/p/4aKH3c
EXTRA and EXTRA+
Stuart Myles * Associated Press * 6th November 2017
© 2017 IPTC (www.iptc.org) All rights reserved
https://flic.kr/p/kAXGfC
Rules-Based Classification
• Rules better for breaking news than statistical methods
– You don’t need 50 examples before you can start tagging
– A rule for a new topic doesn’t require other rules to change
• More consistent and scalable than hand tagging
• Easier to explain why rules classify content
– Machine learning methods are still “black boxes”
– Easier to precisely explain - and correct - mistakes
• You can use your own taxonomy, rules and formats
- Example rules help us drive development of the EXTRA system
- You can use the example rules to see how to develop your own
- Rules could apply IPTC Media Topics or any other taxonomy
© 2017 IPTC (www.iptc.org) All rights reserved 3
EXTRA
EXTraction Rules Apparatus
Rules-based classification of text
Open source software
EXTRA was being developed by the IPTC
€50,000 Grant from the Digital News Initiative
https://www.digitalnewsinitiative.com/fund/
https://iptc.github.io/extra/
© 2017 IPTC (www.iptc.org) All rights reserved 4
Development Process
The EXTRA software is being developed by Infalia
- All software is open source
Two linguists creating rules in English and German
- Samples rules to apply IPTC Media Topics
Example news corpora licensed for EXTRA
- English from Thomson Reuters
- German from APA
© 2017 IPTC (www.iptc.org) All rights reserved 5
EXTRA Components
Elasticsearch
Percolator
+ Custom
Code
Classification
Rule
authoring
Corpus
Testing
Schema
Management
© 2017 IPTC (www.iptc.org) All rights reserved 6
Classification using Percolator
• Elasticsearch
– A sophisticated, open source full-text search engine
– Lets you query documents stored in an index
• Elasticsearch Percolator
– Store queries in an index and match documents to queries
– Classification uses the percolator to match documents to rules
• EXTRA Rule Language
– Rule-writer-friendly language (easier than ES DSL)
– Access to all ES features, plus custom operators
© 2017 IPTC (www.iptc.org) All rights reserved 7
Schema and Rules
• EXTRA Schema
– Documents must be in (or converted to) a JSON format
– But it can be any JSON format you choose
– Allows validating that your rules reference valid fields
• Granular, field-by-field control of analyzers
– Such as whether and how to stem, e.g. by language
– Different ways to tokenize fields, e.g. for slug
– Allow a field to be queried as a whole or tokenized by sentence
or paragraph
– Allows validating that operators are valid by field type
• E.g. to flag that your rule references paragraphs in a field that has
none
© 2017 IPTC (www.iptc.org) All rights reserved 8
Schema and Rules Example
• Two fields - headline and body- with body allowed to be
queried by paragraph
headline
body
body_paragraph
• A rule to require that “angela merkel” and “us elections”
appear in the same paragraph
(prox/unit=paragraph/distance=1
(body adj "angela merkel")
(body adj "us elections")
)
© 2017 IPTC (www.iptc.org) All rights reserved 9
EXTRA Source Code
• The core classification engine
– cql parsers, cql to es mapper, rule schema dict classes,
dao classes, etc
https://github.com/iptc/extra-core
• EXTRA “extra” code
– API, UI, docker files for deployment
https://github.com/iptc/extra-ext
• Open source
– MIT license for EXTRA-specific code
– Apache license for Elasticsearch
© 2017 IPTC (www.iptc.org) All rights reserved 10
EXTRA Timetable
• EXTRA was completed in Summer 2017
• You can access the source code now
– Feedback welcome
• We have applied for a second round of funding: EXTRA+
• Join the (low frequency) email list to stay up-to-date
https://groups.yahoo.com/neo/groups/iptc-extra/info
© 2017 IPTC (www.iptc.org) All rights reserved 11
EXTRA+
Enriching Rule-based Classification of News
with Powerful Semantics
• “aboutness” evaluation
– Given that a story is about a topic, how much is it about it?
• Rule suggestion
– Suggest rules based on a pre-tagged corpus
• Enriched rule operators
– For example, nested “count” operators
© 2017 IPTC (www.iptc.org) All rights reserved 12
Date and Place of Next Meeting
Athens 23rd – 25th April 2018
https://flic.kr/p/atFSAr
ευχαριστώ και αντίο!!
© 2017 IPTC (www.iptc.org) All rights reserved 13

IPTC EXTRA and EXTRA+ November 2017

  • 1.
    “Extra” by JeremyBrooks https://flic.kr/p/4aKH3c
  • 2.
    EXTRA and EXTRA+ StuartMyles * Associated Press * 6th November 2017 © 2017 IPTC (www.iptc.org) All rights reserved https://flic.kr/p/kAXGfC
  • 3.
    Rules-Based Classification • Rulesbetter for breaking news than statistical methods – You don’t need 50 examples before you can start tagging – A rule for a new topic doesn’t require other rules to change • More consistent and scalable than hand tagging • Easier to explain why rules classify content – Machine learning methods are still “black boxes” – Easier to precisely explain - and correct - mistakes • You can use your own taxonomy, rules and formats - Example rules help us drive development of the EXTRA system - You can use the example rules to see how to develop your own - Rules could apply IPTC Media Topics or any other taxonomy © 2017 IPTC (www.iptc.org) All rights reserved 3
  • 4.
    EXTRA EXTraction Rules Apparatus Rules-basedclassification of text Open source software EXTRA was being developed by the IPTC €50,000 Grant from the Digital News Initiative https://www.digitalnewsinitiative.com/fund/ https://iptc.github.io/extra/ © 2017 IPTC (www.iptc.org) All rights reserved 4
  • 5.
    Development Process The EXTRAsoftware is being developed by Infalia - All software is open source Two linguists creating rules in English and German - Samples rules to apply IPTC Media Topics Example news corpora licensed for EXTRA - English from Thomson Reuters - German from APA © 2017 IPTC (www.iptc.org) All rights reserved 5
  • 6.
  • 7.
    Classification using Percolator •Elasticsearch – A sophisticated, open source full-text search engine – Lets you query documents stored in an index • Elasticsearch Percolator – Store queries in an index and match documents to queries – Classification uses the percolator to match documents to rules • EXTRA Rule Language – Rule-writer-friendly language (easier than ES DSL) – Access to all ES features, plus custom operators © 2017 IPTC (www.iptc.org) All rights reserved 7
  • 8.
    Schema and Rules •EXTRA Schema – Documents must be in (or converted to) a JSON format – But it can be any JSON format you choose – Allows validating that your rules reference valid fields • Granular, field-by-field control of analyzers – Such as whether and how to stem, e.g. by language – Different ways to tokenize fields, e.g. for slug – Allow a field to be queried as a whole or tokenized by sentence or paragraph – Allows validating that operators are valid by field type • E.g. to flag that your rule references paragraphs in a field that has none © 2017 IPTC (www.iptc.org) All rights reserved 8
  • 9.
    Schema and RulesExample • Two fields - headline and body- with body allowed to be queried by paragraph headline body body_paragraph • A rule to require that “angela merkel” and “us elections” appear in the same paragraph (prox/unit=paragraph/distance=1 (body adj "angela merkel") (body adj "us elections") ) © 2017 IPTC (www.iptc.org) All rights reserved 9
  • 10.
    EXTRA Source Code •The core classification engine – cql parsers, cql to es mapper, rule schema dict classes, dao classes, etc https://github.com/iptc/extra-core • EXTRA “extra” code – API, UI, docker files for deployment https://github.com/iptc/extra-ext • Open source – MIT license for EXTRA-specific code – Apache license for Elasticsearch © 2017 IPTC (www.iptc.org) All rights reserved 10
  • 11.
    EXTRA Timetable • EXTRAwas completed in Summer 2017 • You can access the source code now – Feedback welcome • We have applied for a second round of funding: EXTRA+ • Join the (low frequency) email list to stay up-to-date https://groups.yahoo.com/neo/groups/iptc-extra/info © 2017 IPTC (www.iptc.org) All rights reserved 11
  • 12.
    EXTRA+ Enriching Rule-based Classificationof News with Powerful Semantics • “aboutness” evaluation – Given that a story is about a topic, how much is it about it? • Rule suggestion – Suggest rules based on a pre-tagged corpus • Enriched rule operators – For example, nested “count” operators © 2017 IPTC (www.iptc.org) All rights reserved 12
  • 13.
    Date and Placeof Next Meeting Athens 23rd – 25th April 2018 https://flic.kr/p/atFSAr ευχαριστώ και αντίο!! © 2017 IPTC (www.iptc.org) All rights reserved 13