Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EXTRA Open Source Rules Classification for News

330 views

Published on

Update on IPTC's EXTRA project, an open source engine for classifying news using rules.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

EXTRA Open Source Rules Classification for News

  1. 1. “Extra” by Jeremy Brooks https://flic.kr/p/4aKH3c
  2. 2. An Update on EXTRA Stuart Myles * Associated Press * 16th May 2017 © 2017 IPTC (www.iptc.org) All rights reserved https://flic.kr/p/tiRXEB
  3. 3. Rules-Based Classification • Rules better for breaking news than statistical methods – You don’t need 50 examples before you can start tagging – A rule for a new topic doesn’t require other rules to change • More consistent and scalable than hand tagging • Easier to explain why rules classify content – Machine learning methods are still “black boxes” – Easier to precisely explain - and correct - mistakes • You can use your own taxonomy, rules and formats - Example rules help us drive development of the EXTRA system - You can use the example rules to see how to develop your own - Rules could apply IPTC Media Topics or any other taxonomy © 2017 IPTC (www.iptc.org) All rights reserved 3
  4. 4. EXTRA EXTraction Rules Apparatus Rules-based classification of text Open source software EXTRA is being developed by the IPTC €50,000 Grant from the Digital News Initiative https://www.digitalnewsinitiative.com/fund/ https://iptc.github.io/extra/ © 2017 IPTC (www.iptc.org) All rights reserved 4
  5. 5. Development Process The EXTRA software is being developed by Infalia - All software is open source Two linguists creating rules in English and German - Samples rules to apply IPTC Media Topics Example news corpora licensed for EXTRA - English from Thomson Reuters - German from APA © 2017 IPTC (www.iptc.org) All rights reserved 5
  6. 6. EXTRA Components Elasticsearch Percolator + Custom Code Classification Rule authoring Corpus Testing Schema Management © 2017 IPTC (www.iptc.org) All rights reserved 6
  7. 7. Classification using Percolator • Elasticsearch – A sophisticated, open source full-text search engine – Lets you query documents stored in an index • Elasticsearch Percolator – Store queries in an index and match documents to queries – Classification uses the percolator to match documents to rules • EXTRA Rule Language – Rule-writer-friendly language (easier than ES DSL) – Access to all ES features, plus custom operators © 2017 IPTC (www.iptc.org) All rights reserved 7
  8. 8. Schema and Rules • EXTRA Schema – Documents must be in (or converted to) a JSON format – But it can be any JSON format you choose – Allows validating that your rules reference valid fields • Granular, field-by-field control of analyzers – Such as whether and how to stem, e.g. by language – Different ways to tokenize fields, e.g. for slug – Allow a field to be queried as a whole or tokenized by sentence or paragraph – Allows validating that operators are valid by field type • E.g. to flag that your rule references paragraphs in a field that has none © 2017 IPTC (www.iptc.org) All rights reserved 8
  9. 9. Schema and Rules Example • Two fields - headline and body- with body allowed to be queried by paragraph headline body body_paragraph • A rule to require that “angela merkel” and “us elections” appear in the same paragraph (prox/unit=paragraph/distance=1 (body adj "angela merkel") (body adj "us elections") ) © 2017 IPTC (www.iptc.org) All rights reserved 9
  10. 10. EXTRA Source Code • The core classification engine – cql parsers, cql to es mapper, rule schema dict classes, dao classes, etc https://github.com/iptc/extra-core • EXTRA “extra” code – API, UI, docker files for deployment https://github.com/iptc/extra-ext • Open source – MIT license for EXTRA-specific code – Apache license for Elasticsearch © 2017 IPTC (www.iptc.org) All rights reserved 10
  11. 11. EXTRA Timetable • First phase of the EXTRA project is due to complete Summer 2017 • You can access the source code now – Feedback welcome • Will there be a second phase? TBD… • Join the (low frequency) email list to stay up-to-date https://groups.yahoo.com/neo/groups/iptc-extra/info © 2017 IPTC (www.iptc.org) All rights reserved 11
  12. 12. News Metadata Summit • Proposal: dedicate part of our next face-to-face meeting to descriptive news metadata • Gather academics, vendors, linguists, product owners • Discuss use cases, techniques, technologies – “Face off” between machine learning, deep learning, rules… • Demo final version of EXTRA • Let me know if you’re interested in participating? © 2017 IPTC (www.iptc.org) All rights reserved 12
  13. 13. Date and Place of Next Meeting Barcelona 6th – 8th November 2017 https://flic.kr/p/kAXGfC Thanks and goodbye!! © 2017 IPTC (www.iptc.org) All rights reserved 13

×