Getting Things
Gregor William Stewart

Director of Product Management, Text Analytics

Basis Technology Corporation
Introduction
 Product Manager for Text

Analytics, including:
–
–
–
–
–

Rosette Linguistics Platform
Entity Analytics
Na...
Overview
1

4

2

+

Source

Tasks

Technologies

Adaptations

Properties

Description

Action (Input/Output)

“Out of Box...
‫1100000-2102-‪Source: SOCOM‬‬‫‪HT‬‬
‫‪ An Arabic language source‬‬

‫‪document‬‬
‫‪Letters/emails from one colleague‬‬
‫...
Task: Triage
 Triage: should we process further








5

and/or urgently?
Too few trained, trusted linguists to
r...
Technology Entity Extraction (1)

6
Technology Entity Extraction (2)

7
Technology: Entity Extraction (3)
2
5

Domai
n
Text

Tagged
Text

23

Unsupervised Model

4

Supervised Model

Input
Text
...
Adaptation: Entity Extraction to Triage
 Out of the box:
– False +/- because contextual cues are fewer/different.
– Weapo...
Task: Translation
 Produce standardized, “user








10

language” versions of the source
document
Too few transla...
Adaptation: Extraction to Translation
(1)
 Out of the box:
– Same problems as in Gisting case, only now they matter more....
Adaptation: Extraction to Translation
(2)
Thanks:~ Itai_Rolnick$ cat
en_wc.txt | grep -i "
aleppo " | tr ' ' 'n' |
shuf | ...
Task: Cataloging
 Distill content into an index, to









13

facilitate search and further
refinement at scale
I...
Technology: Entity Resolution (1)
Alberto

Alberto Amos
Alberto
Fernandez…
Fernandez…

Alberto Fernandez…
… born in Cuba
…...
Technology: Entity Resolution (2)

15
Technology: Entity Resolution (3)

16
Technology: Entity Resolution (4)

17
Technology: Entity Resolution (5)
Resolution Engine

Entity
Mention

Link or
Ghost

Candidate Selection
Ranking

3
4

Lear...
Adaptation: EntRes to Cataloging (1)
 Out of the box:
– Quality dependent on output of extraction and order of input
– Lo...
Adaptation: Ent Res to Cataloging (2)
 In Linking mode:
– Link to existing KB or declare unknown,
discarding context
– St...
Task: Retrieval
 Find relevant information for further







21

analysis
String-based retrieval methods are
easy to...
Adaptation: EntRes to Retrieval
 Out of the box:
– Entity labels not in user’s language confusing
– Returns results that ...
Summary
 News-trained NER OK for Triage, but adding entity types via

lists and patterns could improve results considerab...
Remaining Challenges
 Current reality: even “simple” adaptation can be difficult:
–
–
–
–

Too much knowledge, experience...
Q&A
gregor@basistech.com

Director of Product Management, Text Analytics

Basis Technology Corporation
Upcoming SlideShare
Loading in …5
×

HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

625 views

Published on

Many of the most robust Human Language Technologies, including statistical part of speech taggers and entity extractors, are developed primarily using high quality newswire datasources. The performance of these technologies on texts in other genres, including short texts like tweets and even sub-genres of news like market summaries, is typically poor. Adapting such technologies to these increasingly important genres is still very difficult and an active area of commercial and academic research. In this presentation, Mr. Stewart will highlight the ways in which newswire trained modules typically fail on the most important emerging text genres, outline the most effective and lowest cost methods to adapt these resources that researchers and practitioners have discovered, and offer guidance on what degree of improvement users can expect to see in the short to medium term.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
625
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Getting “high quality” entities from textDoing it quickly and accuratelyGuiding people in their use
  • Adapt to your needs. A system that adapts to data.Could use click through data to factor evidence
  • Note Time and place is missing from diagram – affects both vocab and grammar
  • Operational priorities change too quickly to merit the development of a model for interest, and the learned model would probably miss many things that we wanted to see. TaskProblem(s)Challenge(s)Approach(es)Solution
  • Traditionally, finding (putative) entity mentions in text:Mark spans that we think refer to something “in the world”For each, make a guess about the kind of thing each refers to, e.g. PERSON, PLACE, ORGANISATIONOptionally, group the mentions that you think co-refer into chainsMost often called Named Entity Recognition (NER)Embarrassingly good method combines statistical B-I-O sequence model with lists and known patternsStatistical model typically trained using local features over annotated newswire text: abundance, quality
  • Deterministic or Explicit Components:Gazeteers: Lists, e.g. company names, product namesRegular Expressions: Patterns, e.g. Probabilistic or Implicit Components:Training and testing data, e.g. annotated newswire, raw domain textFeatures, e.g. metadata_subject=markets, prior_word_class=543LearnersModel(s) – what the learner outputsCombiner/RedactorAdjudication between component outputsEntity Joining/In document Co-reference ResolutionModify joining rulesSet confidence thresholdsIdentify entity types consistentlySet weight or length preferencesEasier:Novel entities with small number of forms - Novel, highly productive but structured entities – regular expressionsForms we know aren’t entities - blacklistsBroad vocabulary and style shift – using unsupervised word class modelsHarder:New Entity Types – additional annotated data, and feature engineeringStructure change – additional annotated data if within bounds set by featuresFine Grained Entities (lots of data and annotation)
  • Extraction and Co-Reference Performance varies greatly by entity type and languageBrittle to changes in domain and genre:Distribution of Entity TypesVocabulary Differences“Grammar” or Structure, inc. document length, abbreviationData sparsity means:Fine grained, rarer entities can be very difficult to extractPerformance on very short texts is typically very lowEntity types decided up front/embedded in models
  • First step is to build a representation of the entity-base structured to make feature evaluation easy, so we can learn to link.Our system begins by building an index of the information in the knowledge base - the entity-base can be anything from a list, to a database, to a graph, to a rich, semi-structured text resource like Wikipedia. For each entity in the knowledge base, we create a entry in the index containing information that is known to be useful for efficiently differentiating it from other entities (called features), e.g. the non-stop words in a canonical mention sentence, like “president” and “USA” in the opening line of Barack Obama’s Wikipedia page.
  • (AT LEFT) Let’s focus on four of these coreference chains: Hyon Song-wol, Ri Sol-ju, Wangjeasan Light Music Band and Chosun.  In a first pass, we compare the surface form of the mentions in each chain with the labels of the entities in the index, this generates a small number of candidates for closer consideration. In a second pass, we score the degree of similarity of these mentions and their surrounding context, like the non-stop words in the sentences they appear in, with the contents of the candidate entity entries. We can think of this as trying to find which entity or entities the mentions are closest to in some space, called the feature space. Here we can see that Hyon Song-wol, and Wangjesean groups are quite closely associated with the entities we would expect them to be; that Chosun is equally well associated with two entities; and that Ri Sol-ju is not particularly closely associated with any of the known entities.
  • (AT LEFT) In this example, our scoring resolves Hyon-Song-wol and Wangjeasan correctly to the respective entries in wikipedia; correctly identifies that Ri Sol-ju is a genuinely new name or “ghost” entity which we may wish to create a knowledge base entry for; but incorrectly associates Chosun with the wikipedia page for Korea, rather than the news agency ChosunIlbo. Had Chosun been correctly tagged as an ORG at the NER stage, it would almost certainly have been resolved correctly. This example emphasizes how important high quality foundational linguistic components are for higher level tasks, and how flexibility must be built into downstream algorithms to prevent the errors that do occur from being unrecoverable.
  • Deterministic or Explicit Components:Gazeteers: Lists, e.g. company names, product namesRegular Expressions: Patterns, e.g. Probabilistic or Implicit Components:Training and testing data, e.g. annotated newswire, raw domain textFeatures, e.g. metadata_subject=markets, prior_word_class=543LearnersModel(s) – what the learner outputsCombiner/RedactorAdjudication between component outputsEntity Joining/In document Co-reference ResolutionModify joining rulesSet confidence thresholdsIdentify entity types consistentlySet weight or length preferencesEasier:Novel entities with small number of forms - Novel, highly productive but structured entities – regular expressionsForms we know aren’t entities - blacklistsBroad vocabulary and style shift – using unsupervised word class modelsHarder:New Entity Types – additional annotated data, and feature engineeringStructure change – additional annotated data if within bounds set by featuresFine Grained Entities (lots of data and annotation)
  • If recent TAC data is anything to go by, entity linking is expected to associate very different strings.
  • REX Field Training KitA Package of Tools and Processes for English, PashtoProvides guidelines for:Effective use of gazetteer and regular expression componentsAnnotation of data and training of supervised modelsClustering tool allows adaptation to domain vocabulary for languages that have word class dataSlated for 2.0:Coverage for all languages REX supports, inc. Korean, ArabicSeed resources for specific domains
  • Less data: better balance between stability of models and volume of data available for adaptationLess effort: automated adaptation, tools and UIs for annotation projectsInline, e.g. task performance as feedback, in addition to correctionOnline, e.g. dynamic knowledge sources without discontinuities
  • HLT 2013 - Adapting News-Trained Entity Extraction to New Domains and Emerging Text Genres by Gregor Stewart

    1. 1. Getting Things Gregor William Stewart Director of Product Management, Text Analytics Basis Technology Corporation
    2. 2. Introduction  Product Manager for Text Analytics, including: – – – – – Rosette Linguistics Platform Entity Analytics Name Indexing and Translation Chat Translator Highlight  Questing for: – – – – 2 Quality: accuracy, performance Coverage: languages, domains, genres Integration: tasks, workflows, UX Innovation: new aggregates, functions
    3. 3. Overview 1 4 2 + Source Tasks Technologies Adaptations Properties Description Action (Input/Output) “Out of Box” Comparison Problem(s) Components Suggested Adaptations Challenge(s) Process Potential Benefits Approach(es) Adaptation Opportunities Costs Solution Signal(s)  Focus on entity analytics in four stages of the processing and exploitation of SOCOM-2012-0000011-HT  Reaching “state of the art” in practice means adapting to source, task and user. 3
    4. 4. ‫1100000-2102-‪Source: SOCOM‬‬‫‪HT‬‬ ‫‪ An Arabic language source‬‬ ‫‪document‬‬ ‫‪Letters/emails from one colleague‬‬ ‫‪to others, regarding policy‬‬ ‫‪Written years before it was‬‬ ‫‪acquired, processed‬‬ ‫‪Perhaps imperfectly transcribed, or‬‬ ‫‪OCRed into our forensics platform‬‬ ‫,‪Of uncertain provenance, content‬‬ ‫‪value‬‬ ‫‪Not a current web news article for‬‬ ‫‪wide consumption, with metadata‬‬ ‫‪‬‬ ‫‪‬‬ ‫‪‬‬ ‫‪‬‬ ‫‪‬‬ ‫1100000-2102-‪SOCOM‬‬ ‫ ‬ ‫ ‬ ‫ـــــ  /   ﺍﺍﻟﻌﺰﻳﻳﺰ  ﻋﺪﻧ ﻥ  / ﺍﺍ)  ﺣﺎﻓﻆ  ﺳﻠﻄﺎﻥ  (  ‬ ‫ﻥ‬ ‫أأﺧﻲ  ‬ ‫ﻋﺭﺳﺎﺋﻞ إإﻟﻰ  ﻛﺮ ﺍ ﻭ ﻭأأ ﻣﻋﻤﺮ  ﻭ ﻧﺳﻬﻬﻢ  ؛ ﺭﺳﺎﺋﻞ  ﺗﻮﺟﻴﻬﻬﻴﻴﺔ ‬ ‫ ﻴ‬ ‫ﺎ ﺭ‬ ‫ﻲ‬ ‫ﻭ ﻭﺑ‬ ‫ﻲ ‬ ‫ﻃﻠﺒ ﻢ  ﻣﻨﻜﻢ ﺘأأﻳﻳﻀﺎ  ﻓﻲ ﺭﺳﺎﺋﻞ  ﺳﺎ ﺑﺔ ﺍﺍﻟﻤﺴﺎﺭﺭﺔ  ﺑﻜﺘﺎﺑﺔﺭ‬ ‫ﻘ‬ ‫ﺭ  ‬ ‫ ‬ ‫ ‬ ‫ﻴ‬ ‫،‬ ‫ﻓ‬ ‫ﻱ‬ ‫ﻥ‬ ‫ﺮأ‬ ‫ﺣﺎﺯﺯﺔ،،  ﻓﺈﻧ ﻨ ﻲﻑ  ﻋﻰ  ﺍﻹﺓﺓ  ﻦ ﺍﺍﻷﺧﻄﺎء ﺍﺍﻟﺴﻴﻴﺎ ﺳﻴﺔ،،  ﻓﻘﺪ  ﺳﻤﻌ ﻢ  ﻭﻻﺑﺪ  ﺧﻄﺒﺔ أأﺑﻲ  ﻋﻤﺮ ﺍﺍﻷ ﺧﻴﺮﺓﺓ،  ﻭ ﻲ  ﻧﻈ ﻱ  أﻥ ‬ ‫ﻭ‬ ‫ﻭ ﺘ‬ ‫ﻮ ﻴ‬ ‫ﻣﺧ‬ ‫ﻠ‬ ‫أأﺧﺎﻑ  ‬ ‫  ﻣ‬ ‫ﺤﻬﺎ أأﺧﻄﺎء  ﻭﺍﺍﺿ ﻭﺔ ﻓ:  ﻴﻴﻬﻬأأﺷﻴﻴﺎ ء  ﺎ  ﻛﺎﻥ ﻳﻳﻨ ﺒﻲ  أﻥ  ﺗﺬﻛﺮ  ﻓﻲ  ﺧﻄﺒﺔ  ﻗﺎﺋﺪ ﻛﻬﻬﺬﺍﺍ،  ﻭﻳﻳﺪ ﻝ ﻫﻫﺎ  ﻓﻲ  ﺧﻄﺎﺑﻪﻪ    ﻻ ﺳﻴ ﻴﺎ  ﻲ ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻭ ﻝ ﺫﺫﻛﺮﻫ‬ ‫، ﻫ‬ ‫ﻐ‬ ‫أﻥ‬ ‫ﻤ ﺎ ﻥ ﻓ ﻥ ﻣ‬ ‫ﻓﻴﻬ   ﻴ‬ ‫ﻮ‬ ‫ﺳﻴﺎﻕ ﺍﺍ ﻟﻮﺍﺍﺏ  ﻭﺍﺍﻟﻤﺒﺎﺩﺩئ  ﻋﻠﻰ أأﻧﻬﻬﻢ  ﻣﺘﺸﺪﺩﺩﻭﻥ،  ﻭﺗﻌﻄﻲ إإﻳﻳﺤﺎء  ﻬﻬﻢ  ﻣﺘﻌ ّ ﻘﻥ  ﻣﺴﺘﻌﺠﻠﻮﻥ. .!  ﻭﻴﻴﻬﺎ  ﺗﻨﻔﻴﻴﺮ  ﻭﻗ ﺔ  ﺣﻜﻤﺔ.  ‬ ‫ﻭ ﻬ‬ ‫ ‬ ‫ﻥ‬ ‫ﻤ‬ ‫ﻭ‬ ‫ﺑﺄﻧ‬ ‫ ‬ ‫ﻭﻥ ﻭ ،‬ ‫ﻓ‬ ‫ﻠ‬ ‫ﺏ ئ‬ ‫!‬ ‫ﻴ‬ ‫ ‬ ‫ﻭﺜ‬ ‫ﻕ‬ ‫ﻭأأﻧﺎ  ﻋﻦ  ﻧ ﻔﻲ  ﻛﺘﺒﺖ ﻟﻬﻬﻢ  ﻭﻋﺎﺗﺒﺘﻬﻬﻢ ﺴﻭﺷﺪﺩﺩﺕ  ﻋﻠﻴﻴﻬﻢ  ﺑﻌﺾ ﺍﺍﻟﺘﺸﺪﻳﻳﺪ.  ‬ ‫ﻬ‬ ‫،ﻭ   ،   ﺕ‬ ‫ﺧ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﺮء  ﺗﻠﻮ ‬ ‫ﻭأ أﺎﻑ  ﻬﻬﻢ إإﻥ ﺍﺍﺳﺘﻤ ﻭﺍﺍ ﻲ   ﺜﻫﻫﺬﺍﺍ ﺍﺍﻷﺳﻠﻮ  ﻭﺍﺍﻟﻄﺮﻳﻳﻘﺔ ﻳﻳﻔﺴ ﻭ ﻭ ﻭﻳﻳﻨﻔ ﻭﻭﻥﺍﺍﻟ ﺎﺱ  ﻭﻳﻳﻔﻘﺪﻭﻧﻬﻬﻢ  ﻭﻳﻳﻜﺴﺒﻮﻥ ﺍﺍﻷﻋﺪﺍ  ‬ ‫ﻥ‬ ‫ﻥ  ﻥ ﺪ ﻭﻥ  ﺱ ﻭ ﻭ ﻭﻨ‬ ‫ ‬ ‫ﺏ‬ ‫ﺏ‬ ‫ﻭ‬ ‫ﺮﻓ ﻭﻞ ﻣ‬ ‫ﻫ‬ ‫  ﻫ‬ ‫ﻭ‬ ‫ﻭﺍ ﻑ أأﻧ ﻥ ،‬ ‫ﺍﺍﻷﻋﺪﺍﺍء  ﻭﻳﻳﻌ ﻄﻥ  ﻟﻸﻋﺪ ﺍ ﻭﺍﺍ، ﻮﺼﻡ  ﺍﺍﻟ ﻔ ﻮﺮﺔ  ﻟﻠﻨﻴﻴﻞ  ﻣ ﻨﻬﻢ،،  ﻭﺍﺍ  ﻟﺤﻤﺔ  ﻬﻬﻢ  ﺷ ﺮﺔ  ﺟﺪﺍ  ﺗﺸﻮﻳﻬﻬﺎ  ﻭﺗﻨﻔﻴﻴﺮﺍً  ﻭﻛﺬﺑﺎ ﻭﺍﺍﻓﺘﺮﺍﺍء، ‬ ‫ﺳ ﺍ ﻭ ﻳ ﺍﻭ‬ ‫ﻠ‬ ‫ﻭ ﻋﻠﻴﻴ‬ ‫ﻬ‬ ‫ﻥ‬ ‫ﻭ‬ ‫ﻡ‬ ‫ﺻ‬ ‫ء‬ ‫ﺍﻟ ﺨ‬ ‫ﻭ ‬ ‫،‬ ‫ﻫﻫ ﺍ ﻳﻳﺴﺘﺪﻋﻲ  ﻏﻠﻖ  ﺎ أأﻣﻜﻨﻨﺎ  ﻣﻦ أأ ﻮﺍﺍﺏ  ﻭﻭ ﻗﻊ ﺍﺍﻟﻄﺮﻳﻳﻖ  ﻋﻠﻰ ﺍﺍﻷﻋﺪﺍﺍء،،   ﻓﻜﻴﻴﻒ  ﺑ ﺈﻮﺍﺍﻧ ﺧﻳﻳﺰﻳﻳ ﻭﻭﻥﺍﺍﻟ ﻄﻴﻦ ﻴﺑﻠﺔ  ﻭﻳﻳﻔﺘﺤﻮﻥ ‬ ‫ﻭ‬ ‫ﻨ  ﺎ ﻥ  ﺪ‬ ‫ ‬ ‫ ‬ ‫ﺑ‬ ‫ﻄ‬ ‫ﺏ‬ ‫ﺍ   ﺬ ﻣ‬ ‫ﻭ‬ ‫ﻭﻫ‬ ‫ﻫ‬ ‫ﻥ‬ ‫ﻋﻠﻰ أأﻧﻔﺴﻬ ﻭأأﻮﺍﺍﺑً  ﻣﻦ ﺍﺍﻟﺸﺮ..! ‬ ‫ﻬ‬ ‫ﺑ‬ ‫  ﻢ  ﺎ  ‬ ‫ﻭﺍﺍﻟﻤﻘﺼﻮﺩﺩ  :  ﻻ  ﺗ ﺘﺮﻮﺍﺍ  ﻣﺤﻤﻮﺩﺩ ﻴ)  ﻄﻴﺔ  (  ﻭﺣﺪﻩﻩ،  أﺭﻳﻳﺪ  ﻣﻨﻜﻢ ﺍﺍﺳﺘﺼﺪﺍﺭ ﺭﺳﺎﺋﻞ  ﺧ ﺎﺔ  ﻭﻋﺎﻣﺔ أ ﻋﻠﻨ ﻴ ﻭﺳ ﺮﻳﺔ  ،  ﻣﻦ  ﻋﺒﺪ ‬ ‫  ﻭﺔ   ﻳ‬ ‫ﻭﺻ‬ ‫ﺭ  ﺭ  ‬ ‫ﺍ‬ ‫،‬ ‫ﺭ أ‬ ‫ﻋ، ﻭ‬ ‫ﻴ ﻛ‬ ‫ﺍﺍﻟﺸﺎﻓﻲﺩ)  ﻛﻠﻴﻴﻢ  (  ﻭﺣ ﻰ  ﻣﻦ ﺍﺍﻟﺼﺎﺩﻕ  ﻪﺯﻣﺮﺍﺍﻱ  ( إإﺫﺍﺍ ﻟ ﺱ أﻜ ﻴﻴﻬﺎ ﺮ ﺒ ﺎﺢ  ﻭﺗﻮﺟﻴﻬﻬﺎﺕ  ﻣﺒﺎﺷ ﻦ ﺷﻪ  ﻣﺒﺎﺷ ﺓ  ﻭﻣﺤﺪﺩﺓ ‬ ‫ﻭ‬ ‫ﻓ ﺓ  ﻭﺩ ،‬ ‫ﺮ‬ ‫ﺓ‬ ‫ﺫ‬ ‫ﻬ‬ ‫ﺘ ﻕ ) ﺯ ﻱ     ﻣﻭ  ﻧﺼ ﻭ ﺋ‬ ‫ﻭ ﻴ‬ ‫، ﺕ‬ ‫ﺓ‬ ‫ﻭ‬ ‫ ‬ ‫ﻭ‬ ‫ ‬ ‫ﺍ ﺎ،‬ ‫ﻣ‬ ‫ﻭﺍﺍﺿﺤﺔ  ﻟﻠﻜﺮﻭ  ﻭﺍﺍ ﻲ ﻣﻋﻤﺮ  ﻭإ إ ﻬﻬﻢ،  ﻲ ،ﻣﺴﺎﺋﻞ  ﺳﻴﻴﺎﺳﺔ  ﺍ ﺱ ﻭﺍﺍﻟﺘﻌﺎﻣﻞ  ﻣﻌﻬﻢ  ﻭﻣﻊ ﺍﺍﻟﻔﺼﺎﺋﻞ ﺍﺍﻷﺧﺮﻯ،  ﻭﻋﺪﻡﻡ ‬ ‫  ﻯ ﻭ‬ ‫ﻭ ﻬ‬ ‫ﻨ ﻓ‬ ‫ﻭ‬ ‫ﺑ‬ ‫ﺧ‬ ‫ﻲ ﻮﺍﺍﻧ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﺍ‬ ‫ﻭ‬ ‫ﺍﺍﻻﺳﺘﻌﺠﺎﻝ،،  أﻥ أ ﻥﻳﻳﺤﺪﺛﻮﺍ أأﻣﺮﺍ  )ﻛﺒﻴﻴﺮﺍﺍ ﻣﻬﻬﻤﺍ(  إإﻻ  ﺑﻤﺸﻮ ﺭ ﻭأأﻥﻭﻳﻳﺴﻌﻮﺍ  ﻫﻫﺪﻳﻳﻦ  ﻻﺳﺘﻴﻴﻌﺎ ﺏﺍﺍﻟ ﺎﺱ،  أﻥ أ ﻥﻳﻳﺼ ﻮﺍﺍﻔأأﺣﺪﺍ ‬ ‫ﻻ ﻭ ﻭ‬ ‫ﻨ‬ ‫،‬ ‫ﺏ  ﺱ‬ ‫ ‬ ‫ﺭ ﺓ ﻥ ﺟﺎﻫ‬ ‫ﻫ‬ ‫ ‬ ‫ﺍ‬ ‫ﺓﻴ‬ ‫ﺎ‬ ‫ً‬ ‫ﺍ‬ ‫ﻝ‬ ‫ﻻ ‬ ‫ﻴ ‬ ‫  ﺸﻭﻋﺔأﻭ  أ ﻫﻫﺎ،،  ﻥ ﻫﻫ ﺍﺍ  ﺳﺎﻖ  ﻷﻭ ﺍﻪ  ﻭﻟﻠﻧﻨﺱ ﺍأﻓﻬﻬﺎﻡﻡ أﻣﺨﺘﻠﻔﺔ  ﻭﺗﺄﻭﻳﻳﻼﺕ  ﻭﻧﻈﺮﺍﺍﺕ .. ‬ ‫ﺕﻭ ﻭﺕ‬ ‫ﻭ‬ ‫ﺑ‬ ‫ﻪ‬ ‫ﺎﺱ‬ ‫ﻭ ﻭ‬ ‫ﺮ ﻳ  ﻧﺤﻮﻫ . ﻓﺈﻥ ﻫ‬ ‫ﻫ‬ ‫ﻭ‬ ‫ﻫ‬ ‫ﻦ ﻣ ﻫﻫ ﺪ ﻳ ﺍﺍﻵﺧﺮﻳﻳﻦ  ﺑﻌﺪﻡﻡ‬ ‫ﻦ‬ ‫ﺍﺍﻟﻤﺠﺎﻫ ﺬ‬ ‫ﻫ‬ ‫ﻭﻧ ﻮ ﺫﺫﻟﻚ  ﻣﺎ ﻳﻳﻨﺎﺳﺐ.  ‬ ‫ ‬ ‫ع‬ ‫ﻓ‬ ‫ﺤ‬ ‫ﺮ ﻤ‬ ‫ﻭ‬ ‫ﺳ ﺍ‬ ‫ﻭﺍﺍﻟﺮﺟﺎء ﺍﺍﻹ ﺍع  ﻲ ﺫﺫﻟﻚ.. ‬ ‫،‬ ‫ﻟ‬ ‫ﻭ‬ ‫ﺔﻭﺍﺍﻛﺘﺐ أأﻧﺖ  ﻧﻔﺴﻚ أأﺧﻲ ﺍﺍﻟﻜﺮﻳﻳﻢ ،،  ﻓﺎﻟﻜﺮ ﻭ ﻣ ﻳﻳﻌﺮﻓﻚ  ﻭﺩﺍﺍﺋﻤﺎ ﻳﻳﺴﺄﻟ ﻲ ﻨﻋﻨﻚ، ﻭ ﻝ  ﺧﺎﻲ  ﻓﻼﻥ.  ‬ ‫ﻥ‬ ‫ﻭﻳﻳﻘﻮﻝ  ‬ ‫ﺍ ﻲ ﻭ  ﺩ ﻭ‬ ‫ ‬ ‫ ‬ ‫ﻭ‬ ‫  ‬ ‫ﻠ‬ ‫ﺧﻲ  ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ ﻳﻳﻜﺘﺐ  ﻭﻳﻳﻜﺘﺐ  ﻭﻻ ﻳﻳﻤﻞ  ﻣﻦ ﺍﺍﻟﻤﺮ ﺍﺳ ﻭﺍﺍﻟﻀ ﻐﻋﻠﻰ ﺍﺍﻹﺧﻮﺓﺓ.. ‬ ‫ﻭ‬ ‫ﻂ ‬ ‫ ‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻠ‬ ‫ﻭﻭ‬ ‫ﻭأأﻳﻳﻀﺎﺄأأﺣﻤﺪ  ﻋﺒﺪ ﺍﺍﻟﻌﻈﻴﻢ  ﻓﻬﻮ  ﻣﺆﺛﺮ  ﻓﻴﻴﻬﻢ  ﺟﺪﺍﺍ،  ﻭﻳﻳﺤﺘﺮﻣﻮﻧﻪ  ﻛﺜﻴﻴﺮﺍﺍ.. ‬ ‫ﻪ‬ ‫ﻬ ﻭ ،‬ ‫ﻬﻴ‬ ‫  ‬ ‫ﻭ‬ ‫ﻠﻭﻛﻞ  ﻣ ﺕ ﻪ ﻟ ﺗﺛﻴﻴﺮ.  ‬ ‫َﻦ ﻪ‬ ‫ﻭ‬ ‫ ‬ ‫ﺕ‬ ‫ ‬ ‫أأﻳ ﺎ ﻭﻣﺴﺄﻟﺔ   أﺧﻯ  ﻣﻬﻬﺔ  ﺟﺪﺍ،،   ﻻﺑ ﺪأﻥ  ﺗﻜﺘﺒﻮﺍ   ﻹ ﻮﺍﺍﻧأ ﺎ أأ ﻧﺎﺭ ﺍﺍ ﺴﺔ،،  ﻓﺈ ﻬﻢ ﻳﻳﻨﺘﻈﺮﻭﻥ  ﻣ ﻨﻢ  ﻣﺮﺍﺍﺳﻼ  ﻭأأﺟﻮﺑﺔ  ﻭﻰ ‬ ‫ﻋ‬ ‫ﻥ‬ ‫ﻜ‬ ‫ﺭ ﻟ ﻨ ﻧ ﻬ ﻭ‬ ‫ﻯ‬ ‫ﻤ‬ ‫ ‬ ‫ﻥ‬ ‫ﺍ‬ ‫ﺧ‬ ‫ﺼ‬ ‫ﺮ‬ ‫أ‬ ‫ﻳﻀ ﻭ‬ ‫،ﺷﻜﺎﻭ ﻫﻫ ﻭﺭﺳ ﺋﻠﻬﻢ،،ﻬﺍﺍﻛ ﺘﻮﺍﺍ ﻟﻬﻢ   ﻭﺍﺍﺳﺘﻭﻦ  ﺑﻌﺒﺪ  ﺍ ﺍ ،ﻆ  ﻭﺑﺄﺣﻤﺪ،  ﻭﺣﺎﻭﻝ ﻟﺍﻳﻳﻀﺎ أأﻥ  ﺗﺴﺘﺼﺪﺭ  ﻟﻬﻢ  ﺭﺳ ﺔ  ﻦ  ﻋﺒﺪ ﺍﺍﻟﺸﺎﻓﻲ، ‬ ‫ﻬ‬ ‫ﻣ‬ ‫ﺭﻟ‬ ‫ﺭ‬ ‫ﻭ ﻭ ،ﻝ ﺍ ﻥ  ‬ ‫ﻢ‬ ‫ﺎ‬ ‫ ‬ ‫،‬ ‫ﻌ‬ ‫ﺤﻔﻴ ﻭ ﻴ‬ ‫ﺒ ﺎ‬ ‫ﻬ ﻫ‬ ‫ ‬ ‫ﻭ ﺍﺍ ﺭ ﻫ ﻭ‬ ‫ﻜﺍﺍﻛﺘ ﻮﺍﺍ  ﻬﻬﻢ  ﻛﻼﻣﺎ  ﻟﻄﻴﻴ ﺎ  ﻋﺎﺩﺩﻳﺎ  ﻳﻳﺨﺴﺮﻭ ﻭ ﻣﻨﻬﻬﺎ  ﺷﻴﻴ ،ﻫﻫﺬﺍ  ﻲ ﺍﺍﻟﺤﺪ ﺍﺍﻷﺩﺩﻧﻰ   ﻭﻟﻮ  ﻛﻠﻤﺎﺕ  ﺴﻴ ﻴﺔ  ﻃﻴﻴﺔ  ﺗﻌ ﻭﻥ  ﺪﻴﻬﻬﺎ  ﺑﺎﻟﺨﻴﻴﺮ ‬ ‫ﻴ‬ ‫ﺕ‬ ‫ﻄ‬ ‫ﻓﺒ‬ ‫ﻭﻥ‬ ‫،‬ ‫ﺑ‬ ‫ﺎ‬ ‫ﻭ ،‬ ‫ﻫ  ﻓ ﺌ‬ ‫ﺍ‬ ‫،‬ ‫ﻫ‬ ‫ﻳ ﻻ  ﻥ  ﻥ‬ ‫ﻔ‬ ‫، ﺒﻟ‬ ‫ﻥ‬ ‫ﻮ ﻮ‬ ‫ﻬﻬﻢﻫﻫﻢ ‬ ‫ﻭﺑﺎ ﻟ ﺤﻖ  ﻣﻦ ﺍﺍﻷ ﻮﺭ، ﻭﺑﺄﻧﻜﻢ  ﺗﺘﺎﺑ ﻌﻥ  ﻭﺗ ﻨﺤﻥ ﻭ ﻬﻬﻮﻥ، ﺼ ﻧﻢ  ﺭﺍﺍﺳ ﻠﻢ  ﻭﺳﺘﺮﺍﺍﺳﻠﻮﻥ ﺍﺍﻹ ﺧﺓ   ﻭأأﻳﻳ ﺎ  ﺗﺪﻋﻮﻧ ﻫ‬ ‫ﻫ‬ ‫ﺼ‬ ‫ﻭ‬ ‫ﻀ‬ ‫،،‬ ‫ﻭﺗﻮﺟ ﻥ ﻭأأ ﺓ ﺭ ﻭ ﻮ ﻥ ‬ ‫ ‬ ‫ﻭ‬ ‫  ﻭ‬ ‫ﺭ ﻭ ،‬ ‫ﺘ‬ ‫ﻘ‬ ‫ ‬ ‫ﻣ‬ ‫ﻭ ﺘ‬ ‫)ﺍﺍﻷﻧ ﺎﺭ(  إإﻰ ﺍﺍﻟﻜﻮﻥ   إ إﻮﺍﺍﻧﻬﻢ  ﻛﺎ  ﻓﻌﻞ  ﻋﺒﺪ ﺍﺍﻟﺤﻔﻴﻴﻆ  ﻓﻲ ﺍﺍ ﺮﻳﻂ  ،  ﻭﺗﺮﻭ ﻭأ أﻥﺍﺍﻟﻮﺍﺍﺟﺐ ﻳﻳﻘﺘ ﻲ ﺫﺫﻟﻚ ،، ﺭ ﻢ  ﻣﺎ ‬ ‫ﻀ‬ ‫ﺭ  ‬ ‫ﻟ‬ ‫ ‬ ‫ﻳ‬ ‫ﻥ ،‬ ‫ﺸ ﻭ ﻥ  ﻥ‬ ‫ ‬ ‫ﻜ‬ ‫ﺧ‬ ‫ﻤ‬ ‫ﻣ ﻬﻊ‬ ‫،‬ ‫ﺭ ﻏ ﻟ ﻥ‬ ‫،‬ ‫ﻫﻫﻨﺎﻟﻚ  ﻣﻦ  ﻧﻘﺺ  ﻭﺧﻠ ﻞ  ﻭ ﻟﻦ ﺍﺍﻟﻔﺮ ﻫﻫﻲ أأﺷ ﺮ  ﻦ  ﻛﻞ ﺫﺫﻟ ﻚ ، ، ﻧﻢ  ﺑﺎﻟﻌ ﺡ:  ﺳﺘﻜﻮﻧﻮﻥ  ﻣﻊﻥﻼﻮﺍﺍﻧﻜﻢ  ﻋﺎ ِـ َ إإﺻ ﺡ ‬ ‫ﻣﻞ‬ ‫  إإﺧ ‬ ‫ﻭأأ ﻜ ﺲ ‬ ‫ﻭ‬ ‫ ‬ ‫ﻣ‬ ‫ﻭ ﻜ ﺔ ﻫ‬ ‫ﻗ‬ ‫ﻫ‬ ‫ﻭ‬ ‫ﻫ‬ ‫ﻫ‬ ‫ﻭﺗﺴﺪﻳﻳﺪ  ﺑﺈﺫﺫﻥ ﺍﺍﷲ.......إإﻟﺦ،‬ ‫   ‬ ‫ﺼ‬ ‫،‬ ‫ﻥ‬ ‫ﻭ‬ ‫ ‬ ‫ﻃ ﺒﻌﻳﻳﺎ أأ ﻲ ﻴﺍﺍ ﻟﺰﻳﻳﺰ  أأﺎ  ﻛﺘﺒ ُ  ﻹﺧﻮﺓﺓ ﺍﺍﻷ ﻧﺎﺭ،  ﻋﺪﺓﺓ ﺭﺳﺎﺋﻞ،،   آ ﺮﻫﺎ  ﻣﻦ ﻳﻳﻮﻣ ﻦ ﻭأأﻧﺎ  ﻋﻰ  ﺗﻮﺍﺍﺻﻞ  ﻌﻬﻬﻢ  ﻭﻧﺼﺢ ‬ ‫ﻭ‬ ‫، ،ﻫ ﻭ‬ ‫ﻠ‬ ‫ﺧﻫ‬ ‫ﻫ‬ ‫ﺍ‬ ‫ﺭ‬ ‫آ‬ ‫ﺭﻴ‬ ‫ﻌ‬ ‫ﻧ‬ ‫ﺖ‬ ‫ ‬ ‫ﺧ‬ ‫ﻣ‬ ‫ﺎ ‬ ‫ﻫﻫﻢ  ﻭﻣﺤﺎﻭﻟﺔ إإﺻ ﺡ  ﻭﺗﻘﺮﻳﻳﺐ  ﺑﻴﻴﻨﻬﻬﻢ  ﻭﺑﻴﻴﻦ ﺍﺍﻟﻜﺮﻭﻡﻡ   ﻭ ﻟﻦ  ﺩﺍﺍﺋﻤﺎ أ أﺷﻮ  إإﻰ ، ﷲ  ﻣﻭﺣﺪﺗﻲ ‬ ‫ﺍ ﻦ ﻭ‬ ‫ﻜ‬ ‫ﻟ‬ ‫ﺩ ،‬ ‫ﻜ‬ ‫ﻭ   ﻭ ﻭ‬ ‫ﻼﺡ‬ ‫ ﻭ‬ ‫ﻭﺗﻮﺟﻴﻪﻪ،  ﻭﺗﻄﻴﻴﻴﻴﺐ  ﻟﺨﻮﺍﺍﻃﺮﻫ ﻭ ﻭ‬ ‫ﻫ‬ ‫ﺬ‬ ‫ﻴ‬ ‫،‬ ‫ﻭ‬ ‫ﻭ‬ ‫ﻭﺍﺍﻧﻔﺮﺍﺍﺩﺩﻱ،  ﻭﻻ  ﺣﻮﻝ  ﻭﻻ  ﻗﻮﺓﺓﻙ ﻻ  ﺑﷲ  ،  ﺣﺘﻰ أ أﺎﻑ ﺍﺍﻟ ﺎﺱ ﻨﺗﻤ ّ  ﻣﻨﻲ  ﻭأﺻﻴﻴﺮ  ﻋﻨﺪﻫﻢ  ﻣ ﺒﺘﻻ  ..!! ‬ ‫أ‬ ‫ﻫ ﻫ‬ ‫ﻫ‬ ‫ﻭ‬ ‫،ﻑ ﺱ   ﻞ‬ ‫ﺧ‬ ‫،‬ ‫ﻝ‬ ‫ﻭ‬ ‫ ‬ ‫ﺎ‬ ‫إإ‬ ‫ﻱ‬ ‫أ أﺷﻮ  إإﻰ  ﺍﷲ ﺍﻭﺣﺪﻩﻩ.  ‬ ‫ﻭ‬ ‫ ‬ ‫ﻜ‬ ‫ﻟ‬ ‫ﻌ‬ ‫ﻙ‬ ‫ﻭﺣﺴﺒﻲ ﻌﷲ  ﻭﻧ ﻌﺍﺍﻟﻮﻛﻴﻴﻞ. ‬ ‫ﺍﺍ ﻭ ﻢ ‬ ‫ﻭ‬ ‫ ‬ ‫أ أﻲ  ﺍﺍ ﻟﺰﻳﻳﺰ /  ﻛﻴﻴﻒ أأﺣﻮﺍﻝ ﺍﺍﻟﺼﻮﻣﺎﻝ؟  ﻫﻫﻞ  ﻋﻨﺪﻛﻢ  ﺗﻮﺍﺍﺻ  ﻣﻬ،ﻫﻫﻞ ﻳﻳﻮﺳﻒ  ﺎﺯﺍﻝ  ﺣﻴﻴً  ﻭﻣﻮﺟﻮ ﺍﻫﻫﻨﺎﺩ ﺍ ؟أﻭ  ﺣ ﺪ ﻭﻦ ‬ ‫ﻢ ‬ ‫ﻫ ﻞ ﺩأ‬ ‫ﻫ‬ ‫  ﺯ ﻝﺍ ﻬ ﺎﻭ‬ ‫ﻣ‬ ‫ﻫ ﻭ‬ ‫ﻫ ،‬ ‫ﻝ   ﻝ ﻭﻫ‬ ‫  ﻫ‬ ‫ﺍ‬ ‫ﻣ‬ ‫/‬ ‫ﺧ‬ ‫ ‬ ‫إ إﻮﺍﺍﻧ ﺎ  ﻣﻮﺟﻮ  ﻫﻫﺎﻙ؟ أأ ﺎ  ﻛﻮﻧ ﻨ ُ  ﺑﻌﺾ ﺍﺍﻟﻌﻼﻗﺎﺕ ﺍﺍﻟﺒﺴﻴﻴﻄﺔ  ﻣﻊ إإﺧﻮﺓﺓ  ﻋﺒﺮ  ﻣﺎﺭ ﻑﺍﺍﻟﻨﺖ  ﻃ ﺒﺎ،،ﺩ ﻭﻫﻲ  ﺑﺴﻴﻴﻄﺔ  ﻭﺑﺼﺪﺩﺩ ‬ ‫ﻭﻫ ﻫ ﻭ‬ ‫ﻌ‬ ‫ﻫ‬ ‫ﺭ ﻌ‬ ‫ﻑ ‬ ‫   ‬ ‫ﺄ‬ ‫ﺩ‬ ‫ﻧ ﺕ‬ ‫ ‬ ‫ﺖ‬ ‫ﺟﻫ ﻙ‬ ‫ﻫ‬ ‫ﷲ‬ ‫ﺧ‬ ‫ﺭ‬ ‫ ‬ ‫ﻴﺍﺍﻟﺘﻮ ّﻖ  ﻭ ﻟﻦ  ﻟﻠﻬﻬﺎ  ﺗﺘﻄﻮﺭ،،   ﻓﺭﻮﻌﻥ  ﺗﺨﺒﺮﻭ ﻲ ﻧﺑﻤﺎ  ﻋﻨﺪﻛﻢ  ﻣﺎ  ﻳ ﻳﺴﺢ  ﺍﺍ ﻟﻝ ُ  ﺑ ﺬﺮﻩﻩ،  ﻭﻗﺪ ،ﺳﻤﻌ ُ  ﻛﻠﻤﺔ أأ ﻲ ﺑﻳﻳﺤﻴﻰ ‬ ‫ ‬ ‫ﻭ ﻛ ﺖ‬ ‫ﻤ‬ ‫ﺤ‬ ‫ﺎﻝ‬ ‫ﻓ‬ ‫ﺭ‬ ‫أﻭ ﻥ‬ ‫أ‬ ‫ﻜ‬ ‫ﺛ ﻭ‬ ‫ﻤ‬ ‫ﺍﺍﻟﺠﺪﻳﻳﺪﺓﺓ  ﺍ ﺍ ﻟ ﺟﻬ ﻬإإﻟﻴﻴﻬﻢ ﻭﻫﻫﻲ  ﻃﻴﻴﺒﺔ  ﺑﺎﺭﻙ ﺍﺍ  ﻴﻪ ﻴﻭﻓﻴﻴﻜﻢ.  ‬ ‫ﺭ‬ ‫ ‬ ‫ﻭﻪ‬ ‫ﻙ‬ ‫ﻭﻫ‬ ‫ﻬ‬ ‫ﻫ‬ ‫ﻮ ﺔ ‬ ‫‪Vocabulary‬‬ ‫‪Form‬‬ ‫‪Domain‬‬ ‫”‪“Grammar‬‬ ‫4‬
    5. 5. Task: Triage  Triage: should we process further       5 and/or urgently? Too few trained, trusted linguists to review all the documents in time Enable non-linguist to do linguist’s job Gisting: MT All vs. MT Names alone Combine Entity Extraction with Specialized Machine Translation Integrate into Triage workflow Signal: Documents Selected (How are guidelines interpreted?)
    6. 6. Technology Entity Extraction (1) 6
    7. 7. Technology Entity Extraction (2) 7
    8. 8. Technology: Entity Extraction (3) 2 5 Domai n Text Tagged Text 23 Unsupervised Model 4 Supervised Model Input Text Pattern Match (Regex) Exact Match (Gazetteer) Deterministic Extractor User Defined Lists 8 1 User Defined Patterns Entity Redactor Probabilistic Extractor Overlap Adjudication Entity Joining Filtering 3 Output Text
    9. 9. Adaptation: Entity Extraction to Triage  Out of the box: – False +/- because contextual cues are fewer/different. – Weapon in this document missed, because not a default entity type.  Adaptation: – Add custom entity type(s) via deterministic extractor, e.g. weapons list  Benefit: – Highlights important documents that might otherwise be missed. – Fast and unlikely to affect performance of other components  Difficulties: – Requires forethought, maintenance of lists and patterns in many languages, but much less work than developing a new model 9
    10. 10. Task: Translation  Produce standardized, “user      10 language” versions of the source document Too few translators; name standardization particularly labor intensive Speed up translation without compromising quality MT All reduces translation productivity NER, Coref and Name Translation/Standardization Signals: Resource Selections, Corrections, Resolutions
    11. 11. Adaptation: Extraction to Translation (1)  Out of the box: – Same problems as in Gisting case, only now they matter more.  Adaptation: – Train unsupervised model to help with form and domain differences – Tune co-reference algorithm to most important entity types – Develop form/domain specific resource sets, and allow users to select them.  Benefit: – Fewer errors in highlighting should mean translation actually speeds up  Difficulties: – Often hard to amass a big enough corpus of like material for model building. – Form/Domain may be ephemeral 11
    12. 12. Adaptation: Extraction to Translation (2) Thanks:~ Itai_Rolnick$ cat en_wc.txt | grep -i " aleppo " | tr ' ' 'n' | shuf | head  Unsupervised algorithm clusters words Loveland -- City in Colorado Svetogorsk -- Town in Russia MASSOUD -- ?Probably also of a village. Atiak -- Town in Uganda Waltha -- typo for Waltham? - town in Mass BASILICA -- type of Church? Sapukai -- Town in Paraguai Yeisk -- Town in Russia Descoberto -- Town in Brasil SINKHOLE -- ? A pub in Beligium ??  12      with distributional similarities together Word cluster ID is one feature used in learning the sequence model Based on Collins & Singer (1999) Part of REX Field Training Kit Shown: random sample of words clustered with “Aleppo” in a ~10GB English model Note they’re almost all LOCs Would an annotated training corpus ever cover so many remote entities?
    13. 13. Task: Cataloging  Distill content into an index, to      13 facilitate search and further refinement at scale Impossible to annotate more than a tiny fraction of documents by hand High quality automated enrichment that makes efficient use of knowledge resources and structure in data Many approaches, e.g. LSI, topic modeling, document classification Entity resolution is robust extension of NER; data and knowledge driven. Signals: mentions/aliases, shallow relationships between entities
    14. 14. Technology: Entity Resolution (1) Alberto Alberto Amos Alberto Fernandez… Fernandez… Alberto Fernandez… … born in Cuba … US Ambassador Sportsmen? YES Alberto Alberto Alberto Fernandiz… Alberto Fernandez de la Puebla… Albert Fernandez… Alberto Fernandez… Alberto Alberto … Chief of Cabinet … Argentina… …Prof of Criminal Law… Alberto Ratio of Politicians to Sportsmen? 2:1 Alberto Alberto M. Fernandez… Fernandez… Alburto Fernandez… Alberto Fernandez… Alberto Alberto Alberto Alberto … born Sept 7, 1984 … cycling … Madrid Nickname “El Galleta?” ?
    15. 15. Technology: Entity Resolution (2) 15
    16. 16. Technology: Entity Resolution (3) 16
    17. 17. Technology: Entity Resolution (4) 17
    18. 18. Technology: Entity Resolution (5) Resolution Engine Entity Mention Link or Ghost Candidate Selection Ranking 3 4 Learned Seeded 2 Entity Index Knowledge Base 1 18
    19. 19. Adaptation: EntRes to Cataloging (1)  Out of the box: – Quality dependent on output of extraction and order of input – Lots of ghosts, poor links if Wikipedia-based KB doesn’t contain entities in document – Seeding context selection may not be suited to domain  Adaptations: – – – – Custom KB, sized and suited to the domain and languages Seeding using context most likely to match in your domain Choose Linking or Learning mode Choose evidence factoring scheme that meets your operational needs  Benefits: – Linking throughput is high, accuracy is high, ghosts are informative (because fewer confounders) – System can maintain low latency after ingestion of many documents – Linking accuracy can remain high after ingestion of many documents  Difficulties: – Each element requires experimentation and thought – Changes likely to cause discontinuities unless re-indexing 19
    20. 20. Adaptation: Ent Res to Cataloging (2)  In Linking mode: – Link to existing KB or declare unknown, discarding context – State size is constant, latency stable  In Learning mode: – Link to existing KB or create New, storing context – State size increases, increasing latency – Semantic drift – Confidence measure gets complicated  Scaling with learning introduces the need to factor evidence.  Evidence factoring schemes need to be customized to use cases. 20
    21. 21. Task: Retrieval  Find relevant information for further     21 analysis String-based retrieval methods are easy to understand, but require a lot of effort and distract from the task. Deliver search modalities that are more productive but still interpretable and correctable Search using entity-driven facets, as well as keywords Signals: query log, click through, curation, corrections
    22. 22. Adaptation: EntRes to Retrieval  Out of the box: – Entity labels not in user’s language confusing – Returns results that can’t be easily summarized as a Boolean, cf aliases – Complex, potentially misleading measure of confidence  Adaptations: – Use name translation for non user-language labels, e.g. from KB – Present users with cues to expansion in string terms, e.g. mentions – Present confidence measure carefully  Benefits: – User spends less time confused, search is more productive  Difficulties: – Users still want to do things like exclude certain mentions. 22
    23. 23. Summary  News-trained NER OK for Triage, but adding entity types via lists and patterns could improve results considerably.  Speeding up Translation requires a better fit: unsupervised adaptation and custom resource selection could make the difference between time saved or wasted.  Cataloguing by resolved entities enables powerful search, but relies on high quality extraction; Learning-mode requires evidence factoring at scale.  Entity-based search is incredibly productive compared to Boolean and keyword approaches, but users need cues that explain expansion and robust measures of confidence. 23
    24. 24. Remaining Challenges  Current reality: even “simple” adaptation can be difficult: – – – – Too much knowledge, experience required Too much data required, e.g. 10GB for unsupervised Mostly “out of band” Usually Offline  Through the REX Field Training Kit and Entity Resolution API, Basis lowering the barriers to manual adaptation to sources, tasks and users today.  Integration of explicit signals, e.g. corrections and implicit signals, e.g. selections is ongoing. 24
    25. 25. Q&A gregor@basistech.com Director of Product Management, Text Analytics Basis Technology Corporation

    ×