MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Searching in Harsh Environments
Ophir Frieder
Computer Science Dept. | Georgetown University &
Biostatistics, Bioinformatics, & Biomathematics| Georgetown University Medical Center
ophir@ir.cs.georgetown.edu March 2016

Correcting the Search Myth
If it’s search, then Google solved it!
 Some of what Google solved
 Was solved by others first
 Google’s focus is computerized data,
 Much data are not digitized
 Google is hardly a key social media player
 Social media data are everywhere
2

Diverse Search Applications
 Complex Document Information Processing
 The whole is greater than the sum of its parts
 Searching is easy
 Unless it is in adverse (misspelled) environments
 Social Media Search & Surveillance
 Detecting outbreaks in their infancy
3

Talk Outline
Engineering Research
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
4

 Complex documents include
 handwritten notes,
 diagrams,
 graphics,
 printed or formatted text
 Point solutions exist:
 OCR, Information Retrieval,
Information Extraction,
Image Processing, Text
Clustering, Computational
Stylistics, …
 No definition of state-of-the-art
for the integrated problem
 Manual partitioning/collating:
Expensive, time-consuming,
error-prone
Some are even more complex!
6

Optical character recognition
(OCR)
Document clustering and browsing
Document structure extraction
Extraction from tables/lists
Handwriting analysis and signature
recognition
Figure caption identification
and extraction
Conventional and image retrieval
systems
Entity and relationship extraction
Existing Technology Point Solutions
7

Complex
Document
Images
LayersOCR
Table Extraction
Logo Extraction
Signature Match
Doc Metadata
Text Extraction
Entity Tagging
CDIP Metadata
Database
Analyst
Integrated
Retrieval
Data
Mining
Enhance
Correcting the Search MythCDIP Processing Architecture
8

13
Without Logos: At which institution?
Without Text: What positions do I hold?
Ophir Frieder
McDevitt Prof. of Comp. Sci. & Inf. Proc.
&
Prof. of Biostatistics, Bioinformatics, & Biomathematics
Integration Helps
13

Technology comes and goes
but….
Benchmarks (Collections) are ever (forever) lasting
14

 Cover the richness of inputs
 Range of formats, lengths, & genres
 Variance in print and image quality
 Document should include:
 Handwritten text and notations
 Diverse fonts
 Graphical elements
 graphs, tables, photos, logos, and diagrams
Test Collection Characteristics
15

 Sufficiently high volume of documents
 Vast volume of redundant & irrelevant documents
 Support diverse applications
 Include private communications within and between
groups planning activities and deploying resources
 Publicly available data!
 Minimal cost
 Minimal licensing
16
Test Collection Characteristics

17
 Data made public via legal proceedings
 Master Settlement Agreement subset of UCSF Legacy
Tobacco Document Library
 Documents scanned by individual companies; hence scan
quality widely varies
 ~ 7 million documents
 ~ 42 million scanned TIFF format pages (~ 1.5 TB)
 ~ 5 GB Metadata
 ~ 100 GB OCR
Dataset: https://ir.nist.gov/cdip/cdip-images/
17
CDIP Test Collection

The CDIP Test Collection
(NIST TREC V1.0)
18
 Used multiple years in TREC Legal Track
 Records (62GB) made available to TREC
participants (through ftp/dvd)
 40 queries simulating legal case investigations
with relevant judgments produced by 35 lawyers.
 Novel queries with relevant judgments generated
by tobacco researchers

 CDIP Benchmark data – as a novel text
test collection for “live scenarios”
 NIST TREC Legal Track, 2006 - 2009
 Housed permanently at NIST
 Complex Document search
 Ground truth difficult
 800 hand checked sub-collection
Evaluation
19

Completed:
Subset of 800 documents
Manually labelled authorship & organizational unit
Evaluated:
Authorship, organizational, monetary, date, and
address-based retrieval tasks
Ongoing:
Subset of 20K documents.
Open Problem:
Performance evaluation (measures) for larger sets
Preliminary Results
20

System Configuration Screen
System Configuration Screen
21

22
Query: ATC Logo + “income forecast” + > $500,000

23
Query: RJR Logo + “filtration efficiency” + signature

24
Query: Five signatures with the highest dollar total

25
Query: Associations of a given person (Dr. D. Stone)

Collaborators
Initial effort
 Gady Agam – Illinois Inst. of Tech.
 Shlomo Argamon – Illinois Inst. of Tech.
 David Doermann – Univ. of Maryland DARPA
 David Grossman – Illinois Inst. of Tech. Grossman Lab
 David D. Lewis – DDL Consulting
 Sargur Srihari – SUNY Buffalo
Ongoing effort
 Gideon Frieder – George Washington Univ.
 Jon Parker – Georgetown Univ. MITRE
26

 S. Argamon, G. Agam, O. Frieder, D. Grossman, D. Lewis, G. Sohn,, and K. Voorhees, “A Complex
Document Information Processing Prototype,” ACM SIGIR, 2006.
 D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a Test Collection for
Complex Document Information Processing,” ACM SIGIR, 2006.
 G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, “Content-Based Document Image
Retrieval in Complex Document Collections,” Document Recognition and Retrieval, 2007.
 G. Bal, G. Agam, O. Frieder, and G. Frieder, “Interactive Degraded Document Enhancement and Ground
Truth Generation,” Document Recognition and Retrieval , 2008.
 T. Obafemi-Ajayi, G. Agam, and O. Frieder, “Historical Document Enhancement Using LUT Classification,”
International Journal on Document Analysis and Recognition, 13(1), March 2010.
 J. Parker, G. Frieder, and O. Frieder, “Automatic Enhancement and Binarization of Degraded Document
Images,” International Conference on Document Analysis and Recognition, 2013.
 J. Parker, G. Frieder, and O. Frieder, "Robust Binarization of Degraded Document Images using
Heuristics," Document Recognition and Retrieval XXI, San Francisco, California, February 2014.
 Parker, et. al, "System and Method for Enhancing the Legibility of Degraded Images" US Patent
#8,995,782. March 31, 2015.
 Frieder, et. al, "System and Method for Enhancing the Legibility of Images," US Patent #9,269,126.
February 23, 2016.
References
27

Talk Outline
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
28

Spelling in Adverse Conditions
 Foreign language (Yizkor Books)
 User unfamiliar with character pronunciation
 Multiple languages within a document
 Domain specific (Medical)
 Terms unfamiliar to the general audience
29

Yizkor Books
 Yizkor = Hebrew word for “remember”
 Firsthand accounts of events that preceded, took place
during, and followed Second World War
 Documents destroyed communities and people who perished
 Started early 1940’s; highest activity in 1960’s and 1970’s
 Published in 13 languages, across 6 continents
 One of largest collections resides in USHMM
 Access restricted due to limited number, fragile
state, and prevention of destruction or theft
30

Traditional Access
User requested; archivist driven
 Requires “complete” understanding of books
 High human resource costs
 Inefficient & slow
 Often fails to obtain complete, if any, results
31

Metadata Search Access
Provides an intuitive search capability for
apprehensive but interested users
Creates and queries collection metadata
32

Yikzor Interface
 Centralized index
 Global access
 Efficient search
 Accurate search
 Multi-lingual spelling
correction
33

Spell Checker
 Upon entering a
misspelled query,
users are presented
with a ranked list of
suggestions
 Percentages
represent similarity to
original query as
measured by our
algorithms
35

Query Processing
Language independent
string manipulation for
auto-correction via a
voting algorithm
36

Language Independent Correction
Simplistic Rules Work ! or ?
 Replace first and last characters by a wild card, in succession;
 Retain only first and last characters and insert a wild card;
 Retain only first and last two characters and insert a wild card;
 Replace middle n-characters by a wild card, in succession;
 Replace first half by a wild card;
 Replace second half by a wild card;
37

Single Character Correction
Add Single Random Character
Remove Single Random Character
Replace Single Random Character
Swap Random Adjacent Pair of Characters
Mitton 1996 – “Spellchecking by Computers”
Found Rank
D-M Sound 41.41 N/A
N-Gram 94.97 2.58
USHMM 100 1.71
1.71
Found Rank
D-M Sound 41.96 N/A
N-Gram 93.40 3.46
USHMM 99.97 2.54
2.57
Found Rank
D-M Sound 57.89 N/A
N-Gram 85.02 4.77
USHMM 97.97 3.75
3.00
Found Rank
D-M Sound 31.45 N/A
N-Gram 92.06 3.24
USHMM 100 2.15
2.01
38

Multiple Character Correction
Add Multiple Characters Remove Multiple Characters
Found (%) Rank
2 Chars
DM Sound 19.58 N/A
N-Gram 92.00 3.45
USHMM 99.38 2.55 / 2.42
3 Chars
DM Sound 10.69 N/A
N-Gram 87.91 4.20
USHMM 97.53 3.19 / 3.02
4 Chars
DM Sound 6.75 N/A
N-Gram 83.86 4.97
USHMM 95.04 3.87 / 3.80
Found (%) Rank
2 Chars
DM Sound 20.47 N/A
N-Gram 84.79 4.78
USHMM 97.83 4.62 / 3.88
3 Chars
DM Sound 10.75 N/A
N-Gram 74.48 5.77
USHMM 92.73 6.41 / 4.80
4 Chars
DM Sound 9.70 N/A
N-Gram 69.98 6.04
USHMM 86.34 7.12 / 5.15
39

Replace Multiple Characters Swap Multiple Characters
Found (%) Rank
2 Chars
DM Sound 16.80 N/A
N-Gram 80.73 4.44
USHMM 93.88 4.19 / 3.33
3 Chars
DM Sound 9.11 N/A
N-Gram 69.23 5.15
USHMM 85.83 5.51 / 3.84
4 Chars
DM Sound 5.63 N/A
N-Gram 57.83 5.94
USHMM 75.03 6.79 / 4.78
Found (%) Rank
2 Chars
DM Sound 17.33 N/A
N-Gram 54.66 6.92
USHMM 71.69 7.55 / 5.46
3 Chars
DM Sound 9.19 N/A
N-Gram 42.91 7.30
USHMM 57.65 8.61 / 6.19
4 Chars
DM Sound 7.15 N/A
N-Gram 34.42 8.60
USHMM 46.31 9.32 / 7.30
40

Applying operational technology to a
medical domain…
Corrected spelling within a
Medical Terms Dictionary
41

Transcription Errors
 “What is a prescribing error?”, J. Quality in Health Care, 2000;
9:232–237.
 “Reducing medication errors and increasing patient safety: Case
studies in clinical pharmacology”, J. Clinical Pharmacology, July
2003. vol. 43 no. 7: 768-783.
 “Preventing medication errors in community pharmacy: root-cause
analysis of transcription errors”, Quality and Safety in Health Care,
2007;16:285-290.
 “10 strategies for minimizingdispensingerrors”, Pharmacy Times, Jan.
20th, 2010
Note: Although many of the transcription errors are
not spelling errors; some indeed are!
42

Medical Term Data Set
HosfordMedical Terms Dictionary v.3.0
 Number of terms: 9,883
 Term characteristics:
 Average: 10.58
 Minimum: 2
 Maximum: 30
 Median: 10
 Mode: 10
43

Single Character Correction
Add Single Random Character
Remove Single Random Character
Replace Single Random Character
Swap Random Adjacent Pair of Characters
Found Rank
D-M Sound 38.54 N/A
3-Gram 99.67 1.08
Med-Find 100 1.03
1.03
Found Rank
D-M Sound 44.84 N/A
3-Gram 99.52 1.16
Med-Find 100 1.07
1.07
Found Rank
D-M Sound 62.73 N/A
3-Gram 96.39 1.50
Med-Find 99.54 1.42
1.27
Found Rank
D-M Sound 29.99 N/A
3-Gram 98.76 1.19
Med-Find 99.99 1.10
1.08
44

Add Multiple Characters Remove Multiple Characters
Found (%) Rank
2 Chars
DM Sound 16.40 N/A
3-Gram 98.48 1.29
Med-Find 99.55 1.17 / 1.15
3 Chars
DM Sound 7.00 N/A
3-Gram 97.11 1.46
Med-Find 98.36 1.27 / 1.23
4 Chars
DM Sound 3.90 N/A
3-Gram 94.96 1.86
Med-Find 96.79 1.38 / 1.31
Found (%) Rank
2 Chars
DM Sound 19.49 N/A
3-Gram 96.21 1.67
Med-Find 99.07 1.76 / 1.61
3 Chars
DM Sound 8.52 N/A
3-Gram 90.29 2.40
Med-Find 95.21 2.54 / 2.13
4 Chars
DM Sound 3.84 N/A
3-Gram 81.83 3.08
Med-Find 88.88 3.54 / 2.70
45

Replace Multiple Characters Swap Multiple Characters
Found (%) Rank
2 Chars
DM Sound 12.34 N/A
3-Gram 94.54 1.64
Med-Find 97.88 1.57 / 1.40
3 Chars
DM Sound 5.41 N/A
3-Gram 87.95 2.08
Med-Find 92.58 1.90 / 1.64
4 Chars
DM Sound 3.05 N/A
3-Gram 79.46 2.86
Med-Find 85.42 2.19 / 1.78
Found (%) Rank
2 Chars
DM Sound 15.11 N/A
3-Gram 76.36 3.99
Med-Find 82.51 3.02 / 2.25
3 Chars
DM Sound 7.60 N/A
3-Gram 61.13 5.95
Med-Find 66.85 3.89 / 2.70
4 Chars
DM Sound 5.08 N/A
3-Gram 48.91 7.51
Med-Find 54.22 4.61 / 2.87
46

Collaborators
Key Personnel
 Michlean Amir – USHMM
 Rebecca Cathey – BAE Systems
 Gideon Frieder – George Washington Univ.
 Jason Soo – Georgetown/MITRE
Many comments by “prototype” users
47

 J. Soo, R. Cathey, O. Frieder, M. Amir, and G. Frieder, “Yizkor Books: A Voice for the Silent Past,” ACM
Seventeenth Conference on Information and Knowledge Management (CIKM) – Industrial Track, Napa
Valley, California, October 2008.
 J. Soo and O. Frieder, “On Foreign Name Search,” ACM Thirty-Second European Conference on
Information Retrieval (ECIR), Milton Keynes, United Kingdom, March 2010.
 J. Soo and O. Frieder, “On Searching Misspelled Collections,” Journal of the Association for Information
Science and Technology (JASIS), 66(6), June 2015.
 J. Soo and O. Frieder, “Revisiting Known-Item Retrieval in Degraded Document Collections," Document
Recognition and Retrieval (DRR), San Francisco, California, February 2016.
 J. Soo and O. Frieder, “Searching Corrupted Document Collections," Twelfth IAPR Document
Analysis Systems (DAS), Santorini, Greece, April 2016.
References
48

Talk Outline
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
49

Motivation
 Public health surveillance
 Demands considerable human efforts
 Often delayed identification
 Typically: need topic of interest
 Ideally: detect without focus
Motivated to expedite detection
 Social media the answer?
50

Related Efforts
 Social Media
 Known topic problem
 Detection of specific disease (Influenza)
 Correlate occurrence of flu-related words with official
Influenza-like-illness data
 Summarize influenza-related tweets
 Complex solutions
 Detect multiple health conditions via complex
learning algorithms
 Use access-limited resources
 Query logs
51

Hypothesis: Generation vs. Validation
 Goal: extract more general health-related information from
social media streams
 The Old Way:
 Evaluate a pre-existing hypothesis using SM data
 Q: “Is flu occurring more frequently?”
 A: “Yes”
 Our Way:
 Generate a hypothesis from SM data
 Q: “Are any illnesses occurring more frequently? If so,
which ones?”
 A: “Yes, Flu”
52

Tweet Corpus
 Collected by (JHU)
 2 billion tweets (May 2009 - Oct 2010)
 Filtered multiple times to yield medically related
 Using a 20,000 health-related key-phrase list
 High-recall / low-precision health tweets
 SVM to increase precision
53

Frequent Word Set Identification
 Preprocessing
 Punctuation mark removal
 Text lower-cased & tokenized
 Stop-word removal
 Duplicate term removal
 Medical synonym expansion (MedSyn)
56

Frequent Word Set Identification
# Tweet Content
T1 Pounding headache, sore throat, low grade fever, flu
T2 Sleep, a perfect cure to forget about the pain!
T3 This morning woke up with fever, sore throat, and flu
T4 Cough, flu, sore throat. I couldn’t ask for a better combination
T5 Got you down? Fever , muscle aches, cough,
Term Set Support
flu, sore throat 3
fever 3
Cough 2
Frequent Term Sets: {{flu, sore throat}, {fever}} -- Threshold 3
57

Decide “Is Trending”
prevalence(t)
isTrending(t)=(isFrequent(t))AND growth_rate
prevalence(t-1)
 
 
 
 Word sets prevalent throughout - irrelevant
 For example: {feel, sick}
 Relevancy “Is Trending” word sets interest us.
58

Track Word Set Time Series
 Time-series used to determine word sets with a
significant increase in prevalence
Two differing word set tracks by month
59
{feel, sick}
very frequent,
does not trend
{allergies, feel}
trends in April
and May
Trending
Decision

Query a trending word set in Wikipedia
Why Wikipedia?
Comprehensive range of topics including
health topics
Written in layman’s English resembling tweets
considered
60
Query Wikipedia

Filter Wikipedia Results
Retrieved articles determine if frequent
word set is health-related
Health-related nature judged by two
metrics:
Ratio of medical tokens in introduction
Presence of International Statistical
Classification of Diseases and Related
Health Problem (ICD) codes.
61

Ratio of Medical Tokens
Article health-related if ratio of health tokens
in introduction surpasses threshold
Process:
Tokenize introduction
Remove stop words
Count the tokens and medical tokens
If # medical_token / # token > 0.75
then health-related
62

ICD Codes
 Health-related Wikipedia articles typically contain
info-box with ICD-9 & ICD-10 codes.
 ICD code – strong health-related indicator
An Wikipedia article’s info box and ICD
63

Detection – 2010 Flu Season
Tweet time series from June
09 to Oct 10
Weekly flu cases in US from
June 09 to Oct 10
64

Social Media Mining Accuracy
 Landing on Hudson and Mumbai Terror Attack
 Flu Tweets (Lampos and Cristianini 2010; Culotta 2010)
 …
 Hurricane Sandy Coordination Communication
 …
 …
 Fake Celebrity Deaths (Jeff Goldblum)
65

Sinus (Anatomy)
0
0.1
0.2
0.3
0.4
0.5
0.6
FractionofCummulativeSignal
(%)
Sinus (anatomy)
66

Allergic Response
0
0.1
0.2
0.3
0.4
0.5
0.6
(%)
Allergic Response Sinus (anatomy)
67

Food Allergy
0
0.1
0.2
0.3
0.4
0.5
0.6
(%)
Food Allergy Allergic Response Sinus (anatomy)
68

Summary
 Our Approach:
 Filter a corpus to be topic specific
 Identify trending word sets
 Connect multiple trending words sets to topics of interest
 Detect trending topic of interest – Generate Hypotheses
69

Future Work
 Run framework on a larger scale
 Increase data volume: 2 billion  200 billion
 Increasing temporal resolution: months  weeks  days
 Use resources besides Wikipedia and ICD to filter out non-
medically related trending topics
 Detect other types of trends by changing the filters to suit a
new topic of interest
 Deploy globally
70

Collaborators
Key Personnel
 Nazli Goharian – Georgetown University
 Alek Kolcz – Twitter PushD
 Jon Parker – Johns Hopkins/Georgetown MITRE
 Andrew Yates – Georgetown University
Many comments by “prototype” users
71

Reference
 A. Yates, J. Parker, N. Goharian, and O. Frieder, “A Framework for Public
Health Surveillance,” 9th Language Resources and Evaluation
Conference (LREC-2014), Reykjavik, Iceland, May 2014.
 J. Parker, A. Yates, N. Goharian, and O. Frieder, “Health Related
Hypothesis Generation using Social Media Data,” Social Network
Analysis and Mining, 5(7), March 2015.
 A. Yates, N. Goharian, and O. Frieder, “Learning the Relationships
between Drug, Symptom, and Medical Condition Mentions in Social
Media,“, AAAI 10th International Conference on Web and Social
Media (ICWSM), Cologne, Germany, May 2016.
 A. Yates, A. Kolcz, N. Goharian, and O. Frieder, “Effects of Sampling
on Twitter Trend Detection,” 10th Language Resources and
Evaluation Conference (LREC-2016), Portoroz, Slovenia, May 2016.
72

Summary
 Complex Document Information Processing
 The whole is greater than the sum of its parts
 Searching is easy
 Unless it is in adverse (misspelled) environments
 Social Media Search: Surveillance in a positive light
 Detecting outbreaks in their infancy
73

MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

Similar to MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments (20)

More from Micah Altman

More from Micah Altman (19)

Recently uploaded

Recently uploaded (20)

MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments