SlideShare a Scribd company logo
Searching in Harsh Environments
Ophir Frieder
Computer Science Dept. | Georgetown University &
Biostatistics, Bioinformatics, & Biomathematics| Georgetown University Medical Center
ophir@ir.cs.georgetown.edu March 2016
Correcting the Search Myth
If it’s search, then Google solved it!
 Some of what Google solved
 Was solved by others first
 Google’s focus is computerized data,
 Much data are not digitized
 Google is hardly a key social media player
 Social media data are everywhere
2
Diverse Search Applications
 Complex Document Information Processing
 The whole is greater than the sum of its parts
 Searching is easy
 Unless it is in adverse (misspelled) environments
 Social Media Search & Surveillance
 Detecting outbreaks in their infancy
3
Talk Outline
Engineering Research
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
4
Documents are Complex!
5
 Complex documents include
 handwritten notes,
 diagrams,
 graphics,
 printed or formatted text
 Point solutions exist:
 OCR, Information Retrieval,
Information Extraction,
Image Processing, Text
Clustering, Computational
Stylistics, …
 No definition of state-of-the-art
for the integrated problem
 Manual partitioning/collating:
Expensive, time-consuming,
error-prone
Some are even more complex!
6
Optical character recognition
(OCR)
Document clustering and browsing
Document structure extraction
Extraction from tables/lists
Handwriting analysis and signature
recognition
Figure caption identification
and extraction
Conventional and image retrieval
systems
Entity and relationship extraction
Existing Technology Point Solutions
7
Complex
Document
Images
LayersOCR
Table Extraction
Logo Extraction
Signature Match
Doc Metadata
Text Extraction
Entity Tagging
CDIP Metadata
Database
Analyst
Integrated
Retrieval
Data
Mining
Enhance
Correcting the Search MythCDIP Processing Architecture
8
Enhancement
9
10
Enhancement
11
Enhancement
12
Enhancement
13
Without Logos: At which institution?
Without Text: What positions do I hold?
Ophir Frieder
McDevitt Prof. of Comp. Sci. & Inf. Proc.
&
Prof. of Biostatistics, Bioinformatics, & Biomathematics
Integration Helps
13
Technology comes and goes
but….
Benchmarks (Collections) are ever (forever) lasting
14
 Cover the richness of inputs
 Range of formats, lengths, & genres
 Variance in print and image quality
 Document should include:
 Handwritten text and notations
 Diverse fonts
 Graphical elements
 graphs, tables, photos, logos, and diagrams
Test Collection Characteristics
15
 Sufficiently high volume of documents
 Vast volume of redundant & irrelevant documents
 Support diverse applications
 Include private communications within and between
groups planning activities and deploying resources
 Publicly available data!
 Minimal cost
 Minimal licensing
16
Test Collection Characteristics
17
 Data made public via legal proceedings
 Master Settlement Agreement subset of UCSF Legacy
Tobacco Document Library
 Documents scanned by individual companies; hence scan
quality widely varies
 ~ 7 million documents
 ~ 42 million scanned TIFF format pages (~ 1.5 TB)
 ~ 5 GB Metadata
 ~ 100 GB OCR
Dataset: https://ir.nist.gov/cdip/cdip-images/
17
CDIP Test Collection
The CDIP Test Collection
(NIST TREC V1.0)
18
 Used multiple years in TREC Legal Track
 Records (62GB) made available to TREC
participants (through ftp/dvd)
 40 queries simulating legal case investigations
with relevant judgments produced by 35 lawyers.
 Novel queries with relevant judgments generated
by tobacco researchers
 CDIP Benchmark data – as a novel text
test collection for “live scenarios”
 NIST TREC Legal Track, 2006 - 2009
 Housed permanently at NIST
 Complex Document search
 Ground truth difficult
 800 hand checked sub-collection
Evaluation
19
Completed:
Subset of 800 documents
Manually labelled authorship & organizational unit
Evaluated:
Authorship, organizational, monetary, date, and
address-based retrieval tasks
Ongoing:
Subset of 20K documents.
Open Problem:
Performance evaluation (measures) for larger sets
Preliminary Results
20
System Configuration Screen
System Configuration Screen
21
22
Query: ATC Logo + “income forecast” + > $500,000
23
Query: RJR Logo + “filtration efficiency” + signature
24
Query: Five signatures with the highest dollar total
25
Query: Associations of a given person (Dr. D. Stone)
Collaborators
Initial effort
 Gady Agam – Illinois Inst. of Tech.
 Shlomo Argamon – Illinois Inst. of Tech.
 David Doermann – Univ. of Maryland DARPA
 David Grossman – Illinois Inst. of Tech. Grossman Lab
 David D. Lewis – DDL Consulting
 Sargur Srihari – SUNY Buffalo
Ongoing effort
 Gideon Frieder – George Washington Univ.
 Jon Parker – Georgetown Univ. MITRE
26
 S. Argamon, G. Agam, O. Frieder, D. Grossman, D. Lewis, G. Sohn,, and K. Voorhees, “A Complex
Document Information Processing Prototype,” ACM SIGIR, 2006.
 D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a Test Collection for
Complex Document Information Processing,” ACM SIGIR, 2006.
 G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, “Content-Based Document Image
Retrieval in Complex Document Collections,” Document Recognition and Retrieval, 2007.
 G. Bal, G. Agam, O. Frieder, and G. Frieder, “Interactive Degraded Document Enhancement and Ground
Truth Generation,” Document Recognition and Retrieval , 2008.
 T. Obafemi-Ajayi, G. Agam, and O. Frieder, “Historical Document Enhancement Using LUT Classification,”
International Journal on Document Analysis and Recognition, 13(1), March 2010.
 J. Parker, G. Frieder, and O. Frieder, “Automatic Enhancement and Binarization of Degraded Document
Images,” International Conference on Document Analysis and Recognition, 2013.
 J. Parker, G. Frieder, and O. Frieder, "Robust Binarization of Degraded Document Images using
Heuristics," Document Recognition and Retrieval XXI, San Francisco, California, February 2014.
 Parker, et. al, "System and Method for Enhancing the Legibility of Degraded Images" US Patent
#8,995,782. March 31, 2015.
 Frieder, et. al, "System and Method for Enhancing the Legibility of Images," US Patent #9,269,126.
February 23, 2016.
References
27
Talk Outline
Engineering Research
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
28
Spelling in Adverse Conditions
 Foreign language (Yizkor Books)
 User unfamiliar with character pronunciation
 Multiple languages within a document
 Domain specific (Medical)
 Terms unfamiliar to the general audience
29
Yizkor Books
 Yizkor = Hebrew word for “remember”
 Firsthand accounts of events that preceded, took place
during, and followed Second World War
 Documents destroyed communities and people who perished
 Started early 1940’s; highest activity in 1960’s and 1970’s
 Published in 13 languages, across 6 continents
 One of largest collections resides in USHMM
 Access restricted due to limited number, fragile
state, and prevention of destruction or theft
30
Traditional Access
User requested; archivist driven
 Requires “complete” understanding of books
 High human resource costs
 Inefficient & slow
 Often fails to obtain complete, if any, results
31
Metadata Search Access
Provides an intuitive search capability for
apprehensive but interested users
Creates and queries collection metadata
32
Yikzor Interface
 Centralized index
 Global access
 Efficient search
 Accurate search
 Multi-lingual spelling
correction
33
Search Results
34
Spell Checker
 Upon entering a
misspelled query,
users are presented
with a ranked list of
suggestions
 Percentages
represent similarity to
original query as
measured by our
algorithms
35
Query Processing
Language independent
string manipulation for
auto-correction via a
voting algorithm
36
Language Independent Correction
Simplistic Rules Work ! or ?
 Replace first and last characters by a wild card, in succession;
 Retain only first and last characters and insert a wild card;
 Retain only first and last two characters and insert a wild card;
 Replace middle n-characters by a wild card, in succession;
 Replace first half by a wild card;
 Replace second half by a wild card;
37
Single Character Correction
Add Single Random Character
Remove Single Random Character
Replace Single Random Character
Swap Random Adjacent Pair of Characters
Mitton 1996 – “Spellchecking by Computers”
Found Rank
D-M Sound 41.41 N/A
N-Gram 94.97 2.58
USHMM 100 1.71
1.71
Found Rank
D-M Sound 41.96 N/A
N-Gram 93.40 3.46
USHMM 99.97 2.54
2.57
Found Rank
D-M Sound 57.89 N/A
N-Gram 85.02 4.77
USHMM 97.97 3.75
3.00
Found Rank
D-M Sound 31.45 N/A
N-Gram 92.06 3.24
USHMM 100 2.15
2.01
38
Multiple Character Correction
Add Multiple Characters Remove Multiple Characters
Found (%) Rank
2 Chars
DM Sound 19.58 N/A
N-Gram 92.00 3.45
USHMM 99.38 2.55 / 2.42
3 Chars
DM Sound 10.69 N/A
N-Gram 87.91 4.20
USHMM 97.53 3.19 / 3.02
4 Chars
DM Sound 6.75 N/A
N-Gram 83.86 4.97
USHMM 95.04 3.87 / 3.80
Found (%) Rank
2 Chars
DM Sound 20.47 N/A
N-Gram 84.79 4.78
USHMM 97.83 4.62 / 3.88
3 Chars
DM Sound 10.75 N/A
N-Gram 74.48 5.77
USHMM 92.73 6.41 / 4.80
4 Chars
DM Sound 9.70 N/A
N-Gram 69.98 6.04
USHMM 86.34 7.12 / 5.15
39
Multiple Character Correction
Replace Multiple Characters Swap Multiple Characters
Found (%) Rank
2 Chars
DM Sound 16.80 N/A
N-Gram 80.73 4.44
USHMM 93.88 4.19 / 3.33
3 Chars
DM Sound 9.11 N/A
N-Gram 69.23 5.15
USHMM 85.83 5.51 / 3.84
4 Chars
DM Sound 5.63 N/A
N-Gram 57.83 5.94
USHMM 75.03 6.79 / 4.78
Found (%) Rank
2 Chars
DM Sound 17.33 N/A
N-Gram 54.66 6.92
USHMM 71.69 7.55 / 5.46
3 Chars
DM Sound 9.19 N/A
N-Gram 42.91 7.30
USHMM 57.65 8.61 / 6.19
4 Chars
DM Sound 7.15 N/A
N-Gram 34.42 8.60
USHMM 46.31 9.32 / 7.30
40
Applying operational technology to a
medical domain…
Corrected spelling within a
Medical Terms Dictionary
41
Transcription Errors
 “What is a prescribing error?”, J. Quality in Health Care, 2000;
9:232–237.
 “Reducing medication errors and increasing patient safety: Case
studies in clinical pharmacology”, J. Clinical Pharmacology, July
2003. vol. 43 no. 7: 768-783.
 “Preventing medication errors in community pharmacy: root-cause
analysis of transcription errors”, Quality and Safety in Health Care,
2007;16:285-290.
 “10 strategies for minimizingdispensingerrors”, Pharmacy Times, Jan.
20th, 2010
Note: Although many of the transcription errors are
not spelling errors; some indeed are!
42
Medical Term Data Set
HosfordMedical Terms Dictionary v.3.0
 Number of terms: 9,883
 Term characteristics:
 Average: 10.58
 Minimum: 2
 Maximum: 30
 Median: 10
 Mode: 10
43
Single Character Correction
Add Single Random Character
Remove Single Random Character
Replace Single Random Character
Swap Random Adjacent Pair of Characters
Found Rank
D-M Sound 38.54 N/A
3-Gram 99.67 1.08
Med-Find 100 1.03
1.03
Found Rank
D-M Sound 44.84 N/A
3-Gram 99.52 1.16
Med-Find 100 1.07
1.07
Found Rank
D-M Sound 62.73 N/A
3-Gram 96.39 1.50
Med-Find 99.54 1.42
1.27
Found Rank
D-M Sound 29.99 N/A
3-Gram 98.76 1.19
Med-Find 99.99 1.10
1.08
44
Multiple Character Correction
Add Multiple Characters Remove Multiple Characters
Found (%) Rank
2 Chars
DM Sound 16.40 N/A
3-Gram 98.48 1.29
Med-Find 99.55 1.17 / 1.15
3 Chars
DM Sound 7.00 N/A
3-Gram 97.11 1.46
Med-Find 98.36 1.27 / 1.23
4 Chars
DM Sound 3.90 N/A
3-Gram 94.96 1.86
Med-Find 96.79 1.38 / 1.31
Found (%) Rank
2 Chars
DM Sound 19.49 N/A
3-Gram 96.21 1.67
Med-Find 99.07 1.76 / 1.61
3 Chars
DM Sound 8.52 N/A
3-Gram 90.29 2.40
Med-Find 95.21 2.54 / 2.13
4 Chars
DM Sound 3.84 N/A
3-Gram 81.83 3.08
Med-Find 88.88 3.54 / 2.70
45
Multiple Character Correction
Replace Multiple Characters Swap Multiple Characters
Found (%) Rank
2 Chars
DM Sound 12.34 N/A
3-Gram 94.54 1.64
Med-Find 97.88 1.57 / 1.40
3 Chars
DM Sound 5.41 N/A
3-Gram 87.95 2.08
Med-Find 92.58 1.90 / 1.64
4 Chars
DM Sound 3.05 N/A
3-Gram 79.46 2.86
Med-Find 85.42 2.19 / 1.78
Found (%) Rank
2 Chars
DM Sound 15.11 N/A
3-Gram 76.36 3.99
Med-Find 82.51 3.02 / 2.25
3 Chars
DM Sound 7.60 N/A
3-Gram 61.13 5.95
Med-Find 66.85 3.89 / 2.70
4 Chars
DM Sound 5.08 N/A
3-Gram 48.91 7.51
Med-Find 54.22 4.61 / 2.87
46
Collaborators
Key Personnel
 Michlean Amir – USHMM
 Rebecca Cathey – BAE Systems
 Gideon Frieder – George Washington Univ.
 Jason Soo – Georgetown/MITRE
Many comments by “prototype” users
47
 J. Soo, R. Cathey, O. Frieder, M. Amir, and G. Frieder, “Yizkor Books: A Voice for the Silent Past,” ACM
Seventeenth Conference on Information and Knowledge Management (CIKM) – Industrial Track, Napa
Valley, California, October 2008.
 J. Soo and O. Frieder, “On Foreign Name Search,” ACM Thirty-Second European Conference on
Information Retrieval (ECIR), Milton Keynes, United Kingdom, March 2010.
 J. Soo and O. Frieder, “On Searching Misspelled Collections,” Journal of the Association for Information
Science and Technology (JASIS), 66(6), June 2015.
 J. Soo and O. Frieder, “Revisiting Known-Item Retrieval in Degraded Document Collections," Document
Recognition and Retrieval (DRR), San Francisco, California, February 2016.
 J. Soo and O. Frieder, “Searching Corrupted Document Collections," Twelfth IAPR Document
Analysis Systems (DAS), Santorini, Greece, April 2016.
References
48
Talk Outline
Engineering Research
Searching &
Mining Social
Media
Searching in
Adverse
Conditions
Complex
Document
Information
Processing
Computer Science
49
Motivation
 Public health surveillance
 Demands considerable human efforts
 Often delayed identification
 Typically: need topic of interest
 Ideally: detect without focus
Motivated to expedite detection
 Social media the answer?
50
Related Efforts
 Social Media
 Known topic problem
 Detection of specific disease (Influenza)
 Correlate occurrence of flu-related words with official
Influenza-like-illness data
 Summarize influenza-related tweets
 Complex solutions
 Detect multiple health conditions via complex
learning algorithms
 Use access-limited resources
 Query logs
51
Hypothesis: Generation vs. Validation
 Goal: extract more general health-related information from
social media streams
 The Old Way:
 Evaluate a pre-existing hypothesis using SM data
 Q: “Is flu occurring more frequently?”
 A: “Yes”
 Our Way:
 Generate a hypothesis from SM data
 Q: “Are any illnesses occurring more frequently? If so,
which ones?”
 A: “Yes, Flu”
52
Tweet Corpus
 Collected by (JHU)
 2 billion tweets (May 2009 - Oct 2010)
 Filtered multiple times to yield medically related
 Using a 20,000 health-related key-phrase list
 High-recall / low-precision health tweets
 SVM to increase precision
53
Framework: High Level View
54
Partition Corpus By Time
55
Frequent Word Set Identification
 Preprocessing
 Punctuation mark removal
 Text lower-cased & tokenized
 Stop-word removal
 Duplicate term removal
 Medical synonym expansion (MedSyn)
56
Frequent Word Set Identification
# Tweet Content
T1 Pounding headache, sore throat, low grade fever, flu
T2 Sleep, a perfect cure to forget about the pain!
T3 This morning woke up with fever, sore throat, and flu
T4 Cough, flu, sore throat. I couldn’t ask for a better combination
T5 Got you down? Fever , muscle aches, cough,
Term Set Support
flu, sore throat 3
fever 3
Cough 2
Frequent Term Sets: {{flu, sore throat}, {fever}} -- Threshold 3
57
Decide “Is Trending”
prevalence(t)
isTrending(t)=(isFrequent(t))AND growth_rate
prevalence(t-1)
 
 
 
 Word sets prevalent throughout - irrelevant
 For example: {feel, sick}
 Relevancy “Is Trending” word sets interest us.
58
Track Word Set Time Series
 Time-series used to determine word sets with a
significant increase in prevalence
Two differing word set tracks by month
59
{feel, sick}
very frequent,
does not trend
{allergies, feel}
trends in April
and May
Trending
Decision
Query a trending word set in Wikipedia
Why Wikipedia?
Comprehensive range of topics including
health topics
Written in layman’s English resembling tweets
considered
60
Query Wikipedia
Filter Wikipedia Results
Retrieved articles determine if frequent
word set is health-related
Health-related nature judged by two
metrics:
Ratio of medical tokens in introduction
Presence of International Statistical
Classification of Diseases and Related
Health Problem (ICD) codes.
61
Ratio of Medical Tokens
Article health-related if ratio of health tokens
in introduction surpasses threshold
Process:
Tokenize introduction
Remove stop words
Count the tokens and medical tokens
If # medical_token / # token > 0.75
then health-related
62
ICD Codes
 Health-related Wikipedia articles typically contain
info-box with ICD-9 & ICD-10 codes.
 ICD code – strong health-related indicator
An Wikipedia article’s info box and ICD
63
Detection – 2010 Flu Season
Tweet time series from June
09 to Oct 10
Weekly flu cases in US from
June 09 to Oct 10
64
Social Media Mining Accuracy
 Landing on Hudson and Mumbai Terror Attack
 Flu Tweets (Lampos and Cristianini 2010; Culotta 2010)
 …
 Hurricane Sandy Coordination Communication
 …
 …
 Fake Celebrity Deaths (Jeff Goldblum)
65
Sinus (Anatomy)
0
0.1
0.2
0.3
0.4
0.5
0.6
FractionofCummulativeSignal
(%)
Sinus (anatomy)
66
Allergic Response
0
0.1
0.2
0.3
0.4
0.5
0.6
FractionofCummulativeSignal
(%)
Allergic Response Sinus (anatomy)
67
Food Allergy
0
0.1
0.2
0.3
0.4
0.5
0.6
FractionofCummulativeSignal
(%)
Food Allergy Allergic Response Sinus (anatomy)
68
Summary
 Our Approach:
 Filter a corpus to be topic specific
 Identify trending word sets
 Connect multiple trending words sets to topics of interest
 Detect trending topic of interest – Generate Hypotheses
69
Future Work
 Run framework on a larger scale
 Increase data volume: 2 billion  200 billion
 Increasing temporal resolution: months  weeks  days
 Use resources besides Wikipedia and ICD to filter out non-
medically related trending topics
 Detect other types of trends by changing the filters to suit a
new topic of interest
 Deploy globally
70
Collaborators
Key Personnel
 Nazli Goharian – Georgetown University
 Alek Kolcz – Twitter PushD
 Jon Parker – Johns Hopkins/Georgetown MITRE
 Andrew Yates – Georgetown University
Many comments by “prototype” users
71
Reference
 A. Yates, J. Parker, N. Goharian, and O. Frieder, “A Framework for Public
Health Surveillance,” 9th Language Resources and Evaluation
Conference (LREC-2014), Reykjavik, Iceland, May 2014.
 J. Parker, A. Yates, N. Goharian, and O. Frieder, “Health Related
Hypothesis Generation using Social Media Data,” Social Network
Analysis and Mining, 5(7), March 2015.
 A. Yates, N. Goharian, and O. Frieder, “Learning the Relationships
between Drug, Symptom, and Medical Condition Mentions in Social
Media,“, AAAI 10th International Conference on Web and Social
Media (ICWSM), Cologne, Germany, May 2016.
 A. Yates, A. Kolcz, N. Goharian, and O. Frieder, “Effects of Sampling
on Twitter Trend Detection,” 10th Language Resources and
Evaluation Conference (LREC-2016), Portoroz, Slovenia, May 2016.
72
Summary
 Complex Document Information Processing
 The whole is greater than the sum of its parts
 Searching is easy
 Unless it is in adverse (misspelled) environments
 Social Media Search: Surveillance in a positive light
 Detecting outbreaks in their infancy
73
Thanks!
Questions?
74

More Related Content

What's hot

"Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective""Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective"
Micah Altman
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual Archives
Micah Altman
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
Micah Altman
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Micah Altman
 
Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?
National Information Standards Organization (NISO)
 
Best Practices for Sharing Economics Data
Best Practices for Sharing Economics DataBest Practices for Sharing Economics Data
Best Practices for Sharing Economics Data
Micah Altman
 
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
Micah Altman
 
Critically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart CityCritically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart City
Communication and Media Studies, Carleton University
 
Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...
Micah Altman
 
Data Citation Rewards and Incentives
 Data Citation Rewards and Incentives Data Citation Rewards and Incentives
Data Citation Rewards and Incentives
Micah Altman
 
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina LeonelliLa ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
Ismel - Istituto per la Memoria e la Cultura del Lavoro, dell'Impresa e dei Diritti Sociali
 
A brave new world: student surveillance in higher education
A brave new world: student surveillance in higher educationA brave new world: student surveillance in higher education
A brave new world: student surveillance in higher education
University of South Africa (Unisa)
 
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Micah Altman
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
University of Washington
 
Automating Homelessness
Automating HomelessnessAutomating Homelessness
Open Data is Not Enough: Making Data Sharing Work
Open Data is Not Enough: Making Data Sharing WorkOpen Data is Not Enough: Making Data Sharing Work
Open Data is Not Enough: Making Data Sharing Work
Research Data Alliance
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
University of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
Writing Analytics for Epistemic Features of Student Writing #icls2016 talk
Writing Analytics for Epistemic Features of Student Writing #icls2016 talkWriting Analytics for Epistemic Features of Student Writing #icls2016 talk
Writing Analytics for Epistemic Features of Student Writing #icls2016 talk
Simon Knight
 
Well-Being - A Sunset Conversation
Well-Being - A Sunset ConversationWell-Being - A Sunset Conversation
Well-Being - A Sunset Conversation
Micah Altman
 

What's hot (20)

"Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective""Reproducibility from the Informatics Perspective"
"Reproducibility from the Informatics Perspective"
 
Linking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual ArchivesLinking Data to Publications through Citation and Virtual Archives
Linking Data to Publications through Citation and Virtual Archives
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
 
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
Brown Bag: New Models of Scholarly Communication for Digital Scholarship, by ...
 
Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?Wilbanks Can We Simultaneously Support Both Privacy & Research?
Wilbanks Can We Simultaneously Support Both Privacy & Research?
 
Best Practices for Sharing Economics Data
Best Practices for Sharing Economics DataBest Practices for Sharing Economics Data
Best Practices for Sharing Economics Data
 
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
July IAP: Confidential Information - Storage, Sharing, & Publication - with M...
 
Critically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart CityCritically Assembling Data, Processes & Things: Toward and Open Smart City
Critically Assembling Data, Processes & Things: Toward and Open Smart City
 
Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...Matching Uses and Protections for Government Data Releases: Presentation at t...
Matching Uses and Protections for Government Data Releases: Presentation at t...
 
Data Citation Rewards and Incentives
 Data Citation Rewards and Incentives Data Citation Rewards and Incentives
Data Citation Rewards and Incentives
 
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina LeonelliLa ricerca scientifica nell'era dei Big Data - Sabina Leonelli
La ricerca scientifica nell'era dei Big Data - Sabina Leonelli
 
A brave new world: student surveillance in higher education
A brave new world: student surveillance in higher educationA brave new world: student surveillance in higher education
A brave new world: student surveillance in higher education
 
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
Privacy Gaps in Mediated Library Services: Presentation at NERCOMP2019
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Automating Homelessness
Automating HomelessnessAutomating Homelessness
Automating Homelessness
 
Open Data is Not Enough: Making Data Sharing Work
Open Data is Not Enough: Making Data Sharing WorkOpen Data is Not Enough: Making Data Sharing Work
Open Data is Not Enough: Making Data Sharing Work
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Writing Analytics for Epistemic Features of Student Writing #icls2016 talk
Writing Analytics for Epistemic Features of Student Writing #icls2016 talkWriting Analytics for Epistemic Features of Student Writing #icls2016 talk
Writing Analytics for Epistemic Features of Student Writing #icls2016 talk
 
Well-Being - A Sunset Conversation
Well-Being - A Sunset ConversationWell-Being - A Sunset Conversation
Well-Being - A Sunset Conversation
 

Viewers also liked

Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
Micah Altman
 
Dulin PermaCC Talk for MIT PIS
Dulin PermaCC Talk for MIT PISDulin PermaCC Talk for MIT PIS
Dulin PermaCC Talk for MIT PIS
Micah Altman
 
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
Micah Altman
 
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKELBROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
Micah Altman
 
Can computers be feminist? Program on Information Science Talk by Gillian Smith
Can computers be feminist? Program on Information Science Talk by Gillian SmithCan computers be feminist? Program on Information Science Talk by Gillian Smith
Can computers be feminist? Program on Information Science Talk by Gillian Smith
Micah Altman
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
Micah Altman
 
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
Micah Altman
 
Inform- interacting with a dynamic shape display
Inform- interacting with a dynamic shape displayInform- interacting with a dynamic shape display
Inform- interacting with a dynamic shape display
Hari Teja Joshi
 
Con3036 soaring-through-the-clouds-oow2016-160920214845
Con3036 soaring-through-the-clouds-oow2016-160920214845Con3036 soaring-through-the-clouds-oow2016-160920214845
Con3036 soaring-through-the-clouds-oow2016-160920214845
Getting value from IoT, Integration and Data Analytics
 
Test driven cloud development using Oracle SOA CS and Oracle Developer CS
Test driven cloud development using Oracle SOA CS and Oracle Developer CSTest driven cloud development using Oracle SOA CS and Oracle Developer CS
Test driven cloud development using Oracle SOA CS and Oracle Developer CS
Sven Bernhardt
 
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
Micah Altman
 
How Not To Be Seen
How Not To Be SeenHow Not To Be Seen
How Not To Be Seen
Mark Pesce
 
Part 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative valuePart 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative value
Girija Muscut
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
shannonsdavis
 
Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3a_akhavan
 
Transforming the world with Information technology
Transforming the world with Information technologyTransforming the world with Information technology
Transforming the world with Information technology
Glenn Klith Andersen
 
Making Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information DesignMaking Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information Design
Hubbard One
 
Information Overload and Information Science / Mieczysław Muraszkiewicz
Information Overload and Information Science / Mieczysław MuraszkiewiczInformation Overload and Information Science / Mieczysław Muraszkiewicz
Information Overload and Information Science / Mieczysław Muraszkiewicz
Zakład Systemów Informacyjnych, Instytut Informacji Naukowej i Studiów Bibliologicznych (UW)
 

Viewers also liked (20)

Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
Program on Information Science Brown Bag:David Weinberger on Libraries as Pla...
 
Dulin PermaCC Talk for MIT PIS
Dulin PermaCC Talk for MIT PISDulin PermaCC Talk for MIT PIS
Dulin PermaCC Talk for MIT PIS
 
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
BROWN BAG TALK WITH CHAOQUN NI- TRANSFORMATIVE INTERACTIONS IN THE SCIENTIFIC...
 
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKELBROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
BROWN BAG: THE VISUAL COMPONENT: MORE THAN PRETTY PICTURES - WITH FELICE FRANKEL
 
Can computers be feminist? Program on Information Science Talk by Gillian Smith
Can computers be feminist? Program on Information Science Talk by Gillian SmithCan computers be feminist? Program on Information Science Talk by Gillian Smith
Can computers be feminist? Program on Information Science Talk by Gillian Smith
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
The Open Access Network: Rebecca Kennison’s Talk for the MIT Prorgam on Infor...
 
Inform- interacting with a dynamic shape display
Inform- interacting with a dynamic shape displayInform- interacting with a dynamic shape display
Inform- interacting with a dynamic shape display
 
Compressive DIsplays: SID Keynote by Ramesh Raskar
Compressive DIsplays: SID Keynote by Ramesh RaskarCompressive DIsplays: SID Keynote by Ramesh Raskar
Compressive DIsplays: SID Keynote by Ramesh Raskar
 
Con3036 soaring-through-the-clouds-oow2016-160920214845
Con3036 soaring-through-the-clouds-oow2016-160920214845Con3036 soaring-through-the-clouds-oow2016-160920214845
Con3036 soaring-through-the-clouds-oow2016-160920214845
 
Test driven cloud development using Oracle SOA CS and Oracle Developer CS
Test driven cloud development using Oracle SOA CS and Oracle Developer CSTest driven cloud development using Oracle SOA CS and Oracle Developer CS
Test driven cloud development using Oracle SOA CS and Oracle Developer CS
 
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
Brown Bag: DMCA §1201 and Video Game Preservation Institutions: A Case Study ...
 
How Not To Be Seen
How Not To Be SeenHow Not To Be Seen
How Not To Be Seen
 
Part 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative valuePart 5 create sequence increment value using negative value
Part 5 create sequence increment value using negative value
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3Prolog -Cpt114 - Week3
Prolog -Cpt114 - Week3
 
Transforming the world with Information technology
Transforming the world with Information technologyTransforming the world with Information technology
Transforming the world with Information technology
 
Presentation1
Presentation1Presentation1
Presentation1
 
Making Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information DesignMaking Information Usable: The Art & Science of Information Design
Making Information Usable: The Art & Science of Information Design
 
Information Overload and Information Science / Mieczysław Muraszkiewicz
Information Overload and Information Science / Mieczysław MuraszkiewiczInformation Overload and Information Science / Mieczysław Muraszkiewicz
Information Overload and Information Science / Mieczysław Muraszkiewicz
 

Similar to MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

What are Data?
What are Data?What are Data?
What are Data?ntunmg
 
Ben Shneiderman: Thrill of Discovery
Ben Shneiderman: Thrill of DiscoveryBen Shneiderman: Thrill of Discovery
Ben Shneiderman: Thrill of Discovery
russ9595
 
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICSBig Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Matt Stubbs
 
“Big data” in human services organisations: Practical problems and ethical di...
“Big data” in human services organisations: Practical problems and ethical di...“Big data” in human services organisations: Practical problems and ethical di...
“Big data” in human services organisations: Practical problems and ethical di...
husITa
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
bodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
Sara-Jayne Terp
 
Ethnography, Grounded Theory and Systems Analysis
Ethnography, Grounded Theory and Systems AnalysisEthnography, Grounded Theory and Systems Analysis
Ethnography, Grounded Theory and Systems Analysis
linlinlin
 
Measuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classification
Stuart Shulman
 
People Like You Like Presentations Like This
People Like You Like Presentations Like ThisPeople Like You Like Presentations Like This
People Like You Like Presentations Like This
David Millard
 
Digital Nightmares: Accessing the Technology
Digital Nightmares: Accessing the TechnologyDigital Nightmares: Accessing the Technology
Digital Nightmares: Accessing the Technology
Errol A. Adams, J.D., M.L.S.
 
4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lrDominic A Ienco
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Data as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la HarpeData as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la Harpe
African Open Science Platform
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
Andre Freitas
 
Thomas ecn 2012
Thomas ecn 2012Thomas ecn 2012
Thomas ecn 2012ECNOfficer
 
ptg11539634ptg11539634Digital Archaeologyp.docx
ptg11539634ptg11539634Digital Archaeologyp.docxptg11539634ptg11539634Digital Archaeologyp.docx
ptg11539634ptg11539634Digital Archaeologyp.docx
woodruffeloisa
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
BREENAHICETSTAFFCSE
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count

Similar to MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments (20)

What are Data?
What are Data?What are Data?
What are Data?
 
Ben Shneiderman: Thrill of Discovery
Ben Shneiderman: Thrill of DiscoveryBen Shneiderman: Thrill of Discovery
Ben Shneiderman: Thrill of Discovery
 
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICSBig Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
Big Data LDN 2018: PROMISE AND PITFALLS OF TEXT ANALYTICS
 
“Big data” in human services organisations: Practical problems and ethical di...
“Big data” in human services organisations: Practical problems and ethical di...“Big data” in human services organisations: Practical problems and ethical di...
“Big data” in human services organisations: Practical problems and ethical di...
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Ethnography, Grounded Theory and Systems Analysis
Ethnography, Grounded Theory and Systems AnalysisEthnography, Grounded Theory and Systems Analysis
Ethnography, Grounded Theory and Systems Analysis
 
Measuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classificationMeasuring reliability and validity in human coding and machine classification
Measuring reliability and validity in human coding and machine classification
 
EDI 2009 Case Law Update
EDI 2009 Case Law UpdateEDI 2009 Case Law Update
EDI 2009 Case Law Update
 
People Like You Like Presentations Like This
People Like You Like Presentations Like ThisPeople Like You Like Presentations Like This
People Like You Like Presentations Like This
 
Digital Nightmares: Accessing the Technology
Digital Nightmares: Accessing the TechnologyDigital Nightmares: Accessing the Technology
Digital Nightmares: Accessing the Technology
 
4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr4th_paradigm_book_complete_lr
4th_paradigm_book_complete_lr
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Data as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la HarpeData as a service: a human-centered design approach/Retha de la Harpe
Data as a service: a human-centered design approach/Retha de la Harpe
 
Topical_Facets
Topical_FacetsTopical_Facets
Topical_Facets
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Thomas ecn 2012
Thomas ecn 2012Thomas ecn 2012
Thomas ecn 2012
 
ptg11539634ptg11539634Digital Archaeologyp.docx
ptg11539634ptg11539634Digital Archaeologyp.docxptg11539634ptg11539634Digital Archaeologyp.docx
ptg11539634ptg11539634Digital Archaeologyp.docx
 
unit 1 DATA MINING.ppt
unit 1 DATA MINING.pptunit 1 DATA MINING.ppt
unit 1 DATA MINING.ppt
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
Lowenberg Making Data Count
 

More from Micah Altman

Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategies
Micah Altman
 
Well-being A Sunset Conversation
Well-being A Sunset ConversationWell-being A Sunset Conversation
Well-being A Sunset Conversation
Micah Altman
 
Can We Fix Peer Review
Can We Fix Peer ReviewCan We Fix Peer Review
Can We Fix Peer Review
Micah Altman
 
Academy Owned Peer Review
Academy Owned Peer ReviewAcademy Owned Peer Review
Academy Owned Peer Review
Micah Altman
 
Redistricting in the US -- An Overview
Redistricting in the US -- An OverviewRedistricting in the US -- An Overview
Redistricting in the US -- An Overview
Micah Altman
 
A Future for Electoral Districting
A Future for Electoral DistrictingA Future for Electoral Districting
A Future for Electoral Districting
Micah Altman
 
A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk  A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk
Micah Altman
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
Micah Altman
 
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Micah Altman
 
Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:
Micah Altman
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Micah Altman
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
Micah Altman
 
Ndsa 2016 opening plenary
Ndsa 2016 opening plenaryNdsa 2016 opening plenary
Ndsa 2016 opening plenary
Micah Altman
 
Software Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental ScanSoftware Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental Scan
Micah Altman
 
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
Micah Altman
 
Agenda's for Preservation Research
Agenda's for Preservation ResearchAgenda's for Preservation Research
Agenda's for Preservation Research
Micah Altman
 
Software Repositories for Research -- An Environmental Scan
Software Repositories for Research -- An Environmental ScanSoftware Repositories for Research -- An Environmental Scan
Software Repositories for Research -- An Environmental Scan
Micah Altman
 
How Many Copies is Enough
How Many Copies is EnoughHow Many Copies is Enough
How Many Copies is Enough
Micah Altman
 
Reputation Management for Early Career Researchers
Reputation Management for Early Career ResearchersReputation Management for Early Career Researchers
Reputation Management for Early Career Researchers
Micah Altman
 

More from Micah Altman (19)

Selecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategiesSelecting efficient and reliable preservation strategies
Selecting efficient and reliable preservation strategies
 
Well-being A Sunset Conversation
Well-being A Sunset ConversationWell-being A Sunset Conversation
Well-being A Sunset Conversation
 
Can We Fix Peer Review
Can We Fix Peer ReviewCan We Fix Peer Review
Can We Fix Peer Review
 
Academy Owned Peer Review
Academy Owned Peer ReviewAcademy Owned Peer Review
Academy Owned Peer Review
 
Redistricting in the US -- An Overview
Redistricting in the US -- An OverviewRedistricting in the US -- An Overview
Redistricting in the US -- An Overview
 
A Future for Electoral Districting
A Future for Electoral DistrictingA Future for Electoral Districting
A Future for Electoral Districting
 
A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk  A History of the Internet :Scott Bradner’s Program on Information Science Talk
A History of the Internet :Scott Bradner’s Program on Information Science Talk
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
Labor And Reward In Science: Commentary on Cassidy Sugimoto’s Program on Info...
 
Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:Utilizing VR and AR in the Library Space:
Utilizing VR and AR in the Library Space:
 
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-NotsCreative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
Creative Data Literacy: Bridging the Gap Between Data-Haves and Have-Nots
 
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
 
Ndsa 2016 opening plenary
Ndsa 2016 opening plenaryNdsa 2016 opening plenary
Ndsa 2016 opening plenary
 
Software Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental ScanSoftware Repositories for Research-- An Environmental Scan
Software Repositories for Research-- An Environmental Scan
 
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
Attribution from a Research Library Perspective, on NISO Webinar: How Librari...
 
Agenda's for Preservation Research
Agenda's for Preservation ResearchAgenda's for Preservation Research
Agenda's for Preservation Research
 
Software Repositories for Research -- An Environmental Scan
Software Repositories for Research -- An Environmental ScanSoftware Repositories for Research -- An Environmental Scan
Software Repositories for Research -- An Environmental Scan
 
How Many Copies is Enough
How Many Copies is EnoughHow Many Copies is Enough
How Many Copies is Enough
 
Reputation Management for Early Career Researchers
Reputation Management for Early Career ResearchersReputation Management for Early Career Researchers
Reputation Management for Early Career Researchers
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

MIT Program on Information Science Talk -- Ophir Frieder on Searching in Harsh Environments

  • 1. Searching in Harsh Environments Ophir Frieder Computer Science Dept. | Georgetown University & Biostatistics, Bioinformatics, & Biomathematics| Georgetown University Medical Center ophir@ir.cs.georgetown.edu March 2016
  • 2. Correcting the Search Myth If it’s search, then Google solved it!  Some of what Google solved  Was solved by others first  Google’s focus is computerized data,  Much data are not digitized  Google is hardly a key social media player  Social media data are everywhere 2
  • 3. Diverse Search Applications  Complex Document Information Processing  The whole is greater than the sum of its parts  Searching is easy  Unless it is in adverse (misspelled) environments  Social Media Search & Surveillance  Detecting outbreaks in their infancy 3
  • 4. Talk Outline Engineering Research Searching & Mining Social Media Searching in Adverse Conditions Complex Document Information Processing Computer Science 4
  • 6.  Complex documents include  handwritten notes,  diagrams,  graphics,  printed or formatted text  Point solutions exist:  OCR, Information Retrieval, Information Extraction, Image Processing, Text Clustering, Computational Stylistics, …  No definition of state-of-the-art for the integrated problem  Manual partitioning/collating: Expensive, time-consuming, error-prone Some are even more complex! 6
  • 7. Optical character recognition (OCR) Document clustering and browsing Document structure extraction Extraction from tables/lists Handwriting analysis and signature recognition Figure caption identification and extraction Conventional and image retrieval systems Entity and relationship extraction Existing Technology Point Solutions 7
  • 8. Complex Document Images LayersOCR Table Extraction Logo Extraction Signature Match Doc Metadata Text Extraction Entity Tagging CDIP Metadata Database Analyst Integrated Retrieval Data Mining Enhance Correcting the Search MythCDIP Processing Architecture 8
  • 13. 13 Without Logos: At which institution? Without Text: What positions do I hold? Ophir Frieder McDevitt Prof. of Comp. Sci. & Inf. Proc. & Prof. of Biostatistics, Bioinformatics, & Biomathematics Integration Helps 13
  • 14. Technology comes and goes but…. Benchmarks (Collections) are ever (forever) lasting 14
  • 15.  Cover the richness of inputs  Range of formats, lengths, & genres  Variance in print and image quality  Document should include:  Handwritten text and notations  Diverse fonts  Graphical elements  graphs, tables, photos, logos, and diagrams Test Collection Characteristics 15
  • 16.  Sufficiently high volume of documents  Vast volume of redundant & irrelevant documents  Support diverse applications  Include private communications within and between groups planning activities and deploying resources  Publicly available data!  Minimal cost  Minimal licensing 16 Test Collection Characteristics
  • 17. 17  Data made public via legal proceedings  Master Settlement Agreement subset of UCSF Legacy Tobacco Document Library  Documents scanned by individual companies; hence scan quality widely varies  ~ 7 million documents  ~ 42 million scanned TIFF format pages (~ 1.5 TB)  ~ 5 GB Metadata  ~ 100 GB OCR Dataset: https://ir.nist.gov/cdip/cdip-images/ 17 CDIP Test Collection
  • 18. The CDIP Test Collection (NIST TREC V1.0) 18  Used multiple years in TREC Legal Track  Records (62GB) made available to TREC participants (through ftp/dvd)  40 queries simulating legal case investigations with relevant judgments produced by 35 lawyers.  Novel queries with relevant judgments generated by tobacco researchers
  • 19.  CDIP Benchmark data – as a novel text test collection for “live scenarios”  NIST TREC Legal Track, 2006 - 2009  Housed permanently at NIST  Complex Document search  Ground truth difficult  800 hand checked sub-collection Evaluation 19
  • 20. Completed: Subset of 800 documents Manually labelled authorship & organizational unit Evaluated: Authorship, organizational, monetary, date, and address-based retrieval tasks Ongoing: Subset of 20K documents. Open Problem: Performance evaluation (measures) for larger sets Preliminary Results 20
  • 21. System Configuration Screen System Configuration Screen 21
  • 22. 22 Query: ATC Logo + “income forecast” + > $500,000
  • 23. 23 Query: RJR Logo + “filtration efficiency” + signature
  • 24. 24 Query: Five signatures with the highest dollar total
  • 25. 25 Query: Associations of a given person (Dr. D. Stone)
  • 26. Collaborators Initial effort  Gady Agam – Illinois Inst. of Tech.  Shlomo Argamon – Illinois Inst. of Tech.  David Doermann – Univ. of Maryland DARPA  David Grossman – Illinois Inst. of Tech. Grossman Lab  David D. Lewis – DDL Consulting  Sargur Srihari – SUNY Buffalo Ongoing effort  Gideon Frieder – George Washington Univ.  Jon Parker – Georgetown Univ. MITRE 26
  • 27.  S. Argamon, G. Agam, O. Frieder, D. Grossman, D. Lewis, G. Sohn,, and K. Voorhees, “A Complex Document Information Processing Prototype,” ACM SIGIR, 2006.  D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, “Building a Test Collection for Complex Document Information Processing,” ACM SIGIR, 2006.  G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, “Content-Based Document Image Retrieval in Complex Document Collections,” Document Recognition and Retrieval, 2007.  G. Bal, G. Agam, O. Frieder, and G. Frieder, “Interactive Degraded Document Enhancement and Ground Truth Generation,” Document Recognition and Retrieval , 2008.  T. Obafemi-Ajayi, G. Agam, and O. Frieder, “Historical Document Enhancement Using LUT Classification,” International Journal on Document Analysis and Recognition, 13(1), March 2010.  J. Parker, G. Frieder, and O. Frieder, “Automatic Enhancement and Binarization of Degraded Document Images,” International Conference on Document Analysis and Recognition, 2013.  J. Parker, G. Frieder, and O. Frieder, "Robust Binarization of Degraded Document Images using Heuristics," Document Recognition and Retrieval XXI, San Francisco, California, February 2014.  Parker, et. al, "System and Method for Enhancing the Legibility of Degraded Images" US Patent #8,995,782. March 31, 2015.  Frieder, et. al, "System and Method for Enhancing the Legibility of Images," US Patent #9,269,126. February 23, 2016. References 27
  • 28. Talk Outline Engineering Research Searching & Mining Social Media Searching in Adverse Conditions Complex Document Information Processing Computer Science 28
  • 29. Spelling in Adverse Conditions  Foreign language (Yizkor Books)  User unfamiliar with character pronunciation  Multiple languages within a document  Domain specific (Medical)  Terms unfamiliar to the general audience 29
  • 30. Yizkor Books  Yizkor = Hebrew word for “remember”  Firsthand accounts of events that preceded, took place during, and followed Second World War  Documents destroyed communities and people who perished  Started early 1940’s; highest activity in 1960’s and 1970’s  Published in 13 languages, across 6 continents  One of largest collections resides in USHMM  Access restricted due to limited number, fragile state, and prevention of destruction or theft 30
  • 31. Traditional Access User requested; archivist driven  Requires “complete” understanding of books  High human resource costs  Inefficient & slow  Often fails to obtain complete, if any, results 31
  • 32. Metadata Search Access Provides an intuitive search capability for apprehensive but interested users Creates and queries collection metadata 32
  • 33. Yikzor Interface  Centralized index  Global access  Efficient search  Accurate search  Multi-lingual spelling correction 33
  • 35. Spell Checker  Upon entering a misspelled query, users are presented with a ranked list of suggestions  Percentages represent similarity to original query as measured by our algorithms 35
  • 36. Query Processing Language independent string manipulation for auto-correction via a voting algorithm 36
  • 37. Language Independent Correction Simplistic Rules Work ! or ?  Replace first and last characters by a wild card, in succession;  Retain only first and last characters and insert a wild card;  Retain only first and last two characters and insert a wild card;  Replace middle n-characters by a wild card, in succession;  Replace first half by a wild card;  Replace second half by a wild card; 37
  • 38. Single Character Correction Add Single Random Character Remove Single Random Character Replace Single Random Character Swap Random Adjacent Pair of Characters Mitton 1996 – “Spellchecking by Computers” Found Rank D-M Sound 41.41 N/A N-Gram 94.97 2.58 USHMM 100 1.71 1.71 Found Rank D-M Sound 41.96 N/A N-Gram 93.40 3.46 USHMM 99.97 2.54 2.57 Found Rank D-M Sound 57.89 N/A N-Gram 85.02 4.77 USHMM 97.97 3.75 3.00 Found Rank D-M Sound 31.45 N/A N-Gram 92.06 3.24 USHMM 100 2.15 2.01 38
  • 39. Multiple Character Correction Add Multiple Characters Remove Multiple Characters Found (%) Rank 2 Chars DM Sound 19.58 N/A N-Gram 92.00 3.45 USHMM 99.38 2.55 / 2.42 3 Chars DM Sound 10.69 N/A N-Gram 87.91 4.20 USHMM 97.53 3.19 / 3.02 4 Chars DM Sound 6.75 N/A N-Gram 83.86 4.97 USHMM 95.04 3.87 / 3.80 Found (%) Rank 2 Chars DM Sound 20.47 N/A N-Gram 84.79 4.78 USHMM 97.83 4.62 / 3.88 3 Chars DM Sound 10.75 N/A N-Gram 74.48 5.77 USHMM 92.73 6.41 / 4.80 4 Chars DM Sound 9.70 N/A N-Gram 69.98 6.04 USHMM 86.34 7.12 / 5.15 39
  • 40. Multiple Character Correction Replace Multiple Characters Swap Multiple Characters Found (%) Rank 2 Chars DM Sound 16.80 N/A N-Gram 80.73 4.44 USHMM 93.88 4.19 / 3.33 3 Chars DM Sound 9.11 N/A N-Gram 69.23 5.15 USHMM 85.83 5.51 / 3.84 4 Chars DM Sound 5.63 N/A N-Gram 57.83 5.94 USHMM 75.03 6.79 / 4.78 Found (%) Rank 2 Chars DM Sound 17.33 N/A N-Gram 54.66 6.92 USHMM 71.69 7.55 / 5.46 3 Chars DM Sound 9.19 N/A N-Gram 42.91 7.30 USHMM 57.65 8.61 / 6.19 4 Chars DM Sound 7.15 N/A N-Gram 34.42 8.60 USHMM 46.31 9.32 / 7.30 40
  • 41. Applying operational technology to a medical domain… Corrected spelling within a Medical Terms Dictionary 41
  • 42. Transcription Errors  “What is a prescribing error?”, J. Quality in Health Care, 2000; 9:232–237.  “Reducing medication errors and increasing patient safety: Case studies in clinical pharmacology”, J. Clinical Pharmacology, July 2003. vol. 43 no. 7: 768-783.  “Preventing medication errors in community pharmacy: root-cause analysis of transcription errors”, Quality and Safety in Health Care, 2007;16:285-290.  “10 strategies for minimizingdispensingerrors”, Pharmacy Times, Jan. 20th, 2010 Note: Although many of the transcription errors are not spelling errors; some indeed are! 42
  • 43. Medical Term Data Set HosfordMedical Terms Dictionary v.3.0  Number of terms: 9,883  Term characteristics:  Average: 10.58  Minimum: 2  Maximum: 30  Median: 10  Mode: 10 43
  • 44. Single Character Correction Add Single Random Character Remove Single Random Character Replace Single Random Character Swap Random Adjacent Pair of Characters Found Rank D-M Sound 38.54 N/A 3-Gram 99.67 1.08 Med-Find 100 1.03 1.03 Found Rank D-M Sound 44.84 N/A 3-Gram 99.52 1.16 Med-Find 100 1.07 1.07 Found Rank D-M Sound 62.73 N/A 3-Gram 96.39 1.50 Med-Find 99.54 1.42 1.27 Found Rank D-M Sound 29.99 N/A 3-Gram 98.76 1.19 Med-Find 99.99 1.10 1.08 44
  • 45. Multiple Character Correction Add Multiple Characters Remove Multiple Characters Found (%) Rank 2 Chars DM Sound 16.40 N/A 3-Gram 98.48 1.29 Med-Find 99.55 1.17 / 1.15 3 Chars DM Sound 7.00 N/A 3-Gram 97.11 1.46 Med-Find 98.36 1.27 / 1.23 4 Chars DM Sound 3.90 N/A 3-Gram 94.96 1.86 Med-Find 96.79 1.38 / 1.31 Found (%) Rank 2 Chars DM Sound 19.49 N/A 3-Gram 96.21 1.67 Med-Find 99.07 1.76 / 1.61 3 Chars DM Sound 8.52 N/A 3-Gram 90.29 2.40 Med-Find 95.21 2.54 / 2.13 4 Chars DM Sound 3.84 N/A 3-Gram 81.83 3.08 Med-Find 88.88 3.54 / 2.70 45
  • 46. Multiple Character Correction Replace Multiple Characters Swap Multiple Characters Found (%) Rank 2 Chars DM Sound 12.34 N/A 3-Gram 94.54 1.64 Med-Find 97.88 1.57 / 1.40 3 Chars DM Sound 5.41 N/A 3-Gram 87.95 2.08 Med-Find 92.58 1.90 / 1.64 4 Chars DM Sound 3.05 N/A 3-Gram 79.46 2.86 Med-Find 85.42 2.19 / 1.78 Found (%) Rank 2 Chars DM Sound 15.11 N/A 3-Gram 76.36 3.99 Med-Find 82.51 3.02 / 2.25 3 Chars DM Sound 7.60 N/A 3-Gram 61.13 5.95 Med-Find 66.85 3.89 / 2.70 4 Chars DM Sound 5.08 N/A 3-Gram 48.91 7.51 Med-Find 54.22 4.61 / 2.87 46
  • 47. Collaborators Key Personnel  Michlean Amir – USHMM  Rebecca Cathey – BAE Systems  Gideon Frieder – George Washington Univ.  Jason Soo – Georgetown/MITRE Many comments by “prototype” users 47
  • 48.  J. Soo, R. Cathey, O. Frieder, M. Amir, and G. Frieder, “Yizkor Books: A Voice for the Silent Past,” ACM Seventeenth Conference on Information and Knowledge Management (CIKM) – Industrial Track, Napa Valley, California, October 2008.  J. Soo and O. Frieder, “On Foreign Name Search,” ACM Thirty-Second European Conference on Information Retrieval (ECIR), Milton Keynes, United Kingdom, March 2010.  J. Soo and O. Frieder, “On Searching Misspelled Collections,” Journal of the Association for Information Science and Technology (JASIS), 66(6), June 2015.  J. Soo and O. Frieder, “Revisiting Known-Item Retrieval in Degraded Document Collections," Document Recognition and Retrieval (DRR), San Francisco, California, February 2016.  J. Soo and O. Frieder, “Searching Corrupted Document Collections," Twelfth IAPR Document Analysis Systems (DAS), Santorini, Greece, April 2016. References 48
  • 49. Talk Outline Engineering Research Searching & Mining Social Media Searching in Adverse Conditions Complex Document Information Processing Computer Science 49
  • 50. Motivation  Public health surveillance  Demands considerable human efforts  Often delayed identification  Typically: need topic of interest  Ideally: detect without focus Motivated to expedite detection  Social media the answer? 50
  • 51. Related Efforts  Social Media  Known topic problem  Detection of specific disease (Influenza)  Correlate occurrence of flu-related words with official Influenza-like-illness data  Summarize influenza-related tweets  Complex solutions  Detect multiple health conditions via complex learning algorithms  Use access-limited resources  Query logs 51
  • 52. Hypothesis: Generation vs. Validation  Goal: extract more general health-related information from social media streams  The Old Way:  Evaluate a pre-existing hypothesis using SM data  Q: “Is flu occurring more frequently?”  A: “Yes”  Our Way:  Generate a hypothesis from SM data  Q: “Are any illnesses occurring more frequently? If so, which ones?”  A: “Yes, Flu” 52
  • 53. Tweet Corpus  Collected by (JHU)  2 billion tweets (May 2009 - Oct 2010)  Filtered multiple times to yield medically related  Using a 20,000 health-related key-phrase list  High-recall / low-precision health tweets  SVM to increase precision 53
  • 56. Frequent Word Set Identification  Preprocessing  Punctuation mark removal  Text lower-cased & tokenized  Stop-word removal  Duplicate term removal  Medical synonym expansion (MedSyn) 56
  • 57. Frequent Word Set Identification # Tweet Content T1 Pounding headache, sore throat, low grade fever, flu T2 Sleep, a perfect cure to forget about the pain! T3 This morning woke up with fever, sore throat, and flu T4 Cough, flu, sore throat. I couldn’t ask for a better combination T5 Got you down? Fever , muscle aches, cough, Term Set Support flu, sore throat 3 fever 3 Cough 2 Frequent Term Sets: {{flu, sore throat}, {fever}} -- Threshold 3 57
  • 58. Decide “Is Trending” prevalence(t) isTrending(t)=(isFrequent(t))AND growth_rate prevalence(t-1)        Word sets prevalent throughout - irrelevant  For example: {feel, sick}  Relevancy “Is Trending” word sets interest us. 58
  • 59. Track Word Set Time Series  Time-series used to determine word sets with a significant increase in prevalence Two differing word set tracks by month 59 {feel, sick} very frequent, does not trend {allergies, feel} trends in April and May Trending Decision
  • 60. Query a trending word set in Wikipedia Why Wikipedia? Comprehensive range of topics including health topics Written in layman’s English resembling tweets considered 60 Query Wikipedia
  • 61. Filter Wikipedia Results Retrieved articles determine if frequent word set is health-related Health-related nature judged by two metrics: Ratio of medical tokens in introduction Presence of International Statistical Classification of Diseases and Related Health Problem (ICD) codes. 61
  • 62. Ratio of Medical Tokens Article health-related if ratio of health tokens in introduction surpasses threshold Process: Tokenize introduction Remove stop words Count the tokens and medical tokens If # medical_token / # token > 0.75 then health-related 62
  • 63. ICD Codes  Health-related Wikipedia articles typically contain info-box with ICD-9 & ICD-10 codes.  ICD code – strong health-related indicator An Wikipedia article’s info box and ICD 63
  • 64. Detection – 2010 Flu Season Tweet time series from June 09 to Oct 10 Weekly flu cases in US from June 09 to Oct 10 64
  • 65. Social Media Mining Accuracy  Landing on Hudson and Mumbai Terror Attack  Flu Tweets (Lampos and Cristianini 2010; Culotta 2010)  …  Hurricane Sandy Coordination Communication  …  …  Fake Celebrity Deaths (Jeff Goldblum) 65
  • 69. Summary  Our Approach:  Filter a corpus to be topic specific  Identify trending word sets  Connect multiple trending words sets to topics of interest  Detect trending topic of interest – Generate Hypotheses 69
  • 70. Future Work  Run framework on a larger scale  Increase data volume: 2 billion  200 billion  Increasing temporal resolution: months  weeks  days  Use resources besides Wikipedia and ICD to filter out non- medically related trending topics  Detect other types of trends by changing the filters to suit a new topic of interest  Deploy globally 70
  • 71. Collaborators Key Personnel  Nazli Goharian – Georgetown University  Alek Kolcz – Twitter PushD  Jon Parker – Johns Hopkins/Georgetown MITRE  Andrew Yates – Georgetown University Many comments by “prototype” users 71
  • 72. Reference  A. Yates, J. Parker, N. Goharian, and O. Frieder, “A Framework for Public Health Surveillance,” 9th Language Resources and Evaluation Conference (LREC-2014), Reykjavik, Iceland, May 2014.  J. Parker, A. Yates, N. Goharian, and O. Frieder, “Health Related Hypothesis Generation using Social Media Data,” Social Network Analysis and Mining, 5(7), March 2015.  A. Yates, N. Goharian, and O. Frieder, “Learning the Relationships between Drug, Symptom, and Medical Condition Mentions in Social Media,“, AAAI 10th International Conference on Web and Social Media (ICWSM), Cologne, Germany, May 2016.  A. Yates, A. Kolcz, N. Goharian, and O. Frieder, “Effects of Sampling on Twitter Trend Detection,” 10th Language Resources and Evaluation Conference (LREC-2016), Portoroz, Slovenia, May 2016. 72
  • 73. Summary  Complex Document Information Processing  The whole is greater than the sum of its parts  Searching is easy  Unless it is in adverse (misspelled) environments  Social Media Search: Surveillance in a positive light  Detecting outbreaks in their infancy 73