IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

Indexing and searching
of noisy data

Franciska de Jong
University of Twente Erasmus University
cluster Human Media Interaction Erasmus Studio for e-research
Enschede, The Netherlands Rotterdam, The Netherlands
http://hmi.ewi.utwente.nl/~fdejong

IMPACT Closing Event - The Hague 1

Overview

Part I: Noisy data analysis – other examples
Part II: Emerging scenarios of scholarly use
Part III: From noisy (meta)data towards
metadata mining


Noisy Channel for Spelling Correction

J&M Figure 5.23

noise: limitations in spelling skills

Noisy Channel for Speech Recognition

J&M Figure 9.2

noise: limitations in sound captured

Noisy Channel for Machine Translation

J&M Figure 25.15
noise: loss of information through translation

Noisy Channel for OCR

J&M Figure 5.23

noise:
loss of information through typesetting/handwriting

Decoding spoken audio
• Audio modelling: collect data on the ground
truth for audio segments
• Language modelling: collect data on co-
occurrence s of words
• 100 hours of speech,
• Text data (500 M words)

There is no data like more data

After decoding
• multiple hypotheses with varying probabilities
of being correct
• selection from n-best list: errors unavoidable
• post-editing can be an option, but never
without extra costs
– time (editors), money (editing platform)
– complexity of workflow


Impact of noise on access tasks
• Content/metadata with a certain amount of
errors
• Search with reduced accuracy:
– missed hits (false negatives)
– incorrect hits (false positive; ‘noise’)
• Noisy data less suited for presentation layer
– pdf versus ascii
– original audio versus transcript; alternatives: word
clouds, related content

Access to interviews: transcript generation
metadata multimedia
interview
archive

speech/
speaker speech
non-speech result
detection recognition
detection presentation
automatic speech transcription

users:
transcripts with time stamps search general public,
and semantic annotations engine archivists,
researchers

query
summarization text mining tagging

automatic metadata extraction

Optimization Strategies (1)
• Error correction: post-editing, better
recognition
• Improved recognition
– typically effective for core collections (WER below
20%)
– less effective for the long tail
Case: interviews with Willem Frederik Hermans
• With models for news: 81% WER
• Aim: reduction to around 60%

Optimization Strategies (2)
• Dedicated /task-specific evaluation
– for seach applications errors in function words are
less critical than errors in e.g. names of persons
and locations
• Dedicated weigthing schemes for search tasks
– assign confidences scores to fragments found and
rerank search results accordingly


Access to interviews: support for users
metadata multimedia
interview
archive

speech/
speaker speech
non-speech result
detection recognition
detection presentation
automatic speech transcription

users:
transcripts with time stamps search general public,
and semantic annotations engine archivists,
researchers

query
summarization text mining tagging

automatic metadata extraction

• Part II: Emerging scenarios of scholarly use


DLs and knowledge discovery
• Focus of attention for analysis is no longer the
document alone.
• Room for statistical methods to analyse entire
collections, archives, libraries.
• Tools that automatically detect and capture
various semantic layers and feed the patterns
found back into the metadata structures.
• Discovery versus item finding: room for
serendipity and data-driven content
exploration. IMPACT Closing Event - The Hague 15

Paradigm evolution
Science Information
examples studies examples
direct obervation interpretation/ decoding of
Experimental texts
work
E = mc2 S → NP VP
Theoretical
a2 + b2 = c2 Principle of
modeling Compositionality
change GIS for visualisation of
Computational mobility patterns
simulation
modeling text-mining: cross-
particle physics, document entity linking for
Data-intensive cultural heritage libraries
astronomy
computing rule-based parsing of large
IMPACT Closing Event - The Hague corpora (typology studies))
16

More than search: metadata
extraction
• For large-scale digital (distributed) collections the
potential added value of automatically generated
metadata is becoming more and more apparent.
• Automatic content labeling:
– not just a matter of speeding up the annotation process and
enlarging the scope of analysis, also
– starting point for generating annotation layers at collection
level , and
– basis for link structures for all kinds of semantic aspects of
content, such as chronological trends, topic shifts, style and
authenticity.
– potentially noisy IMPACT Closing Event - The Hague 17

“Multi”-issues for DL metadata (1)
• Multi-layer
– beyond tomb stone: content description at
fragment level (full text, full content, etc.)
– free text annotation versus thesaurus-based
labeling
• Multiple media formats
– text, text, text
– spoken audio, video, still images, music, scores,
umerical data, sensor data, sensus data, etc.

Multi-issues for DL metadata (2)
• Multiple perspectives
– cover more than local context
– cover more than one domain perspective
– cover more than one language
• Multiple values due to uncertainty
– multiple human annotators
– automatic labeling extracted from potentially
noisy data
– dynamics in collection/context

Scholarly use
• Comparative perspective
– Quantitative and qualitative issues
• Need for enhanced content presentation:
– Multiple layers
– Links to context
– Links to related content
• Emerging methodological shift
– Enhanced collection exploration (think of Google
n-grams)


Part III
From noisy data/metadata towards metadata
mining


Metadata mining: crucial steps
• Treat all annotation types (classical
metadata, automatically extracted
metadata, scholarly annotation, community
tagging) as assets.
• Learn how to integrate the various types and
layers to enhance accessibility and to be able to
exploit the knowledge captured in metadata
– Exploiting manual annotation for machine learning
training
– Detection of collection-level semantic features
– Innovative interface Event - The Hague
IMPACT Closing
and interaction design 22

What can metadata mining bring?
• Quality added to metadata for increased accessibility
of content:
– structured search (full text + classification-based)
– navigation across collections, rich presentation layers
• Increased insight in relations between data
collections (across media types, languages, etc.)
• Increased understanding of knowledge production
as captured by metadata and annotation processing
• Support for capturing the essence of association and
analogy.
There is no data like metadata!

Issues for metadata models
Old
• annotation interoperability (e.g., metadata
integration for content annotated with coding
tools such as thesauri and ontologies)
New
• how to capture fuzziness and uncertainty coming
from multiple sources and/or statistical
processing
• coding of change over time (e.g., metadata for
the dynamics of temporal and geo-spatial details)


Issues for scholarly users
Individual level
• Learn to deal with imperfection
• Understand the limitations of technological
innovation
Community level
• Stay tuned with developers
• Organize methodology teaching
• Study emerging practises
• Share success stories

Issues for developers
• Learn about scholarly practises
• Stay tuned with users during the entire
process
• Organize structured feedback loops
• Study best practises
• Share responsibility for centers of expertise


Issues for e-humanities
• e-humanities is e-research
• multiple media, multiple patforms
• keep connecting !


Contact
• email:
f.m.g.dejong@utwente.nl or
fdejong@ese.eur.nl
• url: http://hmi.ewi.utwente.nl/~fdejong


IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

More Related Content

Viewers also liked

Similar to IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

More from IMPACT Centre of Competence

Recently uploaded

IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data