Indexing and searching
        of noisy data

              Franciska de Jong
           University of Twente           Erasmus University
cluster Human Media Interaction           Erasmus Studio for e-research
     Enschede, The Netherlands            Rotterdam, The Netherlands
         http://hmi.ewi.utwente.nl/~fdejong




                   IMPACT Closing Event - The Hague                       1
Overview

Part I: Noisy data analysis – other examples
Part II: Emerging scenarios of scholarly use
Part III: From noisy (meta)data towards
          metadata mining




                 IMPACT Closing Event - The Hague   2
Noisy Channel for Spelling Correction




                                        J&M Figure 5.23

noise: limitations in spelling skills
Noisy Channel for Speech Recognition




                                       J&M Figure 9.2

noise: limitations in sound captured
Noisy Channel for Machine Translation




                                                 J&M Figure 25.15
noise: loss of information through translation
Noisy Channel for OCR




                                             J&M Figure 5.23

noise:
loss of information through typesetting/handwriting
Decoding spoken audio
• Audio modelling: collect data on the ground
  truth for audio segments
• Language modelling: collect data on co-
  occurrence s of words
• 100 hours of speech,
• Text data (500 M words)

There is no data like more data
                 IMPACT Closing Event - The Hague   7
After decoding
• multiple hypotheses with varying probabilities
  of being correct
• selection from n-best list: errors unavoidable
• post-editing can be an option, but never
  without extra costs
  – time (editors), money (editing platform)
  – complexity of workflow


                   IMPACT Closing Event - The Hague   8
Impact of noise on access tasks
• Content/metadata with a certain amount of
  errors
• Search with reduced accuracy:
  – missed hits (false negatives)
  – incorrect hits (false positive; ‘noise’)
• Noisy data less suited for presentation layer
  – pdf versus ascii
  – original audio versus transcript; alternatives: word
    clouds, related content
                    IMPACT Closing Event - The Hague   9
Access to interviews: transcript generation
metadata                 multimedia
                          interview
                           archive


      speech/
                    speaker         speech
    non-speech                                              result
                    detection     recognition
     detection                                           presentation
     automatic speech transcription


                                                                           users:
        transcripts with time stamps            search                  general public,
        and semantic annotations                engine                    archivists,
                                                                         researchers


                                                            query
    summarization   text mining      tagging

      automatic metadata extraction
Optimization Strategies (1)
• Error correction: post-editing, better
  recognition
• Improved recognition
  – typically effective for core collections (WER below
    20%)
  – less effective for the long tail
Case: interviews with Willem Frederik Hermans
• With models for news: 81% WER
• Aim: reduction to around 60%
                   IMPACT Closing Event - The Hague   11
Optimization Strategies (2)
• Dedicated /task-specific evaluation
  – for seach applications errors in function words are
    less critical than errors in e.g. names of persons
    and locations
• Dedicated weigthing schemes for search tasks
  – assign confidences scores to fragments found and
    rerank search results accordingly



                   IMPACT Closing Event - The Hague   12
Access to interviews: support for users
metadata                 multimedia
                          interview
                           archive


      speech/
                    speaker         speech
    non-speech                                              result
                    detection     recognition
     detection                                           presentation
     automatic speech transcription


                                                                           users:
        transcripts with time stamps            search                  general public,
        and semantic annotations                engine                    archivists,
                                                                         researchers


                                                            query
    summarization   text mining      tagging

      automatic metadata extraction
• Part II: Emerging scenarios of scholarly use




                  IMPACT Closing Event - The Hague   14
DLs and knowledge discovery
• Focus of attention for analysis is no longer the
  document alone.
• Room for statistical methods to analyse entire
  collections, archives, libraries.
• Tools that automatically detect and capture
  various semantic layers and feed the patterns
  found back into the metadata structures.
• Discovery versus item finding: room for
  serendipity and data-driven content
  exploration.  IMPACT Closing Event - The Hague   15
Paradigm evolution
                 Science                             Information
                 examples                            studies examples
                         direct obervation           interpretation/ decoding of
Experimental                                         texts
work
                 E = mc2                             S → NP VP
Theoretical
                 a2 + b2 = c2                        Principle of
modeling                                             Compositionality
                 change                              GIS for visualisation of
Computational                                        mobility patterns
                 simulation
modeling                                             text-mining: cross-
                 particle physics,                   document entity linking for
Data-intensive                                       cultural heritage libraries
                           astronomy
computing                                            rule-based parsing of large
                  IMPACT Closing Event - The Hague   corpora (typology studies))
                                                                             16
More than search: metadata
extraction
• For large-scale digital (distributed) collections the
  potential added value of automatically generated
  metadata is becoming more and more apparent.
• Automatic content labeling:
   – not just a matter of speeding up the annotation process and
     enlarging the scope of analysis, also
   – starting point for generating annotation layers at collection
     level , and
   – basis for link structures for all kinds of semantic aspects of
     content, such as chronological trends, topic shifts, style and
     authenticity.
   – potentially noisy IMPACT Closing Event - The Hague            17
“Multi”-issues for DL metadata (1)
• Multi-layer
  – beyond tomb stone: content description at
    fragment level (full text, full content, etc.)
  – free text annotation versus thesaurus-based
    labeling
• Multiple media formats
  – text, text, text
  – spoken audio, video, still images, music, scores,
    umerical data, sensor data, sensus data, etc.
                   IMPACT Closing Event - The Hague     18
Multi-issues for DL metadata (2)
• Multiple perspectives
  – cover more than local context
  – cover more than one domain perspective
  – cover more than one language
• Multiple values due to uncertainty
  – multiple human annotators
  – automatic labeling extracted from potentially
    noisy data
  – dynamics in collection/context
                  IMPACT Closing Event - The Hague   19
Scholarly use
• Comparative perspective
  – Quantitative and qualitative issues
• Need for enhanced content presentation:
  – Multiple layers
  – Links to context
  – Links to related content
• Emerging methodological shift
  – Enhanced collection exploration (think of Google
    n-grams)

                   IMPACT Closing Event - The Hague    20
Part III
From noisy data/metadata towards metadata
mining




              IMPACT Closing Event - The Hague   21
Metadata mining: crucial steps
• Treat all annotation types (classical
  metadata, automatically extracted
  metadata, scholarly annotation, community
  tagging) as assets.
• Learn how to integrate the various types and
  layers to enhance accessibility and to be able to
  exploit the knowledge captured in metadata
  – Exploiting manual annotation for machine learning
    training
  – Detection of collection-level semantic features
  – Innovative interface Event - The Hague
                  IMPACT Closing
                                 and interaction design 22
What can metadata mining bring?
• Quality added to metadata for increased accessibility
  of content:
   – structured search (full text + classification-based)
   – navigation across collections, rich presentation layers
• Increased insight in relations between data
  collections (across media types, languages, etc.)
• Increased understanding of knowledge production
  as captured by metadata and annotation processing
• Support for capturing the essence of association and
  analogy.
There is no data like metadata!
                   IMPACT Closing Event - The Hague 23
Issues for metadata models
Old
• annotation interoperability (e.g., metadata
  integration for content annotated with coding
  tools such as thesauri and ontologies)
New
• how to capture fuzziness and uncertainty coming
  from multiple sources and/or statistical
  processing
• coding of change over time (e.g., metadata for
  the dynamics of temporal and geo-spatial details)

                 IMPACT Closing Event - The Hague   24
Issues for scholarly users
Individual level
• Learn to deal with imperfection
• Understand the limitations of technological
  innovation
Community level
• Stay tuned with developers
• Organize methodology teaching
• Study emerging practises
• Share success stories
                 IMPACT Closing Event - The Hague   25
Issues for developers
• Learn about scholarly practises
• Stay tuned with users during the entire
  process
• Organize structured feedback loops
• Study best practises
• Share responsibility for centers of expertise



                  IMPACT Closing Event - The Hague   26
Issues for e-humanities
• e-humanities is e-research
• multiple media, multiple patforms
• keep connecting !




                 IMPACT Closing Event - The Hague   27
Contact
• email:
  f.m.g.dejong@utwente.nl or
  fdejong@ese.eur.nl
• url:     http://hmi.ewi.utwente.nl/~fdejong




                  IMPACT Closing Event - The Hague   28

IMPACT Final Event 26-06-2012 - Franciska de Jong - Indexing and searching of ‘noisy’ data

  • 1.
    Indexing and searching of noisy data Franciska de Jong University of Twente Erasmus University cluster Human Media Interaction Erasmus Studio for e-research Enschede, The Netherlands Rotterdam, The Netherlands http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 1
  • 2.
    Overview Part I: Noisydata analysis – other examples Part II: Emerging scenarios of scholarly use Part III: From noisy (meta)data towards metadata mining IMPACT Closing Event - The Hague 2
  • 3.
    Noisy Channel forSpelling Correction J&M Figure 5.23 noise: limitations in spelling skills
  • 4.
    Noisy Channel forSpeech Recognition J&M Figure 9.2 noise: limitations in sound captured
  • 5.
    Noisy Channel forMachine Translation J&M Figure 25.15 noise: loss of information through translation
  • 6.
    Noisy Channel forOCR J&M Figure 5.23 noise: loss of information through typesetting/handwriting
  • 7.
    Decoding spoken audio •Audio modelling: collect data on the ground truth for audio segments • Language modelling: collect data on co- occurrence s of words • 100 hours of speech, • Text data (500 M words) There is no data like more data IMPACT Closing Event - The Hague 7
  • 8.
    After decoding • multiplehypotheses with varying probabilities of being correct • selection from n-best list: errors unavoidable • post-editing can be an option, but never without extra costs – time (editors), money (editing platform) – complexity of workflow IMPACT Closing Event - The Hague 8
  • 9.
    Impact of noiseon access tasks • Content/metadata with a certain amount of errors • Search with reduced accuracy: – missed hits (false negatives) – incorrect hits (false positive; ‘noise’) • Noisy data less suited for presentation layer – pdf versus ascii – original audio versus transcript; alternatives: word clouds, related content IMPACT Closing Event - The Hague 9
  • 10.
    Access to interviews:transcript generation metadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  • 11.
    Optimization Strategies (1) •Error correction: post-editing, better recognition • Improved recognition – typically effective for core collections (WER below 20%) – less effective for the long tail Case: interviews with Willem Frederik Hermans • With models for news: 81% WER • Aim: reduction to around 60% IMPACT Closing Event - The Hague 11
  • 12.
    Optimization Strategies (2) •Dedicated /task-specific evaluation – for seach applications errors in function words are less critical than errors in e.g. names of persons and locations • Dedicated weigthing schemes for search tasks – assign confidences scores to fragments found and rerank search results accordingly IMPACT Closing Event - The Hague 12
  • 13.
    Access to interviews:support for users metadata multimedia interview archive speech/ speaker speech non-speech result detection recognition detection presentation automatic speech transcription users: transcripts with time stamps search general public, and semantic annotations engine archivists, researchers query summarization text mining tagging automatic metadata extraction
  • 14.
    • Part II:Emerging scenarios of scholarly use IMPACT Closing Event - The Hague 14
  • 15.
    DLs and knowledgediscovery • Focus of attention for analysis is no longer the document alone. • Room for statistical methods to analyse entire collections, archives, libraries. • Tools that automatically detect and capture various semantic layers and feed the patterns found back into the metadata structures. • Discovery versus item finding: room for serendipity and data-driven content exploration. IMPACT Closing Event - The Hague 15
  • 16.
    Paradigm evolution Science Information examples studies examples direct obervation interpretation/ decoding of Experimental texts work E = mc2 S → NP VP Theoretical a2 + b2 = c2 Principle of modeling Compositionality change GIS for visualisation of Computational mobility patterns simulation modeling text-mining: cross- particle physics, document entity linking for Data-intensive cultural heritage libraries astronomy computing rule-based parsing of large IMPACT Closing Event - The Hague corpora (typology studies)) 16
  • 17.
    More than search:metadata extraction • For large-scale digital (distributed) collections the potential added value of automatically generated metadata is becoming more and more apparent. • Automatic content labeling: – not just a matter of speeding up the annotation process and enlarging the scope of analysis, also – starting point for generating annotation layers at collection level , and – basis for link structures for all kinds of semantic aspects of content, such as chronological trends, topic shifts, style and authenticity. – potentially noisy IMPACT Closing Event - The Hague 17
  • 18.
    “Multi”-issues for DLmetadata (1) • Multi-layer – beyond tomb stone: content description at fragment level (full text, full content, etc.) – free text annotation versus thesaurus-based labeling • Multiple media formats – text, text, text – spoken audio, video, still images, music, scores, umerical data, sensor data, sensus data, etc. IMPACT Closing Event - The Hague 18
  • 19.
    Multi-issues for DLmetadata (2) • Multiple perspectives – cover more than local context – cover more than one domain perspective – cover more than one language • Multiple values due to uncertainty – multiple human annotators – automatic labeling extracted from potentially noisy data – dynamics in collection/context IMPACT Closing Event - The Hague 19
  • 20.
    Scholarly use • Comparativeperspective – Quantitative and qualitative issues • Need for enhanced content presentation: – Multiple layers – Links to context – Links to related content • Emerging methodological shift – Enhanced collection exploration (think of Google n-grams) IMPACT Closing Event - The Hague 20
  • 21.
    Part III From noisydata/metadata towards metadata mining IMPACT Closing Event - The Hague 21
  • 22.
    Metadata mining: crucialsteps • Treat all annotation types (classical metadata, automatically extracted metadata, scholarly annotation, community tagging) as assets. • Learn how to integrate the various types and layers to enhance accessibility and to be able to exploit the knowledge captured in metadata – Exploiting manual annotation for machine learning training – Detection of collection-level semantic features – Innovative interface Event - The Hague IMPACT Closing and interaction design 22
  • 23.
    What can metadatamining bring? • Quality added to metadata for increased accessibility of content: – structured search (full text + classification-based) – navigation across collections, rich presentation layers • Increased insight in relations between data collections (across media types, languages, etc.) • Increased understanding of knowledge production as captured by metadata and annotation processing • Support for capturing the essence of association and analogy. There is no data like metadata! IMPACT Closing Event - The Hague 23
  • 24.
    Issues for metadatamodels Old • annotation interoperability (e.g., metadata integration for content annotated with coding tools such as thesauri and ontologies) New • how to capture fuzziness and uncertainty coming from multiple sources and/or statistical processing • coding of change over time (e.g., metadata for the dynamics of temporal and geo-spatial details) IMPACT Closing Event - The Hague 24
  • 25.
    Issues for scholarlyusers Individual level • Learn to deal with imperfection • Understand the limitations of technological innovation Community level • Stay tuned with developers • Organize methodology teaching • Study emerging practises • Share success stories IMPACT Closing Event - The Hague 25
  • 26.
    Issues for developers •Learn about scholarly practises • Stay tuned with users during the entire process • Organize structured feedback loops • Study best practises • Share responsibility for centers of expertise IMPACT Closing Event - The Hague 26
  • 27.
    Issues for e-humanities •e-humanities is e-research • multiple media, multiple patforms • keep connecting ! IMPACT Closing Event - The Hague 27
  • 28.
    Contact • email: f.m.g.dejong@utwente.nl or fdejong@ese.eur.nl • url: http://hmi.ewi.utwente.nl/~fdejong IMPACT Closing Event - The Hague 28