Edlund humans in the automated annotation loop (and their potenital for artificial intelligence and machine learning)
1. Humans in
the automated annotation loop
(and their potential for Artificial Intelligence and Machine Learning)
Jens Edlund, KTH Royal Institute of Technology
Dept. of Speech, Music and Hearing
2. About me
• Speech!
• Ass. prof. at KTH Speech, Music & Hearing / Director at Språkbanken Tal
• Master in linguistics and phonetics, PhD speech technology, Docent
speech communication
• Full time researcher (+ teaching, management…)
• Mainly human face-to-face interaction (humanities)
• Also Speech technology (technology/computer science)
• Mainly methodology, method development
3. About KTH Speech, Music and Hearing
• A research institution
• One of the oldest speech labs in the world (founded by Gunnar Fant in
1951)
4. Structure of this talk
• About speech (controversy?)
• HITL example 1: Iterative transcription (old hat…)
• Hope to surprise:
• HITL example 2: Getting more from the labour of professionals
• HITL example 3: Labelling and exploring with Edyson (Demo!)
5. Speech vs writing
• Speech is often, but not always perceived as a special case of
writing
• Speech is consistently treated like a special case of writing
• But
• Speech predates writing
• Speech is the most commonly occurring form of language
• Writing is a special case of speech?
• In practice, there are many similarities, but the differences are
huge
Speech Writing
6. Some characteristics of speech
• Speech is transient
• It exists only in the present
• This is true, in a sense, even if recorded
• Speech is largely interactive and emergent
• It is created, edited, and understood dynamically
• This is true for read speech as well
• A string of words does not represent the meaning of something
spoken well
• So, looking for the “text in speech” misses a lot
• There’s more to speech than transcription
7. How (not) to analyse archive speech
• “Turn it into text”
• Automatic transcriptions
• Manual correction
• Annotation
• Text storage
• And dig the audio back down again…
8. But…
• Current methods are not designed for this type of speech
• Text is not speech
• Different use cases call for very different analyses
• There is a very real danger in standardizing too soon
9. HITL example 1:
Analysis as an iterative process
• Automatic transcription
• Produces (erroneous) machine transcriptions
• Manual correction
• Produces (correct) manual transcription
• So we have the sound, a negative example, and a positive example
• This is very good for training
• Take home message: don’t throw away materials that can be used
to improve the automatic methods
12. Edyson
• Edyson is a web based tool that implements and combines several
techniques to achieve generic audio browsing and annotation.
• TDA – temporally disassembled audio
• MMAE – massively multichannel audio envorinments
• Techniques for organising audio according to some feature space
• These techniques are used, in different configurations, for a range
of other tasks at TMH as well, e.g. Perception experimentats and
evaluation.
13. Temporally disassembled audio
• In itself nothing new
• Sound sliced into small segments is a staple of speech technology
• The novelty lies in seemlessly flipping between a temporal
(normal) view of sound, and some temporally disassembled spatial
view.
• Some minor novelty in using this spatial view for illustration; for HITL
• Example: Google AI experiments.
14. Feature space (ML, AI)
• Features typically extracted from very short speech samples, e.g.
10 or possibly 100 ms
• Often descturcive (assymetrical) extraction
• Cannot go back to the same sound from the feature description
• Featurs generally meaningless to humans
• In the context of one frame, at least
• Voicing and harmonics, for examples, cannot really be seen in a single
sample
• Segments cannot be listened to
15. Dimensionality reduction
• High number of features = high number of dimensions
• Hard to visualise
• Hard to get one’s head around at all
• Bad for HITL!
• Common solution
• Dimensionality rediction (to 2 or 3)
• Ordering in 2D that attempts to retain original topography
• Very often image processing techniques!
• (E.g. Google example, image clustering from yesterday)
16. Features in summary
• Standard way of treating audio in signal processing/speech tech
• Difficult to understand for humans – removed from acoustic
realisation
• Many existing, strong, methods for sorting, categorising, etc.
• Choice of features, measure of distances, etc. are active research
areas in both signal processing and ML.
• Not our goal here!
17. Listen to 10 hours in 10 minutes?
• How?
• Sampling
• E.g. 6 10 second samples from each hour.
• Covers 1/60th of the data, picked at 60 places.
• Huge risk – need not be representative at all.
• Increased validity linearilly correlated with time consumption.
• More than one sound at a time
• With 60 simultaneously, we hear all 10 h in 10 minutes
• Feasible?
• Well maybe. Think cocktail party. But very difficult to make judgements.