Looking at Archival Sound: Visual features of a spoken word archive’s web interface that enhance the listening experience
1. Looking at Archival Sound: Visual features of a spoken word archive’s web interface that enhance the listening experience Annie Murray Jared Wiercinski Concordia University Montreal, Quebec, Canada September 5, 2011 International Association of Sound and Audiovisual Archives (IASA) 42nd Annual Conference Digital Sense and Nonsense: Digital Decision Making in Sound and Audiovisual Collections Frankfurt, Germany, 3-8 September 2011
JW What helps researchers listen in deep and engaged ways to poetry that is delivered on the web? Our work considers how visual aspects of web-based archives for poetry recordings can enhance the listening experience for users, so that they can better understand and utilize the recordings. Drawing from studies in a variety of disciplines that demonstrate that much of our learning is multimodal, the SpokenWeb project in Montreal, Canada is using digitized live recordings of a Montreal poetry reading series from 1965-1972, and features performances by major North American poets. Our project is to investigate the features that will be the most conducive to scholarly engagement with recorded poetry recitation and performance. We are especially interested in the idea that what you look at while you listen can change what you hear. Consequently, visual features of sound archives such as tethering audio playback with a written transcript, sound visualization, and including videos and images will all be discussed.
AM Before getting started on the specific features we are looking at, we thought it important to mention that the reason we feel such collections merit this kind of attention and development is because we think they have enormous research potential they may have been unfairly neglected. Since our case study involves poetry, we cite poet and critic Charles Bernstein who says that:
JW Before examining how certain visual features of an archive’s interface can enhance the listener’s experience, it is helpful to consider the benefits of audio playback on it’s own. What are the benefits of hearing an audio recording of a spoken word performance (e.g., a poetry reading, speech, or interview) as opposed to reading the same content in a text document? What cognitive, emotional, interpretive or aesthetic benefits can be enjoyed when hearing content in audio form? In his essay Open Ears, Schafer (2011) reminds us of the Latin origins of the word audio: The Latin word audiré (ow-dear-eh) (to hear) has many derivations. One may have an “audience” with the king, that is, a chance to have him hear your petitions. One’s financial affairs are “audited” by an accountant, because originally accounts were read aloud for clarity. Implicit in these examples then, we think, is an understanding that the printed word can be very ambiguous – sometimes to the point of being completely unclear in meaning – and that hearing something read aloud can reduce this ambiguity. When something is read aloud it provides the listener with considerably more information than could be obtained from a written document alone. For example, an audio recording can reveal: Changes in the volume, tempo, or pitch of a speaker’s voice Pauses in speech Improvised or changed sections of an established work The speaker’s emotional tone The speaker’s intentional and unintentional non-speech sounds (for example: laughter, coughs, breathing, sighing, hesitations or stutters) And finally, it can also reveal sounds originating from the audience (for example: laughter, questions, comments, heckling, applause, silence)
AM We are specifically interested in how visual features paired with the audio can provide complementary cognitive, emotional, and aesthetic benefits. We emphasize “complementary” here, because there are many situations when a user’s experience of an interface would be best characterized as being multimodal, involving more than one sense modality at a time. To explore how visual features can enhance the listener’s experience of a spoken word archive, it was helpful to examine literature from a variety of disciplines, including cognitive science, education, literary studies, and library and information studies.
Cognitive science (JW) There are significant cognitive issues related to the user experience of the audio and visual aspects of a spoken word archive’s web interface. Concerning speech perception, there are a wide range of studies that examine the relationship -- sometimes complementary, other times competitive -- between audio and visual information. On the complementary side, Theeuwes, van der Burg, Olivers, and Bronkhorst note that “in the 1950s it was shown that the presentation of the face of the speaker can improve speech recognition compared with auditory-only presentation” (2007, p. 196). On the competitive side, there is evidence for the dominant effect of visual information over auditory information. The effect of viewing speech while listening is so strong, that if the two are in conflict (for example, they are not synchronized or are otherwise mismatched), what a person hears will be partially or completely determined by what they see (Kizllyuk, Moetonen, & Sams, 2008, p. 2175). Concerning desktop user interfaces, Dowell and Shmueli state that the characteristics of visual text have historically made it generally preferable to speech output, and note that Smith and Mossier’s (1986) “[l]ong standing guidelines advise that speech should not be used for presenting complex content and, wherever possible, that a visual display of information should be used instead of speech” (2008, p. 782). In order to further examine this relationship, their 2008 study investigated how well a user comprehends verbal information across three different conditions: only visually, only verbally, and when a visual display is combined with speech output. Their results show that there is no difference across conditions for short sentences. For long sentences, however, they found that participants had better results with the visual-only and multimodal conditions (i.e., participants had poorer results when trying to comprehend long sentences during the speech-only condition). Education (AM) A number of education studies we reviewed demonstrated the success of blending sensory modalities to increase comprehension and learning. For example: Chen (2011) showed that learning a second language is easier when videos are subtitled, and Ginsberg (1940) showed that it is easier to understand Shakespeare plays when recordings are played alongside text whereas Mayer (1997) problem solving is more successful for students who receive visual AND verbal explanations. Literary studies (AM) Poets and scholars of poetry such as Bernstein (1998, 2009), Eichorn (2009) and Furr (2010) have drawn attention to the importance of sounded poetry for increased, complementary or even new critical engagement with poems, and have noted the relative marginalization of poetry recordings as a subject for serious literary research. Middleton (2005) and Swigg (2007)suggest that simultaneously listening to poetry and reading/seeing the text of a poem is essential for full comprehension. Library and information studies (JW) Several reports from the Library and Information Studies literature have examined features that can enhance sound archives. In 2003 Ascensio conducted a user requirement study in order to determine enhancement and development features for moving pictures and sound portals. A selection of some of the most relevant requests that emerged from the study include: “ Consider making manipulation tools available via the portal” (p. 21) Make it possible to “Search on waveform or image characteristics” (p. 30) Portals should “lead to analysis and functional tools (if portal doesn’t have them)” (p. 30) Barrett, Duffy, & Marshalay’s (2004) report on the HOTBED project also includes a user needs analysis. Information from the needs analysis was used to inform the design of their digital music system. Three user needs analysis sessions amongst staff and music students produced a “wish list” of desired features. A selection of the most relevant of these include: The ability to “Slow [a recording] down without altering the pitch” (p. 6) The ability to select phrases and loop them (p. 6) They said the portal should include “Relevant images” (p. 6) They also emphasized that the portal design should take into account that there is a “Visual component to learning by ear” (p. 6) And, finally, users reported that “video is also of great benefit learning aurally... [that is “a-u” aurally] (p. 10)
JW An excellent example of pairing audio playback with a written transcription of the spoken word content comes from the Radiolab player demo ( http://hyper-audio.org/r/ ), a collaborative design effort from Radiolab, Mozilla, and SoundCloud. This site features both a transcript of the spoken word content as well as a waveform display, which is a visual representation of the recorded audio signal’s amplitude over time. And a signal’s amplitude, for those who might not know, is the physical correlate of perceived volume (or how loud something sounds). And these two features are interactive: a user can click on different positions in either the transcript or waveform display in order to navigate the content. The site also offers two-way synchronization or tethering, which is to say that clicking on a particular point in the waveform display changes the highlighted position in the transcript, and vice-versa. These features provide several advantages to users. First, users can search, browse, or navigate the content using the method that best suits their purposes. For example, when searching for a particular word, phrase, or sentence, they can skim the transcript or use their browser’s search function. Or, they can click on a certain points in the waveform display in order to quickly jump to various positions in the audio playback. Secondly, this tethering allows for the listener to follow along in a very precise way. The synchronization of the transcript is done is such a way that the exact word being spoken is also simultaneously highlighted in the transcript (that is to say, via the “edge” in the transcript between already-played and yet-to-come text). Therefore, if a user was listening and viewing the example shown, they would simultaneously hear and see the word “and”. [DEMO]
AM The SpokenWeb site includes a waveform display via the SoundCloud audio player, which can be useful in a number of ways. First, it allows a user to see how long a recording is, and to understand the current position of playback relative to the entire recording. [DEMO] Waveform displays also allow for improved navigation and browsing. A user can click on different sections of the waveform in order to quickly move around from section to section, in order to hear non-adjacent parts of the recording. [DEMO] In some circumstances users with some experience with waveform displays can navigate the content without even listening to it. This is possible because changes in the volume of the recording typically show up as noticeable changes in the waveform, and often these changes in the waveform indicate different parts or sections in the recording. This screenshot shows where different parts of a recording of a poetry reading map to changes in the waveform: Part A is a musical introduction featuring Indian music; In Part B things quiet down with singing interspersed with music; In Part C George Bowering introduces Allen Ginsberg; In Parts D, E, and F, Ginsberg recites separate poems [DEMO]
JW An example of sound visualization that utilizes colour to differentiate separate aspects of an audio recording comes from the website Mashup Breakdown, which is a media player designed to clearly show the component parts of musical mashups. A mashup is a song composed entirely of parts taken from existing recordings, often put together in unexpected ways. The site’s interface includes a fairly simple master timeline at the top and uses different coloured blocks, stacked vertically and of varying lengths, to indicate when different samples (differentiated by different colours) enter and exit the main mix. [DEMO?] Further, it’s easy to see how this interface could be adapted for musical education in other genres. The different coloured blocks could be used, for instance, to represent different instruments or musical themes entering and exiting a classical or jazz piece.
AM The Variations Audio Timeliner features a basic black line to represent a linear timeline, with a grey marker that moves along the timeline to indicate the current position of playback. On top of the line, however, are coloured bubbles that delineate different sections of a musical work: “ Bubbles are the time-spans between timepoints on the line which may be used to represent the phrases, periods, or larger formal sections of music.” Further, the site also features annotations that are timed to display when playback enters a specific bubble or section. These annotations are intended to be used by instructors, so that notes about specific sections of the piece show up at the appropriate time, and are correlated with a visual breakdown of the music in order to improve a student’s comprehension of a musical work’s component parts and of how they interrelate - all while they listen. Finally, it’s easy to see how this interface could be adapted for literary analysis.
JW The RECITE team has experimented with open source speech analysis software named Praat which uses both waveform and spectrogram displays in order to support phonetic analysis. A spectrogram display is a graphic representation of an audio signal that is often displayed in a graph with three dimensions: time on the horizontal axis, soundwave frequency (i.e., the physical correlate of pitch) on the vertical axis, and a third dimension to represent amplitude (often represented by colour). The RECITE team studied the “visual curves that represent pitch changes in Eliot’s reading of the Waste Land in order to help “describe the formal significance of Eliot’s mode of poetic recitation”. One of their findings was that Eliot made heavy use of what Camlot refers to as a “drone” style of intonation when reciting his poetry. Camlot (2011) notes that “...a typical declarative sentence will have a high then falling intonation” but that Elliot’s reading has fewer pitch variations when compared with an actor’s reading of the same work . This slide shows a waveform and spectrogram visualisation of an actor’s reading of Burial of The Dead. [DEMO]
This slide shows Eliot’s own reading of the same work. The mark up (i.e., in blue) highlights how Eliot’s intonation is much flatter than the actor’s intonation was. [DEMO] The RECITE team’s use of Praat suggests, we think, how sound visualisation features, such as a spectrogram display, could be a useful web-interface feature for scholars who are using spoken word audio collections.
AM Examples of some visually-oriented archival items that could enhance sound recordings are: photos of the poet, or of people at the reading promotional brochures, posters, or programs from the event clippings from newspapers Finally, photographs of the original reel or cassettes could be included with digital recordings. This could ground the user in the physical artifacts from the poetry reading. Since users may not have access to fragile analog recordings, the photos of the originals might increase the thrill of archival intimacy in an online space. We can see this on the Library of Congress Jukebox and on the Bob Dylan Audio player where a vinyl record spins as the listener hears digital recordings.
JW Given the scholarly potential of spoken word archives and the rapid development of web technologies to support media, developers of sound archives should consider incorporating visual features to enhance the user experience. Written transcripts tethered to audio playback, sound visualization and accompanying video and images are promising features that could enhance sound archives. In fact, studies in various disciplines attest to the multimodal nature of learning which these features exemplify. It is also important to provide users with control over enhancement features. Developers should provide a site with clean design so unfettered listening can take place, and provide a flexible interface that allows users to select or deselect visual features. Finally, they should respect the needs of users who mainly just want to listen.