Australians at War Oral History Digitisation: Current
Transcription Practices and Emerging Technologies
Recently DAMsmart digitised 11,000 hours of war veteran interviews
from all major conflicts in Australian modern history. In addition,
DAMsmart created and delivered a Media Asset Management system
populated with the digitised records for delivery to the Australian
Defence Force Academy. The commissioned system tailored by
DAMsmart brings together existing information and oral history
transcription data into one media rich asset management system. In
this presentation Andrew Martin will explore the digitisation process,
challenges and achievements in building a turnkey media/archive
system, and explore new technologies that will open up the search
and find capabilities of media archives.
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Australians at War Film Archive: Current Transcription Practices and Emerging Technologies.
1. We Get your Content future
ready.
Australians at War Film
Archive
Oral History Digitisation:
Current Transcription
Practices and Emerging
Technologies.
7. www.DAMsmart.com.au
Credits
Produced by:
MULLION CREEK PRODUCTIONS PTY LTD
Project Director:
MICHAEL CAULFIELD
Producer:
LIZ BUTLER
Consultant Historians:
PROFESSOR JOAN BEAUMONT
DR MICHAEL MCKERNAN
DR JOHN REEVE
DR RICHARD REID
DR PETER STANLEY
DR ALAN STEPHENS
PRODUCTION TEAM:
Production Supervisor:
KYLIE FLEMING
Production Managers:
BRYONY KING
ANNELLA POWELL
TRACEY SHARP
Production Coordinators:
TANIA HORNE
ANNELLA POWELL
TRACEY SHARP
WENDY TRUELOVE
CORINA YIANNOUKAS
Production Accountant:
JOHN RUSSELL
Production Assistants:
LISA CAMILLERI
LEAH GIBSON
JOANNE MARTIN
JADE SUINE
LUCY WATERER
LOUISE WHALLEY
KRISTY WILSON
Australians at War Film Archive
Researchers:
BRETT BARLOW
PHILLIPA CANNON
VICKI ESLICK
SARAH GURICH
BRADLEY HAMMOND
ANGELA HAMMOND
SERENA PORGES
BRONWYN REED
ELIZABETH HALLORAN RICHARDS
Dubbing Officers:
BARRY ELVERD
KIEREN ROBISON
Senior Supervising Editors:
DIANNE BRAMICH
MICHAEL CAULFIELD
MICHELE CUNNINGHAM
DR WAYNE GEERLING
Supervising Editors:
DIANNE BRAMICH
KEN BURSLEM
KIT CANDLIN
MICHELE CUNNINGHAM
CATHERINE DYSON
CHRISTOPHER ELEY
KATE HABGOOD
RON HARPER
CHRISTOPHER HOUGHTON
CHRISTOPHER KEATING
HILARY MCGEACHY
MYLES MCMULLEN
JEANETTE RIMMER
JOHN ROBBINS
JOANNE STEWART
CRAIG TIBBITTS
LEONARD HARRY WISE
Transcription Editors:
GEMMA BATTERSBY
KATE BATTERSBY
DALE BLAIR
PHILLIP BRADLEY
DIANNE BRAMICH
KEN BURSLEM
COLIN CAIRNES
KIT CANDLIN
MICHELE CUNNINGHAM
KIRSTY DE GARIS
CATHERINE DYSON
LUCINDA EDSELIUS
DARREN ELDER
CHRISTOPHER ELEY
ROD FAULKNER
KATE HABGOOD
MAT HARDY
RON HARPER
ROSALIND HEARDER
CHRISTOPHER HOUGHTON
DR JUDITH JEFFERY
CHRISTOPHER KEATING
JOHN KERR
MATTHEW LIBBIS
CHRISTOPHER LINKE
IAN MACKAY
HILARY MCGEACHY
MYLES MCMULLEN
JAMES MORRIS
LOUISE PASCALE
TRISH PATON
CATHY PRYOR
JEANETTE RIMMER
JOHN ROBBINS
KATHY SPORT
JOANNE STEWART
VANESSA STUART
CRAIG TIBBITTS
JOSH WADDELL
ALAN WILSON
LEONARD HARRY WISE
Camera Training
Consultants:
PETER COLEMAN
KATHRYN MILLISS
Transcribers:
SUE BARTIMOTE
MATTHEW BIENEK
ELLA BOWMAN
MARJORY BRADLEY
ALISON BRUCE
ALISON BURGE
HELEN CARVER
MELISSA CAULFIELD
BARBARA DADD
AMANDA DRAKE
KRISTINA
GOTTSCHALL
ANGELA GRAY
GRAHAM JOHNS
SHARON JOHNSON
CLAIRE JONES
KERRY KLEMENS
DAVID MARTIN
DARRYL MASON
RACHEL MEEHAN
FRANCES MILLER
LOUISE MUDDLE
KIM O'DONNELL
JESSICA RICHARDS
KAREN SIMS
KATE SMITH
MARYANNE SMITH
MONICA STEFFAN
JUSTINE WILLIAMS
Additional Transcription
Services
CLEVER TYPES
THE LAST DRAFT PTY
LTD
INTERVIEWING TEAM:
Interviewers:
JULIAN ARGUS - 3.5
MARTIN BALL - 1
REBECCA BARRY - 2
MICHAEL BENNETT - 3
DENISE BLAZEK - 3.5
COLIN CAIRNES - 4
ELLEN CARPENTER - 2
LOUISE CHARMAN - 1
KIRSTY DE GARIS - 1
SERGEI DE SILVA-RANASINGHE - 5
SIMON DIKKENBERG - 2
CATHERINE DYSON - 4
CHRISTOPHER ELEY - 5
KEIRNAN FITZPATRICK - 4
ISABEL FOX - 2
ROSEMARY FRANCIS - 1
KYLIE GREY - 1.5
ZELDA GRIMSHAW - 1
MATTHEW HARDY - 2
NAOMI HOMEL - 3
CHRISTOPHER HOUGHTON - 3.5
IANTO KELLY - 2
SEAN KENNEDY - 1.5
STELLA KINSELLA - 2
ANNIE LETCH - 1
DAVID LEVELL - 1
DENE MASON - 1
CLAIRE MCCARTHY - 1
NICOLE MCCUAIG - 2
MYLES MCMULLEN - 2
COLIN MOWBRAY - 2
KRISTEN MURRAY - 1
PATRICK NOLAN - 1
KAREN NOBES - 1
ROBERT NUGENT - 2.5
LOUISE PASCALE - 2
HEATHER PHILLIPS - 2.5
CATHY PRYOR - 2
SOPHIE RELF - 1
SUE ROBERTS - 1
CHRISTOPHER SALISBURY - 1
GRAHAM SHIRLEY - 2
KATHY SPORT - 5
9. www.DAMsmart.com.au
World War I
World War II
The Occupation of Japan
The Korean War
The Malayan Emergency
The Indonesian Confrontation
The Vietnam War
Gulf War 1
The War Against Terror
Conflicts in Iraq and Afghanistan
Australians at War Film Archive
19. www.DAMsmart.com.au
1 Interview Containing Multiple Tapes
01:00:00:00
02:00:00:00
03:00:00:00
Tape 1
01:00:32:00
Q: Can you give us a summary of your life?
A: I was born on the 19th of August, 1917 in Windsor. My father was
away at the war and I didn't see him until I was two. I grew up, on
his return - in 1919, we
01:01:00:00
moved to Caulfield where I lived until the Second World War. I went
to Caulfield North State School. And the local church, and had a
pleasant childhood although with memories of war. In 1937,
01:01:30:00
I'll go back a bit, I attended Caulfield North Central School, became
dux there and then went to Wesley like my father on a scholarship. I
changed careers - choices - while at Wesley, and decided to become
an actuary. I joined the militia in 1937 and I was a sergeant by the
time the war broke out.
01:02:00:00
I was in the 2nd Medium Brigade and was seconded to other
regiments for particular reasons. The 15th Field Artillery Regiment.
Later I returned to service with the Mediums
01:02:30:00
but in 19 - I was commissioned in 1940, too, I was transferred to the
2nd Field Artillery Training Regiment in Puckapunyal, later moving
to Greta. And I tried to get into the air force at one stage, but the
war took an unusual turn. I
01:03:00:00
was - people were competing, well I was competing for where to
serve. Although I tried to get into the air force I finally was
transferred to Melbourne to become a ballistician in the last two
years of the war, helping to make range stables for the short 25
pounder gun. During the war, in September '41, I married and had
01:03:30:00
a wonderful marriage lasting nearly 54 years. After the war I
resumed my actuarial studies and qualified; I became manager of
the Collective Insurance. At the age of 60 I was
21. www.DAMsmart.com.au
Australians at War Film Archive
Reader was needed to conform timecode in
human text using many variables
Eg:
01:00:01;30
000:01:30
11,000 hrs of content= 1,320,000
30 second timecode points!
22. www.DAMsmart.com.au
Australians at War Film Archive
01:03:00:00
it leads back to the, the fogey, old Masters I had at Melbourne Grammar
School.
Q:
Can I ask how your mother came to be living in Australia and where she
came from in England?
A:
I don’t know where she came from but her mother was an English
widow and she had four daughters and she emigrated to Australia with
these four daughters. And my mother was the eldest of the daughters.
And she was a powerful old lady my grandmother.
23. www.DAMsmart.com.au
Australians at War Film Archive
01:03:00:00 Marker
6	Event		it leads back to
the, the fogey, old Masters I had
at Melbourne Grammar School. Q:
Can I ask how your mother came to
be living in Australia and where
she came from in England? A: I
don’t know where she came from
but her mother was an English
widow and she had four daughters
and she emigrated to Australia
with these four daughters. And my
mother was the eldest of the
daughters. And she was a powerful
old lady my grandmother
28. www.DAMsmart.com.au
New, Emerging Technologies for Oral History Archive Collections
Issues Around Existing Technologies
Manual transcription: Laborious and
Expensive
Not Synchronised with Video unless a MAM
system is utilised
29. www.DAMsmart.com.au
Existing Speech to Text Technology
Large Vocabulary Continuous Speech
Recognition (LVCSR)
Keyword Spotting
New, Emerging Technologies for Oral History Archive Collections
33. We Get your Content future
ready.
DOES YOUR AV ARCHIVE NEED SAVING?
+61 2 6242 6456| andrew@damsmart.com.au
www.DAMsmart.com.au | @DAMsmart_
THANG-CUE!!
Sustainability
Editor's Notes
oh yes and grow great audiovisual archivists
... and Orange is also known for growing great ideas which is how the Australians at War Film Archive came to be by a local company called Mullion Creek Productions run by Michael Caulfield and Liz Butler.
In the year 2000 a television series named 'Australians at War' was produced by Mullion Creek Productions and commissioned by the Department of Veterans' Affairs and broadcast on the ABC. The series also produced a series guide, education kit and flash based quiz games on the website. From the success of that series came the idea to establish a national audiovisual collection that spanned all major conflicts in Australia’s modern history. And at this point time was running out, there weren’t many veterans still alive from World War One.
From this point Mullion Creek Productions and the Department of Veterans affairs put together a consulting group of eminent military historians that would establish the methodology, practice and content of the project.
The two main goals of the collection were diversity, and comprehensiveness. Research conducted initial interviews with over 6,500 potential interviewees before the final choices were made.
With these goals as the driving force for the collection, the archive would encompasses the battlefront, the home front, media and entertainment, children, teachers, wives’, workers and clerics. Interviewers would speak with people for whom the war ended 86 years ago, and with people who came home from the war 14 days before the interview.
So after 12 months of preparation production commenced!
A project like this required a lot of human resources shown here by the size of the production team
The interviewing process for the archive was conducted by two person teams in every state and territory across Australia. Each team interviewed in 'tours' of eight weeks, filming 300 hours of material per team, per tour. Note I couldn’t fit all of the interviewing team onto this slide, and note there was a total of 70 transcribers and editors.
all in total 2005 people were interviewed with recordings of significant length, most ran six hours with many others running eight or nine without rehearsal.
DVCAM Tapes were sent back from the field to Orange where Mullion Creek productions duplicated tapes and transcribed interviews into text with 30 second timecode points relating to the original DVCAM tapes. As you can appreciate the task was huge and the project enormous, at full tilt 50 interviews were arriving in Orange per week.
so what’s in the collection?
the conflicts represented world war one, world war two, the Occupation of Japan, The Korean War, the Malayan Emergency, Indonesian Confrontation, the Vietnam War, Gulf War 1, the War against Terror and conflicts in Iraq and Afghanistan.
Interviews were also conducted with men and women who served on UN, recovery and peace keeping missions.
Interviews were shot with a green screen background, for flexibility for future use, and photos of the interviewees also included in the collection.
So the race against time to capture the stories was won, with 11,000 hours of material recorded. A searchable website was established, however if one wanted to view the recording itself a request for the original tape was needed.. As webserver/ streaming technology was not quite ready for such a large collection.
And another race had already started: the obsolescence and degradation of the tapes. It is widely known that DVCAM is not a suitable preservation format, even under correct storage conditions. Coupled with obsolescence of the players themselves, parts of the collection were at risk to permanent loss. Another drive for the Digitisation of the collection was to create a website where interviews could be searched and viewed.
This is where DAMsmart comes into the picture.
After negotiations with Mullion Creek Productions and the Department of Veterans' Affairs decided ADFA was an appropriate host institution for the collection, DAMsmart was awarded the contract to digitse the collection, and create a digital preservation platform with audiovisual attributers.
.
DAMsmart commenced in 2013 in our facility in Mitchell. Tapes arrived and were checked off, and entered into our management system. A dedicated multi stream platform was created with 8 video streams running across 2 shifts. QC on the files was conducted on shared storage, and low resolution files were transcoded
Interviews were archived as a package, containing high resolution video files, online proxy files, transcripts, and photos.
The archiving platform is DNA Evolution from Storage DNA, using LTO5/LTFS technology. After one LTO 5 tape was full, the data was verified. Once successfully verified, the LTO tape was duplicated.
This is a high level overview of the Digitisation process, and in many ways quite straight forward with this particular collection when comparing to creating the searchable video archive.
before I go into this let’s take a look at another story: 35 WW2 AWAS MASTERED H264 Betty Simmons
These clips are from the 100 Stories collection drawn from Interviews from the digitisation effort DAMsmart performed on the collection. Clips have been streaming on Qantas flights, and audio versions have aired on ABC Radio National.
OK, so DAMsmart needed to import all video files and metadata into the Media Asset Management system including transcripts for viewing. As we had already archived the files into the LTO system, the next task was to create the collection access layer.
One of the main features that DAMsmart looks for in any system is the ease and automated use of data in and data out of a system. Both CatDV and DNA Evolution have XML smarts that enables no vendor lock in.
So One thing we noticed when testing metadata import was that we needed to prepare the transcripts as 'sidecar files'. CatDV regognises metadata for a clip if it is XML and has the same filename and location as the video asset.
Each tape started with an incremental timecode hour depending what tape it was in the order of recording. As an example, Tape 1 starts at 01:00:00:00, Tape 2 as 02:00:00:00, and so on. One of our very clever team members Artak, created an application that used the change in hour as a delimiter and split the XML to match the video file.
we also needed to change the characteristic of the timecode to be compatible with CatDV, so Artak created an app for that also. Marker points would now represent the timecode points and would need to increment every 30 seconds.
However, through further testing it was discovered that some timecode had non compatible or illegal characters (which is fairly understandable considering the size of the transcription process.) So another application was needed to parse all timecode points in the transcription and correct where needed.
Another issue was with characters that were throwing back issues when importing. Even though XML is human and machine readable, that doesn’t necessarily mean the data within the XML is fully compatible. As an example, we had perfectly legitimate carriage returns that looked correct when reading the transcript, however would upset the import process and only import a portion of the transcript.
To get around this, DAMsmart created another app to extract all words and recreate all words with spaces only.
Once the XML were cleansed and updated we went through the import procedure to make the proxy files viewable in CatDV, link the record to the DNA evolution system to understand which LTO tape the high resolution file was located, and match the transcript to the video making it searchable to the word.
That process was automated and complete in a weekend!
Security!!
It’s also important to note: there are embargoed interviews in this collection for a number of reasons: extremely disturbing war accounts, extreme profanity, and with some of the more modern conflicts sensitive defence information. This material is not available on the web site, however was digitized by DAMsmart and only particular users of the CatDV system have access to those embargoed interviews.
Note here we have 3 Production Groups, DAMsmart also digitised 60,000 images for UNSW that are managed in the same system, but don’t get me started on that one!
DAMSmart delivered the system to ADFA this year, and provided training for general usage and administration.
in total the data requirement was 135TB x 2 (200 x LTO5/LTFS Tapes) with 10 TB of online proxies.
So the AAWFA story doesn’t end here, as along with the DAMsmart Digitisation and asset management creation, Link Digital have been working on creating a new website for all Australian to access she collection. Specialist work was also required to modify the transcripts to be searchable on the web.
Video and audio only representations will be made available.
As with the Australians at War Film Archive, the true richness of any oral history collection is having a searchable transcription. Searchability applies to any audiovisual recording that is valued in a collection that warrants access.
Considering the terrific scale of transcription work conducted by Mullion Creek Productions, and the size of the oral history project, I’d like to talk about technology that opens the door to speech based archives in an innovative and automated way.
There are a few issues that are around manual transcription, such as the labour and expense for transcribing and time relating very large collections, and if an intelligent MAM is not used in conjunction, the video and transcription are not in synch.
There is existing speech to text technology that is available, using techniques such as Large Vocabulary Continuous Speech Recognition (which, simply put for my benefit only) adds time reference and looks at a chain of words in a vocabulary and keyword spotting which searches for a word or phrase.
There are limitations using these techniques. LVCSR is limited by the ability to adapt to language that is an evolving living thing. LVCSR is also limited to only recognise words found in their lexicons. Specialised terminology such as names of people and places are also omitted from LVCSR lexicons, to keep them small enough for processing words in a timely manner.
As an alternative, Phonetic search is where the input speech is analysed as phonetic strings, and a highly compressed phonetic representation of the audio is created for later reference and searching.
So the indexing is only performed once, and high resolution audio files are then not needed online. Phonetic search doesn’t rely purely on this phonetic track alone, it also requires language packs, to understand phrase and nuance of the language.
Speed and accuracy: using Phonetic technology the indexing process is only needed once and search can be performed as often as needed. The indexing phase only categorises potential sets of phoneemes, rather that tying words to spoken audio, which makes indexing fast.
As mentioned LVCSR needs to be updated when new words are added. When this occurs the entire collection needs to be reanalysed. When phonetic search is performed, the dictionary is referenced in the search phase, which is relatively fast, and adding new words incurs only another search.
Researches can also enter sound it out searches, misspelled searches, or even words with different pronunciations and still get accurate results.
Following is an example of how a phonetic string is represented, using a fairly modern term that has recently popped up.
In conclusion: I can see real access benefits for oral history collections utilising this technology, further opening up collections and harnessing automated digital processing once collections are digitised. DAMsmart is looking at ways this new technology can improve access to large audiovisual collections.