Presented by Dr. Lai Tee Phang in the "Tools and Technologies for Enhancing Access to Audiovisual" session at the Access, Outreach, and Use of Moving Image Archives Stream during the AMIA 2015 Conference on Friday, November 20, 2015.
Tools & Technologies for Enhancing Access to Audiovisual - the Singapore Journey
1. Tools & Technologies for enhancing access
to Audiovisual - the Singapore Journey
Dr Phang Lai Tee (phang_lai_tee@nlb.gov.sg)
National Archives of Singapore
AMIA Conference
20 Nov 2015
Curated Stream
2. } A “Little Red Dot” …
Greetings from Singapore
3. } Island city-state
} Population: 5.5 million
} Multi-racial community – Chinese
74%, Malay 14%, Indian 9%, other
ethnicities 3%
} Area : 710 sq km
} Government: Parliamentary
Democracy
} We celebrated our Golden Jubilee
} We mourned the passing of our
founding Prime Minister
Who we are
4. Introduction to the
National Archives of Singapore (NAS)
4
} 1968: Established by Act of Parliament
} Aug 1993: Came under National Heritage Board (NHB)
} 1996: AudioVisual Archives Division formally set up
} Nov 2012: Transferred to National Library Board (NLB)
Conveniently located in
Singapore’s Civic District
(1 Canning Rise)
6. New opportunities, new challenges
} Content is king
} Increased digitisation funding
} Robust IT infrastructure for resource hungry AV
} Experienced in improving search-ability of content
} Dared to innovate & try new technologies
} How to be visible in a sea of books (enhanced discovery?)
} Pressure to widen access
} Branding of archives
} Archival principles…
8. Treasure Trove of AV Content
8
} Recommendation of Advisory Council on Culture and the Arts chaired by
then 2nd Deputy Prime Minister Ong Teng Cheong in 1989
} Strengthen the national heritage collection in all media to cover sound-
and-moving images
} Over 100,000 AV recordings covering 60 years of broadcasting history of
Singapore
} AV recordings capturing defining moments and key government initiatives
in Singapore’s 50 years of independence
} Sound recordings documenting recording history of Singapore and the
region from 1903 to 1970s
13. Search beyond the Archives – Expandable
} OneSearch, Many Sources
} Data harmonization and linkages across different descriptive frameworks
and systems for the benefit of users
14. Avoiding pitfalls
} Beware of the mapping
} ISAD-G, MARC, Dublin Core
} Creator/publisher, transferring agency/source of acquisition
} One date v.s. many dates
} Know your collection well and the differences in
descriptions and definitions
} Mapping alone may not be adequate
17. Enhance findability of non-textual content -
voice to text transcription
} 6,000 hours of broadcasts and speeches done
} Useful guide for writing synopses, minimises need to make notes
when listening to audio, reduces time taken by 25% (for those with
good accuracy)
} Problem with names and non-English words
} Sarong became sorrow, Blakang Mati became Locomotiv
} Saudara Joko Senyoto became John Paulson
} Dr Goh Keng Swee became…
} Accuracy highly dependent on clarity of recording and speaker’s
accent; can be improved through training
} There are portions that can only be understood by listening to the
audio repeatedly
} Not suitable for broadcasts with multiple languages, certain series
17
19. Using text analytics to automatically identify related content
Text tokenised;
tokens parsed and
weighted (TF/IDF)
Text tokenised;
tokens parsed and
weighted (TF/IDF)
Weighted
tokens similarity
computed
Similarity = 0.295
Expandable - Mahout
20. Using clustering to handle large datasets
Clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some
sense or another) to each other than to those in other groups (clusters)
Mahout K-Means Clustering
with Cosine Distance
34. Expandable – Project by NLB
} Use machine translation technology & KOS (Knowledge
Organisation System) names database to translate non-English
content/local personality names to English
} Apply text-mining & keyword classification to recommend
related library & archives content across languages
35. In the pipeline
} Extend in-premises access to the libraries
} Image analytics
} Linked data (by NLB)
} Crowdsourcing for home movies?