Large-Scale Computational Research in Arts & Humanities

1,613 views

Published on

Presented by John Coleman at the JISC Future of Research conference, 19th October 2010

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,613
On SlideShare
0
From Embeds
0
Number of Embeds
34
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Large-Scale Computational Research in Arts & Humanities

  1. 1. Large-scale computational research in Arts and Humanities, using mostly unwritten (audio/visual) media John Coleman Faculty of Linguistics, Philology and Phonetics University of Oxford The future of research? 19/10/10
  2. 2. <ul><li>What am I talking about? (Show and tell) </li></ul><ul><li>How is it affecting the A/H research landscape? </li></ul><ul><li>Implications (cost, strategy etc) </li></ul>
  3. 3. nce upon a time ... <ul><li>There she weaves by night and day </li></ul><ul><li>A magic web with colours gay </li></ul>
  4. 4. 1967: writing Important/jj as/cs was/bedz Mr./np O'Donnell's/np$ essay/nn ,/, his/pp$ thesis/nn is/bez so/ql restricting/jj as/cs to/to deny/vb Faulkner/np the/at stature/nn which/wdt he/pps obviously/rb has/hvz ./. He/pps and/cc also/rb Mr./np Cowley/np and/cc Mr./np Warren/np have/hv fallen/vbn to/in the/at temptation/nn which/wdt besets/vbz many/ap of/in us/ppo to/to read/vb into/in our/pp$ authors/nns --/-- Nathaniel/np Hawthorne/np ,/, for/in example/nn ,/, and/cc Herman/np Melville/np --/-- protests/vbz against/in modernism/nn ,/, material/jj progress/nn ,/, and/cc science/nn which/wdt are/ber genuine/jj protests/nns of/in our/pp$ own/jj but/cc may/md not/* have/hv been/ben theirs/pp$$ ./. Faulkner's/np$ total/nn works/nns today/nr ,/, and/cc in/in fact/nn those/dts of/in his/pp$ works/nns which/wdt existed/vbd in/in 1946/cd when/wrb Mr./np Cowley/np made/vbd his/pp$ comment/nn ,/, or/cc in/in 1939/cd ,/, when/wrb Mr./np O'Donnell/np wrote/vbd his/pp$ essay/nn ,/, reveal/vb no/at such/jj simple/jj attitude/nn toward/in the/at South/nr-tl ./. If/cs he/pps is/bez a/at traditionalist/nn ,/, he/pps is/bez an/at eclectic/jj traditionalist/nn ./. If/cs he/pps condemns/vbz the/at recent/jj or/cc the/at present/nn ,/, he/pps condemns/vbz the/at past/nn with/in no/ql less/ap force/nn ./. If/cs he/pps sees/vbz the/at heroic/jj in/in a/at Sartoris/np or/cc a/at Sutpen/np ,/, he/pps sees/vbz also/rb --/-- and/cc he/pps shows/vbz --/-- the/at blind/jj and/cc the/at mean/jj ,/, and/cc he/pps sees/vbz the/at Compson/np family/nn disintegrating/vbg from/in within/rb ./. He/pps is/bez not/* one/cd to/to remain/vb more/ql comfortably/rb and/cc
  5. 5. XML TEI: still writing <inscript id=&quot;halu0001&quot;> <sourceDesc> <physObj type=&quot;ashlar&quot; engrave=&quot;engraved&quot; color=&quot;&quot;> <desc> Negev. Elusa (Haluza). 100-299 CE. Limestone ashlar dressed as a tabula ansata. </desc> <letterHgt min=&quot;1.7&quot; max=&quot;3.5&quot;>1.7-3.0 cm Aramaic, 2.5-3.5 cm Greek</letterHgt> <dateRange calendar=&quot;Gregorian&quot; from=&quot;100&quot; to=&quot;299&quot;> 100 CE to 299 CE <note> Based on the Greek and Palmyrene script. </note> </dateRange> <discovery> <place region=&quot;Negev&quot; city=&quot;Elusa (Haluza)&quot; site=&quot;Foundation of an abandoned Beduin structure&quot; locus=&quot;&quot;> <note> The inscription consists of two lines of Greek followed by one of
  6. 7. VRE for the Study of Documents and Manuscripts: writing A trial of Kathryn Sutherland’s Jane Austen manuscript project; supported by CCH, King's College London but here ported to our VRE-SDM demonstrator
  7. 8. From writing to video
  8. 9. 2005
  9. 10. 2008 YouTube surpasses Yahoo as world’s #2 search engine
  10. 11. Researchers of the future <ul><li>Just as comfortable creating multimodal online content </li></ul><ul><li>video, games, websites etc – </li></ul><ul><li>as writing essays </li></ul><ul><li>Online video is interesting; TV is boring (passive) </li></ul>
  11. 12. Texts
  12. 13. Texts
  13. 14. Texts
  14. 15. Texts Здесь Будет ]ружен Памятник ]божденный т[
  15. 16. Texts Здесь Будет ]ружен Памятник ]божденный т[руд]
  16. 17. Texts Здесь Будет ]ружен Памятник ]божденный т[руд] Here will be erected a monument to liberated labour
  17. 18. Vocal tract movements in speech
  18. 19. Resonance tuning in soprano singing and vocal tract shaping Erik Bresch, Speech Production and Articulation kNowledge Group University of Southern California
  19. 20. Mining a Year of Speech: a “Digging into Data” project http://www.phon.ox.ac.uk/mining/
  20. 21. John Coleman Greg Kochanski Ladan Ravary Sergio Grau Oxford University Phonetics Laboratory Lou Burnard Jonnie Robinson The British Library with support from
  21. 22. Mark Liberman Jiahong Yuan Chris Cieri Phonetics Laboratory and Linguistic Data Consortium UPenn with support from NSF
  22. 23. The “Digging into Data” challenge <ul><li>“ The creation of vast quantities of Internet accessible digital data and the development of techniques for large-scale data analysis and visualization have led to remarkable new discoveries in genetics, astronomy and other fields ... </li></ul><ul><li>With books, newspapers, journals, films, artworks, and sound recordings being digitized on a massive scale, it is possible to apply data analysis techniques to large collections of diverse cultural heritage resources as well as scientific data.” </li></ul>
  23. 24. The “Year of Speech” <ul><li>A grove of corpora, held at various sites with a common indexing scheme and search tools </li></ul><ul><li>US English material: 2,240 hrs of telephone conversations </li></ul><ul><li>1,255 hrs of broadcast news </li></ul><ul><li>As-yet unpublished talk show conversations (1000 hrs), Supreme Court oral arguments (5000 hrs), political speeches and debates </li></ul><ul><li>British English: Spoken part of the British National Corpus, 10 million words of transcribed speech </li></ul><ul><li>Recently digitized by collaboration with British Library </li></ul>
  24. 25. C-SPAN <ul><li>US cable TV channel covering Senate/House proceedings, committees, current affairs discussion shows </li></ul><ul><li>20-year archive of publicly open video </li></ul><ul><li>Large parts of the proceedings officially transcribed and published </li></ul>
  25. 26. Digging for audio: kinds of questions someone might ask <ul><li>1. When did X say Y? </li></ul><ul><li>For example, &quot;find the video clip where George Bush said 'read my lips'.&quot; </li></ul><ul><li>2. How do arguments work? </li></ul><ul><li>For example, how do different people handle interruptions? </li></ul><ul><li>3. How frequent are linguistic features such as phrase-final rising intonation (&quot;uptalk&quot;) across different age groups, genders, social classes, and places? </li></ul><ul><li>4. Who says “ask” and who says “aks”? </li></ul>
  26. 27. British National Corpus <ul><li>Collected in early 1990s by consortium of dictionary makers (Collins, Longman, OUP) and academics (Oxford, Lancaster, Oslo-Bergen) </li></ul><ul><li>100m word text (XML) corpus, of which 10m is transcribed speech </li></ul><ul><li>c. 4.2 m words is demographically-sampled recordings of unplanned conversations </li></ul><ul><li>British Market Research Bureau loaned Sony Walkmans to recruits </li></ul><ul><li>c. 5 m words is “context-governed” speech (educational, business, public speeches/meetings, 'leisure' – sports, clubs, broadcast, phone-ins etc) </li></ul><ul><li>Transcribed by audio typists and structured in XML database with rich metadata annotations </li></ul>
  27. 28. A few speech samples from the BNC <ul><li>A domestic drama </li></ul><ul><li>Political commentary/current affairs </li></ul><ul><li>Are dogs people too? </li></ul>
  28. 29. Practicalities <ul><li>In order to be of much practical use, such very large corpora must be indexed at word and phoneme level </li></ul><ul><li>All included speech corpora must therefore have associated text transcriptions </li></ul><ul><li>We use the Penn Phonetics Laboratory Forced Aligner to associated each word and segment with the corresponding start and end points in the sound files </li></ul>
  29. 30. 'Speech in the wild'
  30. 31. 'Speech in the wild'
  31. 32. Rethinking language <ul><li>Dogs </li></ul><ul><li>Parrot talk (to/about, not by) </li></ul><ul><li>Talk to inanimate objects </li></ul><ul><li>We can look forward to ... </li></ul><ul><li>Listen they were all going [belch] that ain't a burp he said </li></ul><ul><li>Like I'd be talking like this and suddenly it'll go [mimics microphone noises] </li></ul><ul><li>He simply went [sound effect] through his nose </li></ul>
  32. 33. A future of research? <ul><li>Survey of audio-visual tools and resources in the Humanities (AHRC ICT Strategy project) </li></ul><ul><li>http://www.phon.ox.ac.uk/ict </li></ul><ul><li>Key findings </li></ul><ul><ul><li>Growing but relatively poorly-supported use of audio & video in many subjects (Music, Modern Languages, Modern History, Archaeology, Classics, Art, Linguistics) </li></ul></ul><ul><ul><li>Annotation, search and browse tools are essential </li></ul></ul><ul><ul><li>Digital data storage and processing power required vastly outstrips text and photos, and is commensurate with e-Science grid computing </li></ul></ul>
  33. 34. How big is “big science”? <ul><li>Human genome: 3 GB </li></ul><ul><li>DASS audio sampler: 350 GB </li></ul><ul><li>Hubble space telescope: 0.5 TB/year </li></ul><ul><li>'Year of Speech': 1 - 2 TB </li></ul><ul><li>Sloan digital sky survey: 16 TB </li></ul><ul><li>Beazley Archive & partners: 20 TB </li></ul><ul><li>Ruskin School of Art student projects: 30 TB </li></ul><ul><li>10m Google Books: ~150 TB </li></ul><ul><li>Survivors of the Shoah </li></ul><ul><li>Visual History Foundation 180 TB </li></ul><ul><li>Large Hadron Collider: 15 PB/year </li></ul><ul><li>= 100 x Google Books </li></ul><ul><li>Photographic collections, film libraries, museum catalogues etc are pretty large nowadays </li></ul>
  34. 35. How big is “big science”? <ul><li>Human genome: 3 GB </li></ul><ul><li>DASS audio sampler: 350 GB </li></ul><ul><li>Hubble space telescope: 0.5 TB/year </li></ul><ul><li>'Year of Speech': 1 - 2 TB </li></ul><ul><li>Sloan digital sky survey: 16 TB </li></ul><ul><li>Beazley Archive & partners: 20 TB </li></ul><ul><li>Ruskin School of Art student projects: 30 TB </li></ul><ul><li>10m Google Books: ~150 TB </li></ul><ul><li>Survivors of the Shoah </li></ul><ul><li>Visual History Foundation 180 TB </li></ul><ul><li>Large Hadron Collider: 15 PB/year </li></ul><ul><li>= 100 x Google Books </li></ul><ul><li>Photographic collections, film libraries, museum catalogues etc are pretty large nowadays </li></ul>-------------- humanities
  35. 36. Why does big matter? <ul><li>What kind of questions you can study depends on the material you've got. (Obviously.) </li></ul><ul><li>Humanities deals with rare and unique works and interpretations, not repeatable events. </li></ul><ul><li>To study rare events/things and connections, it can be important to just have a lot of data – as much as possible – in order to have enough examples. </li></ul>
  36. 37. Rare(ish) events in English <ul><li>I’[n] trying 160 instances in BNC </li></ul><ul><li>See[n] to 310 </li></ul><ul><li>Alar[ŋ] clock 18 </li></ul><ul><li>Swimmi[m] pool 44 </li></ul><ul><li>Getti[m] paid 19 </li></ul><ul><li>Weddi[m] present 15 </li></ul>
  37. 38. Challenges: technology <ul><li>Amount of material </li></ul><ul><li>Storage </li></ul><ul><ul><li>CD quality: 635 MB/hour </li></ul></ul><ul><ul><li>Uncompressed .wav files: 115 MB/hour </li></ul></ul><ul><ul><li>16 acoustic analysis parameters: 1.44 MB/hour </li></ul></ul><ul><ul><li>2.8 GB/day </li></ul></ul><ul><ul><li>85 GB/month </li></ul></ul><ul><ul><li>1.02 TB/year </li></ul></ul><ul><li>Computing </li></ul><ul><ul><li>distance measures, etc. </li></ul></ul><ul><ul><li>alignment of labels </li></ul></ul><ul><ul><li>searching and browsing </li></ul></ul>
  38. 39. Challenges: technology <ul><li>Storing 1.02 TB/year: not really a problem in 21st century </li></ul><ul><li>1 TB (1000 GB) hard drive costs c. £65 </li></ul><ul><li>Computing (distance measures, alignments, labels etc): multiprocessor cluster </li></ul>
  39. 40. Collaboration, not collection Search interface 2 (e.g. BL) Search interface 1 (e.g. Oxford) Search interface 3 (e.g. Penn) Search interface 4 (e.g. Lancaster ?) BNC-XML database - retrieve time stamps Spoken BNC recordings - BL sound server(s) LDC database - retrieve time stamps Spoken LDC recordings - various locations
  40. 41. Collaboration, not collection Search interface 2 (e.g. BL) Search interface 1 (e.g. Oxford) Search interface 3 (e.g. Penn) Search interface 4 (e.g. Lancaster ?) BNC-XML database - retrieve time stamps Spoken BNC recordings - BL sound server(s) LDC database - retrieve time stamps Spoken LDC recordings - various locations Database of time stamps produced using consistent indexing standards Your recordings - whatever location
  41. 42. Challenges: dispersal/aggregation <ul><li>Dispersed resources; grid computing </li></ul><ul><li>Need for international standards (for authorisation etc.) </li></ul><ul><li>Humanities research may require new support structures (cf. 'big science' comparisons) </li></ul><ul><li>'Federated library' or 'national research laboratory' models? </li></ul>
  42. 43. Challenges: dispersal/technology <ul><li>Finding stuff </li></ul><ul><li>Doing something with it </li></ul><ul><li>Transformation, new interpretations </li></ul>
  43. 44. Challenges: human <ul><li>Human aspect more important than hardware </li></ul><ul><li>Who is qualified to carry out such work? </li></ul><ul><li>Employment prospects </li></ul><ul><li>What training provision is required? </li></ul><ul><li>Should training in computer programming become normal for arts/humanities students? </li></ul>
  44. 45. Possible impacts <ul><li>Will open up Year of Speech data and tools to linguistics, phonetics, speech communication, oral history, education </li></ul><ul><li>Automatic and reliable indexing of spoken on-line materials would be a “killer app” </li></ul><ul><li>Caveat … it is practically impossible to predict the impact of developments in the market (cf. Microsoft, Google, YouTube) </li></ul><ul><li>or that come to market (transistors, lasers, holograms). </li></ul><ul><li>So it’s even harder to reliably predict impacts of cutting-edge research </li></ul>
  45. 46. Thank you for your time and attention http://www.phon.ox.ac.uk/mining/ http://bvreh.humanities.ox.ac.uk/ http://www.phon.ox.ac.uk/ict
  46. 48. Spoken Babylonian Martin Worthington: Babylonian and Assyrian Poetry and Literature: An Archive of Recordings http://people.pwf.cam.ac.uk/mjw65/BAPLAR/Archive The Righteous Sufferer (Ludlul bēl nēmeqi), part of Tablet II, read by Margaret Jaques Cavigneaux
  47. 49. Babylonian Karaoke 1 šattamma ana balāṭ adanna īteq 1 One whole year to the next! The appointed time passed. 2 asaḫḫurma lemun lemunma 3 zapurtī ūtaṣṣapa išartī ul uttu 2 As I turned around, it was more and more terrible; 3 My ill luck was on the increase, I could find no good fortune. 4 ila alsīma ul iddina pānīšu 5 usalli ištarī ul ušaqqâ rēšīša
  48. 51. Visualisation using 3-D/4-D models Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab
  49. 52. Visualisation using 3-D/4-D models Screen renderings of the Odeion at Pompeii, 3D visualisation and research by Martin Blazeby, King's Visualisation Lab

×