Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Computational History and the Transformation of Public Discourse in Finland, 1640 - 1910 (COMHIS)

84 views

Published on

Computational History and the Transformation of Public Discourse in Finland, 1640-1910 (COMHIS)

Published in: Education
  • Be the first to comment

  • Be the first to like this

Computational History and the Transformation of Public Discourse in Finland, 1640 - 1910 (COMHIS)

  1. 1. Consortium partners: • National Library of Finland, Centre for Preservation and Digitisation • University of Helsinki, Faculty of Humanities • University of Turku, Dept of Information Technology • University of Turku, Dept of Cultural History More info at http:// goo.gl/tMH4RE
  2. 2. Researchers: • National Library of Finland, Centre for Preservation and Digitisation Kimmo Kettunen (PI), Mika Koistinen, Teemu Ruokolainen • University of Helsinki, Faculty of Humanities Mikko Tolonen (PI), Leo Lahti, Jani Marjanen, Hege Roivainen, Ville Vaara • University of Turku, Dept of Information Technology Tapio Salakoski (PI), Filip Ginter, Aleksi Vesanto • University of Turku, Dept of Cultural History Hannu Salmi (Consortium PI), Asko Nivala, Heli Rantala, Reetta Sippola
  3. 3. COMHIS Overview
  4. 4. COMHIS Overview Work package 1: Publishing Trends and the Development of Public Discourse WP 1.1 Large-scale Analysis of Library Catalogue Metadata Collections WP 1.2. Intellectual Geography and Transcending of National Borders Work Package 2: WP2 Viral Texts and Social Networks of Finnish Public Discourse in Newspapers and Journals 1771–1910 WP 2.1: Improving the Quality of Newspaper Digital Archives WP 2.2: Virality of Newspaper and Journal Discourse in Nineteenth-Century Finland: Cultural Rhizomes and Social Networks Work package 3: Data Analytical Ecosystem for Newspapers and Historical Document Collections WP 3.1 Quantitative Tools for Bibliographic Library Catalogue Metadata Collections and Finnish Book Production (1488–1910) WP 3.2 Machine learning methods for text mining WP 3.3 Text Reuse and Paraphrasing in Finnish Newspapers and Journals, 1771–1910 WP 3.4 Open Source Statistical Workflows
  5. 5. National Library of Finland NLF has a large digitized newspaper and journal collection 1771-1920 (and newer) • http://digi.kansalliskirjasto.fi Newspapers Digitized 4,501,147 pages. Free use 2,954,424 pages (65%) (-1920). Copyright based material 1,546,723 pages (35%) (1921-) Journals Digitized 6,378,717 pages. Free use 2,161,748 pages (33%) ( -1920). Copyright based material 4,216,969 pages (67%) (1921-).
  6. 6. Article extraction Several articles
  7. 7. Metadata: Books and Newspapers
  8. 8. Timeline count of number of different newspapers published per year in different languages, 1800-1920
  9. 9. Full text-mining: Newspapers and Journals
  10. 10. • How much newspapers and journals shared each others’ content? • We have found 8 million clusters of repeated texts in the corpus of Finnish newspapers and journals 1771–1910, this includes a total of 49 million occurences (hits) • Different forms of text reuse: advertisement, notices, news, anecdotes, poems, etc. • Long-term reuse • Viral chains, explosive replication Text reuse
  11. 11. The problem of OCR’d text: example pair • Multa tä@tä fyNlkÄsiii kchtalostu ,ct , Abouil Asi,3 wic!lä ticiun't>t ,mitää>«, »vaalii luiftti iloista M,m<iä Tshiragauissa, ©elä fi:föf3>i'öi että uiUfatfpäim -uhkaisiloui i Hviarat, miinto fu^tiaani 'fatifefi- fuffotai» lÄuja THi roinin, puutarhassa ja, ipici'ilitsi hwi'tt<iiöii fmmiamcrk^iUi ja anoo» »imilyMla, • Mutta tästä synkästä kohtalosta ei Abbul Asib »ielä tiennyt mitään, vaan »ietti iloista elämää TshiraganiSsa. Sekä sis»Stä «ttä ulkoapäin uhkasivat «aarat. mutta sulttaani katseli lukkotaisteluja Tfhiiaaanin puutarhassa ja palkitsi voittajan lunnicnnerleillä ja ar° vonimityksillä.
  12. 12. Finding text reuse • Programme called NCBI BLAST • Used to compare and align biological sequences, like protein sequences • Finds all similar sub-sequence pairs • Our data is just text, not protein sequences • We had to encode our data into protein sequences • 23 amino acids available • We formed a mapping from the 23 most common letters to the available amino acids • We encoded the data using this mapping and discarded characters that didn’t have an equivalent • "This is an example sentence” à “DSCHCHBEGBNQFGHGEDGEG” • BLAST outputs all similar sub-sequences from our data • We formed clusters by assigning all sub-sequences that overlap enough to be a cluster
  13. 13. Publications • Kimmo Kettunen, Tuula Pääkkönen: “Measuring Lexical Quality of a Historical Finnish Newspaper Collection? Analysis of Garbled OCR Data with Basic Language Technology Tools and Means”, LREC 2016. • Kimmo Kettunen, Eetu Mäkelä, Juha Kuokkala, Teemu Ruokolainen, Jyrki Niemi: “Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910”, LWDA 2016: 124-135. • Tuula Pääkkönen, Jukka Kervinen, Kimmo Kettunen, Asko Nivala, Eetu Mäkelä: “Exporting Finnish Digitized Historical Newspaper Contents for Offline Use”, D-Lib Magazine 22(7/8) (2016). • Mikko Tolonen, Jani Marjanen, Niko Ilomäki, Hege Roivainen and Leo Lahti, “Printing in a Periphery: a Quantitative Study of Finnish Knowledge Production, 1640-1828”, Proceedings of Digital Humanities 2016, long papers, Kraków, Poland, July, 2016 • Mikko Tolonen, Leo Lahti and Niko Ilomäki, “A Quantitative Analysis of History in the ESTC catalogue”, Liber Quarterly, 25(2), pp. 87–116, 2016. DOI: http://doi.org/10.18352/lq.10112 • Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi and Filip Ginter: “A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora”, In Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), 330–333, http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf • Aleksi Vesanto, Asko Nivala, Heli Rantala, Tapio Salakoski, Hannu Salmi and Filip Ginter: “Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771-1910”, Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), 54–58, http://www.ep.liu.se/ecp/133/010/ecp17133010.pdf

×