Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Are New Digital Literacies Skills Neededrscd2018


Published on

Remarrying research and collection services around access to corpora and text mining, are new technical literacy skills needed? Was presented by Ingrid Mason (Deployment Strategist, AARNet) at the Research Support Community Day 2018

  • Be the first to comment

  • Be the first to like this

Are New Digital Literacies Skills Neededrscd2018

  1. 1. Are new technical literacy skills needed? Remarrying research and collection services around access to corpora and text mining. INGRID MASON DEPLOYMENT STRATEGIST
  2. 2. I think yes, and I’m a bit excited about it. 2
  3. 3. Text & data mining (TDM) is being used by a range of researchers to target relevant literature and in HASS research. More research support will need to be provided. 3 Should TDM services be coordinated nationally?
  4. 4. Many many questions What library technical skills are needed (if there is a growing research support need)? Where do researchers go if they want to find, use, move, store or create a corpus? How do researchers learn to build, evaluate, and text mine a corpus? Where can/does/should this specialist service sit (in Library Research Support or in eResearch or in Faculty or in national research infrastructure services)? Psst. I don’t have answers, just the questions at this point. Sorry! 4
  5. 5. O M G O M G O M G OK, what’s a corpus? Find a definition, somewhere reliable [searches the web]. What does a corpus look like? Linguists will know this [searches the web]. How on earth do you “make that blob of stuff accessible”? [compute/storage?] How big is that text blob and what’s it made of? Corpus analyst? [new job title?] Who do I know that knows how to build a corpus? Ah, Steve Cassidy from Alveo VL. What makes for a well balanced/formed corpus? Breathe, reach for library skills. What about commercially hosted text blobs? Read: Kylie Poulton’s VALA 2018 paper. 5
  6. 6. I’m a corpus building & TDM novice - I feel like an imposter. 6 I’m old style but I’d like to give this a go. Would you? Schonfeld, Roger C & Christine Wolff-Eisenberg (2017). Taking a Closer Look at Talent Management: Findings from the US Library Survey, 10 April 2017. Ithaka S+R Blog. Last accessed: 18/04/2017
  7. 7. 7
  8. 8. 8
  9. 9. 9
  10. 10. Digital Humanities Australasia 2016 Hobart, Australia 10
  11. 11. Digital Humanities Australasia 2016 Hobart, Australia 11
  12. 12. Alan Liu’s DH Toychest Data Collections and Datasets Question: How does this arrangement of resources in Liu’s DH Toychest change my understanding of collecting resources for research and supporting research? Answer: Quite a lot, I feel out of my depth, but also very intrigued and my fingers are tingling. Why? Challenge: I need to start looking into corpora and have a go at constructing a corpus (hint: two projects this year). 12
  13. 13. 13
  14. 14. 14 Library Technical Skills Research support in: Research Data Management / Digital Scholarship / Digital Curation / Research Techniques Using: iPython (now Jupyter) notebook - Natural Language Toolkit / Library Carpentry or Data Carpentry or Software Carpentry / Text Mining with R (O’Reilly) Psst we aim for Jupyter notebooks connected to CloudStor (1 notebook pp to play with)
  15. 15. A Trend Expertise lies in the university to support text mining for research and scholarly literature searches. Biomedical Text Mining An important problem that text mining attempts to address is information overload and overlook. Examples of solutions to this problem include Information Extraction, Document Summarisation, and Document Classification. In the following example we demonstrate the use of Text Mining to classify sentences in biomedical articles and extract key units of information. This provides a way for busy professionals to reduce the amount of information to which they are exposed and focus only on salient aspects in which they are interested. From Text Mining Collaboration - UNSW 15
  16. 16. Learn More Some history and definition of the terms (and more) is offered. Text mining & Text analysis - what is the difference? Text mining began with the computational and information management fields (e.g. database searching and information retrieval), whereas Text analysis began in the humanities with the manual analysis of text, (e.g Bible concordances and newspaper indexes). More recently, the two terms have become synonymous, and now generally refer to the use of computational methods to search, retrieve, and analyse text data. "Text mining or text analytics is an umbrella term describing a range of techniques that seek to extract useful information from document collections through the identification and exploration of interesting patterns in the unstructured textual data of various types of documents – such as books, web pages, emails, reports or product descriptions." (Truyens & van Eecke, 2014) From: Text Mining and Text Analysis - UQ (Research Techniques) 16
  17. 17. Digital Scholarship How can research support for corpus building and text mining be scaled up? 17 Text and data mining Analyse large scale datasets in your research Data mining is the process of applying open-ended computational methods to large scale datasets to discover new insights that may not be revealed through targeted smaller scale analyses. When the datasets used are bodies of text, this process is often termed text mining and can provide a complementary approach to traditional close readings of texts. Text and data mining (TDM) approaches can open up new areas of scholarly enquiry. Research Data Management - USYD (RDM)
  18. 18. 18 Institutional vs National Services for Corpus Building & TDM? More library minds and coordination is needed in this space. What overlap is there with CAUL/CEIRC & NCRIS?
  19. 19. 19 Sydney Stock Exchange Records - Institutional Digitisation for research. AARNet partnership with ANU Library and Noel Butlin Archive. Stock and Share Lists include ~199 registers of printed and written (copperplate) information that requires format conversion and automated translation. Records includes company names, price of stocks, and share transactions from 1901-1950. An archival series that can be delivered for search and browse via an interface. A corpus that can be built and text mined and analysed via an interface.
  20. 20. HASS DEVL - National The Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab (DEVL) will bring together fragmented data, tools and services into a shared workspace. Key outcomes from the project will be: ● Lowering barriers to entry for HASS infrastructure ● Increased interoperability between existing HASS platforms ● More joined up data landscape ● Data curation for better reuse, reproduction, and publishing of research data sets ● Game-changing skills and training activities Funding and co-investment via NCRIS and institutional partners. https://www.ands- 20
  21. 21. 21
  22. 22. HASS DEVL Data curation package - Datasets sourced from Prosecution Project, NLA/TROVE, SLQ and APO - Datasets processed via Alveo and AURIN - Data curation framework between UoM, Alveo, AURIN, and NLA/TROVE Will these composites of digital objects be a digital collection, a dataset, a data collection, a series, a demo corpus, a text corpus, or a linguistic corpus? We will need to explore this question together [please all don your curator’s hat]. 22
  23. 23. Digital Collections ● AU government gazettes (NLA) ● QLD records of railway workers / publicans / government workers (SLQ) ● Court records from various states and territories (PP) ● Historical census data (ADA) ● Grey literature (APO) Trick question: which of these collections could be text mined and/or become a corpus? 23
  24. 24. UL Research Support? 24
  25. 25. Want to know more? #datawhodunnit #datalibs 25 AARNet m/xmnpn eRSA BvBun INGRID MASON DEPLOYMENT STRATEGIST Read: Kylie Poulton’s VALA 2018 TDM Paper
  26. 26. Definitions and Examples 26
  27. 27. Text Mining Identifying linguistic patterns in text (as data) Categorising, clustering, or identifying named entities Abstracting, analysing and summarising (the textual content) Constrained by the extent and scope of the textual data Using programming languages like R or tools like Voyant 27
  28. 28. Text Corpora The selection, extraction and processing of the text may involve linguistic methods but may not be for the purpose of studying language, rather to investigate the nature of text as semantic content. Take a look at Visualising Raynal - three editions of Guillame-Thomas Raynal’s Histoire de deux Indes (1770, 1774, 1780). Part of the ANU Digitizing Raynal project led by Glenn Roe (working with Centre for Literary and Linguistic Computing (UoN)). PDFs from BNF (1770 + 1780) and Bodleian (1774). 28
  29. 29. Corpus (Corpora) If in doubt - dictionary time! a : all the writings or works of a particular kind or on a particular subject; especially : the complete works of an author b : a collection or body of knowledge or evidence; especially : a collection of recorded utterances used as a basis for the descriptive analysis of a language 29
  30. 30. Linguistic Corpora Australian National Corpus June Farris (Subject Specialist) at University of Chicago Library Linguistic Data Consortium (UPenn) 30