Data Mining for scholarship


Published on

An introduction to text mining on digital content

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Introduce something on a+h foucs
  • Data Mining for scholarship

    1. 1. The Million Book Challenge : data mining for scholarship Alastair Dunning Digitisation Programme Manager, JISC 0203 006 6065 [email_address]
    2. 2. From Keyhole to Open Door <ul><li>Many scholars still approach digitised data with the ‘search+browse’ paradigm </li></ul><ul><li>All digitised resources are initially constructed in this way </li></ul><ul><ul><li>E.g. British Library 19 th -Century Newspapers – over 1m pages of text, billions of words – keyword searches tend to reveal lists of 100s of pages </li></ul></ul><ul><li>Yet digitised resources can be analysed in their entirety </li></ul>
    3. 3. Open Door possibilities <ul><li>Machine translation </li></ul><ul><li>Analogue to text – e.g. identifying footnotes within text, spotting the beginning and end of entries, encyclopedias, and gazetteers </li></ul><ul><li>Information Extraction </li></ul><ul><ul><li>(Semantic) recognition of people, places, dates, and organizations, citations etc. </li></ul></ul><ul><li> For scholars, new types of research in understanding primary (or secondary) sources </li></ul>
    4. 4. A Case Study: 17 th -Cent. News <ul><li>Thanks to Ian Gregory, Andrew Hardie (University of Lancaster) for this study </li></ul><ul><li>Lancaster Newsbooks Corpus </li></ul><ul><ul><li>800,000 words of 1650s English newsbooks: </li></ul></ul><ul><ul><li>Every surviving newsbook from mid-Dec 1653 to end of May 1654 </li></ul></ul><ul><ul><li>Freely available via AHDS Catalogue </li></ul></ul><ul><ul><li>Methodological and technical problems exist – skirted over here </li></ul></ul>
    5. 5. 1. Recognising geographies <ul><li>Extracting individual “mentions” of place names from the corpus </li></ul><ul><li>The identification of proper nouns is accomplished via part-of-speech tagging, a well established technology within linguistics </li></ul>
    6. 6. <ul><ul><li>In the Patrick of Liverpoole – which we lately recovered from the Brest men of War – was one Walter Roche – who was to carry her to Brest – and he informed us - that there are these Ships following belonging to Brest – who do so vex us in these Seas – viz. </li></ul></ul><ul><ul><li><p> In_II the_AT <em> Patrick_NP1 </em> of_IO <em> <reg orig=&quot;Liverpoole&quot;> Liverpool_NP1 </reg> </em> ,_, which_DDQ we_PPIS2 lately_RR recovered_VVN from_II the_AT <em> Brest_JJT </em> men_NN2 of_IO War_NN1 ,_, was_VBDZ one_MC1 <em> Walter_NP1 Roche_NP1 </em> ,_, who_PNQS was_VBDZ to_TO carry_VVI her_PPHO1 to_II <em> Brest_NP1 </em> ,_, and_CC he_PPHS1 informed_VVD us_PPIO2 ,_, that_CST there_EX are_VBR these_DD2 Ships_NN2 following_II belonging_VVG to_II <em> Brest_NP1 </em> ,_, who_PNQS do_VD0 so_RR vex_VVI us_PPIO2 in_II these_DD2 Seas_NN2 ,_, <em> viz._REX </em> </p> </li></ul></ul>
    7. 7. 2. Extracting place names and assigning co-ordinates <ul><li>Proper nouns compared to a gazetteer </li></ul><ul><ul><li>We chose </li></ul></ul><ul><ul><li>Places outside Europe filtered out </li></ul></ul><ul><ul><li>SQL database relational join </li></ul></ul><ul><ul><li>Filters out non-place-name proper nouns </li></ul></ul><ul><ul><li>Problem: duplicate place names (e.g. Newcastle in Ireland) </li></ul></ul><ul><li>Each instance of a place name is associated with (one or more) sets of coordinates </li></ul>
    8. 8. 3. And on to GIS… Google Earth ArcGIS Density smoothing in ArcGIS (GIS – Geographical Information System)
    9. 9. 4. Mapping by theme <ul><li>What is being discussed in relation to each mention of each place-name? </li></ul><ul><ul><li>We cannot tell just from the dates and co-ordinates </li></ul></ul><ul><li>Solution: concordance + semantic tagging </li></ul><ul><ul><li>USAS system (University of Lancaster Semantic Analysis System) </li></ul></ul><ul><li>Finding all terms related to a theme, e.g. money, cash, sterling, pound. </li></ul>
    10. 10. Identifying thematic associations (a): semantic tags in immediate context <ul><li><hit_word>Dunkirk</hit_word> </li></ul><ul><li><text>DutchDiurn03</text> </li></ul><ul><li>of a rich Fleet from Z5 Z5 I1:1u Z2 Z5 </li></ul><ul><li>Dunkirk Z2 </li></ul><ul><li>, consisting of about forty A1:8u Z5 A13:4 N1 </li></ul>
    11. 11. Mapping war… Tag: G3 “warfare, etc” 780 mentions <ul><li>Problems: </li></ul><ul><li>March – 18 mentions, 2 places </li></ul><ul><li>Munster – 12 mentions, 3 places </li></ul><ul><li>Newcastle – 5 mentions in west of Ireland </li></ul><ul><li>Manchester </li></ul><ul><ul><li>Middleton – 63 mentions, General in a rebellion in Scotland </li></ul></ul><ul><ul><li>Whalley – 10 mentions, General in a regiment of horse </li></ul></ul>
    12. 12. Mapping money and government I1 Money 140 mentions G1 Government 293 mentions
    13. 13. 1m Books Challenge (I) <ul><li>Such analysis could be done on any dataset </li></ul><ul><li>Concept developed by Greg Crane , Tufts University, Director of Perseus Project </li></ul><ul><li>Taken up by six funding bodies to create international grant competition (from US, UK, Canada, Germany) – name t.b.c. </li></ul><ul><li>Competition to forge international partnerships to undertake type of work highlighted in case study </li></ul>
    14. 14. 1m Books Challenge (II) <ul><li>Will involve scholars, computer scientists, information managers and publishers </li></ul><ul><li>Competition is seeking to open up publishers’ content to allow for this type of analysis </li></ul><ul><li>Competition due to open in January 2009; significant time built into the call to allow for relationships with publishers to be developed </li></ul>
    15. 15. Technical & Legal Issues (I) <ul><li>Scholars need to have a copy of the corpus / dataset to be analysed </li></ul><ul><ul><li>Difficulties in actually transferring large corpora </li></ul></ul><ul><ul><li>Obvious IPR risks – material could be passed on </li></ul></ul><ul><li>Or publishers need to make entire corpus available online </li></ul><ul><ul><li>Technically complex; requires powerful infrastructure; how does online content interact with analytical tools? </li></ul></ul>
    16. 16. Technical & Legal Issues (II) <ul><li>Experiments need to be repeatable </li></ul><ul><ul><li>Peer-review demands that other scholars have access to a corpus to review peers’ analyses </li></ul></ul><ul><li>Where are enriched datasets stored? </li></ul><ul><ul><li>Proliferating number of enriched datasets </li></ul></ul><ul><ul><li>Demand for delivering enriched datasets (or parts of them) for research and teaching </li></ul></ul><ul><li>Who owns the IPR in an enriched data set </li></ul><ul><ul><li>Original publisher – yes. Researcher? Software, thesaurus and gazetteer creators? </li></ul></ul><ul><li>How does this work for records, images, maps, audio, video? </li></ul>
    17. 17. Significant Challenges <ul><li>Significant challenges exist for all stakeholders </li></ul><ul><li>But possibilities for exploiting investment in creating digital content </li></ul><ul><li>And potential for new avenues of research which scholars will wish to explore </li></ul><ul><li>‘ Million Books Challenge’ will help explores some of these issues </li></ul>