Data Mining for scholarship

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    Introduce something on a+h foucs

    Favorites, Groups & Events

    Data Mining for scholarship - Presentation Transcript

    1. The Million Book Challenge : data mining for scholarship Alastair Dunning Digitisation Programme Manager, JISC 0203 006 6065 [email_address]
    2. From Keyhole to Open Door
      • Many scholars still approach digitised data with the ‘search+browse’ paradigm
      • All digitised resources are initially constructed in this way
        • E.g. British Library 19 th -Century Newspapers – over 1m pages of text, billions of words – keyword searches tend to reveal lists of 100s of pages
      • Yet digitised resources can be analysed in their entirety
    3. Open Door possibilities
      • Machine translation
      • Analogue to text – e.g. identifying footnotes within text, spotting the beginning and end of entries, encyclopedias, and gazetteers
      • Information Extraction
        • (Semantic) recognition of people, places, dates, and organizations, citations etc.
      •  For scholars, new types of research in understanding primary (or secondary) sources
    4. A Case Study: 17 th -Cent. News
      • Thanks to Ian Gregory, Andrew Hardie (University of Lancaster) for this study
      • Lancaster Newsbooks Corpus
        • 800,000 words of 1650s English newsbooks:
        • Every surviving newsbook from mid-Dec 1653 to end of May 1654
        • Freely available via AHDS Catalogue
        • Methodological and technical problems exist – skirted over here
    5. 1. Recognising geographies
      • Extracting individual “mentions” of place names from the corpus
      • The identification of proper nouns is accomplished via part-of-speech tagging, a well established technology within linguistics
        • In the Patrick of Liverpoole – which we lately recovered from the Brest men of War – was one Walter Roche – who was to carry her to Brest – and he informed us - that there are these Ships following belonging to Brest – who do so vex us in these Seas – viz.
        • <p> In_II the_AT <em> Patrick_NP1 </em> of_IO <em> <reg orig=&quot;Liverpoole&quot;> Liverpool_NP1 </reg> </em> ,_, which_DDQ we_PPIS2 lately_RR recovered_VVN from_II the_AT <em> Brest_JJT </em> men_NN2 of_IO War_NN1 ,_, was_VBDZ one_MC1 <em> Walter_NP1 Roche_NP1 </em> ,_, who_PNQS was_VBDZ to_TO carry_VVI her_PPHO1 to_II <em> Brest_NP1 </em> ,_, and_CC he_PPHS1 informed_VVD us_PPIO2 ,_, that_CST there_EX are_VBR these_DD2 Ships_NN2 following_II belonging_VVG to_II <em> Brest_NP1 </em> ,_, who_PNQS do_VD0 so_RR vex_VVI us_PPIO2 in_II these_DD2 Seas_NN2 ,_, <em> viz._REX </em> </p>
    6. 2. Extracting place names and assigning co-ordinates
      • Proper nouns compared to a gazetteer
        • We chose http://www.world-gazetteer.com
        • Places outside Europe filtered out
        • SQL database relational join
        • Filters out non-place-name proper nouns
        • Problem: duplicate place names (e.g. Newcastle in Ireland)
      • Each instance of a place name is associated with (one or more) sets of coordinates
    7. 3. And on to GIS… Google Earth ArcGIS Density smoothing in ArcGIS (GIS – Geographical Information System)
    8. 4. Mapping by theme
      • What is being discussed in relation to each mention of each place-name?
        • We cannot tell just from the dates and co-ordinates
      • Solution: concordance + semantic tagging
        • USAS system (University of Lancaster Semantic Analysis System)
      • Finding all terms related to a theme, e.g. money, cash, sterling, pound.
    9. Identifying thematic associations (a): semantic tags in immediate context
      • <hit_word>Dunkirk</hit_word>
      • <text>DutchDiurn03</text>
      • of a rich Fleet from Z5 Z5 I1:1u Z2 Z5
      • Dunkirk Z2
      • , consisting of about forty A1:8u Z5 A13:4 N1
    10. Mapping war… Tag: G3 “warfare, etc” 780 mentions
      • Problems:
      • March – 18 mentions, 2 places
      • Munster – 12 mentions, 3 places
      • Newcastle – 5 mentions in west of Ireland
      • Manchester
        • Middleton – 63 mentions, General in a rebellion in Scotland
        • Whalley – 10 mentions, General in a regiment of horse
    11. Mapping money and government I1 Money 140 mentions G1 Government 293 mentions
    12. 1m Books Challenge (I)
      • Such analysis could be done on any dataset
      • Concept developed by Greg Crane , Tufts University, Director of Perseus Project
      • Taken up by six funding bodies to create international grant competition (from US, UK, Canada, Germany) – name t.b.c.
      • Competition to forge international partnerships to undertake type of work highlighted in case study
    13. 1m Books Challenge (II)
      • Will involve scholars, computer scientists, information managers and publishers
      • Competition is seeking to open up publishers’ content to allow for this type of analysis
      • Competition due to open in January 2009; significant time built into the call to allow for relationships with publishers to be developed
    14. Technical & Legal Issues (I)
      • Scholars need to have a copy of the corpus / dataset to be analysed
        • Difficulties in actually transferring large corpora
        • Obvious IPR risks – material could be passed on
      • Or publishers need to make entire corpus available online
        • Technically complex; requires powerful infrastructure; how does online content interact with analytical tools?
    15. Technical & Legal Issues (II)
      • Experiments need to be repeatable
        • Peer-review demands that other scholars have access to a corpus to review peers’ analyses
      • Where are enriched datasets stored?
        • Proliferating number of enriched datasets
        • Demand for delivering enriched datasets (or parts of them) for research and teaching
      • Who owns the IPR in an enriched data set
        • Original publisher – yes. Researcher? Software, thesaurus and gazetteer creators?
      • How does this work for records, images, maps, audio, video?
    16. Significant Challenges
      • Significant challenges exist for all stakeholders
      • But possibilities for exploiting investment in creating digital content
      • And potential for new avenues of research which scholars will wish to explore
      • ‘ Million Books Challenge’ will help explores some of these issues

    + Alastair DunningAlastair Dunning, 12 months ago

    custom

    472 views, 0 favs, 1 embeds more stats

    An introduction to text mining on digital content

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 472
      • 467 on SlideShare
      • 5 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 13
    Most viewed embeds
    • 5 views on http://digitisation.jiscinvolve.org

    more

    All embeds
    • 5 views on http://digitisation.jiscinvolve.org

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories