Exploring Challenges in Mining Historical Text

1,104 views

Published on

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,104
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Exploring Challenges in Mining Historical Text

    1. 1. Exploring Challenges in Mining Historical TextBeatrice Alex, Claire Grover, Richard Tobin and Ewan Klein Working with text: Tools, techniques and approaches for text mining Edinburgh - 07/07/2012
    2. 2. Overview‣ Project‣ Data‣ Preprocessing historical text ‣ Improvements to OCR ‣ Language identification ‣ Text mining tables‣ Text-mining ‣ Improved commodity identification ‣ Ports-based geo-grounding ‣ Relation extraction Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    3. 3. Project (01/2012-12/2014)‣ Funded by Digging into Data (round 2)‣ Partners Ewan Klein, Claire Grover, Bea Alex (text mining) Colin Coates, Jim Clifford (historical analysis) James Reid (data integration) Aaron Quigley, Uta Hinrichs (information visualisation) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    4. 4. Trading Consequences‣ What does archival text say about the economic and environmental consequences of global commodity trading during the nineteenth century?‣ Help historians to discover novel patters and explore new hypotheses.‣ Example questions: ‣ What were the routes and volumes of international trade in resource commodities 1850-1914? ‣ What were the local environmental consequences of this demand for these resources? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    5. 5. Geolocating Cinchona Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    6. 6. Trading Consequences‣ Scope: global but with focus on Canadian natural resource flows to test reliability and efficacy of our methods‣ Methods: ‣ Text mining and geo-parsing to transform the text into structured data, e.g. relational database ‣ Query interface targeted at historians ‣ Information visualisation for interactive exploration Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    7. 7. Historical Data‣ Digitised sources from the 19th century British Empire, currently processing ‣ Early Canadiana Online: 83,038 files ‣ JSTOR data: 1,000 XML files ‣ House of Commons Parliamentary Papers: 4,135 files ‣ Books: selected books on nineteenth century trade‣ Further sources: ‣ ProQuest data ‣ Encyclopaedia Britannica, Jstor Plants, Forestry Journals?, The Botanist? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    8. 8. Processing Historical Data‣ Challenges so far: ‣ Different formats ‣ Low-quality OCRed text ‣ Old/low-quality prints, quality of OCR technology ‣ Historical English: historical word variants, ſ (long s) characters mixed up with f by OCR ‣ Artefacts in original documents: headers/footers, page numbers, notes in margins, end-of-line hyphenation ‣ Text in different languages ‣ Information in tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    9. 9. Processing Historical Data‣ Challenges so far: ‣ Different formats ‣ Low-quality OCRed text ‣ Old/low-quality prints, quality of OCR technology ‣ Historical English: historical word variants, ſ (long s) characters mixed up with f by OCR ‣ Artefacts in original documents: headers/footers, page numbers, notes in margins, end-of-line hyphenation ‣ Text in different languages ‣ Information in tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    10. 10. Improvements to OCR‣ Normalisation and post-correction‣ Fixed end-of-line hyphenation ‣ Dehyphen all token-splitting hyphens using a dictionary-based approach (dictionary is the system dictionary + the text of the current document)‣ Added f-to-s conversion ‣ Convert all false f characters to s using a corpus- based a approach (corpus is a collection of historical documents from the Gutenberg Project)‣ Example: reduced number of words unrecognised by spell checker from 61 to 21 - > approx. 67% improvement Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    11. 11. Improvements to OCRWorking with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    12. 12. Improvements to OCRWorking with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    13. 13. Improvements to OCR‣ Extensive evaluation of both tools against human corrected/normalised gold standard‣ Reduce word error rate by 12.5% in a random Canadiana sample (word acc: 0.776 -> 0.804)‣ Improvements have an effect on later text mining steps and would also be beneficial for searching text in any IR system (e.g. Jstor database search for “French colonifts”) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    14. 14. Language Identification‣ Most sources do not ISO Code eng Language English Frequency 2,677,498 contain language fra French 1,208,811 deu German 2,886 information like chn Chinook jargon 2,488 Canadiana does moh Mohawk 1,547 oji Ojibwa 1,395‣ The table displays emg Eastern Meohang 835 the number of text enb cre Markweeta Cree 666 501 elements in iro Iroquoian 324 alg Algonquian 210 Canadiana per nge Ngemba 157 language ignoring nld lat Dutch Latin 131 119 notes and titles mic Micmac 61 gla Scottish Gaelic 22 Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    15. 15. Language Identification‣ Make use of automatic language identification using TextCat, especially for the JSTOR data which is also multi-lingual.‣ LID is done for each paragraph and for the entire document by taking the most frequent language tag assigned.‣ Can limit processing to English (and French) documents only.‣ 740 English documents (out of 1,000) Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    16. 16. Text Mining Tables Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    17. 17. Text Mining Tables‣ Tables contain a lot of relevant information but are difficult to mine.‣ HCPP documents contain coordinates for each table entry. <w p="961,1777,1026,1807" v="d">Rio</w> <w p="1026,1777,1170,1807" v="d">Janeiro</w> ... <w p="961,1892,1087,1921" v="n">Culcutta</w> <w p="1496,1530,1565,1555" v="o">141</w> <w p="1565,1525,1631,1555" v="d">bags</w> <w p="1227,1774,1336,1804" v="d">Wood</w> <w p="1353,1791,1366,1799" v="o">-</w> <w p="1494,1776,1565,1804" v="o">338</w> <w p="1565,1783,1676,1803" v="d">planks</w> <w p="1704,1791,1718,1799" v="o">-</w>‣ Planning to do a feasibility study for a table mining algorithm. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    18. 18. Text Mining Pipeline‣ Steps after that OCR improvements and LID: ‣ Tokenisation ‣ Part-of-speech tagging ‣ Lemmatisation ‣ Wordnet lookup to find commodities ‣ Named-entity recognition including commodity lexicon lookup ‣ Port-based Geo-grounding ‣ Chunking Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    19. 19. Text Mining Pipeline‣ Steps after that OCR improvements and LID: ‣ Tokenisation ‣ Part-of-speech tagging ‣ Lemmatisation ‣ Wordnet lookup to find commodities ‣ Named-entity recognition including commodity lexicon lookup ‣ Port-based Geo-grounding ‣ Chunking Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    20. 20. Commodities Identification‣ WordNet lookup using an approximation of commodity named entities: ‣ Noun phrases with hypernyms such as substance, physical matter, plant or animal in WordNet. ‣ Each NP which leads to a match is assigned a wn=”true” attribute.‣ Commodities gazetteer lookup using a list of commodities derived by historians. ‣ Strings matching the entries in the gazetteer are assigned a commlex=”true” attribute.‣ Words/phrases with wn=”true” and commlex=”true” are good candidates. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    21. 21. Ports-based Geo-grounding‣ Started with non-optimised geo-resolution.‣ Incorporated the list of ports. Locations are assigned with an is_port="1" or an is_port="0" attribute. ‣ Grounding now ignores non-port candidates in case of ambiguous location mentions. ‣ is_port locations are also given a higher weight in the scoring.‣ Hypothesis: ports are more likely to be significant locations in historic documents about trade.‣ Not tested yet as need gold standard data. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    22. 22. Ports-based Geo-grounding‣ Example:Dalhousie is in the list of ports as:DALHOUSIE                -66.4   48.1Geo-grounding in non-optimised resolver:<ent id="rb3" type="location" lat="32.5333300" long="75.9833300" in-country="IN"gazref="geonames:1273648" feat-type="ppl" pop-size="7601"> <parts> <part ew="w136" sw="w136">Dalhousie</part> </parts> </ent>Geo-grounding in ports-dependent resolver: <ent id="rb2" type="location" lat="48.0550200" long="-66.3847200" in-country="CA" gazref="geonames:6943599" feat-type="ppl" pop-size="0"> <parts> <part ew="w97" sw="w97">Dalhousie</part> </parts> </ent> Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    23. 23. Ports-based Geo-grounding‣ Geo-grounding assumes that each text is a coherent whole. All locations contribute to the resolution of all others. May have to change that.‣ Segmentation (e.g. of books) into smaller units might improve the resolution.‣ Need to consider old spellings of place names. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    24. 24. Relation Extraction‣ Crude way to identify commodity-location relations: ‣ Sentences (s) containing words (w) with the commlex="true" and wn="true" and a location. Good: The quantity of raw cotton imported annually into the United Kingdom—take for example, the year 1854—amounted to, at least, 887,335,9041bs., of which the United States supplied 722,154,101 lbs. Of interest: Another kind of quinine-yieldmg bark has been discovered on the western side of the Cordillera, which produces more sulphate than the common cinchona; and as the cinchona grows on both sides of the Cordillera, it may be inferred that the new plant will be found also in the lands of Gualaquiza and Canelos. Bad: The first-class refreshment room, Central Station, Leeds, has a notice that only five-year old whisky is sold there. OR This paper was concealed in the handle of a spear, carried from Omdurman to Gedarif in that way. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    25. 25. Relation Extraction Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    26. 26. Relation Extraction‣ Need to improve the relation extraction.‣ Will look at pattern-based relation extraction exploiting vocabulary like "import", "export", "ship", "shipment", "trade", “manufacture”, “grow” etc.‣ Will annotate a small test corpus for evaluation.‣ Need to distinguish between irrelevant or false commodity-location relations and commodity-location relations referring to trade. Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    27. 27. Thank You‣ Questions? Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    28. 28. Example Input‣ Different sources converted into common XML format Working with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012
    29. 29. Example OutputWorking with text: Tools, techniques and approaches for text mining - Edinburgh - 07/07/2012

    ×