• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Digital History and Big Data: text mining historical documents on trade in the British empire
 

Digital History and Big Data: text mining historical documents on trade in the British empire

on

  • 592 views

 

Statistics

Views

Total Views
592
Views on SlideShare
592
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Digital History and Big Data: text mining historical documents on trade in the British empire Digital History and Big Data: text mining historical documents on trade in the British empire Presentation Transcript

    • Beatrice Alexbalex@inf.ed.ac.ukDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013Digital history and big data:Text mining historical documents on trade inthe British Empire
    • OverviewWhat is text mining?Text Mining in digital historyTrading Consequences“Big data”VisualisationChallenge of noisy dataCollaborating with historiansDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Text MiningDescribes a set of linguistic, statistical andmachine learning techniques that model andstructure the information content of textualresources.Turns unstructured text into structured data (e.g.relational database or linked data).Is very useful for analysing large text collectionsautomatically.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Text MiningTM methods often rely on a set of linguistic pre-processing steps such as tokenisation, sentencedetection, part-of-speech tagging, lemmatisation,syntactic parsing (chunking).Currently our focus is on named entityrecognition, entity grounding and relationextraction.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • TM in Digital HistoryGoal: By analysing large amounts of digitiseddata, help historians to discover novel patternsand explore hypothesis.Methods: linguistic text analysis, named entityrecognition, geo-grounding and relation extractionto transform the text into structured data.Sea-change to methods used in ‘traditional’history.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • “Traditional” HistoricalResearchCinchona plantations in George King’s A Manual ofCinchona Cultivation in India (1880).Global Fats Supply 1894-98Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Trading ConsequencesDigging into Data II project (till Dec. 2013)Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex,Dr. Claire Grover, Clare Llewellyn, Richard Tobin,James Reid, Nicola Osborne, Ian FieldhouseDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • TRADING CONSEQUEnCESDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Trading ConsequencesWhat does archival text say about the economicand environmental consequences of globalcommodity trading during the nineteenth century?Scope: global, but with focus on Canadian naturalresources.Example questions:‣ What were the routes and volumes of international trade inresource commodities in the nineteenth century?‣ What were the local environmental consequences of thisdemand for these resources?Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Document CollectionsBig data for historians:Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Mined InformationExample sentence:Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Mined InformationExample sentence:Extracted entities:commodity: cassia barkdate: 1871location: Padanglocation: Americaquantity + unit: 6,127 piculsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Mined InformationExample sentence:Normalised and grounded entities:commodity: cassia barkdate: 1871 (year=1871)location: Padang (lat=-0.94924;long=100.35427;country=ID)location: America (lat=39.76;long=-98.50;country=n/a)quantity + unit: 6,127 piculsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Mined InformationExample sentence:Extracted entity attributes and relations:origin location: Padangdestination location: Americacommodity–date relation: cassia bark – 1871commodity–location relation: cassia bark – Padangcommodity–location relation: cassia bark – AmericaDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Commodity OntologyDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Noisy DataOptical character recognition contains many errorsand often the structure of the page layout is lost.Sophistication of the OCR engine and scanning equipment.Quality of the original print and paper.Use of historical language.Information in page margins (header, page numbers, etc.).Information in tables.Language of the text.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Fixing Noisy DataText normalisation and correction:End-of-line soft hyphen removalDehyphen all token-splitting hyphens using a dictionary-basedapproach.“False f”-to-s conversionConvert all false f characters to s using a corpus.Example: reduced number of words unrecognised byspell checker from 61 to 21 -> 67%, on average 12%reduction in word error rate in a random sample (Alex etal, 2012).Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Fixing Noisy DataDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Fixing Noisy DataDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Extract from document10.2307/60238580 in FCOC.How Noisy Is Too Noisy?qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifTpapua}X3sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx a"3(s aq} tnojjssfitns q}TM Sni5[ooi si jb}s }S.ii; aqxpapnaoSB q}Bq naABSjjqS;H °1 ssbui s.uauuaqsu aqxDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • The Users (Historians)Involvement of historians:Everything is based on the use cases and build on users’hypotheses/research questions.They are responsible for identification of relevant collectionsand are involved in the ontology development.They provide feedback for us to improve technology iteratively:Partners at York use of the prototype for their research andtrack errors; Workshop at CHESS 2013 with a group ofindependent historiansClarity on the text mining accuracy is IMPORTANT.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • SummaryText mining historic documents in TradingConsequences.Processing “big data”.Power of visualising structured data.Fixing noisy data.Importance of two-way collaboration betweentechnology experts and users in digital history.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
    • Thank youQuestions? Fire away or contact me at:balex@inf.ed.ac.ukDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013