Beatrice Alexbalex@inf.ed.ac.ukDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013Digital history and big data:Text...
OverviewWhat is text mining?Text Mining in digital historyTrading Consequences“Big data”VisualisationChallenge of noisy da...
Text MiningDescribes a set of linguistic, statistical andmachine learning techniques that model andstructure the informati...
Text MiningTM methods often rely on a set of linguistic pre-processing steps such as tokenisation, sentencedetection, part...
TM in Digital HistoryGoal: By analysing large amounts of digitiseddata, help historians to discover novel patternsand expl...
“Traditional” HistoricalResearchCinchona plantations in George King’s A Manual ofCinchona Cultivation in India (1880).Glob...
Trading ConsequencesDigging into Data II project (till Dec. 2013)Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex,Dr. C...
TRADING CONSEQUEnCESDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Trading ConsequencesWhat does archival text say about the economicand environmental consequences of globalcommodity tradin...
Document CollectionsBig data for historians:Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined InformationExample sentence:Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Mined InformationExample sentence:Extracted entities:commodity: cassia barkdate: 1871location: Padanglocation: Americaquan...
Mined InformationExample sentence:Normalised and grounded entities:commodity: cassia barkdate: 1871 (year=1871)location: P...
Mined InformationExample sentence:Extracted entity attributes and relations:origin location: Padangdestination location: A...
Commodity OntologyDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Noisy DataOptical character recognition contains many errorsand often the structure of the page layout is lost.Sophisticat...
Fixing Noisy DataText normalisation and correction:End-of-line soft hyphen removalDehyphen all token-splitting hyphens usi...
Fixing Noisy DataDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Fixing Noisy DataDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Extract from document10.2307/60238580 in FCOC.How Noisy Is Too Noisy?qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifTpapua}...
The Users (Historians)Involvement of historians:Everything is based on the use cases and build on users’hypotheses/researc...
SummaryText mining historic documents in TradingConsequences.Processing “big data”.Power of visualising structured data.Fi...
Thank youQuestions? Fire away or contact me at:balex@inf.ed.ac.ukDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
Upcoming SlideShare
Loading in …5
×

Digital History and Big Data: text mining historical documents on trade in the British empire

1,283 views

Published on

Published in: Technology
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
1,283
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Digital History and Big Data: text mining historical documents on trade in the British empire

  1. 1. Beatrice Alexbalex@inf.ed.ac.ukDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013Digital history and big data:Text mining historical documents on trade inthe British Empire
  2. 2. OverviewWhat is text mining?Text Mining in digital historyTrading Consequences“Big data”VisualisationChallenge of noisy dataCollaborating with historiansDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  3. 3. Text MiningDescribes a set of linguistic, statistical andmachine learning techniques that model andstructure the information content of textualresources.Turns unstructured text into structured data (e.g.relational database or linked data).Is very useful for analysing large text collectionsautomatically.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  4. 4. Text MiningTM methods often rely on a set of linguistic pre-processing steps such as tokenisation, sentencedetection, part-of-speech tagging, lemmatisation,syntactic parsing (chunking).Currently our focus is on named entityrecognition, entity grounding and relationextraction.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  5. 5. TM in Digital HistoryGoal: By analysing large amounts of digitiseddata, help historians to discover novel patternsand explore hypothesis.Methods: linguistic text analysis, named entityrecognition, geo-grounding and relation extractionto transform the text into structured data.Sea-change to methods used in ‘traditional’history.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  6. 6. “Traditional” HistoricalResearchCinchona plantations in George King’s A Manual ofCinchona Cultivation in India (1880).Global Fats Supply 1894-98Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  7. 7. Trading ConsequencesDigging into Data II project (till Dec. 2013)Edinburgh Team: Prof. Ewan Klein, Dr. Beatrice Alex,Dr. Claire Grover, Clare Llewellyn, Richard Tobin,James Reid, Nicola Osborne, Ian FieldhouseDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  8. 8. TRADING CONSEQUEnCESDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  9. 9. Trading ConsequencesWhat does archival text say about the economicand environmental consequences of globalcommodity trading during the nineteenth century?Scope: global, but with focus on Canadian naturalresources.Example questions:‣ What were the routes and volumes of international trade inresource commodities in the nineteenth century?‣ What were the local environmental consequences of thisdemand for these resources?Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  10. 10. Document CollectionsBig data for historians:Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  11. 11. Mined InformationExample sentence:Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  12. 12. Mined InformationExample sentence:Extracted entities:commodity: cassia barkdate: 1871location: Padanglocation: Americaquantity + unit: 6,127 piculsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  13. 13. Mined InformationExample sentence:Normalised and grounded entities:commodity: cassia barkdate: 1871 (year=1871)location: Padang (lat=-0.94924;long=100.35427;country=ID)location: America (lat=39.76;long=-98.50;country=n/a)quantity + unit: 6,127 piculsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  14. 14. Mined InformationExample sentence:Extracted entity attributes and relations:origin location: Padangdestination location: Americacommodity–date relation: cassia bark – 1871commodity–location relation: cassia bark – Padangcommodity–location relation: cassia bark – AmericaDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  15. 15. Commodity OntologyDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  16. 16. Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  17. 17. Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  18. 18. Improved Search &VisualisationsDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  19. 19. Noisy DataOptical character recognition contains many errorsand often the structure of the page layout is lost.Sophistication of the OCR engine and scanning equipment.Quality of the original print and paper.Use of historical language.Information in page margins (header, page numbers, etc.).Information in tables.Language of the text.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  20. 20. Fixing Noisy DataText normalisation and correction:End-of-line soft hyphen removalDehyphen all token-splitting hyphens using a dictionary-basedapproach.“False f”-to-s conversionConvert all false f characters to s using a corpus.Example: reduced number of words unrecognised byspell checker from 61 to 21 -> 67%, on average 12%reduction in word error rate in a random sample (Alex etal, 2012).Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  21. 21. Fixing Noisy DataDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  22. 22. Fixing Noisy DataDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  23. 23. Extract from document10.2307/60238580 in FCOC.How Noisy Is Too Noisy?qBiu si }S3A:req s,uauuaqsu aq} }Bq} uirepo.ifTpapua}X3sSuiav }qSuq Jiaq} qiiM jib ui snnS bbs aqx a"3(s aq} tnojjssfitns q}TM Sni5[ooi si jb}s }S.ii; aqxpapnaoSB q}Bq naABSjjqS;H °1 ssbui s.uauuaqsu aqxDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  24. 24. The Users (Historians)Involvement of historians:Everything is based on the use cases and build on users’hypotheses/research questions.They are responsible for identification of relevant collectionsand are involved in the ontology development.They provide feedback for us to improve technology iteratively:Partners at York use of the prototype for their research andtrack errors; Workshop at CHESS 2013 with a group ofindependent historiansClarity on the text mining accuracy is IMPORTANT.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  25. 25. SummaryText mining historic documents in TradingConsequences.Processing “big data”.Power of visualising structured data.Fixing noisy data.Importance of two-way collaboration betweentechnology experts and users in digital history.Digital scholarship: day of ideas 2, Edinburgh, 02/05/2013
  26. 26. Thank youQuestions? Fire away or contact me at:balex@inf.ed.ac.ukDigital scholarship: day of ideas 2, Edinburgh, 02/05/2013

×