Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IMPACT Final Conference - Gregory Crane

3,085 views

Published on

Published in: Technology, Education
  • Check the source ⇒ www.WritePaper.info ⇐ This site is really helped me out gave me relief from headaches. Good luck!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

IMPACT Final Conference - Gregory Crane

  1. 1. OCR and the Transformation of the Humanities Gregory Crane and David Bamman Tufts University Bruce Robertson Mount Allison University John Darlington and Brian Fuchs Imperial College London
  2. 2. Three basic changes
  3. 3. Three basic changes <ul><li>Transformation of scale of questions </li></ul>
  4. 4. Three basic changes <ul><li>Transformation of scale of questions </li></ul><ul><ul><li>Breadth and Depth </li></ul></ul>
  5. 5. Three basic changes <ul><li>Transformation of scale of questions </li></ul><ul><ul><li>Breadth and Depth </li></ul></ul><ul><li>Student researchers and citizen scholars </li></ul>
  6. 6. Three basic changes <ul><li>Transformation of scale of questions </li></ul><ul><ul><li>Breadth and Depth </li></ul></ul><ul><li>Student researchers and citizen scholars </li></ul><ul><ul><ul><li>Not enough professors and library professionals </li></ul></ul></ul>
  7. 7. Three basic changes <ul><li>Transformation of scale of questions </li></ul><ul><ul><li>Breadth and Depth </li></ul></ul><ul><li>Student researchers and citizen scholars </li></ul><ul><ul><ul><li>Not enough professors and library professionals </li></ul></ul></ul><ul><li>Globalization of cultural heritage </li></ul>
  8. 8. Three basic changes <ul><li>Transformation of scale of questions </li></ul><ul><ul><li>Breadth and Depth </li></ul></ul><ul><li>Student researchers and citizen scholars </li></ul><ul><ul><ul><li>Not enough professors and library professionals </li></ul></ul></ul><ul><li>Globalization of cultural heritage </li></ul><ul><ul><li>Not enough expertise in Europe + North America </li></ul></ul>
  9. 9. Towards Dynamic Variorum Editions Gregory Crane and David Bamman Tufts University Bruce Robertson Mount Allison University John Darlington and Brian Fuchs Imperial College London
  10. 10. Thanks to … <ul><li>Digging into Data Phase 1 </li></ul><ul><li>National Endowment for the Humanities </li></ul><ul><li>JISC (UK) </li></ul><ul><li>SSHRC (Canada) </li></ul><ul><li>National Science Foundation </li></ul><ul><li>Mellon Foundation </li></ul><ul><li>Google Digital Humanities </li></ul><ul><li>Cantus Foundation </li></ul><ul><li>German Research Foundation </li></ul>
  11. 11. The Dynamic Variorum as grand challenge
  12. 12. The Dynamic Variorum as grand challenge <ul><li>How do you build self-organizing collections? </li></ul>
  13. 13. What is a variorum? <ul><li>Short for cum notis variorum , “with notes of different people” </li></ul>
  14. 14. New Variorum Shakespeare Series
  15. 15. New Variorum Shakespeare Series
  16. 16. New Variorum Shakespeare Series “ New” = 140 years old
  17. 17. New Variorum Shakespeare Series “ New” = 140 years old “ New” vs. 1821 Shakespeare Variorum
  18. 18. Heinsius’ Claudian
  19. 19. Heinsius’ Claudian
  20. 20. Heinsius’ Claudian
  21. 21. NVS 2011
  22. 22. NVS 2011
  23. 23. What was in the 1873 NVS Macbeth?
  24. 24. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul>
  25. 25. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul>
  26. 26. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul>
  27. 27. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul>
  28. 28. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul>
  29. 29. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul>
  30. 30. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul>
  31. 31. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul>
  32. 32. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul>
  33. 33. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul>
  34. 34. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul>
  35. 35. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul>
  36. 36. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul>
  37. 37. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul>
  38. 38. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul>
  39. 39. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul><ul><li>Bibliographies </li></ul>
  40. 40. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul><ul><li>Bibliographies </li></ul>
  41. 41. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul><ul><li>Bibliographies </li></ul><ul><li>Running Text </li></ul>
  42. 42. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul><ul><li>Bibliographies </li></ul><ul><li>Running Text </li></ul><ul><li>Multiple Versions </li></ul>
  43. 43. What was in the 1873 NVS Macbeth? <ul><li>Index </li></ul><ul><li>[Table of contents] </li></ul><ul><li>Sources </li></ul><ul><li>Adaptations </li></ul><ul><li>General Topics </li></ul><ul><li>Bibliographies </li></ul><ul><li>Running Text </li></ul><ul><li>Multiple Versions </li></ul><ul><li>Annotations </li></ul>
  44. 44. Brown’s Intermedia c. 1990
  45. 45. The problem… <ul><li>Not feasible to summarize scholarship on any major canonical author by manual means </li></ul><ul><li>An issue in 1665 and in 1905 but much worse now… </li></ul><ul><li>How do we generate a Variorum edition from the very large collections that make this such a challenge? How do we make scale an advantage? </li></ul>
  46. 46. Shakespeare as an easy case…
  47. 47. Shakespeare as an easy case…
  48. 48. Shakespeare as an easy case… c. 500 years of English ….
  49. 49. Greco-Roman World
  50. 50. Greco-Roman World From Rabat to Kandahar …
  51. 51. c. 100 CE papyrus from Euclid (c. 300 BCE) http://www.math.ubc.ca/~cass/Euclid/papyrus/papyrus.html
  52. 52. 800-1000 CE: Greek into Arabic Hunayn Ibn Ishaq (809–873), Arabic version of the Prognosticon from the Hippocratic Corpus http://www.nlm.nih.gov/exhibition/odysseyofknowledge/
  53. 53. c. 1200-1300: Arabic into Latin Medieval Translation of the Prognosticon from Arabic into Latin
  54. 54. Return of Greek sources c. 1500 This first edition of Dioscorides' Greek text, printed in Venice in 1499 by Aldo Manuzio (ca. 1447–1515)
  55. 55. Status as of October 2011 <ul><li>What do you do with a billion words? </li></ul><ul><ul><li>2000 years of Latin </li></ul></ul><ul><li>How do you integrate data across languages </li></ul><ul><ul><li>Projecting markup over noisy data </li></ul></ul><ul><li>How do you trace ideas? </li></ul><ul><ul><li>Detecting changes within and across languages </li></ul></ul><ul><li>How do you get the data you need? </li></ul><ul><ul><li>Customizing OCR for a pre-modern language </li></ul></ul><ul><li>How do you scale up your services? </li></ul><ul><ul><li>From workflows to Cloud-based design </li></ul></ul>
  56. 56. Disciplines and Speakers <ul><li>David Bamman, Tufts University </li></ul><ul><ul><li>Computational Linguistics </li></ul></ul><ul><li>Bruce Robertson, Mount Allison University </li></ul><ul><ul><li>Digital Classics </li></ul></ul><ul><li>Brian Fuchs, Imperial College London </li></ul><ul><ul><li>Software Engineering </li></ul></ul>
  57. 57. 1. Computational Linguistics David Bamman Tufts University (Carnegie Mellon University) United States
  58. 58. Overview: Publications <ul><ul><li>Bamman, David and Gregory Crane (2011), “Measuring Historical Word Sense Variation,” Proceedings of the 11 th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2011). Nominee, Best Paper Award. </li></ul></ul><ul><ul><li>Bamman, David, Alison Babeu, and Gregory Crane (2010), &quot;Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection,&quot; in: Proceedings of the 10th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2010). Winner, Best Paper Award. </li></ul></ul><ul><ul><li>Bamman, David and David Smith (forthcoming), “Extracting Two Thousand Years of Latin from a Million Book Library”, Journal of Computing and Cultural Heritage. </li></ul></ul>
  59. 60. 2000+ Years of Latin
  60. 61. Goal: Tracking Language Change <ul><li>Lexical change (new vocabulary, shifts in word sense) </li></ul><ul><li>Syntactic change (including the influence of the author’s L1 on the Latin syntax) </li></ul><ul><li>Topical change (the rise of new genres) </li></ul><ul><li>Identifying the spread of variation across authors. </li></ul>
  61. 62. Corpus development <ul><li>Data source </li></ul><ul><ul><li>1.2M books from the Internet Archive (snapshot of collection from 2009) </li></ul></ul><ul><ul><li>25,886 works catalogued as Latin </li></ul></ul><ul><li>Metadata problems </li></ul><ul><ul><li>1. Language identification (many of these works are not Latin.) </li></ul></ul><ul><ul><li>2. Historical date info (dates of publication != dates of composition.) </li></ul></ul>
  62. 63. 25,886 works catalogued as Latin in the IA, charted by “date.”
  63. 66. Language ID <ul><li>Language ID to identify which of these works actually have Latin as a major language. </li></ul><ul><ul><li>Trained a language classifier (alias-i Lingpipe) on: </li></ul></ul><ul><ul><ul><li>24 editions of Wikipedia </li></ul></ul></ul><ul><ul><ul><li>Perseus classical corpus </li></ul></ul></ul><ul><ul><ul><li>Known badly-OCR’d Greek in the IA. </li></ul></ul></ul><ul><li>Results </li></ul><ul><ul><li>10,263 of 25,886 books catalogued as Latin are not recognizably so (mostly Greek) </li></ul></ul><ul><ul><li>6,790 books not catalogued as Latin in the 1.2M collection are in fact so (98% precision). </li></ul></ul><ul><ul><li>Net: 22,413 Latin books containing 2.97 billion words . </li></ul></ul>
  64. 67. Composition dating <ul><li>With undergraduate students in Classics, established dates of composition for each Latin text. So far, considered 10,398 of them: </li></ul><ul><ul><li>7,055 dated </li></ul></ul><ul><ul><li>3,343 excluded as not representative of language use – e.g., reference works (dictionaries, catalogues, lists of manuscripts) </li></ul></ul><ul><li>From these 7,055 works, we extract just the Latin to create a dated historical corpus of 389 million words. </li></ul>
  65. 68. 25,886 works catalogued as Latin in the IA, charted by “date.”
  66. 69. 7,055 Latin works in the IA, charted by date of composition.
  67. 70. “ America” (1066)
  68. 71. “ de” (2,955,462)
  69. 72. “ oratio”
  70. 73. “ lead” vs. “iron”
  71. 74. Polysemy Words have many senses. Lead Iron (verb) cause to go (verb) to smooth w. an iron be in command (noun) element Fe (noun) position of advantage tool with flat steel base used to smooth clothes chief part in play golf club element Pb graphite in pencil Oratio (noun) Speech Prayer
  72. 75. Measuring sense variation <ul><li>Method: Train broad-coverage word sense disambiguation using aligned parallel texts </li></ul><ul><li>English/French (Diab and Resnik 02), English/Chinese (Chan and Ng 05, Ng et al. 03), English/Portuguese (Specia et al. 05), English/Vietnamese (Dinh 02). </li></ul><ul><li>Parallel text alignment </li></ul><ul><li>Identify translations (130 English translations manually identified by students from a representative range of dates) </li></ul><ul><li>Word align Latin text <-> English text (ca. 1.3M words) </li></ul><ul><li>Induce a sense inventory from the alignment </li></ul><ul><li>Word sense disambiguation </li></ul><ul><li>Train a WSD classifier on noisily aligned texts </li></ul><ul><li>Automatically classify remaining 387M words </li></ul><ul><li>Track lexical change </li></ul>
  73. 76. WSD via parallel texts <ul><li>SMT based on Brown et al (1990) </li></ul><ul><li>Different senses for a word in one language are translated by different words in another. </li></ul><ul><li>“ Bank” (English) </li></ul><ul><ul><li>financial institution = French “banque” </li></ul></ul><ul><ul><li>side of a river = French “rive” (e.g., la rive gauche ) </li></ul></ul>
  74. 77. (Dynamic Lexicon)
  75. 78. (Bootstrapping multilingual digital libraries) + <ul><li>Projecting XML markup across editions and translations (Bamman and Crane 2010) </li></ul><ul><li>Alignment of the source document with the target document in a cascading process: document -> sentence -> word </li></ul><ul><li>Projection of XML tags in the source document to the target document in way that exploits the linguistic similarity of the text pair. </li></ul>
  76. 79. 2. Parallel text alignment <ul><li>Sentence level : Moore’s Bilingual Sentence Aligner (Moore 2002) </li></ul><ul><ul><li>aligns sentences that are 1-1 translations of each other w/ high precision (98.5% on a corpus of 10K English-Hindi sentences) </li></ul></ul><ul><li>Word level : MGIZA++ (Gao and Vogel 2008) </li></ul><ul><ul><li>parallel version of: GIZA++ (Och and Ney 2003) - implementation of IBM Models 1-5. </li></ul></ul>
  77. 80. 3. Sense induction
  78. 81. 4. WSD Training Source word oratione (oratio) Sense label prayer Training context ad spem pertinent, quae in … dominica continentur
  79. 82. 5. WSD Classification <ul><li>For all words without an aligned translation, use the surrounding context to determine the most likely sense. </li></ul>
  80. 83. 5a. WSD static evaluation <ul><li>Created held-out test set of 105 instances of 5 Latin nouns with known shifts in meaning sampled uniformly from 21 centuries. Evaluated 7 different WSD classifiers + simple baseline of most frequent sense overall (MFS). </li></ul>System villa pastor miles scientia oratio Average 5-gram LM 54.8% 69.2% 90.2% 73.7% 61.4% 69.9% 6-gram LM 58.3% 61.5% 91.2% 65.8% 63.8% 68.1% Bayes 63.5% 62.3% 92.6% 70.2% 48.0% 67.3% Token Unigram LM 63.5% 62.4% 92.6% 70.2% 48.0% 67.3% Token Bigram LM 64.3% 62.4% 92.6% 70.2% 48.8% 67.7% TF/IDF 64.3% 60.7% 82.8% 70.2% 49.6% 65.5% KNN 64.3% 73.5% 84.4% 63.2% 40.1% 65.1% MFS Baseline 60.9% 66.7% 92.6% 79.0% 60.6% 72.0%
  81. 84. 5b. WSD time series evaluation
  82. 85. 5b. WSD time series evaluation
  83. 86. 5b. WSD time series evaluation <ul><li>Evaluated via mean square error between gold standard time series and automatically classified one. </li></ul>System villa pastor miles scientia oratio Average 5-gram LM .056 .034 .052 .044 .137 .065 6-gram LM .053 .053 .052 .022 .022 .040 Bayes .047 .060 .055 .040 .228 .086 Token Unigram LM .047 .060 .055 .044 .230 .086 Token Bigram LM .047 .060 .055 .044 .230 .087 TF/IDF .037 .050 .049 .040 .189 .073 KNN .101 .028 .054 .039 .248 .094 MFS Baseline .228 .170 .014 .091 .338 .178
  84. 87. “ oratio”
  85. 88. 6. Tracking lexical change: “oratio”
  86. 89. Acknowledgments <ul><li>This work was supported by grants from: </li></ul><ul><ul><li>The Digging into Data Challenge (&quot;Towards Dynamic Variorum Editions”) </li></ul></ul><ul><ul><li>The National Science Foundation (IIS-910884, &quot;Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR”) </li></ul></ul><ul><ul><li>The National Endowment for the Humanities (PR-50013-08, &quot;The Dynamic Lexicon: Cyberinfrastructure and the Automated Analysis of Historical Languages”) </li></ul></ul><ul><li>Thanks are also due to research assistants Alison Darling, Elise Goodman-Tuchmayer, Daniel Libatique, Lee Marmor, John Owen and Erin Shanahan. </li></ul>
  87. 90. 2. Digital Classics Bruce Robertson Mount Allison University Canada
  88. 91. Digitizing and Viewing Difficult Texts: Lessons From Ancient Greek   19th century provides a vast array of editions of Greek text, many still very useful          - Yet they could not be accessed digitally   What tools and workflows might help us digitize diverse texts such as these? What applications can we create to make the resulting OCR data useful to researchers and students?
  89. 92. Diversity of 19th Century Fonts and Layout
  90. 94. Character Classification and the Modern Undergraduate <ul><li>Performing optical character recognition requires a great deal of 'training' </li></ul><ul><li>This is perfectly suited to the undergraduate researcher </li></ul><ul><li>Ph.D. student asks: &quot;why isn't this part of the beginning Greek curriculum?&quot; </li></ul><ul><li>    It introduces students to the beauty and heritage of the typography of their subject </li></ul><ul><li>    It immediately engages them in a vital research project </li></ul><ul><li>(True of all languages where learning a new character set is a preliminary skill) </li></ul>
  91. 95. Results
  92. 97. http://www.youtube.com/watch?v=OIjaq7ds2J8
  93. 99. Lessons Learned <ul><li>Undergraduates provide excellent middle-tier academic labour </li></ul><ul><li>  </li></ul><ul><li>Shared dictionary data will be fundamental to a cloud-based approach </li></ul><ul><li>Include as many languages as possible from the beginning </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul><ul><li>  </li></ul>
  94. 100. Future Work <ul><li>Continue to improve Greek OCR engine based on 'Gamera' </li></ul><ul><li>  </li></ul><ul><li>Integrate visualization tools that aid students of the language </li></ul><ul><li>  </li></ul><ul><li>Implement many dictionaries: English, French, Latin, etc. </li></ul><ul><li>  </li></ul><ul><li>Integrate other crowd-sourcing opportunities so interested viewers can: </li></ul><ul><li>    Verify  or correct dubious OCR results </li></ul><ul><li>    Identify the grammar or syntax of words </li></ul><ul><li>       </li></ul><ul><li>       </li></ul><ul><li>       </li></ul>
  95. 101. 3. Software Engineering John Darlington Brian Fuchs Imperial College London United Kingdom
  96. 102. ICL ’ s role in DVE <ul><li>High-throughput infrastructure for </li></ul><ul><ul><li>OCR for Greek and Latin </li></ul></ul><ul><ul><li>Text-based Feature extraction </li></ul></ul><ul><li>E-Science utility computing infrastructure </li></ul><ul><li>High-level functional interfaces for e-Science. </li></ul>
  97. 103. DVE: Context at SCG <ul><li>E-Science Frameworks </li></ul><ul><ul><li>Grid </li></ul></ul><ul><ul><li>Cloud </li></ul></ul><ul><ul><li>Parallel Processing </li></ul></ul><ul><ul><li>Functional / Declarative approaches </li></ul></ul><ul><li>Internet services and economics </li></ul><ul><ul><li>Healthcare </li></ul></ul><ul><ul><li>Music </li></ul></ul><ul><ul><li>Mobile Applications </li></ul></ul><ul><ul><li>Transport </li></ul></ul>
  98. 104. OCR parallel challenge <ul><li>The key to OCR at scale is miminising the need for eyeballs. </li></ul><ul><li>i.e. “ ground-truth ” -- manual checking against the original. </li></ul>
  99. 105. Rapid OCR using MapReduce +a Cloud IaaS
  100. 106. Infrastructure: State of play <ul><ul><li>6 node static hadoop testbed </li></ul></ul><ul><ul><li>160 node eucalyptus cluster on old opteron chips </li></ul></ul><ul><ul><li>20 dual quad-core machines with 16TB storage on fibre. </li></ul></ul><ul><ul><li>Stack assembled and deployed </li></ul></ul><ul><ul><li>Initial training sets tested. </li></ul></ul>
  101. 107. Throughput Infrastructure <ul><li>Boschetti Aligner </li></ul><ul><ul><li>OCR post-processing for Greek/Latin developed at PDL  </li></ul></ul><ul><ul><li>multiple sequence alignment dynamic algorithm ( like BLAST, Clustal, Mr. Bayes) </li></ul></ul><ul><ul><li>bayesian classifier to select the most probable sequence of characters </li></ul></ul><ul><ul><li>spell-checking filtered by ocr evidence  </li></ul></ul>
  102. 108. Throughput Infrastructure <ul><li>MapReduce </li></ul><ul><ul><li>= functional Map/Fold. </li></ul></ul><ul><ul><li>Made famous by Google, but developed by others. </li></ul></ul><ul><ul><li>Map, then reduce </li></ul></ul><ul><ul><li>Map: apply a function in parallel to a bunch of key/value pairs. </li></ul></ul><ul><ul><li>Reduce: apply a function in parallel to each group of similar k/v pair outputs from Map. </li></ul></ul>
  103. 109. Throughput Infrastructure <ul><li>MapReduce </li></ul><ul><ul><li>E.g. count occurrences of words in docs </li></ul></ul><ul><ul><li>Map( docname, doc.txt))-> ‘ mittitur ’ :1, ‘ cura ’ :1 </li></ul></ul><ul><ul><li>Reduce(word:count)  ‘ mittitur ’ :23, ‘ cura ’ :10,… </li></ul></ul>
  104. 110. Throughput Infrastructure <ul><li>MapReduce </li></ul><ul><ul><li>E.g. count occurrences of words in docs </li></ul></ul><ul><ul><li>Map: </li></ul></ul><ul><ul><ul><li>Count the words in 1000 documents (in parallel) </li></ul></ul></ul><ul><ul><ul><li>map( docname, doc.txt))-> ‘ mittitur ’ :1, ‘ cura ’ :1 </li></ul></ul></ul><ul><ul><li>Reduce </li></ul></ul><ul><ul><ul><li>Group the output by word, and add up occurrences (in parallel) </li></ul></ul></ul><ul><ul><ul><li>Reduce(word:count)  ‘ mittitur ’ :23, ‘ cura ’ :10,… </li></ul></ul></ul>
  105. 111. Throughput Infrastructure <ul><li>Eucalyptus </li></ul><ul><ul><li>Open Source Cloud Computing </li></ul></ul><ul><ul><li>UC, Santa Barbara  Spin-off </li></ul></ul><ul><ul><li>compatible with Amazon EC2/ S3 </li></ul></ul><ul><ul><li>Supported in Ubuntu as of 10.4. </li></ul></ul>
  106. 112. Throughput Infrastructure <ul><li>Hadoop </li></ul><ul><ul><li>Apache Distributed File System for MapReduce jobs. </li></ul></ul><ul><ul><li>MapReduce Engine—co-ordinates MapReduce </li></ul></ul>
  107. 113. Cluster provisioning <ul><li>Create an image with the whole stack </li></ul><ul><li>Deploy the image as many times as nodes are required </li></ul><ul><li>Push required config data to the nodes </li></ul><ul><li>Turn on </li></ul><ul><li>Keep storage separate (i.e. don ’ t use hdfs to store data) </li></ul>
  108. 114. OCR parallel methods <ul><li>Run parallel jobs on the same scans </li></ul><ul><li>Score results </li></ul><ul><li>Use highest score in the next round </li></ul>
  109. 115. OCR parallel methods <ul><li>3 different ocr engines per page </li></ul><ul><li>x different filters per page </li></ul><ul><li>x different filters per section of page. </li></ul><ul><li>= c. 30 runs per scan. </li></ul>
  110. 116. OCR vote and error prediction methods Courtesy: Federico Boschetti
  111. 117. Alignment voting <ul><li>Map: Run three ocr engines/training sets on each page </li></ul><ul><ul><li>Gamera </li></ul></ul><ul><ul><li>Tesseract: training set 1 </li></ul></ul><ul><ul><li>Tesseract: training set 2 </li></ul></ul><ul><li>Reduce: </li></ul><ul><ul><li>spell check and compare </li></ul></ul>
  112. 118. Training set voting <ul><li>Map: Run random pages on all avail. training sets. </li></ul><ul><li>Reduce: Check against dictionary, and score. </li></ul>
  113. 119. Tiling <ul><li>Map: Run several filters over different parts of a page to compensate for local minima = blotches </li></ul><ul><li>Reduce: score the output and compare. </li></ul>
  114. 120. Why Eucalyptus? <ul><li>Scalable </li></ul><ul><ul><li>Amazon/NGS hybrid possibilities </li></ul></ul><ul><li>Reuseable </li></ul><ul><ul><li>Very fast start-up/tear-down. </li></ul></ul><ul><li>Configurable </li></ul><ul><ul><li>Quickly configure custom throughput clusters </li></ul></ul>
  115. 121. Why MapReduce? <ul><li>“ Shared Nothing ” architecture </li></ul><ul><li>= suited to “ dumb ” processes like page ocr </li></ul>Why Hadoop? <ul><li>Easy to integrate with other FS ’ s, e.g. s3 </li></ul><ul><li>Excellent customisation options </li></ul><ul><li>Most flexible implementation of MapReduce (cf. GridGain) </li></ul>
  116. 122. Why not MapReduce? <ul><li>Requires extensive refactoring. </li></ul><ul><li>Only a subset of functional possibilities. </li></ul>Why not Hadoop? <ul><li>Filesystem is slooooowwww…. </li></ul><ul><li>Resource intensive. </li></ul><ul><li>Headnode is a bottleneck… </li></ul>
  117. 123. Challenges for the future <ul><li>Feature Extraction.e.g. </li></ul><ul><ul><li>Named Entities </li></ul></ul><ul><ul><li>Part of Speech tagging </li></ul></ul><ul><ul><li>Multi-lingual alignment </li></ul></ul><ul><li>Iteration is hard with distributed systems! </li></ul>
  118. 124. Conclusions
  119. 125. Three conclusions <ul><li>Increased intellectual range </li></ul>
  120. 126. Three conclusions <ul><li>Increased intellectual range </li></ul><ul><ul><li>Greco-Roman Antiquity is an enabling subject to understand cultural tectonic forces at work today </li></ul></ul>
  121. 128. Plato ’ s Republic and the Guardians The Islamic Republic of Iran and the Guardianship of Islamic Jurists
  122. 129. Sometimes Greek philosophy does have an impact.. Plato ’ s Republic and the Guardians The Islamic Republic of Iran and the Guardianship of Islamic Jurists
  123. 130. Three conclusions <ul><li>Increased intellectual range </li></ul><ul><ul><li>Greco-Roman Antiquity is an enabling subject to understand cultural tectonic forces at work today </li></ul></ul><ul><li>Cultural heritage -> network of cultures </li></ul>
  124. 131. Three conclusions <ul><li>Increased intellectual range </li></ul><ul><ul><li>Greco-Roman Antiquity is an enabling subject to understand cultural tectonic forces at work today </li></ul></ul><ul><li>Cultural heritage -> network of cultures </li></ul><ul><ul><li>We share Greco-Roman Cultural Heritage </li></ul></ul>
  125. 132. Students of Greek and Latin
  126. 133. Students of Greek and Latin
  127. 134. Students of Greek and Latin
  128. 135. How do we work together?
  129. 136. Three conclusions <ul><li>Increased intellectual range </li></ul><ul><ul><li>Greco-Roman Antiquity is an enabling subject to understand cultural tectonic forces at work today </li></ul></ul><ul><li>Cultural heritage -> network of cultures </li></ul><ul><ul><li>We share Greco-Roman Cultural Heritage </li></ul></ul><ul><li>Decentralized Lab Culture in the Humanities </li></ul>
  130. 137. Three conclusions <ul><li>Increased intellectual range </li></ul><ul><ul><li>Greco-Roman Antiquity is an enabling subject to understand cultural tectonic forces at work today </li></ul></ul><ul><li>Cultural heritage -> network of cultures </li></ul><ul><ul><li>We share Greco-Roman Cultural Heritage </li></ul></ul><ul><li>Decentralized Lab Culture in the Humanities </li></ul><ul><ul><li>Even/esp. hard subjects need contributions from student researchers and citizen scholars </li></ul></ul>
  131. 138. Student Researchers Tufts
  132. 139. Student Researchers Tufts Holy Cross
  133. 140. Student Researchers Tufts Furman Holy Cross
  134. 141. Student Researchers Tufts Furman Holy Cross Houston
  135. 142. Student Researchers Tufts Furman Holy Cross Mount Allison Houston
  136. 143. Huge Open Collections <ul><li>Provide the net public with physical access to unprecedented bodies of cultural heritage </li></ul><ul><li>Researchers and automated systems provide initial intellectual access BUT… </li></ul><ul><li>These alone cannot succeed without student researchers and citizen scholars </li></ul>
  137. 144. Three basic changes <ul><li>Transformation of scale of questions </li></ul><ul><ul><li>Breadth and Depth </li></ul></ul><ul><li>Student researchers and citizen scholars </li></ul><ul><ul><ul><li>Not enough professors and library professionals </li></ul></ul></ul><ul><li>Globalization of cultural heritage </li></ul><ul><ul><li>Not enough expertise in Europe + North America </li></ul></ul>
  138. 145. We can (if we choose) transform our ability to advance the intellectual life of society
  139. 146. Thank you!

×