Successfully reported this slideshow.
Your SlideShare is downloading. ×

Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
DLF Forum 2015: Beyond eMOP
DLF Forum 2015: Beyond eMOP
Loading in …3
×

Check these out next

1 of 12 Ad

More Related Content

Similar to Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB (20)

Recently uploaded (20)

Advertisement

Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB

  1. 1. eMOP’s Printers and Publishers: Toward Crafting an Early Modern Print Database Matthew Christy, Elizabeth Grumbach
  2. 2. emop.tamu.edu  eMOP ImprintDB  github.com/Early- Modern-OCR/ImprintDB  Mellon Grant Proposal  idhmc.tamu.edu/projects /Mellon/eMOPPublic.pdf eMOP Info eMOP Resources More eMOP  Facebook  The Early Modern OCR Project  Twitter  #emop  @IDHMC_Nexus  @mandellc  @matt_christy  @EMGrumbach 2 Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB
  3. 3.  The Early Modern OCR Project (eMOP) is an  Andrew W. Mellon Foundation funded grant project running out of the Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, to  develop and test tools and techniques to apply Optical Character Recognition (OCR) to early modern English documents  from the hand press period, roughly 1475-1800.  eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is insufficient for scholarly research. 3 Digital Frontiers 2015 - eMOP ImprintDB Goals Sept. 18, 2015
  4. 4. Digital Frontiers 2015 - eMOP ImprintDB 4 Sept. 18, 2015
  5. 5. Wrangling Data The Numbers  EEBO: ~125,000 documents, ~13 million pages images (1475-1700)  ECCO: ~182,000 documents, ~32 million page images (1700-1800)  TCP: ~46,000 double-keyed hand transcriptions (44,000 EEBO, 2,200 ECCO) – Groundtruth  Total: >300,000 documents & ~45 million page images. The Data  ECCO page images (1 pg/ image)  ECCO original OCR results (doc-level XML files)  ECCO TCP transcriptions (doc- level XML and text files)  EEBO page images (2 pgs/ image)  EEBO TCP transcriptions (doc- level XML and text files) Digital Frontiers 2015 - eMOP ImprintDB 5 Sept. 18, 2015
  6. 6. eMOP DB Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 6 • Document metadata • File locations • Page images • Pages text • Groundtruth text • OCR Results • Pages text • Scores against Groundtruth • Results of analysis • noise measure • skew measure • multiple column coords • corrections made
  7. 7. The Problems Early Modern Imprints  Missing  Incorrect  accidentally by printer  accidentally by DB provider  purposefully  No standard format or consistent inclusion of information  Inconsistent spelling and use of initials  Use of conversational language  Use of non-English (Latin, Welsh)  or a mix of languages 7 Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB Imprinted at London : by John Jugge, dwellyng at the north doore of Paules
  8. 8. Early Modern Imprints  Iterative application of regular expressions to cull out the data:  Who the work was Printed By  Who the work was Printed For  Who the work was Sold By  The Place of printing (London, Cambridge, Dublin, etc.)  The Location of printing (“the north doore of Paules”)  Date (gathered from separate metadata field) The Solution 8 Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB Printed by: Iohn Iugge Place: London Location: the north doore of Paules : 1580? Terms to identify the printer: • “printed”, sometimes also accompanied by “by” • prynted • reprinted or re-printed • imprinted • pressed • brintwyd (Welsh) • Typis, presso, pressare, excudebat, … (Latin) • etc. etc.
  9. 9. Results <work> <emopNO>140776</emopNO> <eccoNO>67101600</eccoNO> <tcpNO>NULL</tcpNO> <estcNO>T077294</estcNO> <imprintORIG>[London] : In the Savoy: printed by John Nutt; for John Walthoe, 1713.</imprintORIG> <date>1713</date> <imprintCLN>London : in the Savoy: printed by John Nutt; for John Walthoe,</imprintCLN> <place>London</place> <printedBy>John Nutt</printedBy> <printedFor>John Walthoe</printedFor> <location>in the Savoy</location> </work> Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 9 sourcehttp://bit.ly/1hXpVpd
  10. 10. eMOP Outcomes - Github Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 10 https://github.com/Early-Modern-OCR/ImprintDB
  11. 11. source: http://blog.volkovlaw.com/2013/03/the-future-of-compliance-what-will-the-new-tools-look-like/ Sept. 18, 2015Digital Frontiers 2015 - eMOP ImprintDB 11 Outcomes – DB of EM Printers
  12. 12. The end For eMOP questions please contact us at : mchristy@tamu.edu egrumbac@tamu.edu mandell@tamu.edu 12 Digital Frontiers 2015 - eMOP ImprintDB Sept. 18, 2015

Editor's Notes

  • [LIZ: this is just my “find out more info about eMOP” slide]
  • The Early Modern OCR Project (eMOP) is an Andrew W. Mellon Foundation funded grant project running out of the IDHMC at Texas A&M. Our goal is to develop and test tools and techniques to improve Optical Character Recognition (or OCR) outcomes for printed English documents from the hand press period, roughly 1475-1800. The basic premise of eMOP is to use typeface and book history techniques to train modern OCR engines specifically on the typefaces in our collection of documents, and thereby improve the accuracy of the OCR results. eMOP’s immediate goal is to make machine readable, or improve the readability, for 45 million pages of text from two major proprietary databases: Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO). Generally, eMOP aims to improve the visibility of early modern texts by making their contents fully searchable. The current paradigm of searching special collections for early modern materials by either metadata alone or “dirty” OCR is inefficient for scholarly research.
  • [LIZ: Not sure you want this in here or not, sometimes it’s useful]
    This slide gives you a good idea of the timeline of the documents we worked with on eMOP, which constitutes the “early modern” period.
    We worked with over 300,000 documents.
    You can also see information on right about orthography (spelling/language conventions) and typefaces for this period. These were both issues that complicated eMOP.
    EEBO had FOR EXAMPLE quite a lot of blackletter typefaces which were quite distinct from each other, requiring more specific OCR training for those font faces. And spelling irregularity through this period made checking and correcting OCR output that much more difficult. (We used several alternate spelling lists we gathered from several places.)

  • The data came to us in many different forms that had to be normalized, organized, and scraped for metadata:
    For example:
    Some page images were 1 page per image, some were 2 pages per image. So we had to figure out a page number scheme that would be consistent and would match up with groudtruth files either way.
    AND Metadata had to be generated from a variety of different formats and sources; from XML files, Excel spreadsheets, or Unix file info, etc, (depending on the source) then ingested to create the eMOP DB
    That in itself was a major undertaking that took about the first 3 months of the project
    Afterwards we continued to discovered that some of the metadata supplied to us was incorrect, contradictory, or missing. So we had to do as much cleanup of the data as we could as we went along.
    We still have more cleanup to do, but we think we have one of the most comprehensive and correct collections of metadata for early modern printed documents
  • So that is and overview of the eMOP DB, and here’s an internal view of the eMOP DB. The details aren’t necessarily important, except to say that our DB contains all of this information on each individual page from EEBO & ECCO, which is a great deal.

    [ENTER]

    AND from this eMOP DB, what we extracted to create the printers and publishers DB was actually from this single field. In this wks_publisher field was contained the imprint line from each individual work OCR’d by eMOP in the last two years (all 300,000+ documents).

    Our original intention was to use this data to identity which printers used which typefaces over the early modern period. We would then apply that research to our OCR font training and specifically apply, for example, Caslon font training to a document printed with the Caslon typeface – much like our “modern” OCR engines are trained to read modern fonts like Helvetica or Times New Roman. After much trial and error, we realized that 1) while we could connect some printers and typefaces, creating a database with this information would take much longer than the grant period and 2) these connections weren’t needed due to the way the Tesseract OCR engine worked.

    With that said, it’s still an eMOP goal to produce this DB, so we took the first steps to creating an “Imprint Database” containing all the information we could glean from the Imprint line of these books (confirming that information with other metadata on hand, in the eMOP DB).
  • There are HOWEVER a lot of issues involved with programmatically identifying publishing information from early modern imprints:
    Imprints can be missing or misleading, either intentionally (political pamphlets) or unintentionally, or the data could have been entered wrong into the DB by the provider
    Use of conversational English (so, there was no standard format for how this information was displayed – usually it was stylized as a “sentence” instead of the more easily algorithmically identifiable modern publisher information in a book).
    There is also a good deal of non-English or mixed language information in these imprint lines. The use of Latin place names was particularly common, but my favorite examples come from the Welsh language documents that we have in our DB. It’s relatively easy to find someone with experience in medieval Latin; but slightly harder to find an expert in early modern Welsh.

    In the image on the right we can see the imprint (at the bottom) [hit Enter] contains [hit Enter] :
    conversational English – in the form of a sentence
    the use of “I” in place of “J” – which complicates things
    “dwellyng”? Is that the publishing location? – which is an example of inconsistent spelling, and something that we didn’t take into account in our first pass at algorithmically pulling out publishing location information
  • To do this work we used a set of regular expressions applied over several iterations. The regular expression looked for cues—key words, initials, punctuation, etc.—to break the phrase up and then identify the category:
    personal name
    place name
    Role (who the work was printed for/ who it was printed by)
    location of printing – which we looked for things like: (at, by (but not proceeded by sold, printed, etc.)


    To give you an idea of how complicated this was [hit Enter] , we discovered that the printer of a document could be identified by a number of keywords (or cues…or clues):
    “printed”, sometimes also accompanied by “by”
    prynted
    reprinted or re-printed
    imprinted
    pressed
    brintwyd (Welsh)
    Typis, presso, pressare, excudebat, … (Latin)
  • The result was a text file (one each for EEBO and ECCO separately), and those text files contained:
    The eMOP # of the work (from the eMOP DB)
    The original imprint line
    A cleaned version of the imprint line
    (We kept the original, and a cleaned/formatted version of the imprint line as a reference. We don’t expect that we got everything right with our regular expressions, so this is a way for scholars, or anyone that we collaborate with in the future, to see the original imprint line in order to double-check or correct the formatted imprint data.)
    The date (from the eMOP DB—this is based on metadata created by the providers, Gale-Cengage Learning and Proquest)
    And when available:
    Place
    Printed By
    Printed For
    Sold By
    Location

    We then transformed the text files into XML files for easy use and portability to multiple formats. (We can use this to ingest into a DB, transform to HTML, turn into a spreadsheet, etc).

    We also added ESTC and TCP numbers for each document, when we had them. We collected this information from various sources.
    And we also added ECCO and EEBO numbers as identifiers. For ECCO there is one number, but for EEBO there are two.
    It was the eMOP DB that allowed us to tie all these numbers together and to the Imprints.
  • We’ve taken these XML files we created and made them available on Github for anyone to make use of. With a few caveats:
    Because these imprints came to us from proprietary sources, we can’t technically share them. However, we can share the imprint info from those works which are available via the ESTC (English Short Title Catalog). So on the Github page the XML files contain only those imprints which have ESTC numbers
    For EEBO that’s 115,789 out of ~139,000 works (84%)
    For ECCO that’s 207,662 out of ~211,000 works (98%)
    We also included the schemas used to validate these files in Github
    We would really like it if scholars who download these files let us know if they end up finding problems and making corrections.

    This Github page is also available via the eMOP website [hit Enter] as emop.tamu.edu. Just click on the Github Repo tab to see all of eMOP’s open resources in Github.
  • So, going forward we want to implement a better solution for sharing this data and having some kind of centralized clearing-house for it.
    We are planning on implementing this as a single online database (eXistDB) to make it easily searchable.
    We also want to create an online mechanism, via some kind of form, which would allow users to identify errors and request corrections, keeping this work in a single location for everyone to take advantage of.
    Eventually, we can tie this DB to other available open, online DBs, using the identifying numbers (eMOP, ESTC, ECCO, EEBO, TCP).
  • Please contact us with any questions.

×