"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry

FactMiners, PRImA, & eMOP’s
Knight Prototype Fund Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A PDF-format compilation of the four “silent Ignite Talk”
slide-shows submitted in support of FactMiners &
PRImA’s recent Knight News Challenge entry.
“Big Picture” Backgrounder for
Step 1. Crowdsourcing Ground-Truth

FactMiners’ Prototype Fund entry is the 1st step in
our “Turn Text Soup into Smart Data” R&D agenda
• Originally submitted in support of an unfunded entry in the
current Knight News Challenge, these four short slide-
shows provide a “Big Picture” overview of our collaborative
research and development agenda
• Problem: Text Soup
OCR’s “Dirty” Little Secret
• Goal: Smart Data
From “Readable” to “Computable”
• Solution: Technology
Machine Learning & Smart Data
• Solution: People
Crowdsourcing Ground-Truth
FactMiners, PRImA, & eMOP: Knight Prototype Fund entry – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives:
Step 1. Crowdsourcing Ground-Truth”
Please @KnightFdn folk, don’t deprive
my little boy of a good education. His
Citizen Science teachers need to
create training materials, tests, &
answer sheets for his school at the
Internet Archive.
011010..01.. INIT… Hello, world!
Prototype Fund funding will help me
serve you better…and this part is all
about what will be done by this project.

Problem: Text Soup
OCR’s “Dirty” Little Secret
FactMiners & PRImA’s
Knight News Challenge Entry
1
Part
Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.

Q: What is “Text Soup”?
• A: The uncorrected and
usually hidden text “layer”
that is generated by OCR
(optical character recognition)
during bulk scanning and
digitization of historic and
cultural heritage documents.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned Images
or photos of pages!

Q: How is “Text Soup” Used?
• A: Primarily “behind the
scenes” to support “full text”
search.
• Good for things like:
• Show me the pages with the
word “razor” on them in this
book.
• What books are about shaving?
• What words are found in
proximity to the word “strop” ?
Scanned
Image of text!
Hidden text
layer…

Q: What are Text Soup’s limits?
• Automated OCR
(text recognition) is a
“one size fits all” process in
the workflow of bulk
scanning and digitization.
• Good for basic books &
monographs with simple
document structure…

• Newspapers & magazines have complex
document structures
• Multiple articles, multiple
authors, text continuations,
advertisements, images,
sidebars, text used as art
in design, etc.
• All this data is locked in
our archives waiting
to be “fact-mined”

• On these pages from Softalk magazine we have lots of
“facts” in ads and a monthly column
• We can’t “locate” facts
and assess their meaning
based on the jumbled or
missing info in its
Text Soup.
Complex
document
structures
not identified!

We have to “tame” Text Soup to unlock
“facts” in archive data.
• Our project will focus on recognizing complex
document structure and on “fact-revealing”
content modeling.
• In the next slideshow, we describe our vision for
“fact-mining” Smart Data from newspaper &
magazine digital archives…

Goal: Smart Data
From “readable” to “computable”
2
Part

Q: What is Smart Data?
• A: Smart Data is self-descriptive
data that can “carry on a conversation”
with Smart Programs to support
access, editing, and visualization of
the data itself.
The “actual” data of the database
To access the “actual” data of the database,
Smart Programs “talk” to an embedded
“database about the database” (AKA a metamodel )

Q: What does Smart Data look like?
• A: Smart Data includes BOTH the
complex document structure
of the source AND the underlying
conceptual model of the source
content.

Q: What can Smart Data do?
• A: Turn expensive, time-
consuming, labor-intensive
research studies into “Just ask!”
queries
• Good for things like:
• How did local reporting of race
relations impact public policy in
Indiana in the 1950s?
• Did advertising or editorial
coverage account for the
popularity of programs in the
Softalk Bestseller lists?

Q: How “smart” is our Smart Data design?
• We spent a year researching
museum informatics and
prototyping Smart Data designs.
• Our software architecture is based
on CIDOC-CRM (Conceptual
Reference Model for Museums)
microservice workflows and
PRESSoo, the ISSN.org
metamodel for serial publications
Winter, 2013
Spring, 2014
Fall, 2014
Summer, 2015
Neo4j GraphGist Challenge,
a 1st place for Metamodel
Subgraph domain model
Semi-finals Ashoka/LEGO
“Re-imagine Learning” Challenge.
#MW2014 FactMiners demo.
Introduced to #cidocCRM.
Museum Computer Network
Emerging Professional Scholarship.
#MCN2014 paper & demo.
“Massively Addressable Text” published
in peer-reviewed CODE|WORDS.
#HILT2015 Crowdsourcing Course
DPLA Community Reps.
Internet Archive Content Partner.
ICOM #cidocCRM SIG member.
Incorporate PRESSoo into design.
Begin PRImA Collaboration.

Q: How “open” is our Smart Data design?
• Using a metamodel
subgraph design
pattern to embed and pass
info about data and its access
and transformation is
technology neutral &
future-proof.
Without Smart Data
With Smart Data
Database
10 Load X
20 Print X
30 Goto 10
Domain knowledge written
into task-specific programs
Metamodel statically stored
within #TEI header section of
source documents std. text files
<teiHeader>
<metamodel />
<structure />
<content />
Any “smart” DB
For dynamic Linked Open Data access,
DB need only have import &
ability to represent data structures
read from metamodel header.
10 Load metamodel
20 Configure editors
30 Do stuff…
“Smart” program in
any language

We have a design to “tame” Text Soup and
unlock “facts” in archive data.
• An innovative design combining international standards
for conceptual modeling of museum collections
(cidocCRM and PRESSoo) together with a “self-
descriptive” software/database design pattern provide the
foundation for mining Smart Data from Text Soup.
• In the next slideshow, we describe our design for the
technology to “fact-mine” Smart Data from
newspaper & magazine digital archives…

Solution: Technology
Machine Learning & Smart Data3
Part

Q: Can Robots* read magazines?
• Yes (mostly)…when looking at
layout & text recognition within
the individual page
• No...in terms of recognizing the
complex document structure of the
whole issue
• Our challenge is to move from
individual page to whole-issue
document structure recognition.
* “Robot” = Software Agent
(AKA, a computer program)
From page…
…to pages!

Q: What’s the 1st Step & Where to do it?
• We start by teaching Robot
agents to find & understand
the TOC (Table of Contents)
and Advertiser Index pages
of newspapers & magazines.
• The best place to do this
applied research is in the
collections of the Internet
Archive.
Bring ‘em on!
I can’t get
enough TOC.

Q: Why TOCs & Advertiser Indexes?
• A: TOCs (Table of Contents) &
Advertiser Indexes reveal the
complex document structure of
newspapers & magazines.
• Like a Sudoku puzzle, the TOC &
Ad Index provide helpful “filled-in
answers” about the types of
content to be found within pages
of the newspaper or magazine.

Q: Why the Internet Archive?
• Thousands of “Text Soup era”
newspaper & magazine collections that
can be enhanced through research.
• The Archive’s Scanning Service flags
TOC pages & generates a TOC-
specific XML-encoded file during
its standard digitization workflow.
• The current Archive TOC OCR
analysis does not “see” &
understand the complex TOCs of
magazines & newspapers.

Q: What TOC Robots will we develop?
• TOC-Spotter is an Image/Scene
Recognition software agent to crawl
the Archive in search of TOC &
Ad Index pages.
• TOC-Reader is a software agent
extending PRImA recognition &
evaluation technologies with
Machine Learning capabilities to do
“deep reading” with the assist of the
TOC Pattern Reference Library.
Dot dot dot… Check!
Number… Check!
YO! Gotta TOC here!
Great! Let me take
a good look at it.

Q: How will this help?
• By running our TOC-Agents early in
digitization workflows, we can make
smarter within-page layout
recognition decisions during bulk
OCR of the issue’s subsequent pages.
• We can generate “best guess”
structure-revealing meta-tags in
appropriate files as part of the
standard Archive scanning workflow.
Let’s see… Based on my notes,
that’d be an ad, a feature article,
another ad…and there’s the
Ad Index!

Q: What will be accomplished?
• The structure-mapped text files
generated by the TOC-Reader
agent will be ready for FactMiners'
Semantic tagging (AKA “fact-
mining”) of the issue’s content.
• These files will be compatible with
PRImA’s Alethia program for use
in crowdsourced Ground-Truth
development of the TOC Pattern
Reference Library.
Welcome to the
TOC Pattern Reference Library

• Our immediate PRImA-inspired technology agenda is to
develop “Robot” assistance (software agents) to find,
recognize & deeply understand the TOCs (Table of
Contents) and Advertiser Indexes of magazines in the
Internet Archive magazine & newspaper collections.
• In our last slideshow, we describe the people dimension
of our strategy to “fact-mine” Smart Data from
newspaper & magazine digital archives…

Solution: People
Crowdsourcing Ground-Truth4
Part

Q: Why do we need people?
• If all we had to do was write some
smart “Robot” programs & simply put
them to work, we wouldn’t need people.
• But writing smart code is just the “birth”
of a Machine-Learning Robot.
• We have to teach our Robots how to
read magazines & newspapers!
011010..01.. INIT…
What am I? Who will
teach me? What’s a
magazine?

Q: What is Ground-Truth?
• Teaching means training; lessons, study
materials, tests & their answer sheets, etc.
• An “answer sheet” in OCR research is
called a Ground-Truth solution – the
human-crafted “perfect answer” to
recognition of a scanned page.
• To teach our Robots to read magazines,
we’ll need a pile of TOC* Ground-Truth!
*TOC being “Table of Contents”
See our 3rd Silent Ignite slideshow
for more on TOCs & Technology
Yes, your Honor…
That is EXACTLY
what I saw and ONLY
what I saw on the
Table of Contents
page shown to me
as Exhibit A.

Q: What’s the TOC Pattern Reference Library?
• It will be a Special Purpose Research
Collection at the Internet Archive to
be used to “teach Robots to read
magazines & newspapers.”
• Will Include a TOC Image Dataset,
TOC Ground-Truth Solutions, & Open
Source library of TOC-Spotting &
TOC-Reading software.
Welcome to the
TOC Pattern Reference Library
Yes, counsel, in answer to
your question allow me to
reference material from
the Library.

Q: How will Citizen Scientists help?
• “Volunpeers” are already generating
Ground-Truth data for the TOC Pattern
Reference Library through our
project on the Zooniverse
crowdsourcing platform.
• In addition to refining the workflow for
Ground-Truth data collection, this
project will develop Zooniverse data
export to PRImA’s Aletheia.

Q: What is Alethia “FactMiners Ed.”?
• Aletheia is PRImA’s desktop & web
Ground-Truth Tool.
• Funding will allow PRImA to add
features to Aletheia to support
“whole issue” modeling in
Ground-Truth Solutions.
• We get a Power Tool for Citizen
Scientists who want to “dig deeper”
into Internet Archive newspaper &
magazine collections as pioneer
FactMiners!

• We are confident that the applied research project
submitted as our Knight News Challenge entry will
make substantive contributions to the domain of Open
Data by helping to turn Text Soup into Smart Data in
newspaper & magazine archives.
• We hope you have enjoyed all four of our “silent Ignite Talk”
video slideshows. We welcome your comments, questions,
& (of course) “applause” at: https://goo.gl/99Vn5M

FactMiners & PRImA:
Our Knight News Challenge Entry
• “Turn Text Soup into Smart Data in
https://goo.gl/99Vn5M
• Team
• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
This is the final slide for our
prior submission to the News Challenge
and not to be confused with our current
Prototype Fund entry…

"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry

Recommended

Recommended

More Related Content

Similar to "Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry

Similar to "Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry (20)

Recently uploaded

Recently uploaded (20)

"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry