SlideShare a Scribd company logo
FactMiners, PRImA, & eMOP’s
Knight Prototype Fund Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A PDF-format compilation of the four “silent Ignite Talk”
slide-shows submitted in support of FactMiners &
PRImA’s recent Knight News Challenge entry.
“Big Picture” Backgrounder for
Step 1. Crowdsourcing Ground-Truth
FactMiners’ Prototype Fund entry is the 1st step in
our “Turn Text Soup into Smart Data” R&D agenda
• Originally submitted in support of an unfunded entry in the
current Knight News Challenge, these four short slide-
shows provide a “Big Picture” overview of our collaborative
research and development agenda
• Problem: Text Soup
OCR’s “Dirty” Little Secret
• Goal: Smart Data
From “Readable” to “Computable”
• Solution: Technology
Machine Learning & Smart Data
• Solution: People
Crowdsourcing Ground-Truth
FactMiners, PRImA, & eMOP: Knight Prototype Fund entry – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives:
Step 1. Crowdsourcing Ground-Truth”
Please @KnightFdn folk, don’t deprive
my little boy of a good education. His
Citizen Science teachers need to
create training materials, tests, &
answer sheets for his school at the
Internet Archive.
011010..01.. INIT… Hello, world!
Prototype Fund funding will help me
serve you better…and this part is all
about what will be done by this project.
Problem: Text Soup
OCR’s “Dirty” Little Secret
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
1
Part
Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
Q: What is “Text Soup”?
• A: The uncorrected and
usually hidden text “layer”
that is generated by OCR
(optical character recognition)
during bulk scanning and
digitization of historic and
cultural heritage documents.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned Images
or photos of pages!
Q: How is “Text Soup” Used?
• A: Primarily “behind the
scenes” to support “full text”
search.
• Good for things like:
• Show me the pages with the
word “razor” on them in this
book.
• What books are about shaving?
• What words are found in
proximity to the word “strop” ?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned
Image of text!
Hidden text
layer…
Q: What are Text Soup’s limits?
• Automated OCR
(text recognition) is a
“one size fits all” process in
the workflow of bulk
scanning and digitization.
• Good for basic books &
monographs with simple
document structure…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What are Text Soup’s limits?
• Newspapers & magazines have complex
document structures
• Multiple articles, multiple
authors, text continuations,
advertisements, images,
sidebars, text used as art
in design, etc.
• All this data is locked in
our archives waiting
to be “fact-mined”
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What are Text Soup’s limits?
• On these pages from Softalk magazine we have lots of
“facts” in ads and a monthly column
• We can’t “locate” facts
and assess their meaning
based on the jumbled or
missing info in its
Text Soup.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Complex
document
structures
not identified!
We have to “tame” Text Soup to unlock
“facts” in archive data.
• Our project will focus on recognizing complex
document structure and on “fact-revealing”
content modeling.
• In the next slideshow, we describe our vision for
“fact-mining” Smart Data from newspaper &
magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Goal: Smart Data
From “readable” to “computable”
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
2
Part
Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
Q: What is Smart Data?
• A: Smart Data is self-descriptive
data that can “carry on a conversation”
with Smart Programs to support
access, editing, and visualization of
the data itself.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
The “actual” data of the database
To access the “actual” data of the database,
Smart Programs “talk” to an embedded
“database about the database” (AKA a metamodel )
Q: What does Smart Data look like?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
• A: Smart Data includes BOTH the
complex document structure
of the source AND the underlying
conceptual model of the source
content.
Q: What can Smart Data do?
• A: Turn expensive, time-
consuming, labor-intensive
research studies into “Just ask!”
queries
• Good for things like:
• How did local reporting of race
relations impact public policy in
Indiana in the 1950s?
• Did advertising or editorial
coverage account for the
popularity of programs in the
Softalk Bestseller lists?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: How “smart” is our Smart Data design?
• We spent a year researching
museum informatics and
prototyping Smart Data designs.
• Our software architecture is based
on CIDOC-CRM (Conceptual
Reference Model for Museums)
microservice workflows and
PRESSoo, the ISSN.org
metamodel for serial publications
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Winter, 2013
Spring, 2014
Fall, 2014
Summer, 2015
Neo4j GraphGist Challenge,
a 1st place for Metamodel
Subgraph domain model
Semi-finals Ashoka/LEGO
“Re-imagine Learning” Challenge.
#MW2014 FactMiners demo.
Introduced to #cidocCRM.
Museum Computer Network
Emerging Professional Scholarship.
#MCN2014 paper & demo.
“Massively Addressable Text” published
in peer-reviewed CODE|WORDS.
#HILT2015 Crowdsourcing Course
DPLA Community Reps.
Internet Archive Content Partner.
ICOM #cidocCRM SIG member.
Incorporate PRESSoo into design.
Begin PRImA Collaboration.
Q: How “open” is our Smart Data design?
• Using a metamodel
subgraph design
pattern to embed and pass
info about data and its access
and transformation is
technology neutral &
future-proof.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Without Smart Data
With Smart Data
Database
10 Load X
20 Print X
30 Goto 10
Domain knowledge written
into task-specific programs
Metamodel statically stored
within #TEI header section of
source documents std. text files
<teiHeader>
<metamodel />
<structure />
<content />
Any “smart” DB
For dynamic Linked Open Data access,
DB need only have import &
ability to represent data structures
read from metamodel header.
10 Load metamodel
20 Configure editors
30 Do stuff…
“Smart” program in
any language
We have a design to “tame” Text Soup and
unlock “facts” in archive data.
• An innovative design combining international standards
for conceptual modeling of museum collections
(cidocCRM and PRESSoo) together with a “self-
descriptive” software/database design pattern provide the
foundation for mining Smart Data from Text Soup.
• In the next slideshow, we describe our design for the
technology to “fact-mine” Smart Data from
newspaper & magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
Solution: Technology
Machine Learning & Smart Data3
Part
Q: Can Robots* read magazines?
• Yes (mostly)…when looking at
layout & text recognition within
the individual page
• No...in terms of recognizing the
complex document structure of the
whole issue
• Our challenge is to move from
individual page to whole-issue
document structure recognition.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
* “Robot” = Software Agent
(AKA, a computer program)
From page…
…to pages!
Q: What’s the 1st Step & Where to do it?
• We start by teaching Robot
agents to find & understand
the TOC (Table of Contents)
and Advertiser Index pages
of newspapers & magazines.
• The best place to do this
applied research is in the
collections of the Internet
Archive.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Bring ‘em on!
I can’t get
enough TOC.
Q: Why TOCs & Advertiser Indexes?
• A: TOCs (Table of Contents) &
Advertiser Indexes reveal the
complex document structure of
newspapers & magazines.
• Like a Sudoku puzzle, the TOC &
Ad Index provide helpful “filled-in
answers” about the types of
content to be found within pages
of the newspaper or magazine.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: Why the Internet Archive?
• Thousands of “Text Soup era”
newspaper & magazine collections that
can be enhanced through research.
• The Archive’s Scanning Service flags
TOC pages & generates a TOC-
specific XML-encoded file during
its standard digitization workflow.
• The current Archive TOC OCR
analysis does not “see” &
understand the complex TOCs of
magazines & newspapers.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What TOC Robots will we develop?
• TOC-Spotter is an Image/Scene
Recognition software agent to crawl
the Archive in search of TOC &
Ad Index pages.
• TOC-Reader is a software agent
extending PRImA recognition &
evaluation technologies with
Machine Learning capabilities to do
“deep reading” with the assist of the
TOC Pattern Reference Library.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Dot dot dot… Check!
Number… Check!
YO! Gotta TOC here!
Great! Let me take
a good look at it.
Q: How will this help?
• By running our TOC-Agents early in
digitization workflows, we can make
smarter within-page layout
recognition decisions during bulk
OCR of the issue’s subsequent pages.
• We can generate “best guess”
structure-revealing meta-tags in
appropriate files as part of the
standard Archive scanning workflow.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Let’s see… Based on my notes,
that’d be an ad, a feature article,
another ad…and there’s the
Ad Index!
Q: What will be accomplished?
• The structure-mapped text files
generated by the TOC-Reader
agent will be ready for FactMiners'
Semantic tagging (AKA “fact-
mining”) of the issue’s content.
• These files will be compatible with
PRImA’s Alethia program for use
in crowdsourced Ground-Truth
development of the TOC Pattern
Reference Library.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Welcome to the
TOC Pattern Reference Library
We have a design to “tame” Text Soup and
unlock “facts” in archive data.
• Our immediate PRImA-inspired technology agenda is to
develop “Robot” assistance (software agents) to find,
recognize & deeply understand the TOCs (Table of
Contents) and Advertiser Indexes of magazines in the
Internet Archive magazine & newspaper collections.
• In our last slideshow, we describe the people dimension
of our strategy to “fact-mine” Smart Data from
newspaper & magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
Solution: People
Crowdsourcing Ground-Truth4
Part
Q: Why do we need people?
• If all we had to do was write some
smart “Robot” programs & simply put
them to work, we wouldn’t need people.
• But writing smart code is just the “birth”
of a Machine-Learning Robot.
• We have to teach our Robots how to
read magazines & newspapers!
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
011010..01.. INIT…
What am I? Who will
teach me? What’s a
magazine?
Q: What is Ground-Truth?
• Teaching means training; lessons, study
materials, tests & their answer sheets, etc.
• An “answer sheet” in OCR research is
called a Ground-Truth solution – the
human-crafted “perfect answer” to
recognition of a scanned page.
• To teach our Robots to read magazines,
we’ll need a pile of TOC* Ground-Truth!
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
*TOC being “Table of Contents”
See our 3rd Silent Ignite slideshow
for more on TOCs & Technology
Yes, your Honor…
That is EXACTLY
what I saw and ONLY
what I saw on the
Table of Contents
page shown to me
as Exhibit A.
Q: What’s the TOC Pattern Reference Library?
• It will be a Special Purpose Research
Collection at the Internet Archive to
be used to “teach Robots to read
magazines & newspapers.”
• Will Include a TOC Image Dataset,
TOC Ground-Truth Solutions, & Open
Source library of TOC-Spotting &
TOC-Reading software.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Welcome to the
TOC Pattern Reference Library
Yes, counsel, in answer to
your question allow me to
reference material from
the Library.
Q: How will Citizen Scientists help?
• “Volunpeers” are already generating
Ground-Truth data for the TOC Pattern
Reference Library through our
project on the Zooniverse
crowdsourcing platform.
• In addition to refining the workflow for
Ground-Truth data collection, this
project will develop Zooniverse data
export to PRImA’s Aletheia.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What is Alethia “FactMiners Ed.”?
• Aletheia is PRImA’s desktop & web
Ground-Truth Tool.
• Funding will allow PRImA to add
features to Aletheia to support
“whole issue” modeling in
Ground-Truth Solutions.
• We get a Power Tool for Citizen
Scientists who want to “dig deeper”
into Internet Archive newspaper &
magazine collections as pioneer
FactMiners!
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
We have a design to “tame” Text Soup and
unlock “facts” in archive data.
• We are confident that the applied research project
submitted as our Knight News Challenge entry will
make substantive contributions to the domain of Open
Data by helping to turn Text Soup into Smart Data in
newspaper & magazine archives.
• We hope you have enjoyed all four of our “silent Ignite Talk”
video slideshows. We welcome your comments, questions,
& (of course) “applause” at: https://goo.gl/99Vn5M
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
FactMiners & PRImA:
Our Knight News Challenge Entry
• “Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
https://goo.gl/99Vn5M
• Team
• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
This is the final slide for our
prior submission to the News Challenge
and not to be confused with our current
Prototype Fund entry…

More Related Content

Similar to "Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart DataFactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
Jim Salmons
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
Jean-Philippe Moreux
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
Mark West
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
Vladimir Alexiev, PhD, PMP
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
Mark West
 
Webinar: Metadata Enrichment in Publishing
Webinar: Metadata Enrichment in PublishingWebinar: Metadata Enrichment in Publishing
Webinar: Metadata Enrichment in Publishing
Ontotext
 
IoT as a metaphor!
IoT as a metaphor!IoT as a metaphor!
IoT as a metaphor!
PG Madhavan
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
IoT and the pervasive nature of fast data and apache spark
IoT and the pervasive nature of fast data and apache sparkIoT and the pervasive nature of fast data and apache spark
IoT and the pervasive nature of fast data and apache spark
Stephen Dillon
 
IoT and the Pervasive Nature of Fast Data and Apache Spark
IoT and the Pervasive Nature of Fast Data and Apache SparkIoT and the Pervasive Nature of Fast Data and Apache Spark
IoT and the Pervasive Nature of Fast Data and Apache SparkStephen Dillon
 
Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...
Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...
Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...
Joy Palmer
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
Mark West
 
Big data
Big dataBig data
Big data
Prince Barai
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
Miguel Pastor
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Spark Summit
 
Data Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher ZeitungData Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher Zeitung
René Pfitzner
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
Mark West
 
Data sciences and marketing analytics
Data sciences and marketing analyticsData sciences and marketing analytics
Data sciences and marketing analytics
MJ Xavier
 

Similar to "Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry (20)

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart DataFactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
NDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data ScienceNDC Oslo : A Practical Introduction to Data Science
NDC Oslo : A Practical Introduction to Data Science
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
JavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceJavaZone 2018 - A Practical(ish) Introduction to Data Science
JavaZone 2018 - A Practical(ish) Introduction to Data Science
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Webinar: Metadata Enrichment in Publishing
Webinar: Metadata Enrichment in PublishingWebinar: Metadata Enrichment in Publishing
Webinar: Metadata Enrichment in Publishing
 
IoT as a metaphor!
IoT as a metaphor!IoT as a metaphor!
IoT as a metaphor!
 
presentation
presentationpresentation
presentation
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
IoT and the pervasive nature of fast data and apache spark
IoT and the pervasive nature of fast data and apache sparkIoT and the pervasive nature of fast data and apache spark
IoT and the pervasive nature of fast data and apache spark
 
IoT and the Pervasive Nature of Fast Data and Apache Spark
IoT and the Pervasive Nature of Fast Data and Apache SparkIoT and the Pervasive Nature of Fast Data and Apache Spark
IoT and the Pervasive Nature of Fast Data and Apache Spark
 
Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...
Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...
Beyond Usage Stats (Or, demonstrating value & marketing services when you hav...
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Big data
Big dataBig data
Big data
 
Liferay and Big Data
Liferay and Big DataLiferay and Big Data
Liferay and Big Data
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Data Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher ZeitungData Science & Data Products at Neue Zürcher Zeitung
Data Science & Data Products at Neue Zürcher Zeitung
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
Data sciences and marketing analytics
Data sciences and marketing analyticsData sciences and marketing analytics
Data sciences and marketing analytics
 

Recently uploaded

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry

  • 1. FactMiners, PRImA, & eMOP’s Knight Prototype Fund Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” A PDF-format compilation of the four “silent Ignite Talk” slide-shows submitted in support of FactMiners & PRImA’s recent Knight News Challenge entry. “Big Picture” Backgrounder for Step 1. Crowdsourcing Ground-Truth
  • 2. FactMiners’ Prototype Fund entry is the 1st step in our “Turn Text Soup into Smart Data” R&D agenda • Originally submitted in support of an unfunded entry in the current Knight News Challenge, these four short slide- shows provide a “Big Picture” overview of our collaborative research and development agenda • Problem: Text Soup OCR’s “Dirty” Little Secret • Goal: Smart Data From “Readable” to “Computable” • Solution: Technology Machine Learning & Smart Data • Solution: People Crowdsourcing Ground-Truth FactMiners, PRImA, & eMOP: Knight Prototype Fund entry – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives: Step 1. Crowdsourcing Ground-Truth” Please @KnightFdn folk, don’t deprive my little boy of a good education. His Citizen Science teachers need to create training materials, tests, & answer sheets for his school at the Internet Archive. 011010..01.. INIT… Hello, world! Prototype Fund funding will help me serve you better…and this part is all about what will be done by this project.
  • 3. Problem: Text Soup OCR’s “Dirty” Little Secret FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” 1 Part Recently submitted in support of our unfunded Knight News Challenge entry, this short slide-show describes a key element of our collaborative R&D agenda. Our current Prototype Fund submission is the first strategic step in pursuit of this mission.
  • 4. Q: What is “Text Soup”? • A: The uncorrected and usually hidden text “layer” that is generated by OCR (optical character recognition) during bulk scanning and digitization of historic and cultural heritage documents. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Scanned Images or photos of pages!
  • 5. Q: How is “Text Soup” Used? • A: Primarily “behind the scenes” to support “full text” search. • Good for things like: • Show me the pages with the word “razor” on them in this book. • What books are about shaving? • What words are found in proximity to the word “strop” ? FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Scanned Image of text! Hidden text layer…
  • 6. Q: What are Text Soup’s limits? • Automated OCR (text recognition) is a “one size fits all” process in the workflow of bulk scanning and digitization. • Good for basic books & monographs with simple document structure… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 7. Q: What are Text Soup’s limits? • Newspapers & magazines have complex document structures • Multiple articles, multiple authors, text continuations, advertisements, images, sidebars, text used as art in design, etc. • All this data is locked in our archives waiting to be “fact-mined” FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 8. Q: What are Text Soup’s limits? • On these pages from Softalk magazine we have lots of “facts” in ads and a monthly column • We can’t “locate” facts and assess their meaning based on the jumbled or missing info in its Text Soup. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Complex document structures not identified!
  • 9. We have to “tame” Text Soup to unlock “facts” in archive data. • Our project will focus on recognizing complex document structure and on “fact-revealing” content modeling. • In the next slideshow, we describe our vision for “fact-mining” Smart Data from newspaper & magazine digital archives… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 10. Goal: Smart Data From “readable” to “computable” FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” 2 Part Recently submitted in support of our unfunded Knight News Challenge entry, this short slide-show describes a key element of our collaborative R&D agenda. Our current Prototype Fund submission is the first strategic step in pursuit of this mission.
  • 11. Q: What is Smart Data? • A: Smart Data is self-descriptive data that can “carry on a conversation” with Smart Programs to support access, editing, and visualization of the data itself. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” The “actual” data of the database To access the “actual” data of the database, Smart Programs “talk” to an embedded “database about the database” (AKA a metamodel )
  • 12. Q: What does Smart Data look like? FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” • A: Smart Data includes BOTH the complex document structure of the source AND the underlying conceptual model of the source content.
  • 13. Q: What can Smart Data do? • A: Turn expensive, time- consuming, labor-intensive research studies into “Just ask!” queries • Good for things like: • How did local reporting of race relations impact public policy in Indiana in the 1950s? • Did advertising or editorial coverage account for the popularity of programs in the Softalk Bestseller lists? FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 14. Q: How “smart” is our Smart Data design? • We spent a year researching museum informatics and prototyping Smart Data designs. • Our software architecture is based on CIDOC-CRM (Conceptual Reference Model for Museums) microservice workflows and PRESSoo, the ISSN.org metamodel for serial publications FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Winter, 2013 Spring, 2014 Fall, 2014 Summer, 2015 Neo4j GraphGist Challenge, a 1st place for Metamodel Subgraph domain model Semi-finals Ashoka/LEGO “Re-imagine Learning” Challenge. #MW2014 FactMiners demo. Introduced to #cidocCRM. Museum Computer Network Emerging Professional Scholarship. #MCN2014 paper & demo. “Massively Addressable Text” published in peer-reviewed CODE|WORDS. #HILT2015 Crowdsourcing Course DPLA Community Reps. Internet Archive Content Partner. ICOM #cidocCRM SIG member. Incorporate PRESSoo into design. Begin PRImA Collaboration.
  • 15. Q: How “open” is our Smart Data design? • Using a metamodel subgraph design pattern to embed and pass info about data and its access and transformation is technology neutral & future-proof. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Without Smart Data With Smart Data Database 10 Load X 20 Print X 30 Goto 10 Domain knowledge written into task-specific programs Metamodel statically stored within #TEI header section of source documents std. text files <teiHeader> <metamodel /> <structure /> <content /> Any “smart” DB For dynamic Linked Open Data access, DB need only have import & ability to represent data structures read from metamodel header. 10 Load metamodel 20 Configure editors 30 Do stuff… “Smart” program in any language
  • 16. We have a design to “tame” Text Soup and unlock “facts” in archive data. • An innovative design combining international standards for conceptual modeling of museum collections (cidocCRM and PRESSoo) together with a “self- descriptive” software/database design pattern provide the foundation for mining Smart Data from Text Soup. • In the next slideshow, we describe our design for the technology to “fact-mine” Smart Data from newspaper & magazine digital archives… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 17. Recently submitted in support of our unfunded Knight News Challenge entry, this short slide-show describes a key element of our collaborative R&D agenda. Our current Prototype Fund submission is the first strategic step in pursuit of this mission. FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” Solution: Technology Machine Learning & Smart Data3 Part
  • 18. Q: Can Robots* read magazines? • Yes (mostly)…when looking at layout & text recognition within the individual page • No...in terms of recognizing the complex document structure of the whole issue • Our challenge is to move from individual page to whole-issue document structure recognition. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” * “Robot” = Software Agent (AKA, a computer program) From page… …to pages!
  • 19. Q: What’s the 1st Step & Where to do it? • We start by teaching Robot agents to find & understand the TOC (Table of Contents) and Advertiser Index pages of newspapers & magazines. • The best place to do this applied research is in the collections of the Internet Archive. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Bring ‘em on! I can’t get enough TOC.
  • 20. Q: Why TOCs & Advertiser Indexes? • A: TOCs (Table of Contents) & Advertiser Indexes reveal the complex document structure of newspapers & magazines. • Like a Sudoku puzzle, the TOC & Ad Index provide helpful “filled-in answers” about the types of content to be found within pages of the newspaper or magazine. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 21. Q: Why the Internet Archive? • Thousands of “Text Soup era” newspaper & magazine collections that can be enhanced through research. • The Archive’s Scanning Service flags TOC pages & generates a TOC- specific XML-encoded file during its standard digitization workflow. • The current Archive TOC OCR analysis does not “see” & understand the complex TOCs of magazines & newspapers. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 22. Q: What TOC Robots will we develop? • TOC-Spotter is an Image/Scene Recognition software agent to crawl the Archive in search of TOC & Ad Index pages. • TOC-Reader is a software agent extending PRImA recognition & evaluation technologies with Machine Learning capabilities to do “deep reading” with the assist of the TOC Pattern Reference Library. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Dot dot dot… Check! Number… Check! YO! Gotta TOC here! Great! Let me take a good look at it.
  • 23. Q: How will this help? • By running our TOC-Agents early in digitization workflows, we can make smarter within-page layout recognition decisions during bulk OCR of the issue’s subsequent pages. • We can generate “best guess” structure-revealing meta-tags in appropriate files as part of the standard Archive scanning workflow. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Let’s see… Based on my notes, that’d be an ad, a feature article, another ad…and there’s the Ad Index!
  • 24. Q: What will be accomplished? • The structure-mapped text files generated by the TOC-Reader agent will be ready for FactMiners' Semantic tagging (AKA “fact- mining”) of the issue’s content. • These files will be compatible with PRImA’s Alethia program for use in crowdsourced Ground-Truth development of the TOC Pattern Reference Library. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Welcome to the TOC Pattern Reference Library
  • 25. We have a design to “tame” Text Soup and unlock “facts” in archive data. • Our immediate PRImA-inspired technology agenda is to develop “Robot” assistance (software agents) to find, recognize & deeply understand the TOCs (Table of Contents) and Advertiser Indexes of magazines in the Internet Archive magazine & newspaper collections. • In our last slideshow, we describe the people dimension of our strategy to “fact-mine” Smart Data from newspaper & magazine digital archives… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 26. Recently submitted in support of our unfunded Knight News Challenge entry, this short slide-show describes a key element of our collaborative R&D agenda. Our current Prototype Fund submission is the first strategic step in pursuit of this mission. FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” Solution: People Crowdsourcing Ground-Truth4 Part
  • 27. Q: Why do we need people? • If all we had to do was write some smart “Robot” programs & simply put them to work, we wouldn’t need people. • But writing smart code is just the “birth” of a Machine-Learning Robot. • We have to teach our Robots how to read magazines & newspapers! FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” 011010..01.. INIT… What am I? Who will teach me? What’s a magazine?
  • 28. Q: What is Ground-Truth? • Teaching means training; lessons, study materials, tests & their answer sheets, etc. • An “answer sheet” in OCR research is called a Ground-Truth solution – the human-crafted “perfect answer” to recognition of a scanned page. • To teach our Robots to read magazines, we’ll need a pile of TOC* Ground-Truth! FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” *TOC being “Table of Contents” See our 3rd Silent Ignite slideshow for more on TOCs & Technology Yes, your Honor… That is EXACTLY what I saw and ONLY what I saw on the Table of Contents page shown to me as Exhibit A.
  • 29. Q: What’s the TOC Pattern Reference Library? • It will be a Special Purpose Research Collection at the Internet Archive to be used to “teach Robots to read magazines & newspapers.” • Will Include a TOC Image Dataset, TOC Ground-Truth Solutions, & Open Source library of TOC-Spotting & TOC-Reading software. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Welcome to the TOC Pattern Reference Library Yes, counsel, in answer to your question allow me to reference material from the Library.
  • 30. Q: How will Citizen Scientists help? • “Volunpeers” are already generating Ground-Truth data for the TOC Pattern Reference Library through our project on the Zooniverse crowdsourcing platform. • In addition to refining the workflow for Ground-Truth data collection, this project will develop Zooniverse data export to PRImA’s Aletheia. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 31. Q: What is Alethia “FactMiners Ed.”? • Aletheia is PRImA’s desktop & web Ground-Truth Tool. • Funding will allow PRImA to add features to Aletheia to support “whole issue” modeling in Ground-Truth Solutions. • We get a Power Tool for Citizen Scientists who want to “dig deeper” into Internet Archive newspaper & magazine collections as pioneer FactMiners! FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 32. We have a design to “tame” Text Soup and unlock “facts” in archive data. • We are confident that the applied research project submitted as our Knight News Challenge entry will make substantive contributions to the domain of Open Data by helping to turn Text Soup into Smart Data in newspaper & magazine archives. • We hope you have enjoyed all four of our “silent Ignite Talk” video slideshows. We welcome your comments, questions, & (of course) “applause” at: https://goo.gl/99Vn5M FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 33. FactMiners & PRImA: Our Knight News Challenge Entry • “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” https://goo.gl/99Vn5M • Team • Jim Salmons, FactMiners • Timlynn Babitsky, FactMiners • Apostolos Antonacopoulos, PRImA • Christian Clausner, PRImA FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” This is the final slide for our prior submission to the News Challenge and not to be confused with our current Prototype Fund entry…