SlideShare a Scribd company logo
Problem: Text Soup
OCR’s “Dirty” Little Secret
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A self-running video slideshow.
One slide every 15 seconds.
Pause as needed. 
Q: What is “Text Soup”?
• A: The uncorrected and
usually hidden text “layer”
that is generated by OCR
(optical character recognition)
during bulk scanning and
digitization of historic and
cultural heritage documents.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned Images
or photos of pages!
Q: How is “Text Soup” Used?
• A: Primarily “behind the
scenes” to support “full text”
search.
• Good for things like:
• Show me the pages with the
word “razor” on them in this
book.
• What books are about shaving?
• What words are found in
proximity to the word “strop” ?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned
Image of text!
Hidden text
layer…
Q: What are Text Soup’s limits?
• Automated OCR
(text recognition) is a
“one size fits all” process in
the workflow of bulk
scanning and digitization.
• Good for basic books &
monographs with simple
document structure…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What are Text Soup’s limits?
• Newspapers & magazines have complex
document structures
• Multiple articles, multiple
authors, text continuations,
advertisements, images,
sidebars, text used as art
in design, etc.
• All this data is locked in
our archives waiting
to be “fact-mined”
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What are Text Soup’s limits?
• On these pages from Softalk magazine we have lots of
“facts” in ads and a monthly column
• We can’t “locate” facts
and assess their meaning
based on the jumbled or
missing info in its
Text Soup.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Complex
document
structures
not identified!
We have to “tame” Text Soup to unlock
“facts” in archive data.
• Our project will focus on recognizing complex
document structure and on “fact-revealing”
content modeling.
• In the next slideshow, we describe our vision for
“fact-mining” Smart Data from newspaper &
magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
FactMiners & PRImA:
Our Knight News Challenge Entry
•“Turn Text Soup into Smart Data in
Newspaper & Magazine Archives” -
https://goo.gl/99Vn5M
• Team
• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

More Related Content

More from Jim Salmons

More from Jim Salmons (7)

NewsEye WPIP21 conference: The Case for Magazines slides
NewsEye WPIP21 conference: The Case for Magazines slidesNewsEye WPIP21 conference: The Case for Magazines slides
NewsEye WPIP21 conference: The Case for Magazines slides
 
NewsEye WPIP21 conference: The Case for Magazines slides
NewsEye WPIP21 conference: The Case for Magazines slidesNewsEye WPIP21 conference: The Case for Magazines slides
NewsEye WPIP21 conference: The Case for Magazines slides
 
The Yin-Yang Epigenesis of the Long-tail of the Scale-free Social Network of ...
The Yin-Yang Epigenesis of the Long-tail of the Scale-free Social Network of ...The Yin-Yang Epigenesis of the Long-tail of the Scale-free Social Network of ...
The Yin-Yang Epigenesis of the Long-tail of the Scale-free Social Network of ...
 
ExperOPS5: A Rule-based, Data-driven Production System Language Puts a Mind b...
ExperOPS5: A Rule-based, Data-driven Production System Language Puts a Mind b...ExperOPS5: A Rule-based, Data-driven Production System Language Puts a Mind b...
ExperOPS5: A Rule-based, Data-driven Production System Language Puts a Mind b...
 
Notes and Letters of Support for Crowdsourcing Ground Truth - FactMiners, PRI...
Notes and Letters of Support for Crowdsourcing Ground Truth - FactMiners, PRI...Notes and Letters of Support for Crowdsourcing Ground Truth - FactMiners, PRI...
Notes and Letters of Support for Crowdsourcing Ground Truth - FactMiners, PRI...
 
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA..."Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
 
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart DataFactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart Data
 

Recently uploaded

一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
benishzehra469
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptxMALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
MALL CUSTOMER SEGMENTATION USING K-MEANS CLUSTERING.pptx
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Machine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptxMachine Learning For Career Growth..pptx
Machine Learning For Career Growth..pptx
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
how can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoinhow can i exchange pi coins for others currency like Bitcoin
how can i exchange pi coins for others currency like Bitcoin
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

  • 1. Problem: Text Soup OCR’s “Dirty” Little Secret FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” A self-running video slideshow. One slide every 15 seconds. Pause as needed. 
  • 2. Q: What is “Text Soup”? • A: The uncorrected and usually hidden text “layer” that is generated by OCR (optical character recognition) during bulk scanning and digitization of historic and cultural heritage documents. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Scanned Images or photos of pages!
  • 3. Q: How is “Text Soup” Used? • A: Primarily “behind the scenes” to support “full text” search. • Good for things like: • Show me the pages with the word “razor” on them in this book. • What books are about shaving? • What words are found in proximity to the word “strop” ? FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Scanned Image of text! Hidden text layer…
  • 4. Q: What are Text Soup’s limits? • Automated OCR (text recognition) is a “one size fits all” process in the workflow of bulk scanning and digitization. • Good for basic books & monographs with simple document structure… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 5. Q: What are Text Soup’s limits? • Newspapers & magazines have complex document structures • Multiple articles, multiple authors, text continuations, advertisements, images, sidebars, text used as art in design, etc. • All this data is locked in our archives waiting to be “fact-mined” FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 6. Q: What are Text Soup’s limits? • On these pages from Softalk magazine we have lots of “facts” in ads and a monthly column • We can’t “locate” facts and assess their meaning based on the jumbled or missing info in its Text Soup. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Complex document structures not identified!
  • 7. We have to “tame” Text Soup to unlock “facts” in archive data. • Our project will focus on recognizing complex document structure and on “fact-revealing” content modeling. • In the next slideshow, we describe our vision for “fact-mining” Smart Data from newspaper & magazine digital archives… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
  • 8. FactMiners & PRImA: Our Knight News Challenge Entry •“Turn Text Soup into Smart Data in Newspaper & Magazine Archives” - https://goo.gl/99Vn5M • Team • Jim Salmons, FactMiners • Timlynn Babitsky, FactMiners • Apostolos Antonacopoulos, PRImA • Christian Clausner, PRImA FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”