FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

•

0 likes•746 views

This is the first of four short "silent Ignite Talk" video slideshows that explain FactMiners and PRImA's entry in the Knight News Challenge. Up first, "text soup"... What is it? What can we do about it?

Data & Analytics

Problem: Text Soup
OCR’s “Dirty” Little Secret
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A self-running video slideshow.
One slide every 15 seconds.
Pause as needed. 

Q: What is “Text Soup”?
• A: The uncorrected and
usually hidden text “layer”
that is generated by OCR
(optical character recognition)
during bulk scanning and
digitization of historic and
cultural heritage documents.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned Images
or photos of pages!

Q: How is “Text Soup” Used?
• A: Primarily “behind the
scenes” to support “full text”
search.
• Good for things like:
• Show me the pages with the
word “razor” on them in this
book.
• What books are about shaving?
• What words are found in
proximity to the word “strop” ?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned
Image of text!
Hidden text
layer…

Q: What are Text Soup’s limits?
• Automated OCR
(text recognition) is a
“one size fits all” process in
the workflow of bulk
scanning and digitization.
• Good for basic books &
monographs with simple
document structure…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Q: What are Text Soup’s limits?
• Newspapers & magazines have complex
document structures
• Multiple articles, multiple
authors, text continuations,
advertisements, images,
sidebars, text used as art
in design, etc.
• All this data is locked in
our archives waiting
to be “fact-mined”
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Q: What are Text Soup’s limits?
• On these pages from Softalk magazine we have lots of
“facts” in ads and a monthly column
• We can’t “locate” facts
and assess their meaning
based on the jumbled or
missing info in its
Text Soup.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Complex
document
structures
not identified!

We have to “tame” Text Soup to unlock
“facts” in archive data.
• Our project will focus on recognizing complex
document structure and on “fact-revealing”
content modeling.
• In the next slideshow, we describe our vision for
“fact-mining” Smart Data from newspaper &
magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

FactMiners & PRImA:
Our Knight News Challenge Entry
•“Turn Text Soup into Smart Data in
Newspaper & Magazine Archives” -
https://goo.gl/99Vn5M
• Team
• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Recently uploaded

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样wsppdmt

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls

Discover Why Less is More in B2B Researchmichael115558

Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14

Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515

Digital Transformation Playbook by Graham WareGraham Ware

Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls

Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan

Recently uploaded (20)

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

如何办理英国诺森比亚大学毕业证（NU毕业证书）成绩单原件一模一样

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...

+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...

Discover Why Less is More in B2B Research

Lecture_2_Deep_Learning_Overview-newone1

Dubai Call Girls Peeing O525547819 Call Girls Dubai

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...

Digital Transformation Playbook by Graham Ware

Aspirational Block Program Block Syaldey District - Almora

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed

RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...

Computer science Sql cheat sheet.pdf.pdf

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

1. Problem: Text Soup OCR’s “Dirty” Little Secret FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” A self-running video slideshow. One slide every 15 seconds. Pause as needed. 

2. Q: What is “Text Soup”? • A: The uncorrected and usually hidden text “layer” that is generated by OCR (optical character recognition) during bulk scanning and digitization of historic and cultural heritage documents. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Scanned Images or photos of pages!

3. Q: How is “Text Soup” Used? • A: Primarily “behind the scenes” to support “full text” search. • Good for things like: • Show me the pages with the word “razor” on them in this book. • What books are about shaving? • What words are found in proximity to the word “strop” ? FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Scanned Image of text! Hidden text layer…

4. Q: What are Text Soup’s limits? • Automated OCR (text recognition) is a “one size fits all” process in the workflow of bulk scanning and digitization. • Good for basic books & monographs with simple document structure… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

5. Q: What are Text Soup’s limits? • Newspapers & magazines have complex document structures • Multiple articles, multiple authors, text continuations, advertisements, images, sidebars, text used as art in design, etc. • All this data is locked in our archives waiting to be “fact-mined” FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

6. Q: What are Text Soup’s limits? • On these pages from Softalk magazine we have lots of “facts” in ads and a monthly column • We can’t “locate” facts and assess their meaning based on the jumbled or missing info in its Text Soup. FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives” Complex document structures not identified!

7. We have to “tame” Text Soup to unlock “facts” in archive data. • Our project will focus on recognizing complex document structure and on “fact-revealing” content modeling. • In the next slideshow, we describe our vision for “fact-mining” Smart Data from newspaper & magazine digital archives… FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

8. FactMiners & PRImA: Our Knight News Challenge Entry •“Turn Text Soup into Smart Data in Newspaper & Magazine Archives” - https://goo.gl/99Vn5M • Team • Jim Salmons, FactMiners • Timlynn Babitsky, FactMiners • Apostolos Antonacopoulos, PRImA • Christian Clausner, PRImA FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Recommended

Recommended

More Related Content

More from Jim Salmons

More from Jim Salmons (7)

Recently uploaded

Recently uploaded (20)

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup