This is the first of four short "silent Ignite Talk" video slideshows that explain FactMiners and PRImA's entry in the Knight News Challenge. Up first, "text soup"... What is it? What can we do about it?
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup
1. Problem: Text Soup
OCR’s “Dirty” Little Secret
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A self-running video slideshow.
One slide every 15 seconds.
Pause as needed.
2. Q: What is “Text Soup”?
• A: The uncorrected and
usually hidden text “layer”
that is generated by OCR
(optical character recognition)
during bulk scanning and
digitization of historic and
cultural heritage documents.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned Images
or photos of pages!
3. Q: How is “Text Soup” Used?
• A: Primarily “behind the
scenes” to support “full text”
search.
• Good for things like:
• Show me the pages with the
word “razor” on them in this
book.
• What books are about shaving?
• What words are found in
proximity to the word “strop” ?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned
Image of text!
Hidden text
layer…
4. Q: What are Text Soup’s limits?
• Automated OCR
(text recognition) is a
“one size fits all” process in
the workflow of bulk
scanning and digitization.
• Good for basic books &
monographs with simple
document structure…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
5. Q: What are Text Soup’s limits?
• Newspapers & magazines have complex
document structures
• Multiple articles, multiple
authors, text continuations,
advertisements, images,
sidebars, text used as art
in design, etc.
• All this data is locked in
our archives waiting
to be “fact-mined”
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
6. Q: What are Text Soup’s limits?
• On these pages from Softalk magazine we have lots of
“facts” in ads and a monthly column
• We can’t “locate” facts
and assess their meaning
based on the jumbled or
missing info in its
Text Soup.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Complex
document
structures
not identified!
7. We have to “tame” Text Soup to unlock
“facts” in archive data.
• Our project will focus on recognizing complex
document structure and on “fact-revealing”
content modeling.
• In the next slideshow, we describe our vision for
“fact-mining” Smart Data from newspaper &
magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
8. FactMiners & PRImA:
Our Knight News Challenge Entry
•“Turn Text Soup into Smart Data in
Newspaper & Magazine Archives” -
https://goo.gl/99Vn5M
• Team
• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”