This is a compilation of the four short "silent Ignite Talk" slideshows submitted with our recent unfunded Knight News Challenge entry. Our current Knight Prototype Fund entry focuses on the "People" (Part 4) component of our larger applied research agenda.
Notes and Letters of Support for Crowdsourcing Ground Truth - FactMiners, PRI...Jim Salmons
This is not really a slideshow but rather a source document that is a PDF containing a one-page set of comments about our Knight Prototype Fund project proposal, including letters of support from our collaborative partners.
This slide deck has been prepared for a workshop on Linked Data Publishing and Semantic Processing using the Redlink platform (http://redlink.co). The workshop delivered at the Department of Information Engineering, Computer Science and Mathematics at Università degli Studi dell'Aquila aimed at providing a general understanding of Semantic Web Technologies and how these can be used in real world use cases such as Salzburgerland Tourismus.
A brief introduction has been also included on MICO (Media in Context) a European Union part-funded research project to provide cross-media analysis solutions for online multimedia producers.
Data Journalism lecture - Week 1: Introduction to Data Journalism
Lecture date: 9 Sep 2015
MA in Journalism
National University of Ireland, Galway
Title slide image from The Data Journalism Handbook
Notes and Letters of Support for Crowdsourcing Ground Truth - FactMiners, PRI...Jim Salmons
This is not really a slideshow but rather a source document that is a PDF containing a one-page set of comments about our Knight Prototype Fund project proposal, including letters of support from our collaborative partners.
This slide deck has been prepared for a workshop on Linked Data Publishing and Semantic Processing using the Redlink platform (http://redlink.co). The workshop delivered at the Department of Information Engineering, Computer Science and Mathematics at Università degli Studi dell'Aquila aimed at providing a general understanding of Semantic Web Technologies and how these can be used in real world use cases such as Salzburgerland Tourismus.
A brief introduction has been also included on MICO (Media in Context) a European Union part-funded research project to provide cross-media analysis solutions for online multimedia producers.
Data Journalism lecture - Week 1: Introduction to Data Journalism
Lecture date: 9 Sep 2015
MA in Journalism
National University of Ireland, Galway
Title slide image from The Data Journalism Handbook
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart DataJim Salmons
This is the second of four short "silent Ignite Talk" video slideshows that explain FactMiners and PRImA's entry in the Knight News Challenge. Our Goal: Smart Data. What is it? What's it look like? What can you do with it? etc.
NDC Oslo : A Practical Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this?
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
(1) I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
(2) Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
(3) The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
"Semantic Integration Is What You Do Before The Deep Learning". dev.bg Machine Learning seminar, 13 May 2019.
It's well known that 80\% of the effort of a data scientist is spent on data preparation. Semantic integration is arguably the best way to spend this effort more efficiently and to reuse it between tasks, projects and organizations. Knowledge Graphs (KG) and Linked Open Data (LOD) have become very popular recently. They are used by Google, Amazon, Bing, Samsung, Springer Nature, Microsoft Academic, AirBnb… and any large enterprise that would like to have a holistic (360 degree) view of its business. The Semantic Web (web 3.0) is a way to build a Giant Global Graph, just like the normal web is a Global Web of Documents. IEEE already talks about Big Data Semantics. We review the topic of KGs and their applicability to Machine Learning.
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceMark West
Code: https://github.com/markwest1972/titanic
Video: https://vimeo.com/289705893
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all of this?
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Webinar: Metadata Enrichment in PublishingOntotext
The slide deck from the October 29, 2015 webinar "Metadata Enrichment in Publishing: Boosting Productivity and Increasing User Engagement" presented by Ilian Uzunov and Georgi Georgiev.
IoT and the pervasive nature of fast data and apache sparkStephen Dillon
This white paper and the associated blog http://bit.ly/1X4t9YH will introduce the Fast Data paradigm and provide a context within the scope of the Internet of Things and analytics. We will review Big Data and the architectural building blocks of Fast Data and then briefly survey the state of the art solutions in the open-source market whereas these are readily available to everyone regardless of budget constraints. We will then dive into Apache Spark as well as explore the Lambda architecture which is a popular approach to Fast Data and one Apache Spark supports
well. We will conclude with a look towards what is next for Fast Data as the IoT market trends towards the need to support "Fog computing" a.k.a. Edge Computing use cases.
A Practical-ish Introduction to Data ScienceMark West
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Data Science & Data Products at Neue Zürcher ZeitungRené Pfitzner
With 236 years of age, Neue Zürcher Zeitung (NZZ) is one of the oldest still published newspapers in the world. However, despite its age, NZZ is far from being old-fashioned. Especially when it comes to data-driven decision making and data-driven innovation, NZZ has been investing a lot within the last three years. For wrangling large amounts of data we have been using Apache Spark for almost a year now – and do not regret this choice. It had not only given us flexibility with ad-hoc analytics, but also drives our data-products (in production). In this talk I will share some of our use cases as well as insights we gained over the last year with Apache Spark. I will especially talk about how we calculate article recommendations and showcase some new exciting data products which are currently in active development.
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this? In this session I will share insights and knowledge that I have gained from building up a Data Science department from scratch. The talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organization.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Goal: Smart DataJim Salmons
This is the second of four short "silent Ignite Talk" video slideshows that explain FactMiners and PRImA's entry in the Knight News Challenge. Our Goal: Smart Data. What is it? What's it look like? What can you do with it? etc.
NDC Oslo : A Practical Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this?
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
(1) I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
(2) Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
(3) The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
"Semantic Integration Is What You Do Before The Deep Learning". dev.bg Machine Learning seminar, 13 May 2019.
It's well known that 80\% of the effort of a data scientist is spent on data preparation. Semantic integration is arguably the best way to spend this effort more efficiently and to reuse it between tasks, projects and organizations. Knowledge Graphs (KG) and Linked Open Data (LOD) have become very popular recently. They are used by Google, Amazon, Bing, Samsung, Springer Nature, Microsoft Academic, AirBnb… and any large enterprise that would like to have a holistic (360 degree) view of its business. The Semantic Web (web 3.0) is a way to build a Giant Global Graph, just like the normal web is a Global Web of Documents. IEEE already talks about Big Data Semantics. We review the topic of KGs and their applicability to Machine Learning.
JavaZone 2018 - A Practical(ish) Introduction to Data ScienceMark West
Code: https://github.com/markwest1972/titanic
Video: https://vimeo.com/289705893
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all of this?
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Webinar: Metadata Enrichment in PublishingOntotext
The slide deck from the October 29, 2015 webinar "Metadata Enrichment in Publishing: Boosting Productivity and Increasing User Engagement" presented by Ilian Uzunov and Georgi Georgiev.
IoT and the pervasive nature of fast data and apache sparkStephen Dillon
This white paper and the associated blog http://bit.ly/1X4t9YH will introduce the Fast Data paradigm and provide a context within the scope of the Internet of Things and analytics. We will review Big Data and the architectural building blocks of Fast Data and then briefly survey the state of the art solutions in the open-source market whereas these are readily available to everyone regardless of budget constraints. We will then dive into Apache Spark as well as explore the Lambda architecture which is a popular approach to Fast Data and one Apache Spark supports
well. We will conclude with a look towards what is next for Fast Data as the IoT market trends towards the need to support "Fog computing" a.k.a. Edge Computing use cases.
A Practical-ish Introduction to Data ScienceMark West
In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Data Science & Data Products at Neue Zürcher ZeitungRené Pfitzner
With 236 years of age, Neue Zürcher Zeitung (NZZ) is one of the oldest still published newspapers in the world. However, despite its age, NZZ is far from being old-fashioned. Especially when it comes to data-driven decision making and data-driven innovation, NZZ has been investing a lot within the last three years. For wrangling large amounts of data we have been using Apache Spark for almost a year now – and do not regret this choice. It had not only given us flexibility with ad-hoc analytics, but also drives our data-products (in production). In this talk I will share some of our use cases as well as insights we gained over the last year with Apache Spark. I will especially talk about how we calculate article recommendations and showcase some new exciting data products which are currently in active development.
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
Data Science has been described as the sexiest job of the 21st Century. But what is Data Science? And what has Machine Learning got to do with all this? In this session I will share insights and knowledge that I have gained from building up a Data Science department from scratch. The talk will be split into three sections:
1. I’ll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organization.
2. Next up we’ll run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA, & eMOP Knight Prototype Fund Entry
1. FactMiners, PRImA, & eMOP’s
Knight Prototype Fund Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A PDF-format compilation of the four “silent Ignite Talk”
slide-shows submitted in support of FactMiners &
PRImA’s recent Knight News Challenge entry.
“Big Picture” Backgrounder for
Step 1. Crowdsourcing Ground-Truth
2. FactMiners’ Prototype Fund entry is the 1st step in
our “Turn Text Soup into Smart Data” R&D agenda
• Originally submitted in support of an unfunded entry in the
current Knight News Challenge, these four short slide-
shows provide a “Big Picture” overview of our collaborative
research and development agenda
• Problem: Text Soup
OCR’s “Dirty” Little Secret
• Goal: Smart Data
From “Readable” to “Computable”
• Solution: Technology
Machine Learning & Smart Data
• Solution: People
Crowdsourcing Ground-Truth
FactMiners, PRImA, & eMOP: Knight Prototype Fund entry – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives:
Step 1. Crowdsourcing Ground-Truth”
Please @KnightFdn folk, don’t deprive
my little boy of a good education. His
Citizen Science teachers need to
create training materials, tests, &
answer sheets for his school at the
Internet Archive.
011010..01.. INIT… Hello, world!
Prototype Fund funding will help me
serve you better…and this part is all
about what will be done by this project.
3. Problem: Text Soup
OCR’s “Dirty” Little Secret
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
1
Part
Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
4. Q: What is “Text Soup”?
• A: The uncorrected and
usually hidden text “layer”
that is generated by OCR
(optical character recognition)
during bulk scanning and
digitization of historic and
cultural heritage documents.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned Images
or photos of pages!
5. Q: How is “Text Soup” Used?
• A: Primarily “behind the
scenes” to support “full text”
search.
• Good for things like:
• Show me the pages with the
word “razor” on them in this
book.
• What books are about shaving?
• What words are found in
proximity to the word “strop” ?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned
Image of text!
Hidden text
layer…
6. Q: What are Text Soup’s limits?
• Automated OCR
(text recognition) is a
“one size fits all” process in
the workflow of bulk
scanning and digitization.
• Good for basic books &
monographs with simple
document structure…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
7. Q: What are Text Soup’s limits?
• Newspapers & magazines have complex
document structures
• Multiple articles, multiple
authors, text continuations,
advertisements, images,
sidebars, text used as art
in design, etc.
• All this data is locked in
our archives waiting
to be “fact-mined”
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
8. Q: What are Text Soup’s limits?
• On these pages from Softalk magazine we have lots of
“facts” in ads and a monthly column
• We can’t “locate” facts
and assess their meaning
based on the jumbled or
missing info in its
Text Soup.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Complex
document
structures
not identified!
9. We have to “tame” Text Soup to unlock
“facts” in archive data.
• Our project will focus on recognizing complex
document structure and on “fact-revealing”
content modeling.
• In the next slideshow, we describe our vision for
“fact-mining” Smart Data from newspaper &
magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
10. Goal: Smart Data
From “readable” to “computable”
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
2
Part
Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
11. Q: What is Smart Data?
• A: Smart Data is self-descriptive
data that can “carry on a conversation”
with Smart Programs to support
access, editing, and visualization of
the data itself.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
The “actual” data of the database
To access the “actual” data of the database,
Smart Programs “talk” to an embedded
“database about the database” (AKA a metamodel )
12. Q: What does Smart Data look like?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
• A: Smart Data includes BOTH the
complex document structure
of the source AND the underlying
conceptual model of the source
content.
13. Q: What can Smart Data do?
• A: Turn expensive, time-
consuming, labor-intensive
research studies into “Just ask!”
queries
• Good for things like:
• How did local reporting of race
relations impact public policy in
Indiana in the 1950s?
• Did advertising or editorial
coverage account for the
popularity of programs in the
Softalk Bestseller lists?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
14. Q: How “smart” is our Smart Data design?
• We spent a year researching
museum informatics and
prototyping Smart Data designs.
• Our software architecture is based
on CIDOC-CRM (Conceptual
Reference Model for Museums)
microservice workflows and
PRESSoo, the ISSN.org
metamodel for serial publications
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Winter, 2013
Spring, 2014
Fall, 2014
Summer, 2015
Neo4j GraphGist Challenge,
a 1st place for Metamodel
Subgraph domain model
Semi-finals Ashoka/LEGO
“Re-imagine Learning” Challenge.
#MW2014 FactMiners demo.
Introduced to #cidocCRM.
Museum Computer Network
Emerging Professional Scholarship.
#MCN2014 paper & demo.
“Massively Addressable Text” published
in peer-reviewed CODE|WORDS.
#HILT2015 Crowdsourcing Course
DPLA Community Reps.
Internet Archive Content Partner.
ICOM #cidocCRM SIG member.
Incorporate PRESSoo into design.
Begin PRImA Collaboration.
15. Q: How “open” is our Smart Data design?
• Using a metamodel
subgraph design
pattern to embed and pass
info about data and its access
and transformation is
technology neutral &
future-proof.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Without Smart Data
With Smart Data
Database
10 Load X
20 Print X
30 Goto 10
Domain knowledge written
into task-specific programs
Metamodel statically stored
within #TEI header section of
source documents std. text files
<teiHeader>
<metamodel />
<structure />
<content />
Any “smart” DB
For dynamic Linked Open Data access,
DB need only have import &
ability to represent data structures
read from metamodel header.
10 Load metamodel
20 Configure editors
30 Do stuff…
“Smart” program in
any language
16. We have a design to “tame” Text Soup and
unlock “facts” in archive data.
• An innovative design combining international standards
for conceptual modeling of museum collections
(cidocCRM and PRESSoo) together with a “self-
descriptive” software/database design pattern provide the
foundation for mining Smart Data from Text Soup.
• In the next slideshow, we describe our design for the
technology to “fact-mine” Smart Data from
newspaper & magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
17. Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
Solution: Technology
Machine Learning & Smart Data3
Part
18. Q: Can Robots* read magazines?
• Yes (mostly)…when looking at
layout & text recognition within
the individual page
• No...in terms of recognizing the
complex document structure of the
whole issue
• Our challenge is to move from
individual page to whole-issue
document structure recognition.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
* “Robot” = Software Agent
(AKA, a computer program)
From page…
…to pages!
19. Q: What’s the 1st Step & Where to do it?
• We start by teaching Robot
agents to find & understand
the TOC (Table of Contents)
and Advertiser Index pages
of newspapers & magazines.
• The best place to do this
applied research is in the
collections of the Internet
Archive.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Bring ‘em on!
I can’t get
enough TOC.
20. Q: Why TOCs & Advertiser Indexes?
• A: TOCs (Table of Contents) &
Advertiser Indexes reveal the
complex document structure of
newspapers & magazines.
• Like a Sudoku puzzle, the TOC &
Ad Index provide helpful “filled-in
answers” about the types of
content to be found within pages
of the newspaper or magazine.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
21. Q: Why the Internet Archive?
• Thousands of “Text Soup era”
newspaper & magazine collections that
can be enhanced through research.
• The Archive’s Scanning Service flags
TOC pages & generates a TOC-
specific XML-encoded file during
its standard digitization workflow.
• The current Archive TOC OCR
analysis does not “see” &
understand the complex TOCs of
magazines & newspapers.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
22. Q: What TOC Robots will we develop?
• TOC-Spotter is an Image/Scene
Recognition software agent to crawl
the Archive in search of TOC &
Ad Index pages.
• TOC-Reader is a software agent
extending PRImA recognition &
evaluation technologies with
Machine Learning capabilities to do
“deep reading” with the assist of the
TOC Pattern Reference Library.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Dot dot dot… Check!
Number… Check!
YO! Gotta TOC here!
Great! Let me take
a good look at it.
23. Q: How will this help?
• By running our TOC-Agents early in
digitization workflows, we can make
smarter within-page layout
recognition decisions during bulk
OCR of the issue’s subsequent pages.
• We can generate “best guess”
structure-revealing meta-tags in
appropriate files as part of the
standard Archive scanning workflow.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Let’s see… Based on my notes,
that’d be an ad, a feature article,
another ad…and there’s the
Ad Index!
24. Q: What will be accomplished?
• The structure-mapped text files
generated by the TOC-Reader
agent will be ready for FactMiners'
Semantic tagging (AKA “fact-
mining”) of the issue’s content.
• These files will be compatible with
PRImA’s Alethia program for use
in crowdsourced Ground-Truth
development of the TOC Pattern
Reference Library.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Welcome to the
TOC Pattern Reference Library
25. We have a design to “tame” Text Soup and
unlock “facts” in archive data.
• Our immediate PRImA-inspired technology agenda is to
develop “Robot” assistance (software agents) to find,
recognize & deeply understand the TOCs (Table of
Contents) and Advertiser Indexes of magazines in the
Internet Archive magazine & newspaper collections.
• In our last slideshow, we describe the people dimension
of our strategy to “fact-mine” Smart Data from
newspaper & magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
26. Recently submitted in support of our unfunded Knight News
Challenge entry, this short slide-show describes a key element of
our collaborative R&D agenda. Our current Prototype Fund
submission is the first strategic step in pursuit of this mission.
FactMiners & PRImA’s
Knight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
Solution: People
Crowdsourcing Ground-Truth4
Part
27. Q: Why do we need people?
• If all we had to do was write some
smart “Robot” programs & simply put
them to work, we wouldn’t need people.
• But writing smart code is just the “birth”
of a Machine-Learning Robot.
• We have to teach our Robots how to
read magazines & newspapers!
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
011010..01.. INIT…
What am I? Who will
teach me? What’s a
magazine?
28. Q: What is Ground-Truth?
• Teaching means training; lessons, study
materials, tests & their answer sheets, etc.
• An “answer sheet” in OCR research is
called a Ground-Truth solution – the
human-crafted “perfect answer” to
recognition of a scanned page.
• To teach our Robots to read magazines,
we’ll need a pile of TOC* Ground-Truth!
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
*TOC being “Table of Contents”
See our 3rd Silent Ignite slideshow
for more on TOCs & Technology
Yes, your Honor…
That is EXACTLY
what I saw and ONLY
what I saw on the
Table of Contents
page shown to me
as Exhibit A.
29. Q: What’s the TOC Pattern Reference Library?
• It will be a Special Purpose Research
Collection at the Internet Archive to
be used to “teach Robots to read
magazines & newspapers.”
• Will Include a TOC Image Dataset,
TOC Ground-Truth Solutions, & Open
Source library of TOC-Spotting &
TOC-Reading software.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Welcome to the
TOC Pattern Reference Library
Yes, counsel, in answer to
your question allow me to
reference material from
the Library.
30. Q: How will Citizen Scientists help?
• “Volunpeers” are already generating
Ground-Truth data for the TOC Pattern
Reference Library through our
project on the Zooniverse
crowdsourcing platform.
• In addition to refining the workflow for
Ground-Truth data collection, this
project will develop Zooniverse data
export to PRImA’s Aletheia.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
31. Q: What is Alethia “FactMiners Ed.”?
• Aletheia is PRImA’s desktop & web
Ground-Truth Tool.
• Funding will allow PRImA to add
features to Aletheia to support
“whole issue” modeling in
Ground-Truth Solutions.
• We get a Power Tool for Citizen
Scientists who want to “dig deeper”
into Internet Archive newspaper &
magazine collections as pioneer
FactMiners!
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
32. We have a design to “tame” Text Soup and
unlock “facts” in archive data.
• We are confident that the applied research project
submitted as our Knight News Challenge entry will
make substantive contributions to the domain of Open
Data by helping to turn Text Soup into Smart Data in
newspaper & magazine archives.
• We hope you have enjoyed all four of our “silent Ignite Talk”
video slideshows. We welcome your comments, questions,
& (of course) “applause” at: https://goo.gl/99Vn5M
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
33. FactMiners & PRImA:
Our Knight News Challenge Entry
• “Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
https://goo.gl/99Vn5M
• Team
• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
This is the final slide for our
prior submission to the News Challenge
and not to be confused with our current
Prototype Fund entry…