SlideShare a Scribd company logo
1 of 82
Download to read offline
LANCASTER, FEBRUARY 7TH
LANCASTER, FEBRUARY 7TH
Dr. Juan Miguel Cejuela Jorge Campos
Focus on product development and
frontend
Focus of his research has been on
Text Mining and Machine Learning
PhD in Computer Science MSc Computer Engineering
Munich, Germany Gdansk, Poland
@tagtog_net
LANCASTER, FEBRUARY 7TH
Source: Ex Machina; 2015; director&writer: Alex Garland; production: Universal Pictures International (UPI), Film4, DNA Films
AI à Machine Learning à Deep Learning
AI à Machine Learning à Deep Learning
Tokens: a, garage, has, charged,
me, £, 541, …
isEnglishWord(MOT) = no
hasMixedCase(eBay) = yes
Token ! ? :(, :pouting_face:
A machine learning model is an automatic algorithm that
maps inputs to outputs, for which it is optimized (learned).
A machine learning model is an automatic algorithm that
maps inputs to outputs, for which it is optimized (learned).
• Supervised Learning
• Semi-Supervised Learning
• Unsupervised Learning
• Online Learning
• Active Learning
LANCASTER, FEBRUARY 7TH
Unstructured data is text-heavy
Social Media, Voice, PDFs,
scientific articles, reports, etc.
IDC and EMC: Data will grow to 40 ZB by 2020
Unstructured data grows faster than structured data
Techniques to turn (large amounts) of unstructured text
that is understandable by humans, i.e. natural language,
into unambiguous, structured knowledge.
?
à
à
!
à
à
!
à
à
❤
"
à
à
à
❤
"
à
à
à
❤
"
#
h"ps://www.youtube.com/watch?v=_eewlRCfewQ&
feature=youtu.be&t=560
h"ps://www.bbc.co.uk/programmes/p06jt7j4
Ambiguity !
(AI-complete problem)
Pipeline of Tasks !
(Errors compound)
… à … à … à …
Language Detec+on &
Topic Modeling
[Ms. Smith didn’t wait; she bound her dress and left.]
Sentence segmentation
[Ms.] [Smith] [didn’t] [wait] [;] …
Word segmenta-on &
Tokeniza-on
Ms. à PROPN
Smith à PROPN
did à VERB
n’t à ADV
…
Part-of-Speech (POS) tagging
Stemming & Lemmatization
Ms. Smith didn’t wait; she bound her dress and left.
à bind? bound?
Spelling Corrector
Parsing
(Cons+tuency, Shallow, Dependency)
Abbreviation Expansion
Coreference Resolution
London is the capital of England. It was founded
by the Romans, who named it Londinium.
Coreference Resolu,on
London is the capital of England. It was founded
by the Romans, who named it Londinium.
• London is the capital of England
• London was founded by the Romans
• London was named [by the Romans] Londinium
UTF-8, Special Characters !
Córdoba à C□rdoba
Ü õ Å ñ ö Œ Ô è í â
가-힣 русский язык বাংলা !ह#द% ελληνικά ‫ا‬‫ﻟ‬‫ﻌ‬َ‫ر‬َ‫ﺑ‬ِ‫ﯾ‬‫ﱠ‬‫ﺔ‬
Many file formats !"
LANCASTER, FEBRUARY 7TH
• Annota&on Type? NER? RE? Doc. Classifica&on?
• How many annotators?
• How many documents?
• Which documents?
• Training & Tes&ng?
• Time frame?
• Costs?
• …?
! Documents‘ source?
" Representa1ve sample?
# Biases?
• Have several rounds to train the annotators
• Redefine the guidelines progressively
• Discard all first annota7ons un7l annota7on quality plateaus
true positive (tp)
true positive (tp)
false nega2ve (fn)
true posi)ve (tp)
false nega)ve (fn)
false posi)ve (fp)
true posi)ve (tp)
false negative (fn)
false posi)ve (fp)
true posi)ve (tp)
false nega)ve (fn)
false posi)ve (fp)
1 annotator/document vs.
X >= 2 annotators/document ?
1 annotator/document vs.
X >= 2 annotators/document ?
Always some repe,,on to
ensure IAA remains high
1 annotator/document vs.
X >= 2 annotators/document ?
Always some repe,,on to
ensure IAA remains high
New annotators must go
through the same training
Test; 20%
Valida/on;
20%
Training; 60%
Test; 20%
Validation;
20%
Training; 60%
Don‘t tell the
annotators! !
Test; 20%
Validation;
20%
Training; 60%
Don‘t tell the
annotators! !
Report sta:s:cs (IAA, #Docs, etc.)
for each set
Ac#ve Learning
h"ps://www.tagtog.net/-corpora
✅ Define clear goals
✅ Beware data biases
✅ Ensure high IAA (remains)
❤ Publish your data
LANCASTER, FEBRUARY 7TH
info@tagtog.net
@tagtog_net
The Text Annotation Tool to Train AI

More Related Content

Similar to AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop

Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
Lawrie Hunter
 
Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)
tiptoptech
 
Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)
tiptoptech
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Lucidworks
 

Similar to AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop (20)

Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Computational Thinking - 101
Computational Thinking - 101Computational Thinking - 101
Computational Thinking - 101
 
Natural Language Processing for Irish
Natural Language Processing for IrishNatural Language Processing for Irish
Natural Language Processing for Irish
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
EmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguagesEmergingTrendsInComputingAndProgrammingLanguages
EmergingTrendsInComputingAndProgrammingLanguages
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...Are High Level Programming Languages for Multicore and Safety Critical Conver...
Are High Level Programming Languages for Multicore and Safety Critical Conver...
 
1004-nlp.ppt
1004-nlp.ppt1004-nlp.ppt
1004-nlp.ppt
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
P1 2017 python
P1 2017 pythonP1 2017 python
P1 2017 python
 
Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)
 
Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)Meaning of life (TipTop Technologies)
Meaning of life (TipTop Technologies)
 
Build your own speech to text dataset in 30 days
Build your own speech to text dataset in 30 daysBuild your own speech to text dataset in 30 days
Build your own speech to text dataset in 30 days
 
P1 2018 python
P1 2018 pythonP1 2018 python
P1 2018 python
 
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DBDigital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
Digital Frontiers 2015: eMOP's Imprint (Printer's and Publisher's) DB
 
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
Embracing Diversity: Searching over Multiple Languages - Suneel Marthi, Red H...
 
Python A Comprehensive Guide for Beginners.pdf
Python A Comprehensive Guide for Beginners.pdfPython A Comprehensive Guide for Beginners.pdf
Python A Comprehensive Guide for Beginners.pdf
 
mchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-toolsmchristy-DH2014-emop-bookhistory-tools
mchristy-DH2014-emop-bookhistory-tools
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 

AI for digital humanities, with tagtog.net -- Lancaster University 2019 workshop