Introduction to AI & NLP (for digital humanities)
AND
Hands-on tagtog.net, a text annotation tool
Lancaster University 2019 workshop, February 7th
#dh #textMining #nlproc #nlu #training #trainingData #trainingCorpus #corpus #corpora #deepLearning #machineLearning #session #academia #industry #textAnalytics #juanmirocks #cejuela
4. Dr. Juan Miguel Cejuela Jorge Campos
Focus on product development and
frontend
Focus of his research has been on
Text Mining and Machine Learning
PhD in Computer Science MSc Computer Engineering
Munich, Germany Gdansk, Poland
@tagtog_net
26. Unstructured data is text-heavy
Social Media, Voice, PDFs,
scientific articles, reports, etc.
IDC and EMC: Data will grow to 40 ZB by 2020
Unstructured data grows faster than structured data
27.
28.
29.
30.
31.
32.
33. Techniques to turn (large amounts) of unstructured text
that is understandable by humans, i.e. natural language,
into unambiguous, structured knowledge.
56. Coreference Resolu,on
London is the capital of England. It was founded
by the Romans, who named it Londinium.
• London is the capital of England
• London was founded by the Romans
• London was named [by the Romans] Londinium
57. UTF-8, Special Characters !
Córdoba à C□rdoba
Ü õ Å ñ ö Œ Ô è í â
가-힣 русский язык বাংলা !ह#द% ελληνικά اﻟﻌَرَﺑِﯾﱠﺔ
61. • Annota&on Type? NER? RE? Doc. Classifica&on?
• How many annotators?
• How many documents?
• Which documents?
• Training & Tes&ng?
• Time frame?
• Costs?
• …?
73. 1 annotator/document vs.
X >= 2 annotators/document ?
Always some repe,,on to
ensure IAA remains high
New annotators must go
through the same training