SlideShare a Scribd company logo
1 of 26
IC-SDV 2019
April 9-10, 2019
Nice, France
Addressing requirements for
real-world deployments of ML & NLP
Stefan Geißler, Kairntech
Agenda
Looking back: the NLP landscape has changed
dramatically
Algorithms  Data!
Support dataset creation: The Kairntech Sherpa
Kairntech? Who are we
Conclusion
Looking back : NLP landscape has changed
2000:
Very few open source components
Lexicons, Taggers, Morphology,
Parsers mostly proprietory, complex to
install and maintain, limited coverage
« Make or Buy »
High level of manual efforts in
creating and maintaining lexical
knowledge bases, rule systems
Today
2019:
Sharing! (Github, …)
Lexicons, Taggers, Morphology,
Parsers often in the public domain
« Combine & Adapt »
Broad success of learning-based
approaches
2019: A tipping point in ML & NLP?
 « 2018 was the ‘image net’ moment for deep learning in NLP’ (S.
Ruder)
 In Image Processing in 2012 a Deep Learning network won a
public contest by a large margin. Now in 2018 we saw exciting NLP
models implementing transfer learning: ELMo, UMLfit, BERT
 « ML Engineering in NLP will truly blossom in 2019 » (E. Ameisen)
 Focus on Tools beyond model building! Link NLP/AI to production
use! What does it mean to build data-driven products and
services?
 « Enough papers: Let’s build AI now! » (A. Ng, 2017)
 « AI is the new electricity! »
Example: Named Entity Recognition
Cf.
https://www.researchgate.net/publication/329933780_A_Survey_on_Deep_Learning_for_Named_Entity_Recognition/download
Many / most of
these approaches
available with
code
NLP: A commodity?
Named entity recognition in four steps:
$ pip install spacy
$ python –m spacy download en
$ cat > testspacy.py
import spacy
nlp = spacy.load(‘en’)
doc = nlp(“Angela Merkel will meet Emmanuel Macron at the summit in Amsterdam”)
for entity in doc.ents:
print(entity.text)
CRTL-D
$ python testspacy.py
Angela Merkel
Emmanuel Macron
Amsterdam
Algorithms are commodity
Even the top scoring system from the list earlier is available on github:
https://github.com/zalandoresearch/flair
For the protocol:
The survey does not list Delft (
https://github.com/kermitt2/delft),
implemented by the Kairntech chief
ML expert and which
•Scores exactly at 93,09% on
Conll2003, too
•Creates models that are very
compact (~5MB vs. >150MB)
•Loads model in ~2sec at initialization
Nice and easy
But…
Pain points
 Off-the-shelf NLP models often don’t work for
specific needs
 Implementation is slowed down by the need of
building specific training dataset
 AI/NLP services are often require integration of
business glossaries & knowledge graph
 Absence of maintenance leads to quality deviations
Frequent requirements in real-world projects
 In many commercial scenarios around entity extraction, an entity not only has to be
recognized but also typed
 A DATE in a contract may be the date when the contract becomes effective,
when it was signed, when it will be terminated
 A PERSON in a legal opinion may be the defendant, the lawyer, the judge, the
witness …
 A DISEASE in clinical study may be the core therapeutic area or a peripheral
occasional adverse event
 This is beyond the public named entity recognition modules
 Typically, for these decisions no training corpora exist. They must be established
within a project.
You don’t have to take my word on that.
Let’s listen to what the experts say:
 Algorithms are commodity, data is gold
Peter Norvig:
“We [at Google] don't have
better algorithms than anyone
else; we just have more data!”
“More data beats clever
algorithms.”
Angela Merkel:
“Data is the new oil of the 21st
century!“
So: We need data, not only algorithms
Charts copied from https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-training-data-31a88b97b282
Requirements
What will be more important for
the success of your project?
Driving the training accuracy from, say,
92,4 to 93,6% on a pre-defined data set?
or
ML components that allow high quality with
small training sets and moderate annotation and
training time?
Example
 The Conll2003 data set used in many academic NER
experiments contains >100000 entities
 Assume 30sec per entity  100 person days pure annotation
time! (With one single annotator)
Unrealistic in most commercial project settings.
Commercial projects have requirements that are different
from academic research!
On dataset preparation: Requirements
Web-based (no install), intuitive GUI, usable by domain experts
Limit manual annotation efforts: Active Learning
Collaboration (work in teams, measure inter-annotator agreement)
Not just NER annotation: Entity typing, document categorization, …
Must facilitate deployment-to-production
Why another tool?
 WebAnno:
 Scientific focus: « Annotate corpora to allow the study of
linguistic phenomena »
 Sentence-based, Loosing all layout information
 Spacy/Prodi.gy:
 Focus on local/lexical named entity recognition. Underlying
model by default considering a narrow window of n (n=4) words
left and right.
 Brat:
 Interface-only. Integration with model building, semi-automatic
suggestions, deployment?
Kairntech Sherpa
Annotation
environment
Raw or preannotated
Corpora:
Text, Audio, …
ML model
Curated AnnotationsAutomatic Annotation
Suggestions
User
Datasets and
ML models
Search, Collaboration, Manual &
assisted annotation, Quality
metrics, Synchronisation into ML
model
Active Learning?
 Reduce effort in manual annotation of data by presenting the user with data in
some informed order:
 Ask the user for feedback on the samples that promises the highest benefit:
Samples that are least certain*
(*) Diagrams used from datacamp.com
 Active Learning applied on NLP tasks has been shown to reduce the amount of
required training data dramatically
 7% of the sample under AL regime yield the same quality as naive selection
(cf. Laws 2012: https://d-nb.info/1030521204/34)
 In a project that would mean 1 day annotation instead of 14 days
Benefits of AL?
 Growing accuracy on a
(simple) ML task as number
of samples grows
 Naive selection (« Random »,
orange line) growing slowly
 Informed selection (« QBC,
« query by committee », red
line) grows much faster
 AL promises to reduce effort
required for manual
annotation
A non-expert workflow for dataset creation
Ask the
application for
suggestions
(De-) validate
and retrain
Once satisfied,
export/deploy
About Kairntech
 Kairntech: The company
 Created in dec 2018, 10 partners
 France (Paris & Grenoble/Meylan), Germany
(Heidelberg)
 Kairntech: The team
 Background in Software engineering, Machine
Learning, Sales, Management
 +15 years of experence in NLP development and
deployment from Xerox, IBM, TEMIS. Development of
components currently in production at CERN, NASA,
EPO…)
Kairntech: Our profile
 Industrialize the creation of document sets (training
corpora) by offering an environment for the data
preparation by domain experts, easy and efficient to use
 The transformation of data sets in document analysis
services, adding value to enterprise knowledge
repositories (e.g. knowledge graphs)
 Industrial deployment of maintenance of these services.
Kairntech: Our offering
Conclusions
 So much data!
 But very little of it labelled and useful for superised learning
 So many pretrained models!
 But most of the time they do not quite do what you need in
your project
 So many algorithms!
 But a library alone will not allow you to implement the solution
you need
 Kairntech is there to support you!
Thank you for your attention !
Stefan.Geissler@kairntech.com

More Related Content

What's hot

The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy
 
2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLES2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLESJPINFOTECH JAYAPRAKASH
 
Demystifying transfer learning with Tensorflow
 Demystifying transfer learning with Tensorflow Demystifying transfer learning with Tensorflow
Demystifying transfer learning with TensorflowKnoldus Inc.
 
Hadoop training in mumbai
Hadoop training in mumbaiHadoop training in mumbai
Hadoop training in mumbaifaizrashid1995
 
Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...
Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...
Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...Splashtop Inc
 
Keras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningKeras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningDr. Ananth Krishnamoorthy
 

What's hot (7)

The Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape OverviewThe Python ecosystem for data science - Landscape Overview
The Python ecosystem for data science - Landscape Overview
 
Final_version_SAI_ST_projectenboekje_2015
Final_version_SAI_ST_projectenboekje_2015Final_version_SAI_ST_projectenboekje_2015
Final_version_SAI_ST_projectenboekje_2015
 
2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLES2012 - 2013 DOTNET IEEE PROJECT TITLES
2012 - 2013 DOTNET IEEE PROJECT TITLES
 
Demystifying transfer learning with Tensorflow
 Demystifying transfer learning with Tensorflow Demystifying transfer learning with Tensorflow
Demystifying transfer learning with Tensorflow
 
Hadoop training in mumbai
Hadoop training in mumbaiHadoop training in mumbai
Hadoop training in mumbai
 
Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...
Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...
Cornell University Uses Splashtop to Deliver 2D/3D Applications using Amazon ...
 
Keras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learningKeras: A versatile modeling layer for deep learning
Keras: A versatile modeling layer for deep learning
 

Similar to Stefan Geissler kairntech - SDC Nice Apr 2019

Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at ScaleAndrei Lopatenko
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Clarisse Hedglin
 
Stream SQL eventflow visual programming for real programmers presentation
Stream SQL eventflow visual programming for real programmers presentationStream SQL eventflow visual programming for real programmers presentation
Stream SQL eventflow visual programming for real programmers presentationstreambase
 
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTDr. Haxel Consult
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discoveryadamkraut
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Technologies for startup
Technologies for startupTechnologies for startup
Technologies for startupDzung Nguyen
 
Domain Driven Design & Hexagonal Architecture
Domain Driven Design & Hexagonal ArchitectureDomain Driven Design & Hexagonal Architecture
Domain Driven Design & Hexagonal ArchitectureCan Pekdemir
 
Combine AI & Modern Content Services to Increase Productivity by 15%
Combine AI & Modern Content Services to Increase Productivity by 15%Combine AI & Modern Content Services to Increase Productivity by 15%
Combine AI & Modern Content Services to Increase Productivity by 15%Nuxeo
 
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainingsTop 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainingsMildain Solutions
 
Best Artificial Intelligence Course | Online program | certification course
Best Artificial Intelligence Course | Online program | certification course Best Artificial Intelligence Course | Online program | certification course
Best Artificial Intelligence Course | Online program | certification course Learn and Build
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsAlexey Rybakov
 
Agile Corporation for MIT
Agile Corporation for MITAgile Corporation for MIT
Agile Corporation for MITCaio Candido
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsDatabricks
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Lucas Jellema
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 

Similar to Stefan Geissler kairntech - SDC Nice Apr 2019 (20)

Natural Language Processing at Scale
Natural Language Processing at ScaleNatural Language Processing at Scale
Natural Language Processing at Scale
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
 
Stream SQL eventflow visual programming for real programmers presentation
Stream SQL eventflow visual programming for real programmers presentationStream SQL eventflow visual programming for real programmers presentation
Stream SQL eventflow visual programming for real programmers presentation
 
Consulting
ConsultingConsulting
Consulting
 
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
DevEx Essentials
DevEx EssentialsDevEx Essentials
DevEx Essentials
 
Building Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated DiscoveryBuilding Data Ecosystems for Accelerated Discovery
Building Data Ecosystems for Accelerated Discovery
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Technologies for startup
Technologies for startupTechnologies for startup
Technologies for startup
 
Domain Driven Design & Hexagonal Architecture
Domain Driven Design & Hexagonal ArchitectureDomain Driven Design & Hexagonal Architecture
Domain Driven Design & Hexagonal Architecture
 
Combine AI & Modern Content Services to Increase Productivity by 15%
Combine AI & Modern Content Services to Increase Productivity by 15%Combine AI & Modern Content Services to Increase Productivity by 15%
Combine AI & Modern Content Services to Increase Productivity by 15%
 
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainingsTop 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
Top 10 Most Demand IT Certifications Course in 2020 - MildainTrainings
 
Best Artificial Intelligence Course | Online program | certification course
Best Artificial Intelligence Course | Online program | certification course Best Artificial Intelligence Course | Online program | certification course
Best Artificial Intelligence Course | Online program | certification course
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
 
Agile Corporation for MIT
Agile Corporation for MITAgile Corporation for MIT
Agile Corporation for MIT
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOps
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think Milano
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Stefan Geissler kairntech - SDC Nice Apr 2019

  • 1. IC-SDV 2019 April 9-10, 2019 Nice, France Addressing requirements for real-world deployments of ML & NLP Stefan Geißler, Kairntech
  • 2. Agenda Looking back: the NLP landscape has changed dramatically Algorithms  Data! Support dataset creation: The Kairntech Sherpa Kairntech? Who are we Conclusion
  • 3. Looking back : NLP landscape has changed 2000: Very few open source components Lexicons, Taggers, Morphology, Parsers mostly proprietory, complex to install and maintain, limited coverage « Make or Buy » High level of manual efforts in creating and maintaining lexical knowledge bases, rule systems
  • 4. Today 2019: Sharing! (Github, …) Lexicons, Taggers, Morphology, Parsers often in the public domain « Combine & Adapt » Broad success of learning-based approaches
  • 5. 2019: A tipping point in ML & NLP?  « 2018 was the ‘image net’ moment for deep learning in NLP’ (S. Ruder)  In Image Processing in 2012 a Deep Learning network won a public contest by a large margin. Now in 2018 we saw exciting NLP models implementing transfer learning: ELMo, UMLfit, BERT  « ML Engineering in NLP will truly blossom in 2019 » (E. Ameisen)  Focus on Tools beyond model building! Link NLP/AI to production use! What does it mean to build data-driven products and services?  « Enough papers: Let’s build AI now! » (A. Ng, 2017)  « AI is the new electricity! »
  • 6. Example: Named Entity Recognition Cf. https://www.researchgate.net/publication/329933780_A_Survey_on_Deep_Learning_for_Named_Entity_Recognition/download Many / most of these approaches available with code
  • 7. NLP: A commodity? Named entity recognition in four steps: $ pip install spacy $ python –m spacy download en $ cat > testspacy.py import spacy nlp = spacy.load(‘en’) doc = nlp(“Angela Merkel will meet Emmanuel Macron at the summit in Amsterdam”) for entity in doc.ents: print(entity.text) CRTL-D $ python testspacy.py Angela Merkel Emmanuel Macron Amsterdam
  • 8. Algorithms are commodity Even the top scoring system from the list earlier is available on github: https://github.com/zalandoresearch/flair For the protocol: The survey does not list Delft ( https://github.com/kermitt2/delft), implemented by the Kairntech chief ML expert and which •Scores exactly at 93,09% on Conll2003, too •Creates models that are very compact (~5MB vs. >150MB) •Loads model in ~2sec at initialization
  • 10. Pain points  Off-the-shelf NLP models often don’t work for specific needs  Implementation is slowed down by the need of building specific training dataset  AI/NLP services are often require integration of business glossaries & knowledge graph  Absence of maintenance leads to quality deviations
  • 11. Frequent requirements in real-world projects  In many commercial scenarios around entity extraction, an entity not only has to be recognized but also typed  A DATE in a contract may be the date when the contract becomes effective, when it was signed, when it will be terminated  A PERSON in a legal opinion may be the defendant, the lawyer, the judge, the witness …  A DISEASE in clinical study may be the core therapeutic area or a peripheral occasional adverse event  This is beyond the public named entity recognition modules  Typically, for these decisions no training corpora exist. They must be established within a project.
  • 12. You don’t have to take my word on that. Let’s listen to what the experts say:  Algorithms are commodity, data is gold Peter Norvig: “We [at Google] don't have better algorithms than anyone else; we just have more data!” “More data beats clever algorithms.” Angela Merkel: “Data is the new oil of the 21st century!“
  • 13. So: We need data, not only algorithms Charts copied from https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-training-data-31a88b97b282
  • 14. Requirements What will be more important for the success of your project? Driving the training accuracy from, say, 92,4 to 93,6% on a pre-defined data set? or ML components that allow high quality with small training sets and moderate annotation and training time?
  • 15. Example  The Conll2003 data set used in many academic NER experiments contains >100000 entities  Assume 30sec per entity  100 person days pure annotation time! (With one single annotator) Unrealistic in most commercial project settings. Commercial projects have requirements that are different from academic research!
  • 16. On dataset preparation: Requirements Web-based (no install), intuitive GUI, usable by domain experts Limit manual annotation efforts: Active Learning Collaboration (work in teams, measure inter-annotator agreement) Not just NER annotation: Entity typing, document categorization, … Must facilitate deployment-to-production
  • 17. Why another tool?  WebAnno:  Scientific focus: « Annotate corpora to allow the study of linguistic phenomena »  Sentence-based, Loosing all layout information  Spacy/Prodi.gy:  Focus on local/lexical named entity recognition. Underlying model by default considering a narrow window of n (n=4) words left and right.  Brat:  Interface-only. Integration with model building, semi-automatic suggestions, deployment?
  • 18. Kairntech Sherpa Annotation environment Raw or preannotated Corpora: Text, Audio, … ML model Curated AnnotationsAutomatic Annotation Suggestions User Datasets and ML models Search, Collaboration, Manual & assisted annotation, Quality metrics, Synchronisation into ML model
  • 19. Active Learning?  Reduce effort in manual annotation of data by presenting the user with data in some informed order:  Ask the user for feedback on the samples that promises the highest benefit: Samples that are least certain* (*) Diagrams used from datacamp.com  Active Learning applied on NLP tasks has been shown to reduce the amount of required training data dramatically  7% of the sample under AL regime yield the same quality as naive selection (cf. Laws 2012: https://d-nb.info/1030521204/34)  In a project that would mean 1 day annotation instead of 14 days
  • 20. Benefits of AL?  Growing accuracy on a (simple) ML task as number of samples grows  Naive selection (« Random », orange line) growing slowly  Informed selection (« QBC, « query by committee », red line) grows much faster  AL promises to reduce effort required for manual annotation
  • 21. A non-expert workflow for dataset creation Ask the application for suggestions (De-) validate and retrain Once satisfied, export/deploy
  • 22. About Kairntech  Kairntech: The company  Created in dec 2018, 10 partners  France (Paris & Grenoble/Meylan), Germany (Heidelberg)  Kairntech: The team  Background in Software engineering, Machine Learning, Sales, Management  +15 years of experence in NLP development and deployment from Xerox, IBM, TEMIS. Development of components currently in production at CERN, NASA, EPO…)
  • 23. Kairntech: Our profile  Industrialize the creation of document sets (training corpora) by offering an environment for the data preparation by domain experts, easy and efficient to use  The transformation of data sets in document analysis services, adding value to enterprise knowledge repositories (e.g. knowledge graphs)  Industrial deployment of maintenance of these services.
  • 25. Conclusions  So much data!  But very little of it labelled and useful for superised learning  So many pretrained models!  But most of the time they do not quite do what you need in your project  So many algorithms!  But a library alone will not allow you to implement the solution you need  Kairntech is there to support you!
  • 26. Thank you for your attention ! Stefan.Geissler@kairntech.com

Editor's Notes

  1. Attention: Numbers are not always comparable! Are the models trained with or without the validation set? Are the numbers the best of a set of n experiments? Or the average of n experiments? We have spent some effort in redoing the experiments reported in the literature and there are often slight variations. This does not mean that there is dishonesty involved!! But it means that when results are within a few tenth of a percent, the question “which approach is best” becomes blurry.
  2. https://blog.floydhub.com/ten-trends-in-deep-learning-nlp/ What does it mean for me? Can this research be applied to everyday applications? Or is the underlying technology still evolving so rapidly that it is not worth investing time developing an approach which may be considered obsolete with the next research paper?
  3. Also doccano, talen,