Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
IC-SDV 2019
April 9-10, 2019
Nice, France
Addressing requirements for
real-world deployments of ML & NLP
Stefan Geißler, K...
Agenda
❖ Looking back: the NLP landscape has changed
dramatically
❖ Algorithms → Data!
❖ Support dataset creation: The Kai...
Looking back : NLP landscape has changed
2000:
❖Very few open source components
❖Lexicons, Taggers, Morphology,
Parsers mo...
Today
2019:
❖Sharing! (Github, …)
❖Lexicons, Taggers, Morphology,
Parsers often in the public
domain
❖« Combine & Adapt »
...
2019: A tipping point in ML & NLP?
❖ « 2018 was the ‘image net’ moment for deep learning in NLP’ (S.
Ruder)
❖ In Image Pro...
Example: Named Entity Recognition
Cf.
https://www.researchgate.net/publication/329933780_A_Survey_on_Deep_Learning_for_Nam...
NLP: A commodity?
Named entity recognition in four steps:
$ pip install spacy
$ python –m spacy download en
$ cat > testsp...
Algorithms are commodity
Even the top scoring system from the list earlier is available on github:
https://github.com/zala...
Nice and easy
But…
Pain points
❖ Off-the-shelf NLP models often don’t work for
specific needs
❖ Implementation is slowed down by the need of
...
Frequent requirements in real-world projects
❖ In many commercial scenarios around entity extraction, an entity not only h...
You don’t have to take my word on that.
Let’s listen to what the experts say:
→ Algorithms are commodity, data is gold
Pet...
So: We need data, not only algorithms
Charts copied from https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-t...
Requirements
What will be more important for
the success of your project?
Driving the training accuracy from, say,
92,4 to...
Example
❖ The Conll2003 data set used in many academic NER
experiments contains >100000 entities
❖ Assume 30sec per entity...
On dataset preparation: Requirements
❖ Web-based (no install), intuitive GUI, usable by domain experts
❖ Limit manual anno...
Why another tool?
❖ WebAnno:
❖ Scientific focus: « Annotate corpora to allow the study of
linguistic phenomena »
❖ Sentenc...
Kairntech Sherpa
Annotation environmentRaw or preannotated
Corpora:
Text, Audio, …
ML model
Curated AnnotationsAutomatic A...
Active Learning?
❖ Reduce effort in manual annotation of data by presenting the user with data in
some informed order:
❖ A...
Benefits of AL?
❖ Growing accuracy on a
(simple) ML task as number
of samples grows
❖ Naive selection (« Random »,
orange ...
A non-expert workflow for dataset creation
Ask the
application for
suggestions
(De-) validate
and retrain
Once satisfied,
...
About Kairntech
❖ Kairntech: The company
❖ Created in dec 2018, 10 partners
❖ France (Paris & Grenoble/Meylan), Germany
(H...
Kairntech: Our profile
❖ Industrialize the creation of document sets (training
corpora) by offering an environment for the...
Kairntech: Our offering
Consulting
(feasabilitiy studies,
methodology…)
Professional Services
(Data set & model creation,
...
Conclusions
❖ So much data!
❖ But very little of it labelled and useful for superised learning
❖ So many pretrained models...
Thank you for your attention !
Stefan.Geissler@kairntech.com
Upcoming SlideShare
Loading in …5
×

IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data scientists to solve but were too busy to ask - Stefan Geißler (kairntech, Germany)

213 views

Published on

Applications of machine learning on NLP tasks today receive a lot of attention and have been shown to yield state of the art results on a wide range of tasks. We describe several cases where machine learning is deployed productively under the usual constaints of real-world projects: Real-world requirements, fast throughput, reasonably low requirements in terms of training corpus size and high quality results. What we observe is a general trend towards open source - also our components are open source. With the software being mostly freely available, among the key success criteria for many NLP projects today therefore is first and foremost the necessary expertise required to combine, tune and apply open source components.

Published in: Software
  • Be the first to comment

  • Be the first to like this

IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data scientists to solve but were too busy to ask - Stefan Geißler (kairntech, Germany)

  1. 1. IC-SDV 2019 April 9-10, 2019 Nice, France Addressing requirements for real-world deployments of ML & NLP Stefan Geißler, Kairntech
  2. 2. Agenda ❖ Looking back: the NLP landscape has changed dramatically ❖ Algorithms → Data! ❖ Support dataset creation: The Kairntech Sherpa ❖ Kairntech? Who are we ❖ Conclusion
  3. 3. Looking back : NLP landscape has changed 2000: ❖Very few open source components ❖Lexicons, Taggers, Morphology, Parsers mostly proprietory, complex to install and maintain, limited coverage ❖« Make or Buy » ❖High level of manual efforts in creating and maintaining lexical knowledge bases, rule systems
  4. 4. Today 2019: ❖Sharing! (Github, …) ❖Lexicons, Taggers, Morphology, Parsers often in the public domain ❖« Combine & Adapt » ❖Broad success of learning-based approaches
  5. 5. 2019: A tipping point in ML & NLP? ❖ « 2018 was the ‘image net’ moment for deep learning in NLP’ (S. Ruder) ❖ In Image Processing in 2012 a Deep Learning network won a public contest by a large margin. Now in 2018 we saw exciting NLP models implementing transfer learning: ELMo, UMLfit, BERT ❖ « ML Engineering in NLP will truly blossom in 2019 » (E. Ameisen) ❖ Focus on Tools beyond model building! Link NLP/AI to production use! What does it mean to build data-driven products and services? ❖ « Enough papers: Let’s build AI now! » (A. Ng, 2017) ❖ « AI is the new electricity! »
  6. 6. Example: Named Entity Recognition Cf. https://www.researchgate.net/publication/329933780_A_Survey_on_Deep_Learning_for_Named_Entity_Recognition/download Many / most of these approaches available with code
  7. 7. NLP: A commodity? Named entity recognition in four steps: $ pip install spacy $ python –m spacy download en $ cat > testspacy.py import spacy nlp = spacy.load(‘en’) doc = nlp(“Angela Merkel will meet Emmanuel Macron at the summit in Amsterdam”) for entity in doc.ents: print(entity.text) CRTL-D $ python testspacy.py Angela Merkel Emmanuel Macron Amsterdam
  8. 8. Algorithms are commodity Even the top scoring system from the list earlier is available on github: https://github.com/zalandoresearch/flair For the protocol: The survey does not list Delft (https://github.com/kermitt2/delft), implemented by the Kairntech chief ML expert and which • Scores exactly at 93,09% on Conll2003, too • Creates models that are very compact (~5MB vs. >150MB) • Loads model in ~2sec at initialization
  9. 9. Nice and easy But…
  10. 10. Pain points ❖ Off-the-shelf NLP models often don’t work for specific needs ❖ Implementation is slowed down by the need of building specific training dataset ❖ AI/NLP services are often require integration of business glossaries & knowledge graph ❖ Absence of maintenance leads to quality deviations
  11. 11. Frequent requirements in real-world projects ❖ In many commercial scenarios around entity extraction, an entity not only has to be recognized but also typed ❖ A DATE in a contract may be the date when the contract becomes effective, when it was signed, when it will be terminated ❖ A PERSON in a legal opinion may be the defendant, the lawyer, the judge, the witness … ❖ A DISEASE in clinical study may be the core therapeutic area or a peripheral occasional adverse event ❖ This is beyond the public named entity recognition modules ❖ Typically, for these decisions no training corpora exist. They must be established within a project.
  12. 12. You don’t have to take my word on that. Let’s listen to what the experts say: → Algorithms are commodity, data is gold Peter Norvig: “We [at Google] don't have better algorithms than anyone else; we just have more data!” “More data beats clever algorithms.” Angela Merkel: “Data is the new oil of the 21st century!“
  13. 13. So: We need data, not only algorithms Charts copied from https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-training-data-31a88b97b282
  14. 14. Requirements What will be more important for the success of your project? Driving the training accuracy from, say, 92,4 to 93,6% on a pre-defined data set? or ML components that allow high quality with small training sets and moderate annotation and training time?
  15. 15. Example ❖ The Conll2003 data set used in many academic NER experiments contains >100000 entities ❖ Assume 30sec per entity ➔ 100 person days pure annotation time! (With one single annotator) ➔Unrealistic in most commercial project settings. ➔Commercial projects have requirements that are different from academic research!
  16. 16. On dataset preparation: Requirements ❖ Web-based (no install), intuitive GUI, usable by domain experts ❖ Limit manual annotation efforts: Active Learning ❖ Collaboration (work in teams, measure inter-annotator agreement) ❖ Not just NER annotation: Entity typing, document categorization, … ❖ Must facilitate deployment-to-production
  17. 17. Why another tool? ❖ WebAnno: ❖ Scientific focus: « Annotate corpora to allow the study of linguistic phenomena » ❖ Sentence-based, Loosing all layout information ❖ Spacy/Prodi.gy: ❖ Focus on local/lexical named entity recognition. Underlying model by default considering a narrow window of n (n=4) words left and right. ❖ Brat: ❖ Interface-only. Integration with model building, semi-automatic suggestions, deployment?
  18. 18. Kairntech Sherpa Annotation environmentRaw or preannotated Corpora: Text, Audio, … ML model Curated AnnotationsAutomatic Annotation Suggestions User Datasets and ML models Search, Collaboration, Manual & assisted annotation, Quality metrics, Synchronisation into ML model
  19. 19. Active Learning? ❖ Reduce effort in manual annotation of data by presenting the user with data in some informed order: ❖ Ask the user for feedback on the samples that promises the highest benefit: Samples that are least certain* (*) Diagrams used from datacamp.com ❖ Active Learning applied on NLP tasks has been shown to reduce the amount of required training data dramatically ❖ 7% of the sample under AL regime yield the same quality as naive selection (cf. Laws 2012: https://d-nb.info/1030521204/34) ❖ In a project that would mean 1 day annotation instead of 14 days
  20. 20. Benefits of AL? ❖ Growing accuracy on a (simple) ML task as number of samples grows ❖ Naive selection (« Random », orange line) growing slowly ❖ Informed selection (« QBC, « query by committee », red line) grows much faster ❖ AL promises to reduce effort required for manual annotation
  21. 21. A non-expert workflow for dataset creation Ask the application for suggestions (De-) validate and retrain Once satisfied, export/deploy
  22. 22. About Kairntech ❖ Kairntech: The company ❖ Created in dec 2018, 10 partners ❖ France (Paris & Grenoble/Meylan), Germany (Heidelberg) ❖ Kairntech: The team ❖ Background in Software engineering, Machine Learning, Sales, Management ❖ +15 years of experence in NLP development and deployment from Xerox, IBM, TEMIS. Development of components currently in production at CERN, NASA, EPO…)
  23. 23. Kairntech: Our profile ❖ Industrialize the creation of document sets (training corpora) by offering an environment for the data preparation by domain experts, easy and efficient to use ❖ The transformation of data sets in document analysis services, adding value to enterprise knowledge repositories (e.g. knowledge graphs) ❖ Industrial deployment of maintenance of these services.
  24. 24. Kairntech: Our offering Consulting (feasabilitiy studies, methodology…) Professional Services (Data set & model creation, Annotation pipelines, Knowledge Graphs, Maintenance) Kairntech Platform « Sherpa » (Sherpa Annotation Tool, Sherpa Knowledge Supervisor)
  25. 25. Conclusions ❖ So much data! ❖ But very little of it labelled and useful for superised learning ❖ So many pretrained models! ❖ But most of the time they do not quite do what you need in your project ❖ So many algorithms! ❖ But a library alone will not allow you to implement the solution you need ❖ Kairntech is there to support you!
  26. 26. Thank you for your attention ! Stefan.Geissler@kairntech.com

×