Accommodating the Deep Learning
Revolution by a Development Process
Methodology
Jochen L. Leidner
Coburg University of Applied Sciences and Arts, Coburg, Germany
KnowledgeSpaces®
UG (haftungsbeschränkt), Coburg, Germany
University of Sheffield, Sheffield, UK
2022-10-11
Overview
●
Introduction: Motivation, Pre-Trained Language Model Revolution
●
Quick Recap: Some Machine Learning Methodologies (CRISP-DM, KDD, SEMMA, Data-to-Value)
●
Before and After Pre-Trained LMs
●
Comparison: Where Project Work is Spent: Pre-BERT and Post-BERT
●
A Comment about Energy
●
Summary & Conclusion
A Step Change in NLP: Deep Learning and Pre-Trained Language Models
●
In recent years, Pre-Trained Language Models (PTLMs) like Google’s BERT have emerged (Devlin et
al., 2018/2019).
●
This led to enormous improvements in terms of accuracy in most NLP tasks.
●
PTLMs show that transfer learning is possible by splitting up training into two phases.
BERT: An Example Pretained Neural Language Model – Pre-Training versus Fine-Tuning
 Two Training Phases:
– Pre-training: train deep neural network with masked sentence pairs on generic language (billions of words
from books, Wikipedia)
– Fine-tuning: adapt generic LM to specific task (e.g. question answering) using supervised learning (extra
rounds on top of pre-trained LM)
Practical Questions
●
RQ 1. How do PTLMs change the way NLP projects are done?
●
RQ 2. In particular, How do PTLMs interact with existing methodologies?
Some Methodologies
●
KDD
●
CRISP-DM Azevedo and Santos (2008)
●
SEMMA
●
Data-to-Value (Leidner, 2013; Leidner 2022a,b)
The Data-to-Value Methodology (Leidner, 2013; Leidner 2022a,b) (1 of 2)
The Data-to-Value Methodology (Leidner, 2013; Leidner 2022a,b) (2 of 2)
Minor fine-tuning
sufficient
Before and After PTLMs
Before:
●
Any classifier/regressor a bespoke activity
(100% custom development from scratch)
●
Relatively slow and expensive to build
●
Knobs: more labelled training data, more
features
After:
●
Classifiers can be derived from PTLMs
(80% re-use and 20% custom dev. →fine-tuning)
●
Rapid/agile prototyping, cheap to get started
●
Knobs: more unlabelled training data, more
labelled training data, 3 training regimes:
– Zero-shot (apply PTLM as-is)
– Fine tuning only (take pre-trained LM and
add a few hundred training rounds using
annotated data)
– Pre-training (huge unlabelled data) and fine-
tuning (small labelled data)
Increasing effort
Comparison: How Time May Be Spent – Before and After PTLMs
Before:
●
Data Collection & Pre-Processing 70%
●
Annotation 10%
●
Feature Engineering 10%
●
Model Training 7%
●
Evaluation 3%
After:
●
Data Collection & Pre-Processing 50% - 70%
●
Annotation 2% - 10%
●
Feature Engineering 0%
●
Model Training 0% - 12%
●
Evaluation 3%
Percentages are estimates (an empirical
study is needed but hard to obtain); ranges
reflect training regimes
symbolizes
size of the
project
Deep Learning & Energy Consumption
●
Pre-training neural models is resource-intensive (Strubell,
Ganesh and McCallum, 2019).
●
Individual estimates vary, but cloud cost
and environmental footprint are issues.
●
While experiments show that “bigger is better“ (in terms of F1),
there is a research drive to “distill“ smaller models.
Summary & Conclusions
●
PTLMs have made NLP projects more agile.
– While more unlabelled data may be needed, less labelled data may be required (sufficient data is
sometimes unavailable in industrial practice).
– Most importantly, the feature engineering cycle is removed from projects.
– PTLMs offer 3 training regimes: zero-shot, tune-tuning and pre-training with increasing cost/effort.
●
As artifacts they are also more clunky and energy-inefficient.
●
Implications:
– Research: Increasingly bigger models means some academic teams excluded from research (requires
expensive GPU clusters) → research moves to industry (similar to semiconductor space).
– Business: Public availability of PTLMs generates more level playing-ground, makes competitive
differentiation harder and reduces barriers to entry.
References
●
Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova (2018) BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding, Technical Report/Unpublished ArXiv Pre-print,
https://arxiv.org/abs/1810.04805.
●
Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova (2019) "BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding" Proc. NAACL-HLT, Minneapolis.
●
Azevedo, A. and Santos, M. F. (2008) "KDD, SEMMA and CRISP-DM: a parallel overview", Proc. IADIS
European Conference on Data Mining, Amsterdam, 24-26 July 2000, 182–185.
●
Leidner, Jochen L. (2013) “Data-to-Value“, unpublished lecture notes Big Data and Language Technology,
University of Zurich, Zurich, Switzerland.
●
Leidner, Jochen L. (2022a) Data-to-Value: An Evaluation-First Methodology for Natural Language Projects,
Technical Report/Unpublished ArXiv Pre-print https://arxiv.org/abs/2201.07725.
●
Leidner, Jochen L. (2022b) "Data-to-Value: An Evaluation-First Methodology for Natural Language Projects",
Proceedings of the 27th International Conference on Natural Language & Information Systems (NLDB 2022),
Valencia, Spain, 15-17 June 2022, LNCS 13286, 517–523.
●
Strubell, Emma, Ananya Ganesh and Andrew McCallum (2019) "Energy and Policy Considerations for Deep
Learning in NLP", ArXiv pre-print, https://arxiv.org/pdf/1906.02243.pdf .
Accommodating the Deep Learning Revolution by a Development Process Methodology
Word embeddings, deep learning, transformer models and other pre-trained neural language models
(sometimes recently referred to as "foundational models") have fundamentally changed the way state-of-the-
art systems for natural language processing and information access are built today. The "Data-to-Value"
process methodology (Leidner 2013; Leidner 2022a,b) has been devised to embody best practices for the
construction of natural language engineering solutions; it can assist practitioners and has also been used to
transfer industrial insights into the university classroom.
This talk recaps how the methodology supports engineers in building systems more consistently and then
outlines the changes in the methodology to adapt it to the deep learning age. The cost and energy
implications will also be discussed.
Abstract
About the Presenter
Prof. Dr. Jochen L. Leidner, M.A. M.Phil. Ph.D. FRGS is Professor for Explainable and Responsible Artificial
Intelligence in Insurance at Coburg University of Applied Sciences, a Visiting Professor in the Department of
Computer Science, University of Sheffield. He is also founder and CEO of KnowledgeSpaces.
His experience includes positions as Director of Research at Thomson Reuters and Refinitiv in London,
where he headed its R&D team, which he founded He was also the Royal Academy of Engineering Visiting’
Professor of Data Analytics at the Department of Computer Science, University of Sheffield (2017-2020).
His background includes a Master's in computational linguistics, English and computer science (University of Erlangen-Nuremberg), a
Master's in Computer Speech, Text and Internet Technology (University of Cambridge) and a PhD in Informatics (University of Edinburgh),
which won the first ACM SIGIR Doctoral Consortium Award.
His scientific contributions include leading the teams that developed the QED and ALYSSA open-domain question answering Systems
(evaluated at US NIST/DARPA TREC), proposing a new algorithm and comparing existing algorithms for spatial resolution of named
entities, and information extraction of usual and unusual things (e.g. event extraction, company risk mining, sentiment analysis).
At Thomson Reuters he has led projects in the vertical comains of finance, regulatory/law enforcement, legal, pharmacology, and news. His
code and machine learning models have been transitioned into products deployed at institutions ranging from international banks to the
U.S. Supreme Court.
Prior to Thomson Reuters, he has worked for SAP and founded and co-founded a number of start-ups. He has lived and worked in
Germany, Scotland, the USA, Switzerland and the UK, and has taught at various universities (Erlangen, Saarbrücken, Frankfurt, Zurich and
now Coburg), and is a scientific expert for the European Commission (FP7, H2020, Horizon Europe) and other funding bodies. He is an
author or co-author of several dozen peer-reviewed publications (incl. one best paper award), has authored/co-edited two books and holds
several patents in the areas of information retrieval, natural language processing, and mobile computing.
He has been twice winner of the Thomson Reuters inventor of the year award for the best patent application.
About KnowledgeSpaces®
●
Contact for consulting:
E-Mail:
info@knowledgespaces.de
Phone:
+49 (172) 904 8908

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Process Methodology Jochen Leidner (Coburg University, Germany)

  • 1.
    Accommodating the DeepLearning Revolution by a Development Process Methodology Jochen L. Leidner Coburg University of Applied Sciences and Arts, Coburg, Germany KnowledgeSpaces® UG (haftungsbeschränkt), Coburg, Germany University of Sheffield, Sheffield, UK 2022-10-11
  • 2.
    Overview ● Introduction: Motivation, Pre-TrainedLanguage Model Revolution ● Quick Recap: Some Machine Learning Methodologies (CRISP-DM, KDD, SEMMA, Data-to-Value) ● Before and After Pre-Trained LMs ● Comparison: Where Project Work is Spent: Pre-BERT and Post-BERT ● A Comment about Energy ● Summary & Conclusion
  • 3.
    A Step Changein NLP: Deep Learning and Pre-Trained Language Models ● In recent years, Pre-Trained Language Models (PTLMs) like Google’s BERT have emerged (Devlin et al., 2018/2019). ● This led to enormous improvements in terms of accuracy in most NLP tasks. ● PTLMs show that transfer learning is possible by splitting up training into two phases.
  • 4.
    BERT: An ExamplePretained Neural Language Model – Pre-Training versus Fine-Tuning  Two Training Phases: – Pre-training: train deep neural network with masked sentence pairs on generic language (billions of words from books, Wikipedia) – Fine-tuning: adapt generic LM to specific task (e.g. question answering) using supervised learning (extra rounds on top of pre-trained LM)
  • 5.
    Practical Questions ● RQ 1.How do PTLMs change the way NLP projects are done? ● RQ 2. In particular, How do PTLMs interact with existing methodologies?
  • 6.
    Some Methodologies ● KDD ● CRISP-DM Azevedoand Santos (2008) ● SEMMA ● Data-to-Value (Leidner, 2013; Leidner 2022a,b)
  • 7.
    The Data-to-Value Methodology(Leidner, 2013; Leidner 2022a,b) (1 of 2)
  • 8.
    The Data-to-Value Methodology(Leidner, 2013; Leidner 2022a,b) (2 of 2) Minor fine-tuning sufficient
  • 9.
    Before and AfterPTLMs Before: ● Any classifier/regressor a bespoke activity (100% custom development from scratch) ● Relatively slow and expensive to build ● Knobs: more labelled training data, more features After: ● Classifiers can be derived from PTLMs (80% re-use and 20% custom dev. →fine-tuning) ● Rapid/agile prototyping, cheap to get started ● Knobs: more unlabelled training data, more labelled training data, 3 training regimes: – Zero-shot (apply PTLM as-is) – Fine tuning only (take pre-trained LM and add a few hundred training rounds using annotated data) – Pre-training (huge unlabelled data) and fine- tuning (small labelled data) Increasing effort
  • 10.
    Comparison: How TimeMay Be Spent – Before and After PTLMs Before: ● Data Collection & Pre-Processing 70% ● Annotation 10% ● Feature Engineering 10% ● Model Training 7% ● Evaluation 3% After: ● Data Collection & Pre-Processing 50% - 70% ● Annotation 2% - 10% ● Feature Engineering 0% ● Model Training 0% - 12% ● Evaluation 3% Percentages are estimates (an empirical study is needed but hard to obtain); ranges reflect training regimes symbolizes size of the project
  • 11.
    Deep Learning &Energy Consumption ● Pre-training neural models is resource-intensive (Strubell, Ganesh and McCallum, 2019). ● Individual estimates vary, but cloud cost and environmental footprint are issues. ● While experiments show that “bigger is better“ (in terms of F1), there is a research drive to “distill“ smaller models.
  • 12.
    Summary & Conclusions ● PTLMshave made NLP projects more agile. – While more unlabelled data may be needed, less labelled data may be required (sufficient data is sometimes unavailable in industrial practice). – Most importantly, the feature engineering cycle is removed from projects. – PTLMs offer 3 training regimes: zero-shot, tune-tuning and pre-training with increasing cost/effort. ● As artifacts they are also more clunky and energy-inefficient. ● Implications: – Research: Increasingly bigger models means some academic teams excluded from research (requires expensive GPU clusters) → research moves to industry (similar to semiconductor space). – Business: Public availability of PTLMs generates more level playing-ground, makes competitive differentiation harder and reduces barriers to entry.
  • 13.
    References ● Devlin, Jacob, Ming-WeiChang, Kenton Lee and Kristina Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Technical Report/Unpublished ArXiv Pre-print, https://arxiv.org/abs/1810.04805. ● Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova (2019) "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" Proc. NAACL-HLT, Minneapolis. ● Azevedo, A. and Santos, M. F. (2008) "KDD, SEMMA and CRISP-DM: a parallel overview", Proc. IADIS European Conference on Data Mining, Amsterdam, 24-26 July 2000, 182–185. ● Leidner, Jochen L. (2013) “Data-to-Value“, unpublished lecture notes Big Data and Language Technology, University of Zurich, Zurich, Switzerland. ● Leidner, Jochen L. (2022a) Data-to-Value: An Evaluation-First Methodology for Natural Language Projects, Technical Report/Unpublished ArXiv Pre-print https://arxiv.org/abs/2201.07725. ● Leidner, Jochen L. (2022b) "Data-to-Value: An Evaluation-First Methodology for Natural Language Projects", Proceedings of the 27th International Conference on Natural Language & Information Systems (NLDB 2022), Valencia, Spain, 15-17 June 2022, LNCS 13286, 517–523. ● Strubell, Emma, Ananya Ganesh and Andrew McCallum (2019) "Energy and Policy Considerations for Deep Learning in NLP", ArXiv pre-print, https://arxiv.org/pdf/1906.02243.pdf .
  • 14.
    Accommodating the DeepLearning Revolution by a Development Process Methodology Word embeddings, deep learning, transformer models and other pre-trained neural language models (sometimes recently referred to as "foundational models") have fundamentally changed the way state-of-the- art systems for natural language processing and information access are built today. The "Data-to-Value" process methodology (Leidner 2013; Leidner 2022a,b) has been devised to embody best practices for the construction of natural language engineering solutions; it can assist practitioners and has also been used to transfer industrial insights into the university classroom. This talk recaps how the methodology supports engineers in building systems more consistently and then outlines the changes in the methodology to adapt it to the deep learning age. The cost and energy implications will also be discussed. Abstract
  • 15.
    About the Presenter Prof.Dr. Jochen L. Leidner, M.A. M.Phil. Ph.D. FRGS is Professor for Explainable and Responsible Artificial Intelligence in Insurance at Coburg University of Applied Sciences, a Visiting Professor in the Department of Computer Science, University of Sheffield. He is also founder and CEO of KnowledgeSpaces. His experience includes positions as Director of Research at Thomson Reuters and Refinitiv in London, where he headed its R&D team, which he founded He was also the Royal Academy of Engineering Visiting’ Professor of Data Analytics at the Department of Computer Science, University of Sheffield (2017-2020). His background includes a Master's in computational linguistics, English and computer science (University of Erlangen-Nuremberg), a Master's in Computer Speech, Text and Internet Technology (University of Cambridge) and a PhD in Informatics (University of Edinburgh), which won the first ACM SIGIR Doctoral Consortium Award. His scientific contributions include leading the teams that developed the QED and ALYSSA open-domain question answering Systems (evaluated at US NIST/DARPA TREC), proposing a new algorithm and comparing existing algorithms for spatial resolution of named entities, and information extraction of usual and unusual things (e.g. event extraction, company risk mining, sentiment analysis). At Thomson Reuters he has led projects in the vertical comains of finance, regulatory/law enforcement, legal, pharmacology, and news. His code and machine learning models have been transitioned into products deployed at institutions ranging from international banks to the U.S. Supreme Court. Prior to Thomson Reuters, he has worked for SAP and founded and co-founded a number of start-ups. He has lived and worked in Germany, Scotland, the USA, Switzerland and the UK, and has taught at various universities (Erlangen, Saarbrücken, Frankfurt, Zurich and now Coburg), and is a scientific expert for the European Commission (FP7, H2020, Horizon Europe) and other funding bodies. He is an author or co-author of several dozen peer-reviewed publications (incl. one best paper award), has authored/co-edited two books and holds several patents in the areas of information retrieval, natural language processing, and mobile computing. He has been twice winner of the Thomson Reuters inventor of the year award for the best patent application.
  • 16.
    About KnowledgeSpaces® ● Contact forconsulting: E-Mail: info@knowledgespaces.de Phone: +49 (172) 904 8908