Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
What to Upload to SlideShare
Loading in …3
×
1 of 66

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools

3

Share

Download to read offline

The natural language processing (NLP) landscape has radically changed with the arrival of transformer networks in 2017.

Related Books

Free with a 30 day trial from Scribd

See all

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools

  1. 1. The pipeline for State-of-the-Art NLP Hugging Face
  2. 2. Agenda Lysandre DEBUT Machine Learning Engineer @ Hugging Face, maintainer and core contributor of huggingface/transformers Anthony MOI Technical Lead @ Hugging Face, maintainer and core contributor of huggingface/tokenizers Some slides were adapted from previous HuggingFace talk by Thomas Wolf, Victor Sanh and Morgan Funtowicz
  3. 3. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  4. 4. Hugging Face
  5. 5. Hugging Face Most popular open source NLP library ▪ 1,000+ Research paper mentions ▪ Used in production by 1000+ companies
  6. 6. Hugging Face
  7. 7. Today’s Menu
  8. 8. Subjects we’ll dive in today ● NLP: Transfer learning, transformer networks ● Tokenizers: from text to tokens ● Transformers: from tokens to predictions
  9. 9. Transfer Learning - Transformer networks One big training to rule them all
  10. 10. NLP took a turn in 2018 Self-supervised Training & Transfer Learning Large Text Datasets Compute Power The arrival of the transformer architecture
  11. 11. Transfer learning In a few diagrams
  12. 12. Sequential transfer learning Learn on one task/dataset, transfer to another task/dataset word2vec GloVe skip-thought InferSent ELMo ULMFiT GPT BERT DistilBERT Text classification Word labeling Question-Answering .... Pre-training Adaptation Computationally intensive step General purpose model
  13. 13. Transformer Networks Very large models - State of the Art in several tasks
  14. 14. Transformer Networks ● Very large networks ● Can be trained on very big datasets ● Better than previous architectures at maintaining long-term dependencies ● Require a lot of compute to be trained Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova.In NACCL, 2019.
  15. 15. Transformer Networks Pre-training Base model Pre-trained language model Very large corpus $$$ in compute Days of training
  16. 16. Transformer Networks Fine-tuning Pre-trained language model Fine-tuned language model Training can be done on single GPU Small dataset Easily reproducible
  17. 17. Model Sharing Reduced compute, cost, energy footprint From 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT, by Victor Sanh
  18. 18. A deeper look at the inner mechanisms Pipeline, pre-training, fine-tuning
  19. 19. Adaptation Head Pre-trained model Tokenizer Transfer Learning pipeline in NLP From text to tokens, from tokens to prediction Jim Henson was a puppet ##eer 11067 5567 245 120 7756 9908 1.2 2.7 0.6 -0.2 3.7 9.1 -2.1 3.1 1.5 -4.7 2.4 6.7 6.1 2.4 7.3 -0.6 -3.1 2.5 1.9 -0.1 0.7 2.1 4.2 -3.1 True 0.7886 False 0.223 Jim Henson was a puppeteer Tokenization Convert to vocabulary indices Pre-trained model Task-specificmodel
  20. 20. Pre-training Many currently successful pre-training approaches are based on language modeling: learning to predict Pϴ (text) or Pϴ (text | other text) Advantages: - Doesn’t require human annotation - self-supervised - Many languages have enough text to learn high capacity models - Versatile - can be used to learn both sentence and word representations with a variety of objective functions The rise of language modeling pre-training
  21. 21. Language Modeling Objectives - MLM ['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing'] The pipeline for State-of-the-Art Natural Language Processing ['The', ‘pipeline’ 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing'] Tokenization Masking ['The', ‘pipeline’, 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing'] ‘Natural’ ‘Artificial’ ‘Machine’ ‘Processing’ ‘Speech’ Prediction
  22. 22. Language Modeling Objectives - CLM ['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing'] The pipeline for State-of-the-Art Natural Language Processing Tokenization Prediction ['Process', '##ing', '(', 'NL', '##P', ')', 'software', 'which', 'will', 'allow', 'a', 'user', 'to', 'develop']
  23. 23. Tokenization It doesn’t have to be slow
  24. 24. Tokenization - Convert input strings to a set of numbers Its role in the pipeline Jim Henson was a puppet ##eer 11067 5567 245 120 7756 9908 Jim Henson was a puppeteer - Goal: Find the most meaningful and smallest possible representation
  25. 25. Some examples Let’s dive in the nitty-gritty
  26. 26. Word-based Word by word tokenization Let’s do tokenization! Let ‘s do tokenization ! Split on punctuation: Split on spaces: ▪ Split on spaces, or following specific rules to obtain words ▪ What to do with punctuation? ▪ Requires large vocabularies: dog != dogs, run != running ▪ Out-of-vocabulary (aka <UNK>) tokens for unknown words
  27. 27. Character Character by character tokenization ▪ Split on characters individually ▪ Do we include spaces or not? ▪ Smaller vocabularies ▪ But lack of meaning -> Characters don’t necessarily have a meaning separately ▪ End up with a huge amount of tokens to be processed by the model L e t ‘ s d o t o k e n i z a t i o n !
  28. 28. Byte Pair Encoding Welcome subword tokenization ▪ First introduced by Philip Gage in 1994, as a compression algorithm ▪ Applied to NLP by Rico Sennrich et al. in “Neural Machine Translation of Rare Words with Subwords Units”. ACL 2016.
  29. 29. Byte Pair Encoding Welcome subword tokenization A B C ... a b c ... ? ! ... Initial alphabet: ▪ Start with a base vocabulary using Unicode characters seen in the data ▪ Most frequent pairs get merged to a new token: 1. T + h => Th 2. Th + e => The
  30. 30. Byte Pair Encoding Welcome subword tokenization ▪ Less out-of-vocabulary tokens ▪ Smaller vocabularies Let’s</w> do</w> token ization</w> !</w>
  31. 31. And a lot more So many algorithms... ▪ Byte-level BPE as used in GPT-2 (Alec Radford et al. OpenAI) ▪ WordPiece as used in BERT (Jacob Devlin et al. Google) ▪ SentencePiece (Unigram model) (Taku Kudo et al. Google)
  32. 32. Tokenizers Why did we build it? ▪ Performance ▪ One API for all the different tokenizers ▪ Easy to share and reproduce your work ▪ Easy to use any tokenizer, and re-train it on a new language/dataset
  33. 33. The tokenization pipeline Inner workings Normalization Pre-tokenization Tokenization Post-processing
  34. 34. The tokenization pipeline Inner workings Normalization Pre-tokenization Tokenization Post-processing ▪ Strip ▪ Lowercase ▪ Removing diacritics ▪ Deduplication ▪ Unicode normalization (NFD, NFC, NFKC, NFKD)
  35. 35. The tokenization pipeline Inner workings Normalization Pre-tokenization Tokenization Post-processing ▪ Set of rules to split: - Whitespace use - Punctuation use - Something else?
  36. 36. The tokenization pipeline Inner workings Normalization Pre-tokenization Tokenization Post-processing ▪ Actual tokenization algorithm: - BPE - Unigram - Word level
  37. 37. The tokenization pipeline Inner workings Normalization Pre-tokenization Tokenization Post-processing ▪ Add special tokens: for example [CLS], [SEP] with BERT ▪ Truncate to match the maximum length of the model ▪ Pad all sequence in a batch to the same length ▪ ...
  38. 38. Tokenizers Let’s see some code!
  39. 39. Tokenizers Let’s see some code!
  40. 40. Tokenizers Let’s see some code!
  41. 41. Tokenizers Let’s see some code!
  42. 42. Tokenizers Let’s see some code!
  43. 43. Tokenizers Let’s see some code!
  44. 44. Tokenizers How to install it?
  45. 45. Transformers Using complex models shouldn’t be complicated
  46. 46. Transformers An explosion of Transformer architectures ▪ Wordpiece tokenization ▪ MLM & NSP BERT ALBERT GPT-2 ▪ SentencePiece tokenization ▪ MLM & SOP ▪ Repeating layers ▪ Byte-level BPE tokenization ▪ CLM Same API
  47. 47. Transformers As flexible as possible Runs and trains on: ▪ CPU ▪ GPU ▪ TPU With optimizations: ▪ XLA ▪ TorchScript ▪ Half-precision ▪ Others All models BERT & RoBERTa More to come!
  48. 48. Transformers Tokenization to prediction transformers.PreTrainedTokenizer transformers.PreTrainedModel The pipeline for State-of-the-Art Natural Language Processing [[464, 11523, 329, 1812, 12, ..., 15417, 28403]] Tensor(batch_size, sequence_length, hidden_size) Task-specific prediction Base model With task-specific head
  49. 49. Transformers Available pre-trained models transformers.PreTrainedTokenizer transformers.PreTrainedModel ▪ We publicly host pre-trained tokenizer vocabularies and model weights ▪ 1611 model/tokenizer pairs at the time of writing
  50. 50. Transformers Pipelines transformers.Pipeline ▪ Pipelines handle both the tokenization and prediction ▪ Reasonable defaults ▪ SOTA models ▪ Customizable
  51. 51. A few use-cases That’s where it gets interesting
  52. 52. Transformers Sentiment analysis/Sequence classification (pipeline)
  53. 53. Transformers Question Answering (pipeline)
  54. 54. Transformers Causal language modeling/Text generation
  55. 55. Transformers Sequence Classification - Under the hood
  56. 56. Transformers Sequence Classification - Under the hood
  57. 57. Transformers Sequence Classification - Under the hood
  58. 58. Transformers Sequence Classification - Under the hood
  59. 59. Transformers Sequence Classification - Under the hood
  60. 60. Transformers Sequence Classification - Under the hood
  61. 61. Transformers Training models Example scripts (TensorFlow & PyTorch) - Named Entity Recognition - Sequence Classification - Question Answering - Language modeling (fine-tuning & from scratch) - Multiple Choice Trains on TPU, CPU, GPU Example scripts for PyTorch Lightning
  62. 62. Transformers Just grazed the surface The transformers library covers a lot more ground: - ELECTRA - Reformer - Longformer - Encoder-decoder architectures - Translation & Summarization
  63. 63. Transformers + Tokenizers The full pipeline? Data Tokenization Prediction 🤗 nlp Tokenizers Transformers Metrics 🤗 nlp
  64. 64. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×