Оформление пайплайна в NLP проекте, Виталий Радченко. 22 июня, 2019

1. NLP Pipeline Best practices Vitalii Radchenko  Data Scientist, YouScan

2. You’ll find out ✓What is a good pipeline ✓How to process effectively input data ✓How to build a train pipeline ✓Why is a declarative syntax useful ✓How to alter training to predictive pipeline with small efforts

3. Pipeline • sequences of processing and analysis steps applied to data for a speciﬁc purpose • save time on design time and coding, if you expect to encounter similar tasks • simpliﬁes deployment to production

4. • Reusable pipeline   (applied to diﬀerent tasks with minor changes) • Structured   (logical chain, easy understandable) • Documented   (fully commented, deﬁned arguments types, readme) • Covered by tests   (simple checks for input data/batch shape/model output, and unit tests for each pipeline step) What is a good pipeline?

5. Input data in NLP • Text is always a part of an input  (token, sentence, article, dialogue, html page etc) • Tags/Labels • Spans (start and end indexes)

6. • For text input we need to create a text ﬁeld • Text Field should have next methods: - preprocess (preprocess text,   argument – a list of preprocessors) - tokenize (tokenize text, argument – a tokenizer) - index (index text, argument – a vocabulary) - pad (pad text, argument – maximum number of tokens) - as_tensor (returns a tensor, basic method for all ﬁelds) Text Field Input text Preprocess Tokenize Index Pad As tensor Text Field Input Text Input Labels Text Field Label Field

7. Label Field • label ﬁeld should be convertible to appropriate model format based on vocabulary: - index   (index labels, argument – vocabulary) - as tensor   (returns a tensor with a proper shape) Input labels Index As tensor Label Field Input Text Input Labels Text Field Label Field

8. Other Fields • Metadata Field – other information which will not be used for training (raw text, author, topic, etc) • Categorical Field – for any categorical data which could be encoded as embeddings or OHE (post type, source etc) • Span Field – start, end or inside index • Index ﬁeld – index for the right answer over the sequence • Create your own ﬁelds which contain “as tensor” method, for comfortable use inside your model

9. Instance • Instance is a collection of fields • One sample – one instance • Has Mapping[str, Field] type, all fields are keyed • Get field as tensor by key (“text”, “labels" – in our example) • Should contain “index_fields” method to index all fields by the given vocabulary Instance Label Field Text FieldText Fieldtext labels Instance Input Text Input Labels Text Field Label Field

10. Vocabulary • Vocabulary is a class where all vocabularies (text, labels, categories) are available by namespace • We should be able to create a vocabulary from list of instances, counters or predeﬁned ﬁles • Add special tokens: OOV (Out-of- Vocabulary), padded • Should have options to get statistics,  “string to index” and “index to string” Vocabulary Label Vocab Text FieldText Vocabtext labels Instance Vocabulary Input Text Input Labels Text Field Label Field

11. Starting a pipeline schema Instance Vocabulary Input Text Input Labels Text Field Label Field 1 2 34 1. Input data to Fields 2. Fields to Instance 3. List of Instances to Vocabulary 4. Index Fields with Vocabulary

12. Preprocessing and tokenization • Preprocessing and tokenization you should apply before setting text to ﬁeld • Better to store raw text in metadata (if it is possible, to track correctness of preprocessing and tokenization) • Preprocessing applied to raw text and returns preprocessed text • Preprocessed text is sent to tokenizer, which returns a list of tokens

13. DataSet Reader • Our pipeline schema should be used in dataset reader • Each task should have own dataset reader in which you will construct instances and vocabulary • Could be lazy

14. Iterator • Gets output from Dataset Reader • Types: - “Simple Iterator” just shuﬄes data - “Bucket Iterator” creates batches with the same length • Index and pad ﬁelds during training

15. Trainer • Arguments: - Train and Validation dataset   (List of Instances) - Iterator - Define model - Optimizer and Loss - Train configurations (number of epochs, shuffling, metrics) - Callbacks (early stopping, learning rate schedule, logging etc) • Whole train process is done by Trainer class Dataset Model Optimizer Configuration Callbacks Trainer Iterator

16. Config • config file in JSON or XML format • all pipeline steps are defined in config file • declarative syntax allows to change experiment   without code changing "data_folder": "data/training_data",  "dataset_reader": {  "type": "multilabel_classification"  }  "preprocessing": { "lower": true,   "max_seq_len": 150,   "include_length": true,   "label_one_hot": true,   “preprocessors": [ ["HtmlEntitiesUnescaper"], ["BoldTagReplacer"],  ["HtmlTagReplacer", [" "]], ["URLReplacer", [" urlTag “]]]   } Data "type": "bucket_iterator", "params": {  "batch_size": 1000,  "shuffle": true,  "sort_key": {   "field": "text",   "type": "length"   } } Iterator

17. Conﬁg "early_stopping": {   "patience": 5,   "monitor": "f1_score",   "save_all_iterations": false },   "tf_logging": { "path": "tf_logs" } Callbacks "type": "f1_score",   "params": {   "average": "micro",   "activation": "sigmoid"  } Metrics "loss": {  "type": "bce_with_logits",  "params": {}   },   "optimizer": {  "type": "adam",  "params": {"lr": 0.001 } } Optimizer and loss "model_folder": "models",  "epochs": 100,   "device": "1" Trainer "type": "mixed_rnn",   "params": {   "embedding_dropout": 0.3,   "rnn_1": {  "rnn_type": "lstm",   "hidden_cells": 100,   "hidden_layers": 1  },   "rnn_2": {   "rnn_type": "gru",   "hidden_cells": 100,   "hidden_layers": 1   },   "aggregation_layers": {   "types": ["max_pool", "mean_pool"]   },   "activation": { "use": false }  } Model

18. Saving • Config file (you can always reproduce an experiment if you have a config file) • Vocabulary (match embeddings indexes or labels) • Model weights (load model)

19. Predictor • Input data – JSON-format (API format) • Dataset Reader adopted to JSON-format • Iterator for batch processing • Options to get predictions with probabilities, return top-5 most probable labels, change thresholds and other options for validating results

20. Pipeline schema Instance Vocabulary Input Text Input Labels Text Field Label Field Dataset Reader Iterator Conﬁg Model Trainer/Predictor 1 2 3 4 4 4

21. Training # define reader reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])  # load train and val datasets  train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))   val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))   # get vocabulary and set token vectors  vocab = Vocabulary.from_dataset(train_dataset + val_dataset)  vocab.set_vectors(training_config["embeddings"])   # define data iterator   iterator = get_iterator(training_config["iterator"], vocab=vocab)     # define model, optimizer, loss and metrics  model = get_model(training_config["model"], vocab)   loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])   opt = get_opt(training_config["optimizer"], model.parameters())   metrics = get_metrics(training_config["metrics"])    # define trainer  trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,   optimizer=opt, metrics=metrics)    # train model  trainer.train()

30. Predicting # define reader reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])    # load vocab  vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))     # define iterator  iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)    # load model  model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)    # define predictor   predictor = Predictor(model=model, reader=reader, iterator=iterator)    # function which predicts your input  def predict(data):   return predictor.predict(data)

38. Conclusions • Abstractions make a pipeline more structured, logical and understandable • Declarative syntax helps to keep a pipeline simple and set the whole experiment in conﬁg without code changing • Production pipeline almost the same as training • With good base, prototyping becomes very fast

39. Resources • AllenNLP GitHub: https://github.com/allenai/allennlp • Writing Code for NLP Research: https://bit.ly/2Desi4b • TorchText: https://torchtext.readthedocs.io/en/latest/

40. Any questions?

41. Contacts • ODS-slack: @vradchenko • Email: radchenko.vitaliy.o@gmail.com • Facebook: fb.com/vradchenko

42. Thank you

Оформление пайплайна в NLP проекте, Виталий Радченко. 22 июня, 2019

Recommended

Recommended

More Related Content

Similar to Оформление пайплайна в NLP проекте, Виталий Радченко. 22 июня, 2019

Similar to Оформление пайплайна в NLP проекте, Виталий Радченко. 22 июня, 2019 (20)

More from Mail.ru Group

More from Mail.ru Group (20)

Recently uploaded

Recently uploaded (20)