Сейчас многие компании решают разные NLP-задачи (классификация, чат-боты, кластеризация, вопросное-ответные системы и др.) и с накоплением опыта стали вырабатываться наиболее эффективные пайплайны.
В докладе мы будем ориентироваться на лучшие мировые практики (AllenNLP) и свой собственный опыт. Расскажем, как нужно структурировать ваш пайплайн и особенности каждой его составляющей: как правильно оформлять входящие данные, итераторы по датасету, каким должен быть словарь, подготовка данных и др. Будут приведены примеры с реальных задач и показано, как это помогает в воспроизводимости и легкости дальнейшего использования.
2. You’ll find out
✓What is a good pipeline
✓How to process effectively input data
✓How to build a train pipeline
✓Why is a declarative syntax useful
✓How to alter training to predictive pipeline with
small efforts
3. Pipeline
• sequences of processing and analysis steps applied to
data for a specific purpose
• save time on design time and coding, if you expect to
encounter similar tasks
• simplifies deployment to production
4. • Reusable pipeline
(applied to different tasks with minor changes)
• Structured
(logical chain, easy understandable)
• Documented
(fully commented, defined arguments types, readme)
• Covered by tests
(simple checks for input data/batch shape/model output, and
unit tests for each pipeline step)
What is a good pipeline?
5. Input data in NLP
• Text is always a part of an input
(token, sentence, article, dialogue, html page etc)
• Tags/Labels
• Spans (start and end indexes)
6. • For text input we need to create a text field
• Text Field should have next methods:
- preprocess (preprocess text,
argument – a list of preprocessors)
- tokenize (tokenize text, argument – a tokenizer)
- index (index text, argument – a vocabulary)
- pad (pad text, argument – maximum number of tokens)
- as_tensor (returns a tensor, basic method for all fields)
Text Field
Input text
Preprocess
Tokenize
Index
Pad
As tensor
Text Field
Input Text
Input Labels
Text Field
Label Field
7. Label Field
• label field should be convertible to
appropriate model format based on
vocabulary:
- index
(index labels, argument – vocabulary)
- as tensor
(returns a tensor with a proper
shape)
Input labels
Index
As tensor
Label Field
Input Text
Input Labels
Text Field
Label Field
8. Other Fields
• Metadata Field – other information which will not be used for training
(raw text, author, topic, etc)
• Categorical Field – for any categorical data which could be encoded
as embeddings or OHE (post type, source etc)
• Span Field – start, end or inside index
• Index field – index for the right answer over the sequence
• Create your own fields which contain “as tensor” method, for
comfortable use inside your model
9. Instance
• Instance is a collection of fields
• One sample – one instance
• Has Mapping[str, Field] type, all fields are
keyed
• Get field as tensor by key (“text”, “labels"
– in our example)
• Should contain “index_fields” method to
index all fields by the given vocabulary
Instance
Label Field
Text FieldText Fieldtext
labels
Instance
Input Text
Input Labels
Text Field
Label Field
10. Vocabulary
• Vocabulary is a class where all vocabularies
(text, labels, categories) are available by
namespace
• We should be able to create a vocabulary
from list of instances, counters or predefined
files
• Add special tokens: OOV (Out-of-
Vocabulary), padded
• Should have options to get statistics,
“string to index” and “index to string”
Vocabulary
Label Vocab
Text FieldText Vocabtext
labels
Instance
Vocabulary
Input Text
Input Labels
Text Field
Label Field
11. Starting a pipeline schema
Instance
Vocabulary
Input Text
Input Labels
Text Field
Label Field
1 2
34
1. Input data to Fields
2. Fields to Instance
3. List of Instances to Vocabulary
4. Index Fields with Vocabulary
12. Preprocessing and tokenization
• Preprocessing and tokenization you should apply before
setting text to field
• Better to store raw text in metadata (if it is possible, to track
correctness of preprocessing and tokenization)
• Preprocessing applied to raw text and returns preprocessed
text
• Preprocessed text is sent to tokenizer, which returns a list of
tokens
13. DataSet Reader
• Our pipeline schema should be used in dataset
reader
• Each task should have own dataset reader in which
you will construct instances and vocabulary
• Could be lazy
14. Iterator
• Gets output from Dataset Reader
• Types:
- “Simple Iterator” just shuffles data
- “Bucket Iterator” creates batches with the same
length
• Index and pad fields during training
15. Trainer
• Arguments:
- Train and Validation dataset
(List of Instances)
- Iterator
- Define model
- Optimizer and Loss
- Train configurations (number of epochs, shuffling,
metrics)
- Callbacks (early stopping, learning rate schedule,
logging etc)
• Whole train process is done by Trainer class
Dataset
Model
Optimizer
Configuration
Callbacks
Trainer
Iterator
16. Config
• config file in JSON or XML format
• all pipeline steps are defined in config file
• declarative syntax allows to change experiment
without code changing
"data_folder": "data/training_data",
"dataset_reader": {
"type": "multilabel_classification"
}
"preprocessing": {
"lower": true,
"max_seq_len": 150,
"include_length": true,
"label_one_hot": true,
“preprocessors": [ ["HtmlEntitiesUnescaper"], ["BoldTagReplacer"],
["HtmlTagReplacer", [" "]], ["URLReplacer", [" urlTag “]]]
}
Data
"type": "bucket_iterator",
"params": {
"batch_size": 1000,
"shuffle": true,
"sort_key": {
"field": "text",
"type": "length"
}
}
Iterator
18. Saving
• Config file (you can always reproduce an experiment
if you have a config file)
• Vocabulary (match embeddings indexes or labels)
• Model weights (load model)
19. Predictor
• Input data – JSON-format (API format)
• Dataset Reader adopted to JSON-format
• Iterator for batch processing
• Options to get predictions with probabilities, return
top-5 most probable labels, change thresholds and
other options for validating results
21. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
22. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
23. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
24. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
25. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
26. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
27. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
28. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
29. Training
# define reader
reader = get_reader(training_config["dataset_reader"], preprocessing=training_config[“preprocessing"])
# load train and val datasets
train_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "train_data.p"))
val_dataset = reader.read(data_path=os.path.join(training_config["data_folder"], "val_data.p"))
# get vocabulary and set token vectors
vocab = Vocabulary.from_dataset(train_dataset + val_dataset)
vocab.set_vectors(training_config["embeddings"])
# define data iterator
iterator = get_iterator(training_config["iterator"], vocab=vocab)
# define model, optimizer, loss and metrics
model = get_model(training_config["model"], vocab)
loss = get_loss(training_config["loss"]["type"], training_config["loss"]["params"])
opt = get_opt(training_config["optimizer"], model.parameters())
metrics = get_metrics(training_config["metrics"])
# define trainer
trainer = Trainer(training_params=training_config["training_params"], model=model, loss=loss,
optimizer=opt, metrics=metrics)
# train model
trainer.train()
30. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
31. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
32. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
33. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
34. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
35. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
36. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
37. Predicting
# define reader
reader = get_reader(config["deploy"]["dataset_reader"], preprocessing=config["preprocessing"])
# load vocab
vocab = Vocabulary.from_file(os.path.join(config["model_path"], "vocab"))
# define iterator
iterator = get_iterator(config["deploy"]["iterator"], vocab=vocab)
# load model
model = load_model(os.path.join(config["model_path"], "model"), vocab=vocab)
# define predictor
predictor = Predictor(model=model, reader=reader, iterator=iterator)
# function which predicts your input
def predict(data):
return predictor.predict(data)
38. Conclusions
• Abstractions make a pipeline more structured, logical and
understandable
• Declarative syntax helps to keep a pipeline simple and set
the whole experiment in config without code changing
• Production pipeline almost the same as training
• With good base, prototyping becomes very fast