High Performance Transfer Learning for Classifying Intent of Sales Engagement Emails: An Experimental Study

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Yong Liu
Corey Zumar
High Performance Transfer Learning for
Classifying Intent of Sales Engagement
Emails: An Experimental Study
#UnifiedAnalytics #SparkAISummit

Outline
• Data Science Research Objectives
• Sales Engagement Platform (SEP)
• Use Cases and Technical Challenges
• Experiments and Datasets
• Results
• MLflow Integration and Experiments Tracking
• Summary and Future Work
3#UnifiedAnalytics #SparkAISummit

Data Science Research Objectives
• Establish a high performance transfer learning
evaluation framework for email classification
• Three research questions:
– Which embeddings and pre-trained LMs are to be used?
– Which transfer learning implementation strategies (feature-
based vs. fine-tuning) are to be used?
– How many labeled samples are needed?

Sales Engagement Platform (SEP)
• A new category of software
CRMs (e.g., Salesforce, Microsoft Dynamics, SAP)
Sales Engagement Platform (SEP)
(e.g., Outreach)
Sales Reps

SEP Encodes and Automates Sales
Activities into Workflows/Pipelines
Ø Automates execution and capture of
activities (e.g., emails) and records in a
CRM.
Ø Schedules and reminds the rep when it is
the right time to do the manual tasks (e.g.
phone call, custom manual email)
Ø Enables reps to perform one-on-one
personalized outreach to up to 10x more
prospects than before.

Why Email Intent Classification Is
Needed
• Email content is critical for driving results for
prospecting and other stages of the sales process
• A replier’s email intent-based metric (e.g., positive,
objection, unsubscription) is much better than a
simple “reply rate”
• A/B testing using a better metric can pick winners of
the email content/template more confidently

Why Email Intent Classification is
Challenging @ SEP
• Different context and players: different roles of players
are involved throughout the sales processes and at
different orgs
• Limited labeled sales engagement domain emails:
GDPR and privacy/compliance-constraints; time-
consuming and even not possible to label emails in
many orgs on a SEP

Why Transfer Learning?
• Using pretrained language models opens doors for high
performance transfer learning (HPTL):
– Fewer training samples
– Better accuracy
– Reduced model training time and engineering complexity
• Pretrained language models such as BERT have achieved
state-of-the-art scores in the NLP GLUE leaderboard
(https://gluebenchmark.com/)
– However, whether such benchmark success can be readily
translated to practical application is still unknown

A List of Pretrained LMs and
Embeddings for Experiments
• GloVe
– count-based context-free word embeddings released in 2014
• ELMo
– context-aware character-based embeddings that is based on a
recurrent neural network (RNN) architecture released in 2018
• Flair
– contextual string embedding released in 2018
• BERT
– state-of-the-art transformer-based deep bidirectional language
model released in late 2018 by Google

Experimental Email Dataset

Example Intents and Emails
• Positive: "Actually, I'd be interested in
talking Friday. Do you have some time
around 10am?”
• Objection: “Thanks for reaching out. This
is not something I am interested in at
this time.”
• Unsubscribe: “Please remove me from your
email list.”
• Not-sure: “Mike, in regards to? John”

Two Sets of Experiment Runs
• Using different pretrained language models (LMs)
and embeddings: feature-based vs. fine-tuning
– Using the full training examples
• Different labeled training size with feature-based
and fine-tuning Approach
– Increasingly larger training size: 50, 100, 200, 300, 500,
1000, 2000, 3000

Result (1): Different Embeddings
Ø BERT-finetuning has the best f1 score
Ø When using feature-based approaches, GloVe performs slightly better
Ø Classical MLs such as LightGBM+TF-IDF underperform BERT-finetuing
feature-based

Result (2): Scaling Effect with
Different Training Sample Sizes
Ø BERT-finetuning outperforms all other
Feature-based approaches when training
example size is greater than 300
Ø When training size is small (< 100),
BERT+Flair performs better
Ø To achieve an f1-score > 0.8,
BERT-finetuning needs at least 500 training
examples, while feature-based approach
needs at least 2000 training examples

Introducing
Open machine learning platform
• Works with any ML library & language
• Runs the same way anywhere (e.g. any cloud)
• Designed to be useful for 1 or 1000+ person orgs
• Integrates with Databricks

MLflow Components
Tracking
Record and query
experiments: code,
configs, results, …etc
Projects
Packaging format
for reproducible runs
on any platform
Models
General model format
that supports diverse
deployment tools

Key Concepts in Tracking
Parameters: key-value inputs to your code
Metrics: numeric values (can update over time)
Artifacts: arbitrary files, including data and models
Source: training code that ran
Version: version of the training code
Tags and Notes: any additional info

MLflow Tracking: Example Code
Tracking
Record and query
experiments: code,
configs, results,
…etc
import mlflow
with mlflow.start_run():
mlflow.log_param("layers", layers)
mlflow.log_param("alpha", alpha)
# train model
mlflow.log_metric("mse", model.mse())
mlflow.log_artifact("plot", model.plot(test_df))
mlflow.tensorflow.log_model(model)

Model Format
Flavor 2Flavor 1
ML Frameworks
Inference Code
Standard for ML
models
MLflow Models
Batch & Stream
Scoring
Serving Tools
Standard for ML
models

MLflow to Manage Hundreds of
Experiments
• Pytorch models for the feature-based approach
– Using the Flair framework
• Tensorflow for BERT fine-tuning
– Using the bert-tensorhub framework

MLflow Tracking All Experiments

MLflow Logs
Artifacts/Parameters/Metrics/Models
mlflow.log_metric("micro_avg_f1_score_avg", np.asarray(test_scores).mean())

Images Can Be Logged as Artifacts
mlflow.log_artifact(tSNE_img, 'run_{0}'.format(run_id))

Summary
• Transfer learning using fine-tuning BERT outperforms all feature-based
approaches using different embeddings/pretrained LMs when training
example size is greater than 300
• Pretrained language models solve the cold start problem when there is very
little training data
– E.g., with as little as 50 labeled examples, the f1 score reaches 0.67 with BERT+Flair using
the feature-based approach).
• However, to get to f1-score >0.8, it may still need one to two thousand
examples for a feature-based approach or 500 examples for fine-tuning a
pre-trained BERT language model.
• MLFlow is proven to be useful and powerful for tracking all experiments

Future Work
• MLflow: from experimentation to production
– Pick the best model for deployment
• Extend to cross-org transfer learning
– Using one or multiple orgs data for training and then
applying to other orgs

Acknowledgements
• Outreach Data Science Team
• Databricks MLflow team

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

High Performance Transfer Learning for Classifying Intent of Sales Engagement Emails: An Experimental Study

More Related Content

What's hot

Similar to High Performance Transfer Learning for Classifying Intent of Sales Engagement Emails: An Experimental Study

More from Databricks

Recently uploaded

High Performance Transfer Learning for Classifying Intent of Sales Engagement Emails: An Experimental Study