WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Yong Liu
Corey Zumar
High Performance Transfer Learning for
Classifying Intent of Sales Engagement
Emails: An Experimental Study
#UnifiedAnalytics #SparkAISummit
Outline
• Data Science Research Objectives
• Sales Engagement Platform (SEP)
• Use Cases and Technical Challenges
• Experiments and Datasets
• Results
• MLflow Integration and Experiments Tracking
• Summary and Future Work
3#UnifiedAnalytics #SparkAISummit
Data Science Research Objectives
• Establish a high performance transfer learning
evaluation framework for email classification
• Three research questions:
– Which embeddings and pre-trained LMs are to be used?
– Which transfer learning implementation strategies (feature-
based vs. fine-tuning) are to be used?
– How many labeled samples are needed?
4#UnifiedAnalytics #SparkAISummit
Sales Engagement Platform (SEP)
• A new category of software
5#UnifiedAnalytics #SparkAISummit
CRMs (e.g., Salesforce, Microsoft Dynamics, SAP)
Sales Engagement Platform (SEP)
(e.g., Outreach)
Sales Reps
SEP Encodes and Automates Sales
Activities into Workflows/Pipelines
6#UnifiedAnalytics #SparkAISummit
Ø Automates execution and capture of
activities (e.g., emails) and records in a
CRM.
Ø Schedules and reminds the rep when it is
the right time to do the manual tasks (e.g.
phone call, custom manual email)
Ø Enables reps to perform one-on-one
personalized outreach to up to 10x more
prospects than before.
Why Email Intent Classification Is
Needed
• Email content is critical for driving results for
prospecting and other stages of the sales process
• A replier’s email intent-based metric (e.g., positive,
objection, unsubscription) is much better than a
simple “reply rate”
• A/B testing using a better metric can pick winners of
the email content/template more confidently
7#UnifiedAnalytics #SparkAISummit
Why Email Intent Classification is
Challenging @ SEP
• Different context and players: different roles of players
are involved throughout the sales processes and at
different orgs
• Limited labeled sales engagement domain emails:
GDPR and privacy/compliance-constraints; time-
consuming and even not possible to label emails in
many orgs on a SEP
8#UnifiedAnalytics #SparkAISummit
Why Transfer Learning?
• Using pretrained language models opens doors for high
performance transfer learning (HPTL):
– Fewer training samples
– Better accuracy
– Reduced model training time and engineering complexity
• Pretrained language models such as BERT have achieved
state-of-the-art scores in the NLP GLUE leaderboard
(https://gluebenchmark.com/)
– However, whether such benchmark success can be readily
translated to practical application is still unknown
9#UnifiedAnalytics #SparkAISummit
A List of Pretrained LMs and
Embeddings for Experiments
• GloVe
– count-based context-free word embeddings released in 2014
• ELMo
– context-aware character-based embeddings that is based on a
recurrent neural network (RNN) architecture released in 2018
• Flair
– contextual string embedding released in 2018
• BERT
– state-of-the-art transformer-based deep bidirectional language
model released in late 2018 by Google
10#UnifiedAnalytics #SparkAISummit
Experimental Email Dataset
11#UnifiedAnalytics #SparkAISummit
Example Intents and Emails
• Positive: "Actually, I'd be interested in
talking Friday. Do you have some time
around 10am?”
• Objection: “Thanks for reaching out. This
is not something I am interested in at
this time.”
• Unsubscribe: “Please remove me from your
email list.”
• Not-sure: “Mike, in regards to? John”
12#UnifiedAnalytics #SparkAISummit
Two Sets of Experiment Runs
• Using different pretrained language models (LMs)
and embeddings: feature-based vs. fine-tuning
– Using the full training examples
• Different labeled training size with feature-based
and fine-tuning Approach
– Increasingly larger training size: 50, 100, 200, 300, 500,
1000, 2000, 3000
13#UnifiedAnalytics #SparkAISummit
Result (1): Different Embeddings
14#UnifiedAnalytics #SparkAISummit
Ø BERT-finetuning has the best f1 score
Ø When using feature-based approaches, GloVe performs slightly better
Ø Classical MLs such as LightGBM+TF-IDF underperform BERT-finetuing
feature-based
Result (2): Scaling Effect with
Different Training Sample Sizes
15#UnifiedAnalytics #SparkAISummit
Ø BERT-finetuning outperforms all other
Feature-based approaches when training
example size is greater than 300
Ø When training size is small (< 100),
BERT+Flair performs better
Ø To achieve an f1-score > 0.8,
BERT-finetuning needs at least 500 training
examples, while feature-based approach
needs at least 2000 training examples
Introducing
Open machine learning platform
• Works with any ML library & language
• Runs the same way anywhere (e.g. any cloud)
• Designed to be useful for 1 or 1000+ person orgs
• Integrates with Databricks
16#UnifiedAnalytics #SparkAISummit
MLflow Components
17#UnifiedAnalytics #SparkAISummit
Tracking
Record and query
experiments: code,
configs, results, …etc
Projects
Packaging format
for reproducible runs
on any platform
Models
General model format
that supports diverse
deployment tools
Key Concepts in Tracking
Parameters: key-value inputs to your code
Metrics: numeric values (can update over time)
Artifacts: arbitrary files, including data and models
Source: training code that ran
Version: version of the training code
Tags and Notes: any additional info
18#UnifiedAnalytics #SparkAISummit
MLflow Tracking: Example Code
19#UnifiedAnalytics #SparkAISummit
Tracking
Record and query
experiments: code,
configs, results,
…etc
import mlflow
with mlflow.start_run():
mlflow.log_param("layers", layers)
mlflow.log_param("alpha", alpha)
# train model
mlflow.log_metric("mse", model.mse())
mlflow.log_artifact("plot", model.plot(test_df))
mlflow.tensorflow.log_model(model)
20#UnifiedAnalytics #SparkAISummit
Model Format
Flavor 2Flavor 1
ML Frameworks
Inference Code
Standard for ML
models
MLflow Models
Batch & Stream
Scoring
Serving Tools
Standard for ML
models
MLflow to Manage Hundreds of
Experiments
• Pytorch models for the feature-based approach
– Using the Flair framework
• Tensorflow for BERT fine-tuning
– Using the bert-tensorhub framework
21#UnifiedAnalytics #SparkAISummit
MLflow Tracking All Experiments
22#UnifiedAnalytics #SparkAISummit
MLflow Logs
Artifacts/Parameters/Metrics/Models
23#UnifiedAnalytics #SparkAISummit
mlflow.log_metric("micro_avg_f1_score_avg", np.asarray(test_scores).mean())
Images Can Be Logged as Artifacts
24#UnifiedAnalytics #SparkAISummit
mlflow.log_artifact(tSNE_img, 'run_{0}'.format(run_id))
Summary
• Transfer learning using fine-tuning BERT outperforms all feature-based
approaches using different embeddings/pretrained LMs when training
example size is greater than 300
• Pretrained language models solve the cold start problem when there is very
little training data
– E.g., with as little as 50 labeled examples, the f1 score reaches 0.67 with BERT+Flair using
the feature-based approach).
• However, to get to f1-score >0.8, it may still need one to two thousand
examples for a feature-based approach or 500 examples for fine-tuning a
pre-trained BERT language model.
• MLFlow is proven to be useful and powerful for tracking all experiments
25#UnifiedAnalytics #SparkAISummit
Future Work
• MLflow: from experimentation to production
– Pick the best model for deployment
• Extend to cross-org transfer learning
– Using one or multiple orgs data for training and then
applying to other orgs
26#UnifiedAnalytics #SparkAISummit
Acknowledgements
• Outreach Data Science Team
• Databricks MLflow team
27#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

High Performance Transfer Learning for Classifying Intent of Sales Engagement Emails: An Experimental Study

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Yong Liu Corey Zumar HighPerformance Transfer Learning for Classifying Intent of Sales Engagement Emails: An Experimental Study #UnifiedAnalytics #SparkAISummit
  • 3.
    Outline • Data ScienceResearch Objectives • Sales Engagement Platform (SEP) • Use Cases and Technical Challenges • Experiments and Datasets • Results • MLflow Integration and Experiments Tracking • Summary and Future Work 3#UnifiedAnalytics #SparkAISummit
  • 4.
    Data Science ResearchObjectives • Establish a high performance transfer learning evaluation framework for email classification • Three research questions: – Which embeddings and pre-trained LMs are to be used? – Which transfer learning implementation strategies (feature- based vs. fine-tuning) are to be used? – How many labeled samples are needed? 4#UnifiedAnalytics #SparkAISummit
  • 5.
    Sales Engagement Platform(SEP) • A new category of software 5#UnifiedAnalytics #SparkAISummit CRMs (e.g., Salesforce, Microsoft Dynamics, SAP) Sales Engagement Platform (SEP) (e.g., Outreach) Sales Reps
  • 6.
    SEP Encodes andAutomates Sales Activities into Workflows/Pipelines 6#UnifiedAnalytics #SparkAISummit Ø Automates execution and capture of activities (e.g., emails) and records in a CRM. Ø Schedules and reminds the rep when it is the right time to do the manual tasks (e.g. phone call, custom manual email) Ø Enables reps to perform one-on-one personalized outreach to up to 10x more prospects than before.
  • 7.
    Why Email IntentClassification Is Needed • Email content is critical for driving results for prospecting and other stages of the sales process • A replier’s email intent-based metric (e.g., positive, objection, unsubscription) is much better than a simple “reply rate” • A/B testing using a better metric can pick winners of the email content/template more confidently 7#UnifiedAnalytics #SparkAISummit
  • 8.
    Why Email IntentClassification is Challenging @ SEP • Different context and players: different roles of players are involved throughout the sales processes and at different orgs • Limited labeled sales engagement domain emails: GDPR and privacy/compliance-constraints; time- consuming and even not possible to label emails in many orgs on a SEP 8#UnifiedAnalytics #SparkAISummit
  • 9.
    Why Transfer Learning? •Using pretrained language models opens doors for high performance transfer learning (HPTL): – Fewer training samples – Better accuracy – Reduced model training time and engineering complexity • Pretrained language models such as BERT have achieved state-of-the-art scores in the NLP GLUE leaderboard (https://gluebenchmark.com/) – However, whether such benchmark success can be readily translated to practical application is still unknown 9#UnifiedAnalytics #SparkAISummit
  • 10.
    A List ofPretrained LMs and Embeddings for Experiments • GloVe – count-based context-free word embeddings released in 2014 • ELMo – context-aware character-based embeddings that is based on a recurrent neural network (RNN) architecture released in 2018 • Flair – contextual string embedding released in 2018 • BERT – state-of-the-art transformer-based deep bidirectional language model released in late 2018 by Google 10#UnifiedAnalytics #SparkAISummit
  • 11.
  • 12.
    Example Intents andEmails • Positive: "Actually, I'd be interested in talking Friday. Do you have some time around 10am?” • Objection: “Thanks for reaching out. This is not something I am interested in at this time.” • Unsubscribe: “Please remove me from your email list.” • Not-sure: “Mike, in regards to? John” 12#UnifiedAnalytics #SparkAISummit
  • 13.
    Two Sets ofExperiment Runs • Using different pretrained language models (LMs) and embeddings: feature-based vs. fine-tuning – Using the full training examples • Different labeled training size with feature-based and fine-tuning Approach – Increasingly larger training size: 50, 100, 200, 300, 500, 1000, 2000, 3000 13#UnifiedAnalytics #SparkAISummit
  • 14.
    Result (1): DifferentEmbeddings 14#UnifiedAnalytics #SparkAISummit Ø BERT-finetuning has the best f1 score Ø When using feature-based approaches, GloVe performs slightly better Ø Classical MLs such as LightGBM+TF-IDF underperform BERT-finetuing feature-based
  • 15.
    Result (2): ScalingEffect with Different Training Sample Sizes 15#UnifiedAnalytics #SparkAISummit Ø BERT-finetuning outperforms all other Feature-based approaches when training example size is greater than 300 Ø When training size is small (< 100), BERT+Flair performs better Ø To achieve an f1-score > 0.8, BERT-finetuning needs at least 500 training examples, while feature-based approach needs at least 2000 training examples
  • 16.
    Introducing Open machine learningplatform • Works with any ML library & language • Runs the same way anywhere (e.g. any cloud) • Designed to be useful for 1 or 1000+ person orgs • Integrates with Databricks 16#UnifiedAnalytics #SparkAISummit
  • 17.
    MLflow Components 17#UnifiedAnalytics #SparkAISummit Tracking Recordand query experiments: code, configs, results, …etc Projects Packaging format for reproducible runs on any platform Models General model format that supports diverse deployment tools
  • 18.
    Key Concepts inTracking Parameters: key-value inputs to your code Metrics: numeric values (can update over time) Artifacts: arbitrary files, including data and models Source: training code that ran Version: version of the training code Tags and Notes: any additional info 18#UnifiedAnalytics #SparkAISummit
  • 19.
    MLflow Tracking: ExampleCode 19#UnifiedAnalytics #SparkAISummit Tracking Record and query experiments: code, configs, results, …etc import mlflow with mlflow.start_run(): mlflow.log_param("layers", layers) mlflow.log_param("alpha", alpha) # train model mlflow.log_metric("mse", model.mse()) mlflow.log_artifact("plot", model.plot(test_df)) mlflow.tensorflow.log_model(model)
  • 20.
    20#UnifiedAnalytics #SparkAISummit Model Format Flavor2Flavor 1 ML Frameworks Inference Code Standard for ML models MLflow Models Batch & Stream Scoring Serving Tools Standard for ML models
  • 21.
    MLflow to ManageHundreds of Experiments • Pytorch models for the feature-based approach – Using the Flair framework • Tensorflow for BERT fine-tuning – Using the bert-tensorhub framework 21#UnifiedAnalytics #SparkAISummit
  • 22.
    MLflow Tracking AllExperiments 22#UnifiedAnalytics #SparkAISummit
  • 23.
  • 24.
    Images Can BeLogged as Artifacts 24#UnifiedAnalytics #SparkAISummit mlflow.log_artifact(tSNE_img, 'run_{0}'.format(run_id))
  • 25.
    Summary • Transfer learningusing fine-tuning BERT outperforms all feature-based approaches using different embeddings/pretrained LMs when training example size is greater than 300 • Pretrained language models solve the cold start problem when there is very little training data – E.g., with as little as 50 labeled examples, the f1 score reaches 0.67 with BERT+Flair using the feature-based approach). • However, to get to f1-score >0.8, it may still need one to two thousand examples for a feature-based approach or 500 examples for fine-tuning a pre-trained BERT language model. • MLFlow is proven to be useful and powerful for tracking all experiments 25#UnifiedAnalytics #SparkAISummit
  • 26.
    Future Work • MLflow:from experimentation to production – Pick the best model for deployment • Extend to cross-org transfer learning – Using one or multiple orgs data for training and then applying to other orgs 26#UnifiedAnalytics #SparkAISummit
  • 27.
    Acknowledgements • Outreach DataScience Team • Databricks MLflow team 27#UnifiedAnalytics #SparkAISummit
  • 28.
    DON’T FORGET TORATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT