Deep Learning for Natural Language
Processing Using Apache Spark and
TensorFlow
Alexis Roos – Director Machine Learning @alexisroos
Wenhao Liu – Senior Data Scientist
Activity Intelligence team
Agenda
Introduction
Email Classification
Deep Learning
Model Architecture
TensorFrames/SparkDL
Demo
Wrap up
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions
proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other
than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other
financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded
services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and
services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of
our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our
relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer
deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that
could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for
the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our
Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or
at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and
does not intend to update these forward-looking statements.
Statement under the Private Securities Litigation Reform Act of 1995
Forward-Looking Statement
Doing Well and Doing Good
#1 World’s Most
Innovative Companies
Best Places to Work
for LGBTQ Equality
#1 The World’s Best
Workplaces
#1 Workplace for
Giving Back
#1 Top 50 Companies
that Care
The World’s Most
Innovative Companies
#1 The Future 50
Salesforce Keeps Getting Smarter with Einstein
Guide Marketers
Einstein Engagement Scoring
Einstein Segmentation (pilot)
Einstein Vision for Social
Assist Service Agents
Einstein Bots (pilot)
Einstein Agent (pilot)
Einstein Vision for Field Service (pilot)
Coach Sales Reps
Einstein Forecasting (pilot)
Einstein Lead & Opportunity Scoring
Einstein Activity Capture
Advise Retailers
Einstein Product Recommendations
Einstein Search Dictionaries
Einstein Predictive Sort
Empower Admins & Developers
Einstein Prediction Builder (pilot)
Einstein Vision & Language
Einstein Discovery
Help Community Members
Einstein Answers (pilot)
Community Sentiment (pilot)
Einstein Recommendations
Austin Buchan
CEO, College Forward
Agenda
Introduction
Email Classification
Deep Learning
Model Architecture
TensorFrames/SparkDL
Demo
Wrap up
Enhance CRM experience using AI and activity
Suggest
Action(s)
Insights:
Pricing discussed, Executive
involved, Scheduling Requested,
Angry email, competition mentioned,
etc.
AI Inbox
Timelines
Other Salesforce
Apps
…
Einstein
Activity
Capture
Extract
Insights
Emails,
meetings,
tasks,
calls, etc
Email classification use case
What types of emails do Sales users receive?
• Emails from customers
• Scheduling requests, pricing requests, competitor mentioned, etc.
• Emails from coworkers
• Marketing emails
• Newsletters
• Telecom, Spotify, iTunes, Amazon purchases
• etc
Scheduling requests
We want to identify scheduling requests from customers
Hi Alexis,
Can we get together
Thursday afternoon?
Best,
John
Hello Wenhao,
Can you send me that
really important
document?
Thanks,
Mark
Welcome to
Business review!
Your subscription is
active.
Your next letter will
be emailed on May
25th 2018.
Before scoring: filtering and parsing
• Right language
• Automated vs non automated
• Inbound / outbound
• Within or outside the organization
• etc
INTRO
SIGNATURE
CONFIDENTIALITY NOTICE
REPLY CHAIN
BODY
Hey Alexis,
Let’s meet with Ascander on Friday to discuss
the $10,000/year rate. Ascander’s phone
number is (123) 456-7890.
Thanks,
Noah Bergman
Engineer at Salesforce
(123) 456-7890
The contents of this email and any attachments
are confidential and are intended solely for
addressee…
From: Alexis alexis@salesforce.com
Date: April 1, 2017
Subject: Important Document
Noah, how much does your product cost?
HEADER INFORMATION ...
Steps:
• Normalize and tokenize
• Generate n-grams
• Remove stop words
• Compute TF with min threshold filter based vocabulary size
• Compute IDF and filter n-grams based on IDF threshold
“Basic” NLP text classifier
Shortcomings:
• Lack of generalization as classifier is limited to tokens from training data
• Collection of n-grams doesn’t take into account ordering or sequences
Word2Vec or GloVe
• Unsupervised learning algorithm for obtaining vector
representations for words.
• Training is performed on aggregated global word-word
co-occurrence statistics from a corpus.
word2VecModel.findSynonyms(“cost”, 5)
MONEY
price
license
nominal
budget
• Word vectors for individual tokens capture the semantic.
High-level Architecture
Our current machine learning pipeline is pure Scala / Spark, which has served us well.
Filtering
Text
Preprocessing
Raw Emails
Filtered
Emails
FeatureExtraction
Word2Vec
LDA
TF/IDF
N-gram
Or…
other ML models implemented in
Scala/Spark
Agenda
Introduction
Email Classification
Deep Learning
Model Architecture
TensorFrames/SparkDL
Demo
Wrap up
“Loose” brain inspiration: structure of cells
What are Neural Networks: feed forward networks
What are Neural Networks: recurring networks
“I grew up in France… I speak fluent French.”
• RNNs suffer from vanishing or exploding gardients
• LSTM allow to chain and store and use memory across
sequence controlled through gates and operations
• Designed to be chained into RNN.
LSTM
https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
Agenda
Introduction
Email Classification
Deep Learning
Model Architecture
TensorFrames/SparkDL
Demo
Wrap up
High Level Model Architecture
We present a “simple” BiLSTM model for text classification.
x0 x1 x2 x3
Ob
0 Of
3
• Tokens are mapped into word embeddings (GloVe
pretrained on Wikipedia)
• The word embedding for each token is fed into both
forward and backward recurrent network with LSTM (Long
Short-Term Memory*) cells
• “Last” output of the forward and backward RNNs are
concatenated and taken as input by the sigmoid unit for
binary classification
* Hochreiter & Schmidhuber 1997
Cb
0 Cb
1 Cb
2 Cb
3
Backward
Cf
0 Cf
1 Cf
2 Cf
3
Forward
Detailed Considerations for the Model
• We applied dropout on recurrent connections* and inputs…
• As well as L2 regularization on the model parameters.
About dropout and regularization
Cf
0 Cf
1 Cf
2 Cf
3
x0 x1 x2 x3
Cb
0 Cb
1 Cb
2 Cb
3
Ob
0 Of
3
trainable_vars = tf.trainable_variables()
regularization_loss = tf.reduce_sum(
[tf.nn.l2_loss (v) for v in trainable_vars])
loss = original_loss
+ reg_weight * regularization_loss
*Gal & Ghahramani NIPS 2016
Emails come in different lengths, and some are extremely short while others are long
• One-word email: “Thanks”
• 800+ words long emails are also commonly seen in business emails
Detailed Considerations for the Model
About variable sequence lengths
tf.nn.dynamic_rnn(
cell=lstm_cell,
inputs=input_data,
sequence_length=seq_len
)
Solution: dynamic_rnn + max length + sequence sampling
• tf.nn.dynamic_rnn (or tf.nn.bidirectional_dynamic_rnn) allows for variable lengths for input sequences
Other Model Architectures Considered
• Single-direction RNN
• Single-direction RNN with GRU
• Single-direction RNN with LSTM
• Average pooling for outputs
• Max pooling for all outputs
• CNN on top of outputs
• …
We “settled” on current architecture through lots of experiments and considerations.
Agenda
Introduction
Email Classification
Deep Learning
Model Architecture
TensorFrames/SparkDL
Demo
Wrap up
Our workflow around Spark is completely in Scala/Spark stack
• Train a SparkML model in the notebooking environment and save it out
• At scoring state, load the pretrained SparkML model (part of SparkML Pipeline), and call the transform method
Question: Can we use a TF model as if it was a native Scala/Spark function?
Fitting a TensorFlow model into a Spark pipelineFiltering
Text
Preprocessing
Raw Emails
Filtered
Emails
FeatureExtraction
Word2Vec
LDA
TF/IDF
N-gram
Or…
other ML models implemented in
Scala/Spark
Scala/Spark Pipeline + TensorFlow Model
TensorFrames / SparkDL as Interface
BatchSize
Sequence Length
Embedding Length
VocabularySize
tf.nn.embedding_lookup
Sequence
Length
BatchSize
Em
bedding
Length
[[10 19853 3920 8425 43 … 18646]
[235 489 165638 46562 … 16516]]
…
[[0.19853 0.3920 0.8646 0.459 … 0.1865]
…
[0.684 0.1894 0.1564 0.9874 … 0.354] ]
Filtering
Text
Preprocessing
Raw Emails
Filtered
Emails
Embedding Matrix
Encoded Input
Input Tensor
* Shi Yan, Understanding LSTM and its diagrams
TensorFrames turns a TensorFlow model into a UDF.
Save the model:
Save –> Load –> Score
%python
graph_def = tfx.strip_and_freeze_until(["input_data", "predicted"], sess.graph, sess = sess)
tf.train.write_graph(graph_def, “/model”, ”model.pb", False)
%scala
val graph = new com.databricks.sparkdl.python.GraphModelFactory()
.sqlContext(sqlContext)
.fetches(asJava(Seq("prediction")))
.inputs(asJava(Seq("input_data")), asJava(Seq(”input_data")))
.graphFromFile("/model/model.pb")
graph.registerUDF("model")
Score with the model:
Load the model:
%scala
val predictions = inputDataSet.selectExpr("InputData", "model(InputData)")
Agenda
Introduction
Email Classification
Deep Learning
Model Architecture
TensorFrames/SparkDL
Demo
Wrap up
Agenda
Introduction
Email Classification
Deep Learning
Model Architecture
TensorFrames/SparkDL
Demo
Wrap up
Lessons Learned
• A well-tuned LSTM model can outperform traditional ML approaches
• But data preparation is still needed and key to success
• Spark can play nicely with TensorFlow using TensorFrame as interface
• We can do end-to-end in single notebook and mix Spark/Scala with TF/Python
• Model outperforming ML approach and is being productized
salesforce.com/careers
Alexis Roos
Director of Machine Learning, @alexisroos

Deep Learning for Natural Language Processing Using Apache Spark and TensorFlowwith Alexis Roos and Wenhao liu

  • 1.
    Deep Learning forNatural Language Processing Using Apache Spark and TensorFlow Alexis Roos – Director Machine Learning @alexisroos Wenhao Liu – Senior Data Scientist Activity Intelligence team
  • 2.
    Agenda Introduction Email Classification Deep Learning ModelArchitecture TensorFrames/SparkDL Demo Wrap up
  • 3.
    This presentation maycontain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter. These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements. Statement under the Private Securities Litigation Reform Act of 1995 Forward-Looking Statement
  • 4.
    Doing Well andDoing Good #1 World’s Most Innovative Companies Best Places to Work for LGBTQ Equality #1 The World’s Best Workplaces #1 Workplace for Giving Back #1 Top 50 Companies that Care The World’s Most Innovative Companies #1 The Future 50
  • 5.
    Salesforce Keeps GettingSmarter with Einstein Guide Marketers Einstein Engagement Scoring Einstein Segmentation (pilot) Einstein Vision for Social Assist Service Agents Einstein Bots (pilot) Einstein Agent (pilot) Einstein Vision for Field Service (pilot) Coach Sales Reps Einstein Forecasting (pilot) Einstein Lead & Opportunity Scoring Einstein Activity Capture Advise Retailers Einstein Product Recommendations Einstein Search Dictionaries Einstein Predictive Sort Empower Admins & Developers Einstein Prediction Builder (pilot) Einstein Vision & Language Einstein Discovery Help Community Members Einstein Answers (pilot) Community Sentiment (pilot) Einstein Recommendations Austin Buchan CEO, College Forward
  • 6.
    Agenda Introduction Email Classification Deep Learning ModelArchitecture TensorFrames/SparkDL Demo Wrap up
  • 7.
    Enhance CRM experienceusing AI and activity Suggest Action(s) Insights: Pricing discussed, Executive involved, Scheduling Requested, Angry email, competition mentioned, etc. AI Inbox Timelines Other Salesforce Apps … Einstein Activity Capture Extract Insights Emails, meetings, tasks, calls, etc Email classification use case
  • 8.
    What types ofemails do Sales users receive? • Emails from customers • Scheduling requests, pricing requests, competitor mentioned, etc. • Emails from coworkers • Marketing emails • Newsletters • Telecom, Spotify, iTunes, Amazon purchases • etc
  • 9.
    Scheduling requests We wantto identify scheduling requests from customers Hi Alexis, Can we get together Thursday afternoon? Best, John Hello Wenhao, Can you send me that really important document? Thanks, Mark Welcome to Business review! Your subscription is active. Your next letter will be emailed on May 25th 2018.
  • 10.
    Before scoring: filteringand parsing • Right language • Automated vs non automated • Inbound / outbound • Within or outside the organization • etc INTRO SIGNATURE CONFIDENTIALITY NOTICE REPLY CHAIN BODY Hey Alexis, Let’s meet with Ascander on Friday to discuss the $10,000/year rate. Ascander’s phone number is (123) 456-7890. Thanks, Noah Bergman Engineer at Salesforce (123) 456-7890 The contents of this email and any attachments are confidential and are intended solely for addressee… From: Alexis alexis@salesforce.com Date: April 1, 2017 Subject: Important Document Noah, how much does your product cost? HEADER INFORMATION ...
  • 11.
    Steps: • Normalize andtokenize • Generate n-grams • Remove stop words • Compute TF with min threshold filter based vocabulary size • Compute IDF and filter n-grams based on IDF threshold “Basic” NLP text classifier Shortcomings: • Lack of generalization as classifier is limited to tokens from training data • Collection of n-grams doesn’t take into account ordering or sequences
  • 12.
    Word2Vec or GloVe •Unsupervised learning algorithm for obtaining vector representations for words. • Training is performed on aggregated global word-word co-occurrence statistics from a corpus. word2VecModel.findSynonyms(“cost”, 5) MONEY price license nominal budget • Word vectors for individual tokens capture the semantic.
  • 13.
    High-level Architecture Our currentmachine learning pipeline is pure Scala / Spark, which has served us well. Filtering Text Preprocessing Raw Emails Filtered Emails FeatureExtraction Word2Vec LDA TF/IDF N-gram Or… other ML models implemented in Scala/Spark
  • 14.
    Agenda Introduction Email Classification Deep Learning ModelArchitecture TensorFrames/SparkDL Demo Wrap up
  • 15.
    “Loose” brain inspiration:structure of cells What are Neural Networks: feed forward networks
  • 16.
    What are NeuralNetworks: recurring networks “I grew up in France… I speak fluent French.”
  • 17.
    • RNNs sufferfrom vanishing or exploding gardients • LSTM allow to chain and store and use memory across sequence controlled through gates and operations • Designed to be chained into RNN. LSTM https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
  • 18.
    Agenda Introduction Email Classification Deep Learning ModelArchitecture TensorFrames/SparkDL Demo Wrap up
  • 19.
    High Level ModelArchitecture We present a “simple” BiLSTM model for text classification. x0 x1 x2 x3 Ob 0 Of 3 • Tokens are mapped into word embeddings (GloVe pretrained on Wikipedia) • The word embedding for each token is fed into both forward and backward recurrent network with LSTM (Long Short-Term Memory*) cells • “Last” output of the forward and backward RNNs are concatenated and taken as input by the sigmoid unit for binary classification * Hochreiter & Schmidhuber 1997 Cb 0 Cb 1 Cb 2 Cb 3 Backward Cf 0 Cf 1 Cf 2 Cf 3 Forward
  • 20.
    Detailed Considerations forthe Model • We applied dropout on recurrent connections* and inputs… • As well as L2 regularization on the model parameters. About dropout and regularization Cf 0 Cf 1 Cf 2 Cf 3 x0 x1 x2 x3 Cb 0 Cb 1 Cb 2 Cb 3 Ob 0 Of 3 trainable_vars = tf.trainable_variables() regularization_loss = tf.reduce_sum( [tf.nn.l2_loss (v) for v in trainable_vars]) loss = original_loss + reg_weight * regularization_loss *Gal & Ghahramani NIPS 2016
  • 21.
    Emails come indifferent lengths, and some are extremely short while others are long • One-word email: “Thanks” • 800+ words long emails are also commonly seen in business emails Detailed Considerations for the Model About variable sequence lengths tf.nn.dynamic_rnn( cell=lstm_cell, inputs=input_data, sequence_length=seq_len ) Solution: dynamic_rnn + max length + sequence sampling • tf.nn.dynamic_rnn (or tf.nn.bidirectional_dynamic_rnn) allows for variable lengths for input sequences
  • 22.
    Other Model ArchitecturesConsidered • Single-direction RNN • Single-direction RNN with GRU • Single-direction RNN with LSTM • Average pooling for outputs • Max pooling for all outputs • CNN on top of outputs • … We “settled” on current architecture through lots of experiments and considerations.
  • 23.
    Agenda Introduction Email Classification Deep Learning ModelArchitecture TensorFrames/SparkDL Demo Wrap up
  • 24.
    Our workflow aroundSpark is completely in Scala/Spark stack • Train a SparkML model in the notebooking environment and save it out • At scoring state, load the pretrained SparkML model (part of SparkML Pipeline), and call the transform method Question: Can we use a TF model as if it was a native Scala/Spark function? Fitting a TensorFlow model into a Spark pipelineFiltering Text Preprocessing Raw Emails Filtered Emails FeatureExtraction Word2Vec LDA TF/IDF N-gram Or… other ML models implemented in Scala/Spark
  • 25.
    Scala/Spark Pipeline +TensorFlow Model TensorFrames / SparkDL as Interface BatchSize Sequence Length Embedding Length VocabularySize tf.nn.embedding_lookup Sequence Length BatchSize Em bedding Length [[10 19853 3920 8425 43 … 18646] [235 489 165638 46562 … 16516]] … [[0.19853 0.3920 0.8646 0.459 … 0.1865] … [0.684 0.1894 0.1564 0.9874 … 0.354] ] Filtering Text Preprocessing Raw Emails Filtered Emails Embedding Matrix Encoded Input Input Tensor * Shi Yan, Understanding LSTM and its diagrams
  • 26.
    TensorFrames turns aTensorFlow model into a UDF. Save the model: Save –> Load –> Score %python graph_def = tfx.strip_and_freeze_until(["input_data", "predicted"], sess.graph, sess = sess) tf.train.write_graph(graph_def, “/model”, ”model.pb", False) %scala val graph = new com.databricks.sparkdl.python.GraphModelFactory() .sqlContext(sqlContext) .fetches(asJava(Seq("prediction"))) .inputs(asJava(Seq("input_data")), asJava(Seq(”input_data"))) .graphFromFile("/model/model.pb") graph.registerUDF("model") Score with the model: Load the model: %scala val predictions = inputDataSet.selectExpr("InputData", "model(InputData)")
  • 27.
    Agenda Introduction Email Classification Deep Learning ModelArchitecture TensorFrames/SparkDL Demo Wrap up
  • 28.
    Agenda Introduction Email Classification Deep Learning ModelArchitecture TensorFrames/SparkDL Demo Wrap up
  • 29.
    Lessons Learned • Awell-tuned LSTM model can outperform traditional ML approaches • But data preparation is still needed and key to success • Spark can play nicely with TensorFlow using TensorFrame as interface • We can do end-to-end in single notebook and mix Spark/Scala with TF/Python • Model outperforming ML approach and is being productized
  • 30.