5. Until 1990’s
Data
• Less Domains
• Less “innovative” motivation
• Organzations don’t priortize data
6. Until 1990’s
Machine Learning
• Most of the projects are done locally within the domains
• Cross talks are rare
• It is not an academical leading topic
• Neural Net is a theoretical topic with nearly no
commercial implicatgions
8. Outcomes
• Abundant of data is available
• A huge interest in data’s potential insights
• Financial data becomes interesting (e.g a boost in the hedge funds
industry)
• Academy- Researchers focused on data as their main study
Data became “science”
10. Which Poblems does ML typically solve?
• Which animal can be seen in the image ?
• Was this text’s writer happy?
• Is this recorded utterance human?
• Can we predict our customers’ satisfaction?
12. Memories from Junior High
Function
A deteministic engine that receives an input X and outputs Y
Y =3*X+5
Y =- 3*𝑋2 +4 *X- 0.5
Y= 𝑒−𝑥
+ tan 2𝑋 + 5
13. Which Poblems does ML typically solve?
• Which animal can be seen in the image ?
• Was this text’s writer happy?
• Is this recorded utterance human?
• Can we predict our customers’ satisfaction?
14. Gaps Between Life and Junior High
The input X is not a single number but a collection of information:
• Images
• SQL columns
• Text
No teacher, no function
In most of the problems Y exists
ML is about finding the optimal maps from X to Y
15. A Few Words about Y
• It can be numbers , vectors or categories.
• In this lecture, it is only categories.:
(dog, cat, pineapple) or (malicious, benign)
• These types of problemes is called “Classifaction”
16. Few Words about Y- One hot
Y indicates whether an image is dog, cat, pineapple
• Dog - 1 Y= [1,0,0]
• Cat - 2. Y= [0, 1, 0]
• Pineapple - 3 Y= [0, 0, 1]
This represntaion has two virtues:
• Easy to demestify
• It is a probability vector
27. Training- Continue
• What are Z’s?
• Z1,Z2,Z3 are number that the indicates the network’s observation
• We are data scientists; we need probabilities for our going metabolism
Softmax
28. Few Words about Y- One hot
Y indicates whether an image is dog, cat, pineapple
• Dog - 1 Y= [1,0,0]
• Cat - 2. Y= [0, 1, 0]
• Pineapple - 3 Y= [0, 0, 1]
This represntaion has two virtues:
• Easy to demestify
• It is a probability vector
29.
30. Loss Function
We have Y
We. Have Sigma’s
Loss function
It measures the distance between probabilities (KL-divergence)
we aim to minimize it by updating the weights according to the gradient
34. Motivation
• Text does not have a natural analytic structure nor a
quantitative represntaion
• We wish to solve “text driven” problems (e.g
sentiment analysis)
35. One Hot Coding
Dictionary with 1000 words
We use vectors of length 1000
1. Apple -> [1,0,0,……0]
2. Ball -> [0,1,0………0]
..
10 Eating ->[0,0,0,0,0,0,0,0,0,1,0,0 0,0 ..]
122. I. ->. [0,0,0….(121 times 0) 1,0,0,0,0,0,0 ]
..
633. Playing ->[0,(632 times 0),1 ,0,0……0]
36. Next Word Predction
Eating (an) apple -> [122] [,1]
X=[ 0,0,0,0,0,0,0,0,0, 1, 0,0,0,0… 0,0 0 ..0]
Y= [1,0,0,0…]
X vector of length 1000 with single 1 in place 10
Y has 1 only on the first place
37. Example 2
Playing (a) ball -> [633] [2]
X=[0,0,0,0,0. (in place 633) 1, 0,0 0 ..0] (length 1000)
Y= [0,1,0,0,0…]
X vector of length 1000 with single 1 at palce 633
Y has 1 only on the second place
41. Common KPI’s
There are several useful KPI’s- Binary case
• Accuracy – How many times the model was right
• False Positive- Models detects “True” for “False”
• Recall – How many real “True” the model detects
• Precision – Among all the “True” detections, how many were
corrrect
45. Challenges in phishing
• Ambiguity of detection
• Independence is obscure
• How do we need to label?
• Imbalanced traffic
• Explainability is mandatory
47. Use Case
• Cloud traffic of emails
• A typical customer has several millions of emails per
day
• Phishing’s rate is about 1 per 10000 emails
• System classifies the emails to one of four categories:
Clean
Phishing
Spam
Marketing
48. Avanan’s Model
A double steps XGBoost:
• Combination of tabular feautres
with some text analysis
• In order to achieve good
precsion the second step takes
place only if the first step
detects Phishing
• Second model uses DistilBERT
49. Labeling Protocol
An enormous number of emails
• Labeling emails that were detected by the model
as phishing or spam
• Precision is well measured
• Howerver, new types of phishing are hardly
detected as a result the recall measurements are
endowed with high risk
.
51. Our Objectives
• Construct a DL model
Model’s inputs are both text and tabular
data
Model’s outputs remain the four
categories
Performances requirements:
Optimizing Recall for 98% precision
52. DL –”Engineering”
• Which types of embedding?
• How to Combine embedded text with tabular data?
• After combining what kind of network, we wish to have?
• Loss
60. A solution (A and not The)
• We use Cross Entropy as loss (L)
• We aim to reduce our FP (our objective is the precision level)
𝐿𝑛𝑒𝑤 = L+ F(Score of phishing)
What is F ?
61. Our new F
• For every real phishing training example it outputs 0
• For every non phishing example it outputs :
A positive function that increases with the phishing
probability
For a big prob, we wish a big slope
62.
63.
64. Avanan 6th Patent
Loss Function
• Usage of tanh
• Wasserstein metric
1D
• Usage of neural ODE