Ancestry, Anecdotes & Avanan -DL for Amateurs

DEEP LEARNING
AN C ESTRYY
AN EC OD TES
AVAN AN
FOR
AMATURES

a
Agenda
• ML – Brief History
• Neural Netowrks & DL
• Avanan’s Use Case
• Explainability

Machine Learning became a “hot topic”
Neural net – Motivation & Deep Learning
What do data scientists do, when they don’t brag?

How Did ML Become
the Prom Queen?

Until 1990’s
Data
• Less Domains
• Less “innovative” motivation
• Organzations don’t priortize data

Until 1990’s
Machine Learning
• Most of the projects are done locally within the domains
• Cross talks are rare
• It is not an academical leading topic
• Neural Net is a theoretical topic with nearly no
commercial implicatgions

1990’s
•Stock exchange’s boom
•Genome Project
•Internet

Outcomes
• Abundant of data is available
• A huge interest in data’s potential insights
• Financial data becomes interesting (e.g a boost in the hedge funds
industry)
• Academy- Researchers focused on data as their main study
 Data became “science”

Which Poblems does ML typically solve?
• Which animal can be seen in the image ?
• Was this text’s writer happy?
• Is this recorded utterance human?
• Can we predict our customers’ satisfaction?

What is Common for these Questions?
Junior High

Memories from Junior High
Function
A deteministic engine that receives an input X and outputs Y
Y =3*X+5
Y =- 3*𝑋2 +4 *X- 0.5
Y= 𝑒−𝑥
+ tan 2𝑋 + 5

Gaps Between Life and Junior High
The input X is not a single number but a collection of information:
• Images
• SQL columns
• Text
No teacher, no function
In most of the problems Y exists
ML is about finding the optimal maps from X to Y

A Few Words about Y
• It can be numbers , vectors or categories.
• In this lecture, it is only categories.:
(dog, cat, pineapple) or (malicious, benign)
• These types of problemes is called “Classifaction”

Few Words about Y- One hot
Y indicates whether an image is dog, cat, pineapple
• Dog - 1 Y= [1,0,0]
• Cat - 2. Y= [0, 1, 0]
• Pineapple - 3 Y= [0, 0, 1]
This represntaion has two virtues:
• Easy to demestify
• It is a probability vector

𝑥1
𝑥2
𝑥3
Input
𝑥4
𝑥5
𝑥6
𝑥7
𝑥8
𝑥9
𝑥10
output
Dog
Cat
Pineapple
X Y

𝑥1
𝑥2
𝑥3
Input
𝑥4
𝑥5
𝑥6
𝑥7
𝑥9
𝑥10
output
Dog
Cat
Pineapple
Hidden
𝑥8

𝑥1
𝑥2
𝑥3
Input
𝑥4
𝑥5
𝑥6
𝑥7
𝑥9
𝑥10
F [ 𝑖=1
10
𝑤𝑖𝑥𝑖]
𝑥8
𝑤1
𝑤2
𝑤4
𝑤5
𝑤6
𝑤7
𝑤8
𝑤9
𝑤10

Input
Output
Dog
Cat
Pineapple
Hidden

Neural network is cool.
But..
How do we we get the
function?
Training

Training
• We have X & Y
• We need to find the best function F that maps X to Y

Z1
Z3
Z2
X
𝑥1
𝑥2
𝑥3
𝑥4
𝑥5
𝑥6
𝑥7
𝑥8
𝑥9
𝑥10
Dog
Cat
Pineappl
ee
Y

Training- Continue
• What are Z’s?
• Z1,Z2,Z3 are number that the indicates the network’s observation
• We are data scientists; we need probabilities for our going metabolism
Softmax

Loss Function
We have Y
We. Have Sigma’s
Loss function
It measures the distance between probabilities (KL-divergence)
we aim to minimize it by updating the weights according to the gradient

𝑥1
𝑥2
𝑥3
Input
𝑥4
𝑥5
𝑥6
𝑥7
𝑥9
𝑥10
output
Dog
Cat
Pineapple
Hidden(S)
𝑥8

Motivation
• Text does not have a natural analytic structure nor a
quantitative represntaion
• We wish to solve “text driven” problems (e.g
sentiment analysis)

One Hot Coding
Dictionary with 1000 words
We use vectors of length 1000
1. Apple -> [1,0,0,……0]
2. Ball -> [0,1,0………0]
..
10 Eating ->[0,0,0,0,0,0,0,0,0,1,0,0 0,0 ..]
122. I. ->. [0,0,0….(121 times 0) 1,0,0,0,0,0,0 ]
..
633. Playing ->[0,(632 times 0),1 ,0,0……0]

Next Word Predction
Eating (an) apple -> [122] [,1]
X=[ 0,0,0,0,0,0,0,0,0, 1, 0,0,0,0… 0,0 0 ..0]
Y= [1,0,0,0…]
X vector of length 1000 with single 1 in place 10
Y has 1 only on the first place

Example 2
Playing (a) ball -> [633] [2]
X=[0,0,0,0,0. (in place 633) 1, 0,0 0 ..0] (length 1000)
Y= [0,1,0,0,0…]
X vector of length 1000 with single 1 at palce 633
Y has 1 only on the second place

𝑥1
𝑥2
𝑥3
𝑥4
𝑥6
𝑥999
𝑥1000
𝑦1
𝑦2
𝑦3
𝑦999
𝑦1000

Common KPI’s
There are several useful KPI’s- Binary case
• Accuracy – How many times the model was right
• False Positive- Models detects “True” for “False”
• Recall – How many real “True” the model detects
• Precision – Among all the “True” detections, how many were
corrrect

Numerical Exampls- Balanced Case
ActualDetect “True” “False”
True 990 10
False 10 990
We have 1000 True and 1000. False
Accuarcy = 9900099/(1000100). 0.99
False Positive = 10000/100000 0=0.01
False negative = 1/100 = 0.01
Recall = 99/100 =. 0.99
Precsision 0.99

Numerical Examples– Cyber-Like.
ActualDetect “Phishing” “Benign”
Phishing 99 1
Benign 10,000 990000
Accuarcy = 9900099/(1000100)=0.99
False Positive = 10000/1000000 =0.01
False ngvative = 1/10 0 =0.01
Recall = = 99/100 =0.99
Precision = 99/10099=0.0099
Imagine that a single false positve costs 10$

EMAIL’S PHISHING
AVANAN’S CHALLENGE

Challenges in phishing
• Ambiguity of detection
• Independence is obscure
• How do we need to label?
• Imbalanced traffic
• Explainability is mandatory

Use Case
• Cloud traffic of emails
• A typical customer has several millions of emails per
day
• Phishing’s rate is about 1 per 10000 emails
• System classifies the emails to one of four categories:
 Clean
 Phishing
 Spam
 Marketing

Avanan’s Model
A double steps XGBoost:
• Combination of tabular feautres
with some text analysis
• In order to achieve good
precsion the second step takes
place only if the first step
detects Phishing
• Second model uses DistilBERT

Labeling Protocol
An enormous number of emails
• Labeling emails that were detected by the model
as phishing or spam
• Precision is well measured
• Howerver, new types of phishing are hardly
detected as a result the recall measurements are
endowed with high risk
.

Our Objectives
• Construct a DL model
Model’s inputs are both text and tabular
data
Model’s outputs remain the four
categories
Performances requirements:
Optimizing Recall for 98% precision

DL –”Engineering”
• Which types of embedding?
• How to Combine embedded text with tabular data?
• After combining what kind of network, we wish to have?
• Loss

𝑇𝑒𝑥𝑡
𝑦1
𝑦2
𝑇𝑎𝑏𝑢𝑙𝑎𝑟
dData
Embedding
• DistilBert
• Bert
• FNET
𝑦3
𝑦4

Post Pooling
• FC
• Batch_normlaization
• Relu
• Dropout

A solution (A and not The)
• We use Cross Entropy as loss (L)
• We aim to reduce our FP (our objective is the precision level)
𝐿𝑛𝑒𝑤 = L+ F(Score of phishing)
What is F ?

Our new F
• For every real phishing training example it outputs 0
• For every non phishing example it outputs :
 A positive function that increases with the phishing
probability
 For a big prob, we wish a big slope

Avanan 6th Patent
Loss Function
• Usage of tanh
• Wasserstein metric
1D
• Usage of neural ODE

Background
DL commonly
provides bad
explainability
Epxlainbility
in phishing is
crucial
Huge interest
in Causal
Inference

Example
• We solve a prediction problem :
• Cloudy
• Sprinkler
• Rain
 Wet grass
Prediction is cool. But ….

FINALLY
https://youtu.be/k3sTL5kCJ_4

Ancestry, Anecdotes & Avanan -DL for Amateurs

Ancestry, Anecdotes & Avanan -DL for Amateurs

Recommended

Recommended

More Related Content

Similar to Ancestry, Anecdotes & Avanan -DL for Amateurs

Similar to Ancestry, Anecdotes & Avanan -DL for Amateurs (20)

More from Natan Katz

More from Natan Katz (17)

Recently uploaded

Recently uploaded (20)

Ancestry, Anecdotes & Avanan -DL for Amateurs