Text based regression for
web domain reputation
scoring
Deep Domain
2
Outline
1
2
3
4
5
3
Motivation (1)
Web proxy appliances provide web reputation scoring
• Included as component of the commercial product
• Draws on vendor’s broad information base
– Whois information
– Customer reporting of known badness
– Association with known bad domains
Arguably not the best indicator available
• Range from -10 to +10, negative not necessarily malicious
• May not have web reputation score for new and/or uncommon domains
4
Motivation (2)
How much information is encoded in the domain text itself?
• Spoiler: a lot, but not all…
• Can we supplement new domains/domains with no web rep score?
• Data enrichment
Can we model patterns associated with malicious domains?
• Build a regression model using web rep score as the target variable
• How to build a model from plain text (without words)?
5
www.google.com +8.0
6
What is Deep Learning? (1)
Deep learning refers to the application of neural networks
• These are not new!
• Originally motivated by biological neurons
7
What is Deep Learning? (2)
Deep learning refers to the application of neural networks
• These are not new!
• Originally motivated by biological neurons
• The perceptron is a simple network analogous to logistic regression
8
What is Deep Learning? (3)
So why all the buzz?
• Advances in algorithms and hardware allows more complexity
• New models are often new and interesting NN architectures
9
What is Deep Learning? (4)
So why all the buzz?
• Advances in algorithms and hardware allows more complexity
• New models are often new and interesting NN architectures
• Allows model to learn features
10
Where is Deep learning Used? (1)
Object detection and image recognition:
11
*Code from https://github.com/datitran/object_detector_app
Model courtesy of Google Tensorflow research group
Where is Deep learning Used? (2)
Google Translate uses deep learning with text:
• Expert translation back to English:
– Deep learning – is the coolest thing ever existed
12
13
Common Neural Network Architectures (1)
Convolutional neural network (CNN)
• Great for images, or video
14
Common Neural Network Architectures (2)
Recurrent neural network (RNN)
• Great for sequence data (time series, sentences, characters)
15
Common Neural Network Architectures (3)
Recurrent neural network (RNN)
• Great for sequence data (time series, sentences, characters)
16
RNNs for Text
Word level RNN
• Input sequence is an array or words:
Character level RNN
• Input sequence is an array or characters:
17
'the quick brown fox jumped…' ['the','quick','brown',…]
'the quick brown fox jumped…' ['t','h','e',' ','q',…]
18
Model Input Data (1)
Input data
• Approximately 1.6M distinct hostnames
• Extracted from web proxy requests made over a 10 day period
• For hostnames with multiple scores, use majority voting
19
domain web_rep
ebm.creditonemail.com 3.0
rockypoint360.com -6.2
ir.consumerportfolio.com 1.6
csbsju.imodules.com 0.3
phoneworld.com.pk 7.8
… …
Model Input Data (2)
Neural networks generally work with vector and matrix input
• Need to convert series of characters to series of numbers
• Some implementations require all sequences to be the same length
Preprocessing procedure
• Create character encoding on the fly based on characters in data
– This is the input encoding vocabulary
• Zero-pad sequences so they are all the same length (impose cutoff)
• Make sure to write vocabulary out to disk for later use!!!
20
Deep Learning Frameworks
Many open source deep learning frameworks available
• Theano: One of the O.G.'s in the deep learning space
• Caffe: Developed at UC Berkeley, around the same time as Theano
• Torch: Used in Facebook research. Written in LUA?
• Tensorflow: Developed by Google Brain team, now reaching maturity
• Keras: High-level python API; works with Theano or Tensorflow backend
For this project, used keras with tensorboard for monitoring
21
Model Architecture
Considered a basic LSTM architecture, with optional dropout
22
Training the Model (1)
Initial training to obtain baseline
• Input data length capped at 72 characters
23
'the quick brown fox jumped…' ['t','h','e',' ','q',…]
72 characters max
Training the Model (1)
Initial training to obtain baseline
• Input data length capped at 72 characters
• 100 hidden units for LSTM layer
24
Vectors of size 100
Training the Model (1)
Initial training to obtain baseline
• Input data length capped at 72 characters
• 100 hidden units for LSTM layer
• Train 15 epochs on 50% of training data set
• MSE for loss function, look at MAE for performance
Results
• Train time: 5h 54m
• Final (validation) MAE: 0.82
25
Training the Model (2)
Results seem not too bad!
• MAE steadily decreasing on training set:
26
Training the Model (3)
Results seem not too bad!
• Not so steady on validation set:
27
Training the Model (3)
Overfitting!
28
Training the Model (4)
Look at reducing the number of hidden units
• Reduce model complexity (and training time)
29
Training the Model (5)
Explore the effects of training data size
30
Training the Model (6)
Still overfitting!
31
Model Results (1)
Not too bad to get a ~0.81 MAE on ~5 days of data
• Or is it??
• Need to take a deeper look beyond a single scalar metric
32
Model Results (2)
How can we improve the model around the edges?
• More training data
• More data sources (e.g. internal previous behavior, whois lookups)
• Weight loss function for higher/lower values of target variable
33
34
Deployment Options
Several options for deployment
• Integration into current data ingest pipelines to enhance data
• Input into other predictive models
• Stand alone tool for investigators
Models are always more fun when they're interactive!
• Create simple Python Flask app to serve model
• Automate build and deployment via Docker
35
Flask Application Example
36
37
References
• https://appliedgo.net/media/perceptron/neuron.png
• http://i2.wp.com/abhay.harpale.net/blog/wp-content/uploads/perceptron-picture.png
• https://codesachin.wordpress.com/2017/01/18/understanding-the-new-google-translate/
• https://www.edureka.co/blog/wp-content/uploads/2017/05/Deep-Neural-Network-What-is-Deep-Learning-Edureka.png
• https://cdn-images-1.medium.com/max/1600/1*5egrX--WuyrLA7gBEXdg5A.png
• https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
• http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png
38

Deep Domain

  • 1.
    Text based regressionfor web domain reputation scoring Deep Domain
  • 2.
  • 3.
  • 4.
    Motivation (1) Web proxyappliances provide web reputation scoring • Included as component of the commercial product • Draws on vendor’s broad information base – Whois information – Customer reporting of known badness – Association with known bad domains Arguably not the best indicator available • Range from -10 to +10, negative not necessarily malicious • May not have web reputation score for new and/or uncommon domains 4
  • 5.
    Motivation (2) How muchinformation is encoded in the domain text itself? • Spoiler: a lot, but not all… • Can we supplement new domains/domains with no web rep score? • Data enrichment Can we model patterns associated with malicious domains? • Build a regression model using web rep score as the target variable • How to build a model from plain text (without words)? 5 www.google.com +8.0
  • 6.
  • 7.
    What is DeepLearning? (1) Deep learning refers to the application of neural networks • These are not new! • Originally motivated by biological neurons 7
  • 8.
    What is DeepLearning? (2) Deep learning refers to the application of neural networks • These are not new! • Originally motivated by biological neurons • The perceptron is a simple network analogous to logistic regression 8
  • 9.
    What is DeepLearning? (3) So why all the buzz? • Advances in algorithms and hardware allows more complexity • New models are often new and interesting NN architectures 9
  • 10.
    What is DeepLearning? (4) So why all the buzz? • Advances in algorithms and hardware allows more complexity • New models are often new and interesting NN architectures • Allows model to learn features 10
  • 11.
    Where is Deeplearning Used? (1) Object detection and image recognition: 11 *Code from https://github.com/datitran/object_detector_app Model courtesy of Google Tensorflow research group
  • 12.
    Where is Deeplearning Used? (2) Google Translate uses deep learning with text: • Expert translation back to English: – Deep learning – is the coolest thing ever existed 12
  • 13.
  • 14.
    Common Neural NetworkArchitectures (1) Convolutional neural network (CNN) • Great for images, or video 14
  • 15.
    Common Neural NetworkArchitectures (2) Recurrent neural network (RNN) • Great for sequence data (time series, sentences, characters) 15
  • 16.
    Common Neural NetworkArchitectures (3) Recurrent neural network (RNN) • Great for sequence data (time series, sentences, characters) 16
  • 17.
    RNNs for Text Wordlevel RNN • Input sequence is an array or words: Character level RNN • Input sequence is an array or characters: 17 'the quick brown fox jumped…' ['the','quick','brown',…] 'the quick brown fox jumped…' ['t','h','e',' ','q',…]
  • 18.
  • 19.
    Model Input Data(1) Input data • Approximately 1.6M distinct hostnames • Extracted from web proxy requests made over a 10 day period • For hostnames with multiple scores, use majority voting 19 domain web_rep ebm.creditonemail.com 3.0 rockypoint360.com -6.2 ir.consumerportfolio.com 1.6 csbsju.imodules.com 0.3 phoneworld.com.pk 7.8 … …
  • 20.
    Model Input Data(2) Neural networks generally work with vector and matrix input • Need to convert series of characters to series of numbers • Some implementations require all sequences to be the same length Preprocessing procedure • Create character encoding on the fly based on characters in data – This is the input encoding vocabulary • Zero-pad sequences so they are all the same length (impose cutoff) • Make sure to write vocabulary out to disk for later use!!! 20
  • 21.
    Deep Learning Frameworks Manyopen source deep learning frameworks available • Theano: One of the O.G.'s in the deep learning space • Caffe: Developed at UC Berkeley, around the same time as Theano • Torch: Used in Facebook research. Written in LUA? • Tensorflow: Developed by Google Brain team, now reaching maturity • Keras: High-level python API; works with Theano or Tensorflow backend For this project, used keras with tensorboard for monitoring 21
  • 22.
    Model Architecture Considered abasic LSTM architecture, with optional dropout 22
  • 23.
    Training the Model(1) Initial training to obtain baseline • Input data length capped at 72 characters 23 'the quick brown fox jumped…' ['t','h','e',' ','q',…] 72 characters max
  • 24.
    Training the Model(1) Initial training to obtain baseline • Input data length capped at 72 characters • 100 hidden units for LSTM layer 24 Vectors of size 100
  • 25.
    Training the Model(1) Initial training to obtain baseline • Input data length capped at 72 characters • 100 hidden units for LSTM layer • Train 15 epochs on 50% of training data set • MSE for loss function, look at MAE for performance Results • Train time: 5h 54m • Final (validation) MAE: 0.82 25
  • 26.
    Training the Model(2) Results seem not too bad! • MAE steadily decreasing on training set: 26
  • 27.
    Training the Model(3) Results seem not too bad! • Not so steady on validation set: 27
  • 28.
    Training the Model(3) Overfitting! 28
  • 29.
    Training the Model(4) Look at reducing the number of hidden units • Reduce model complexity (and training time) 29
  • 30.
    Training the Model(5) Explore the effects of training data size 30
  • 31.
    Training the Model(6) Still overfitting! 31
  • 32.
    Model Results (1) Nottoo bad to get a ~0.81 MAE on ~5 days of data • Or is it?? • Need to take a deeper look beyond a single scalar metric 32
  • 33.
    Model Results (2) Howcan we improve the model around the edges? • More training data • More data sources (e.g. internal previous behavior, whois lookups) • Weight loss function for higher/lower values of target variable 33
  • 34.
  • 35.
    Deployment Options Several optionsfor deployment • Integration into current data ingest pipelines to enhance data • Input into other predictive models • Stand alone tool for investigators Models are always more fun when they're interactive! • Create simple Python Flask app to serve model • Automate build and deployment via Docker 35
  • 36.
  • 37.
  • 38.
    References • https://appliedgo.net/media/perceptron/neuron.png • http://i2.wp.com/abhay.harpale.net/blog/wp-content/uploads/perceptron-picture.png •https://codesachin.wordpress.com/2017/01/18/understanding-the-new-google-translate/ • https://www.edureka.co/blog/wp-content/uploads/2017/05/Deep-Neural-Network-What-is-Deep-Learning-Edureka.png • https://cdn-images-1.medium.com/max/1600/1*5egrX--WuyrLA7gBEXdg5A.png • https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png • http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png 38