Deep Domain

Text based regression for
web domain reputation
scoring
Deep Domain

Motivation (1)
Web proxy appliances provide web reputation scoring
• Included as component of the commercial product
• Draws on vendor’s broad information base
– Whois information
– Customer reporting of known badness
– Association with known bad domains
Arguably not the best indicator available
• Range from -10 to +10, negative not necessarily malicious
• May not have web reputation score for new and/or uncommon domains
4

Motivation (2)
How much information is encoded in the domain text itself?
• Spoiler: a lot, but not all…
• Can we supplement new domains/domains with no web rep score?
• Data enrichment
Can we model patterns associated with malicious domains?
• Build a regression model using web rep score as the target variable
• How to build a model from plain text (without words)?
5
www.google.com +8.0

What is Deep Learning? (1)
Deep learning refers to the application of neural networks
• These are not new!
• Originally motivated by biological neurons
7

Deep learning refers to the application of neural networks
• These are not new!
• Originally motivated by biological neurons
• The perceptron is a simple network analogous to logistic regression
8

So why all the buzz?
• Advances in algorithms and hardware allows more complexity
• New models are often new and interesting NN architectures
9

So why all the buzz?
• Advances in algorithms and hardware allows more complexity
• New models are often new and interesting NN architectures
• Allows model to learn features
10

Where is Deep learning Used? (1)
Object detection and image recognition:
11
*Code from https://github.com/datitran/object_detector_app
Model courtesy of Google Tensorflow research group

Where is Deep learning Used? (2)
Google Translate uses deep learning with text:
• Expert translation back to English:
– Deep learning – is the coolest thing ever existed
12

Common Neural Network Architectures (1)
Convolutional neural network (CNN)
• Great for images, or video
14

Recurrent neural network (RNN)
• Great for sequence data (time series, sentences, characters)
15

Recurrent neural network (RNN)
• Great for sequence data (time series, sentences, characters)
16

RNNs for Text
Word level RNN
• Input sequence is an array or words:
Character level RNN
• Input sequence is an array or characters:
17
'the quick brown fox jumped…' ['the','quick','brown',…]
'the quick brown fox jumped…' ['t','h','e',' ','q',…]

Model Input Data (1)
Input data
• Approximately 1.6M distinct hostnames
• Extracted from web proxy requests made over a 10 day period
• For hostnames with multiple scores, use majority voting
19
domain web_rep
ebm.creditonemail.com 3.0
rockypoint360.com -6.2
ir.consumerportfolio.com 1.6
csbsju.imodules.com 0.3
phoneworld.com.pk 7.8
… …

Model Input Data (2)
Neural networks generally work with vector and matrix input
• Need to convert series of characters to series of numbers
• Some implementations require all sequences to be the same length
Preprocessing procedure
• Create character encoding on the fly based on characters in data
– This is the input encoding vocabulary
• Zero-pad sequences so they are all the same length (impose cutoff)
• Make sure to write vocabulary out to disk for later use!!!
20

Deep Learning Frameworks
Many open source deep learning frameworks available
• Theano: One of the O.G.'s in the deep learning space
• Caffe: Developed at UC Berkeley, around the same time as Theano
• Torch: Used in Facebook research. Written in LUA?
• Tensorflow: Developed by Google Brain team, now reaching maturity
• Keras: High-level python API; works with Theano or Tensorflow backend
For this project, used keras with tensorboard for monitoring
21

Model Architecture
Considered a basic LSTM architecture, with optional dropout
22

Training the Model (1)
Initial training to obtain baseline
• Input data length capped at 72 characters
23
'the quick brown fox jumped…' ['t','h','e',' ','q',…]
72 characters max

• 100 hidden units for LSTM layer
24
Vectors of size 100

• 100 hidden units for LSTM layer
• Train 15 epochs on 50% of training data set
• MSE for loss function, look at MAE for performance
Results
• Train time: 5h 54m
• Final (validation) MAE: 0.82
25

Results seem not too bad!
• MAE steadily decreasing on training set:
26

Results seem not too bad!
• Not so steady on validation set:
27

Overfitting!
28

Look at reducing the number of hidden units
• Reduce model complexity (and training time)
29

Explore the effects of training data size
30

Still overfitting!
31

Model Results (1)
Not too bad to get a ~0.81 MAE on ~5 days of data
• Or is it??
• Need to take a deeper look beyond a single scalar metric
32

Model Results (2)
How can we improve the model around the edges?
• More training data
• More data sources (e.g. internal previous behavior, whois lookups)
• Weight loss function for higher/lower values of target variable
33

Deployment Options
Several options for deployment
• Integration into current data ingest pipelines to enhance data
• Input into other predictive models
• Stand alone tool for investigators
Models are always more fun when they're interactive!
• Create simple Python Flask app to serve model
• Automate build and deployment via Docker
35

References
• https://appliedgo.net/media/perceptron/neuron.png
• http://i2.wp.com/abhay.harpale.net/blog/wp-content/uploads/perceptron-picture.png
• https://codesachin.wordpress.com/2017/01/18/understanding-the-new-google-translate/
• https://www.edureka.co/blog/wp-content/uploads/2017/05/Deep-Neural-Network-What-is-Deep-Learning-Edureka.png
• https://cdn-images-1.medium.com/max/1600/1*5egrX--WuyrLA7gBEXdg5A.png
• https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
• http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png
38

Deep Domain

More Related Content

What's hot

Similar to Deep Domain

More from Zachary S. Brown

Recently uploaded

Deep Domain