This document describes Mudpile, a system for detecting malicious URLs using machine learning. It collects data from URLs, extracts features related to phishing indicators, trains a classification model to label URLs as legitimate or phishing, and exposes the model as a REST API. The system is deployed to classify incoming web traffic in real-time and block phishing sites. It is retrained periodically for improved accuracy and to address new phishing techniques.
Scaling API-first – The story of a global engineering organization
Detect Phishing URLs with MUDPIPE Machine Learning Model
1. MUDPIPE
Malicious URL Detection for Phishing
Identification and Prevention
EMAIL : arjun.job14@gmail.com
LINKEDIN: https://www.linkedin.com/in/arjunbm
2. PHISHING INTRODUCTION
The fraudulent practice of sending emails purporting to be from reputable
companies in order to induce individuals to reveal personal information, such
as passwords and credit card numbers
MOTIVES: Financial gain, damage reputation, identity theft, fame & notoriety
Phishing websites indicators:
• Visually appears like the original website
• Email creates a sense of urgency to force user action
• Fake HTTPS certificate & domain name
• Provides attractive offers which tempts the user to respond
3. APPROACH & METHODOLOGY
• Data Collection & Validation
• Parameter Determination
• Address Bar based Features
• HTML and JavaScript based Features
• Domain based Features
• Abnormal Based Features
• URL Blacklist Features
• Feature Extraction
• Algorithm selection and output classification
• Create baseline model with initial dataset
• Evaluate performance of model and fine-tuning
• Apply test data on pre-trained baseline model & make prediction
• Compare with known data sources & further fine-tune results
• Retrain model on frequent intervals for better accuracy, context and relevancy
• Classification model pickled and exposed as REST API
• Pre-trained classification model used to classify and predict incoming URLs
• In-line integration with outgoing web traffic at egress points for centralized
monitoring and control
4. ARCHITECTURE / WORKFLOW
BASELINE
DATA SET
BASELINE
ML MODEL
PREDICTION
TRAIN
EXPOSE AS
REST API
FEATURE
EXTRACTION
TEST DATA
OUTPUT
RETRAIN
WEB
TRAFFIC
(UNKNOWN DATA)
INPUT
• Address Bar based Features
• HTML and JavaScript based
Features
• Domain based Features
• Abnormal Based Features
• URL Blacklist Features
• CLASSIFICATION = 0: LEGITIMATE
• CLASSIFICATION = 1: PHISHING
• COMPARE WITH KNOWN SOURCES
• PROBABILITY OF PREDICTION
SECURITY
ACTION
(BLOCK / ALLOW)
6. DEPLOYING TO PRODUCTION
• Context specific use-cases:
• Certain sub-nets within the org might require access to certain websites to support business functionality
• Org might want to block access to sites even though they are classified as “suspicious” by commercial softwares
• Infrastructural & capacity planning considerations: client, load balancer, web server, queues, etc
• REST-API approach: train, retrain & predictions
• Develop automation test cases for your model (especially on feature engineering side)
• Automate evaluation of the production model, which allows to efficiently back-test changes to the model on historical data and determine if
improvements have been made or not
• Possibly have different end-points exposed for different sections of the network or for different departments
• Have a fall-back or set-default-value for parameters which fal to get processed by the Feature Engineering module
• Decouple the input and the output for the model; model should still work if parameters are added, modified or deleted in feature engineering
• Single egress point for web traffic, where the ML model can be plugged-in with the REST API
• Have a fail-open or kill-switch mechanism for traffic to flow through if model processing fails
• Place model operation in “monitoring” or “non-blocking” mode initially, which allows the ML model to get additional data and allows for fine-
tuning and prevents errors
• Supplement with existing controls like spam filtering, black-listing, etc
• Model should refer to other data sources as well for fine-tuning in the initial stages
• Baselining and retraining the model at frequent internals; also maintaining model versions
• Provide security analysts with an option to tweak/edit input data for contextual representation
• Deploying the MODEL client-side versus server-side
7. BENEFITS
• Reduce dependency, cost & license on third-party external software
• Re-use of in-house org’s data rather than contribute towards improving commercial software
• Better insights into online behavior of employees
• Real-time protection for employees who access malicious websites or click on phishing links
• Detect and prevent against unknown phishing attacks, as new patterns are created by attackers
• Next level of intelligence on top of signature-based prevention techniques & blacklists
• Email filtering solutions help in filtering phishing/spam emails, but this provides holistic
protection for all outgoing internet traffic
• Centralized solution implemented org-wide and no dependency on client-side agents/software
• Anti-phishing: move from real-time to offline; move from reactive to proactive
8. AUDIENCE TAKE-AWAYS
• Provide insights into building an ML pipeline, data engineering & feature extraction
• Learn how to solve a “Classification” problem using ML
• Cyber Security Analysts can use the feature extraction component to quickly analyze indicators and
hence expedite incident response
• Helps security engineers to build more intelligent products, tailored to their own org requirements
• Helps understand the constituents/factors to identify malicious URLs
• Learn how to fingerprint a URL for phishing indicators using various data sources and components
• How to create/obtain baseline dataset for training the baseline ML model
• Learn how to deploy ML model in production
• Learn how to retrain the model for better accuracy and relevancy
• Learn how to identify top influencing variables which determine model output