SlideShare a Scribd company logo
M A C H I N E L E A R N I N G & C Y B E R S E C U R I T Y
Detecting Malicious URLs
in the Haystack
Team
Triss
Data Scientist working in cyber security
Loves motorsport, chin-ups, learning and the
West Coast Eagles (mighty big birds!)
Alistair
@dizzy_data
Research Masters student working in
cyber security.
Enjoy random strolls in foreign cities.
A little story...
So why are we here?
We have been working in data science and cyber security for some time...
We hope to share with you:
1. How Design Thinking can help teams create more meaningful machine learning products;
2. How data science frameworks can provide structure to machine learning product development;
3. How Python can make your machine learning dreams become reality
How our project started
Design Thinking
Source: https://dribbble.com/stories/2019/03/22/what-is-design-thinking
Method To The Madness
+
Design Thinking Data Science Process
Threat Science Framework
A framework for building human-centred machine learning in cyber security defence
Know The User
Modeling & Evaluation
Data Acquisition &
Understanding
Feature
Engineering
Deployment
Nail The Problem Ideate
Know The Threat
Know The User
Know The User - Challenges
Management of numerous security tools
Alert fatigue
High staff turnover and knowledge loss
Know The User - Security Analyst Persona
Security Analysts working in Security Operations Centres tasked
with defending organisations against cyber adversarial threats
Goals:
- Maintain security architecture
- Defend against myriad threat vectors (Incident Response)
- Identify security flaws
Needs:
- Rapid incident response
- Rich tool set
- Coverage across the cyber attack life cycle
- Free time to work on interesting projects such as threat hunting
Pain points:
- Alert fatigue
- Lack of integration across tools and intelligence feeds
- Keeping up with a constantly evolving threat landscape. What will the next attack look like?
Nail the problem
Problem statements (POV)
{User} needs {User’s need} so that {benefit}
Security Teams are faced with a broad and complex threat landscape. Historically, the common
answer has been to focus on adopting numerous tools and staff to build any adequate defence.
However, this approach has proven to be unsustainable.
Security Analysts need rapid and intelligent cyber defence capability so that they can stand a chance
against a growing and often superior threat
Ideate - How Might We
How might we statements
How might we
Form the POV or Problem
Statement
- Brainstorming
(generate ideas from a seed question)
- Brainwriting
(each team member generates a few
ideas, think deeply about them, then
prioritise)
- Mindmapping
(grouping ideas together)
Ideate - The Vision
How might we build an automated and intelligent
ecosystem of machine learning models that work in
unison to provide superior defence against an ever-
evolving threat landscape
Ideate - One Prototype At A Time
Source: https://www.reddit.com/r/reactiongifs/
How might we detect malicious
URLs using machine learning and
Python
Phishing
Typosquatting Domain Generation Algorithms (DGA)
Cybersquatting
Spam
Malware
Know The Threat
Data Acquisition & Understanding
BENIGN
MALICIOUS
ENRICHMENT
Data Acquisition & Understanding
Domain Creation Date
Domain Update Date
Domain Expiry Date
Host IP
Country
Registrar
WHOIS status
Domain name
Exploratory Data Analysis
Exploratory Data Analysis
Feature Engineering
Digit Percentage
Binning
Shannon Entropy
IANA Designation
Special Characters
Normalization
Standardisation
Embeddings
One Hot Encoding
Impute Missing Values
Modeling & Evaluation - Cross Validation
Test & ValidationTrain
Modeling & Evaluation - Candidate Models
Deep Neural Network
Modeling & Evaluation - Candidate Models
Random ForestDeep Neural Network
Modeling & Evaluation - Candidate Models
Random ForestDeep Neural Network Word Embedding
d
h
d
n
s
Modeling & Evaluation - Candidate Models
Random ForestDeep Neural Network Word Embedding
Models Deep Neural Network Random Forest Word Embedding
Accuracy 0.86 0.83 0.79
F1 Score 0.86 0.83 0.80
d
h
d
n
s
Modeling & Evaluation: F1 Score
Image source: https://towardsdatascience.com/precision-vs-recall-386cf9f89488
Modeling & Evaluation: Confusion Matrix
Deep Neural Network
MaliciousBenign
Benign
Malicious
Modeling & Evaluation: Explainability
Global & local model explanation(SHAP)
Source: https://github.com/slundberg/shap#sample-notebooks
Modeling & Evaluation - Parameter Tuning
Grid search
(exhaustive)
Random search
(random)
Deployment
UNDER
CONSTRUCTION
Shout outs
Katie Ford (@katiegford) for the wonderful artwork
Yi Fang and Paul who continue to push us
Research Papers and the wonderful Python community
References
Detecting malicious URLs using machine learning techniques
(Frank Vanhoenshoven ; Gonzalo Nápoles ; Rafael Falcon ; Koen Vanhoof ; Mario Köppen)
What is design thinking?
https://dribbble.com/stories/2019/03/22/what-is-design-thinking
Interactive Design
https://www.interaction-design.org/literature/article/stage-1-in-the-design-thinking-process-empathise-with-your-users
Inc - Brainstorming
Microsoft Data Science Process
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
References
Choi, H., Zhu, B.B. and Lee, H., 2011. Detecting Malicious Web Links and Identifying Their Attack Types. WebApps, 11(11),
p.218.
Bilge, L., Kirda, E., Kruegel, C. and Balduzzi, M., 2011, February. EXPOSURE: Finding Malicious Domains Using Passive
DNS Analysis. In Ndss (pp. 1-17).
Thank you

More Related Content

What's hot

Cyber Security in AI (Artificial Intelligence)
Cyber Security in AI (Artificial Intelligence)Cyber Security in AI (Artificial Intelligence)
Cyber Security in AI (Artificial Intelligence)
Harsh Bhanushali
 
Machine learning in Cyber Security
Machine learning in Cyber SecurityMachine learning in Cyber Security
Machine learning in Cyber Security
RajathV2
 
HOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITYHOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITY
Priyanshu Ratnakar
 
Artificial Intelligence and Cybersecurity
Artificial Intelligence and CybersecurityArtificial Intelligence and Cybersecurity
Artificial Intelligence and Cybersecurity
Olivier Busolini
 
Security in the age of Artificial Intelligence
Security in the age of Artificial IntelligenceSecurity in the age of Artificial Intelligence
Security in the age of Artificial Intelligence
Faction XYZ
 
Cyber security and AI
Cyber security and AICyber security and AI
Cyber security and AI
DexterJanPineda
 
Application of Machine Learning in Cyber Security
Application of Machine Learning in Cyber SecurityApplication of Machine Learning in Cyber Security
Application of Machine Learning in Cyber Security
Dr. Umesh Rao.Hodeghatta
 
The role of big data, artificial intelligence and machine learning in cyber i...
The role of big data, artificial intelligence and machine learning in cyber i...The role of big data, artificial intelligence and machine learning in cyber i...
The role of big data, artificial intelligence and machine learning in cyber i...
Aladdin Dandis
 
“AI techniques in cyber-security applications”. Flammini lnu susec19
“AI techniques in cyber-security applications”. Flammini lnu susec19“AI techniques in cyber-security applications”. Flammini lnu susec19
“AI techniques in cyber-security applications”. Flammini lnu susec19
Francesco Flammini
 
Threat Hunting
Threat HuntingThreat Hunting
Threat Hunting
Splunk
 
Artificial Intelligence in cybersecurity
Artificial Intelligence in cybersecurityArtificial Intelligence in cybersecurity
Artificial Intelligence in cybersecurity
SmartlearningUK
 
How Machine Learning & AI Will Improve Cyber Security
How Machine Learning & AI Will Improve Cyber SecurityHow Machine Learning & AI Will Improve Cyber Security
How Machine Learning & AI Will Improve Cyber Security
DevOps.com
 
Threat hunting for Beginners
Threat hunting for BeginnersThreat hunting for Beginners
Threat hunting for Beginners
SKMohamedKasim
 
cybersecurity strategy planning in the banking sector
cybersecurity strategy planning in the banking sectorcybersecurity strategy planning in the banking sector
cybersecurity strategy planning in the banking sector
Olivier Busolini
 
Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain
BGA Cyber Security
 
Use of Artificial Intelligence in Cyber Security - Avantika University
Use of Artificial Intelligence in Cyber Security - Avantika UniversityUse of Artificial Intelligence in Cyber Security - Avantika University
Use of Artificial Intelligence in Cyber Security - Avantika University
Avantika University
 
Insider Threats Detection in Cloud using UEBA
Insider Threats Detection in Cloud using UEBAInsider Threats Detection in Cloud using UEBA
Insider Threats Detection in Cloud using UEBA
Lucas Ko
 
From SIEM to SOC: Crossing the Cybersecurity Chasm
From SIEM to SOC: Crossing the Cybersecurity ChasmFrom SIEM to SOC: Crossing the Cybersecurity Chasm
From SIEM to SOC: Crossing the Cybersecurity Chasm
Priyanka Aash
 
Security operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیتSecurity operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیت
ReZa AdineH
 
NIST cybersecurity framework
NIST cybersecurity frameworkNIST cybersecurity framework
NIST cybersecurity framework
Shriya Rai
 

What's hot (20)

Cyber Security in AI (Artificial Intelligence)
Cyber Security in AI (Artificial Intelligence)Cyber Security in AI (Artificial Intelligence)
Cyber Security in AI (Artificial Intelligence)
 
Machine learning in Cyber Security
Machine learning in Cyber SecurityMachine learning in Cyber Security
Machine learning in Cyber Security
 
HOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITYHOW AI CAN HELP IN CYBERSECURITY
HOW AI CAN HELP IN CYBERSECURITY
 
Artificial Intelligence and Cybersecurity
Artificial Intelligence and CybersecurityArtificial Intelligence and Cybersecurity
Artificial Intelligence and Cybersecurity
 
Security in the age of Artificial Intelligence
Security in the age of Artificial IntelligenceSecurity in the age of Artificial Intelligence
Security in the age of Artificial Intelligence
 
Cyber security and AI
Cyber security and AICyber security and AI
Cyber security and AI
 
Application of Machine Learning in Cyber Security
Application of Machine Learning in Cyber SecurityApplication of Machine Learning in Cyber Security
Application of Machine Learning in Cyber Security
 
The role of big data, artificial intelligence and machine learning in cyber i...
The role of big data, artificial intelligence and machine learning in cyber i...The role of big data, artificial intelligence and machine learning in cyber i...
The role of big data, artificial intelligence and machine learning in cyber i...
 
“AI techniques in cyber-security applications”. Flammini lnu susec19
“AI techniques in cyber-security applications”. Flammini lnu susec19“AI techniques in cyber-security applications”. Flammini lnu susec19
“AI techniques in cyber-security applications”. Flammini lnu susec19
 
Threat Hunting
Threat HuntingThreat Hunting
Threat Hunting
 
Artificial Intelligence in cybersecurity
Artificial Intelligence in cybersecurityArtificial Intelligence in cybersecurity
Artificial Intelligence in cybersecurity
 
How Machine Learning & AI Will Improve Cyber Security
How Machine Learning & AI Will Improve Cyber SecurityHow Machine Learning & AI Will Improve Cyber Security
How Machine Learning & AI Will Improve Cyber Security
 
Threat hunting for Beginners
Threat hunting for BeginnersThreat hunting for Beginners
Threat hunting for Beginners
 
cybersecurity strategy planning in the banking sector
cybersecurity strategy planning in the banking sectorcybersecurity strategy planning in the banking sector
cybersecurity strategy planning in the banking sector
 
Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain Machine Learning in Cyber Security Domain
Machine Learning in Cyber Security Domain
 
Use of Artificial Intelligence in Cyber Security - Avantika University
Use of Artificial Intelligence in Cyber Security - Avantika UniversityUse of Artificial Intelligence in Cyber Security - Avantika University
Use of Artificial Intelligence in Cyber Security - Avantika University
 
Insider Threats Detection in Cloud using UEBA
Insider Threats Detection in Cloud using UEBAInsider Threats Detection in Cloud using UEBA
Insider Threats Detection in Cloud using UEBA
 
From SIEM to SOC: Crossing the Cybersecurity Chasm
From SIEM to SOC: Crossing the Cybersecurity ChasmFrom SIEM to SOC: Crossing the Cybersecurity Chasm
From SIEM to SOC: Crossing the Cybersecurity Chasm
 
Security operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیتSecurity operations center-SOC Presentation-مرکز عملیات امنیت
Security operations center-SOC Presentation-مرکز عملیات امنیت
 
NIST cybersecurity framework
NIST cybersecurity frameworkNIST cybersecurity framework
NIST cybersecurity framework
 

Similar to Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurityAI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
Tasnim Alasali
 
Architecting trust in the digital landscape, or lack thereof
Architecting trust in the digital landscape, or lack thereofArchitecting trust in the digital landscape, or lack thereof
Architecting trust in the digital landscape, or lack thereof
Jonathan Sinclair
 
Cyber security and attack analysis : how Cisco uses graph analytics
Cyber security and attack analysis : how Cisco uses graph analyticsCyber security and attack analysis : how Cisco uses graph analytics
Cyber security and attack analysis : how Cisco uses graph analytics
Linkurious
 
AI Security : Machine Learning, Deep Learning and Computer Vision Security
AI Security : Machine Learning, Deep Learning and Computer Vision SecurityAI Security : Machine Learning, Deep Learning and Computer Vision Security
AI Security : Machine Learning, Deep Learning and Computer Vision Security
Cihan Özhan
 
Road map for actionable threat intelligence
Road map for actionable threat intelligenceRoad map for actionable threat intelligence
Road map for actionable threat intelligence
abhisheksinghcs
 
Artificial Intelligence - AI For Everyone
Artificial Intelligence - AI For EveryoneArtificial Intelligence - AI For Everyone
Artificial Intelligence - AI For Everyone
Sridhar Seshadri
 
High time to add machine learning to your information security stack
High time to add machine learning to your information security stackHigh time to add machine learning to your information security stack
High time to add machine learning to your information security stack
Minhaz A V
 
ARTIFICIAL INTELLIGENCE IN CYBER SECURITY
ARTIFICIAL INTELLIGENCE IN CYBER SECURITYARTIFICIAL INTELLIGENCE IN CYBER SECURITY
ARTIFICIAL INTELLIGENCE IN CYBER SECURITY
Cynthia King
 
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
DataScienceConferenc1
 
Crits new one_dark-goffin
Crits new one_dark-goffinCrits new one_dark-goffin
Crits new one_dark-goffin
Zeev Rabinovich
 
Cyber Defense - How to be prepared to APT
Cyber Defense - How to be prepared to APTCyber Defense - How to be prepared to APT
Cyber Defense - How to be prepared to APT
Simone Onofri
 
Why do women love chasing down bad guys?
Why do women love chasing down bad guys? Why do women love chasing down bad guys?
Why do women love chasing down bad guys?
SITA
 
AI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseAI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber Defense
Shawn Riley
 
So You Want a Job in Cybersecurity
So You Want a Job in CybersecuritySo You Want a Job in Cybersecurity
So You Want a Job in Cybersecurity
Teri Radichel
 
cybersecurity-careers.pdf
cybersecurity-careers.pdfcybersecurity-careers.pdf
cybersecurity-careers.pdf
RakeshKumar442494
 
The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...
The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...
The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...
JoAnna Cheshire
 
Artificial Intelligence Techniques for Cyber Security
Artificial Intelligence Techniques for Cyber SecurityArtificial Intelligence Techniques for Cyber Security
Artificial Intelligence Techniques for Cyber Security
IRJET Journal
 
IRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdf
IRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdfIRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdf
IRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdf
IRJET Journal
 
Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)
ShehryarSH1
 
What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10
Value Amplify Consulting
 

Similar to Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack (20)

AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurityAI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
AI Cybersecurity: Pros & Cons. AI is reshaping cybersecurity
 
Architecting trust in the digital landscape, or lack thereof
Architecting trust in the digital landscape, or lack thereofArchitecting trust in the digital landscape, or lack thereof
Architecting trust in the digital landscape, or lack thereof
 
Cyber security and attack analysis : how Cisco uses graph analytics
Cyber security and attack analysis : how Cisco uses graph analyticsCyber security and attack analysis : how Cisco uses graph analytics
Cyber security and attack analysis : how Cisco uses graph analytics
 
AI Security : Machine Learning, Deep Learning and Computer Vision Security
AI Security : Machine Learning, Deep Learning and Computer Vision SecurityAI Security : Machine Learning, Deep Learning and Computer Vision Security
AI Security : Machine Learning, Deep Learning and Computer Vision Security
 
Road map for actionable threat intelligence
Road map for actionable threat intelligenceRoad map for actionable threat intelligence
Road map for actionable threat intelligence
 
Artificial Intelligence - AI For Everyone
Artificial Intelligence - AI For EveryoneArtificial Intelligence - AI For Everyone
Artificial Intelligence - AI For Everyone
 
High time to add machine learning to your information security stack
High time to add machine learning to your information security stackHigh time to add machine learning to your information security stack
High time to add machine learning to your information security stack
 
ARTIFICIAL INTELLIGENCE IN CYBER SECURITY
ARTIFICIAL INTELLIGENCE IN CYBER SECURITYARTIFICIAL INTELLIGENCE IN CYBER SECURITY
ARTIFICIAL INTELLIGENCE IN CYBER SECURITY
 
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
 
Crits new one_dark-goffin
Crits new one_dark-goffinCrits new one_dark-goffin
Crits new one_dark-goffin
 
Cyber Defense - How to be prepared to APT
Cyber Defense - How to be prepared to APTCyber Defense - How to be prepared to APT
Cyber Defense - How to be prepared to APT
 
Why do women love chasing down bad guys?
Why do women love chasing down bad guys? Why do women love chasing down bad guys?
Why do women love chasing down bad guys?
 
AI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber DefenseAI-Driven Logical Argumentation in Active Cyber Defense
AI-Driven Logical Argumentation in Active Cyber Defense
 
So You Want a Job in Cybersecurity
So You Want a Job in CybersecuritySo You Want a Job in Cybersecurity
So You Want a Job in Cybersecurity
 
cybersecurity-careers.pdf
cybersecurity-careers.pdfcybersecurity-careers.pdf
cybersecurity-careers.pdf
 
The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...
The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...
The Role of Threat Intelligence and Layered Securiy for Intrusion Prevention ...
 
Artificial Intelligence Techniques for Cyber Security
Artificial Intelligence Techniques for Cyber SecurityArtificial Intelligence Techniques for Cyber Security
Artificial Intelligence Techniques for Cyber Security
 
IRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdf
IRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdfIRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdf
IRJET-https://www.irjet.net/archives/V5/i3/IRJET-V5I377.pdf
 
Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)Artificial Intelligence (ML - DL)
Artificial Intelligence (ML - DL)
 
What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10What Is Artificial Intelligence? Part 1/10
What Is Artificial Intelligence? Part 1/10
 

Recently uploaded

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Machine Learning & Cyber Security: Detecting Malicious URLs in the Haystack

  • 1. M A C H I N E L E A R N I N G & C Y B E R S E C U R I T Y Detecting Malicious URLs in the Haystack
  • 2. Team Triss Data Scientist working in cyber security Loves motorsport, chin-ups, learning and the West Coast Eagles (mighty big birds!) Alistair @dizzy_data Research Masters student working in cyber security. Enjoy random strolls in foreign cities.
  • 4. So why are we here? We have been working in data science and cyber security for some time... We hope to share with you: 1. How Design Thinking can help teams create more meaningful machine learning products; 2. How data science frameworks can provide structure to machine learning product development; 3. How Python can make your machine learning dreams become reality
  • 5. How our project started
  • 7. Method To The Madness + Design Thinking Data Science Process
  • 8. Threat Science Framework A framework for building human-centred machine learning in cyber security defence Know The User Modeling & Evaluation Data Acquisition & Understanding Feature Engineering Deployment Nail The Problem Ideate Know The Threat
  • 10. Know The User - Challenges Management of numerous security tools Alert fatigue High staff turnover and knowledge loss
  • 11. Know The User - Security Analyst Persona Security Analysts working in Security Operations Centres tasked with defending organisations against cyber adversarial threats Goals: - Maintain security architecture - Defend against myriad threat vectors (Incident Response) - Identify security flaws Needs: - Rapid incident response - Rich tool set - Coverage across the cyber attack life cycle - Free time to work on interesting projects such as threat hunting Pain points: - Alert fatigue - Lack of integration across tools and intelligence feeds - Keeping up with a constantly evolving threat landscape. What will the next attack look like?
  • 12. Nail the problem Problem statements (POV) {User} needs {User’s need} so that {benefit} Security Teams are faced with a broad and complex threat landscape. Historically, the common answer has been to focus on adopting numerous tools and staff to build any adequate defence. However, this approach has proven to be unsustainable. Security Analysts need rapid and intelligent cyber defence capability so that they can stand a chance against a growing and often superior threat
  • 13. Ideate - How Might We How might we statements How might we Form the POV or Problem Statement - Brainstorming (generate ideas from a seed question) - Brainwriting (each team member generates a few ideas, think deeply about them, then prioritise) - Mindmapping (grouping ideas together)
  • 14. Ideate - The Vision How might we build an automated and intelligent ecosystem of machine learning models that work in unison to provide superior defence against an ever- evolving threat landscape
  • 15. Ideate - One Prototype At A Time Source: https://www.reddit.com/r/reactiongifs/ How might we detect malicious URLs using machine learning and Python
  • 16. Phishing Typosquatting Domain Generation Algorithms (DGA) Cybersquatting Spam Malware Know The Threat
  • 17. Data Acquisition & Understanding BENIGN MALICIOUS ENRICHMENT
  • 18. Data Acquisition & Understanding Domain Creation Date Domain Update Date Domain Expiry Date Host IP Country Registrar WHOIS status Domain name
  • 21. Feature Engineering Digit Percentage Binning Shannon Entropy IANA Designation Special Characters Normalization Standardisation Embeddings One Hot Encoding Impute Missing Values
  • 22. Modeling & Evaluation - Cross Validation Test & ValidationTrain
  • 23. Modeling & Evaluation - Candidate Models Deep Neural Network
  • 24. Modeling & Evaluation - Candidate Models Random ForestDeep Neural Network
  • 25. Modeling & Evaluation - Candidate Models Random ForestDeep Neural Network Word Embedding d h d n s
  • 26. Modeling & Evaluation - Candidate Models Random ForestDeep Neural Network Word Embedding Models Deep Neural Network Random Forest Word Embedding Accuracy 0.86 0.83 0.79 F1 Score 0.86 0.83 0.80 d h d n s
  • 27. Modeling & Evaluation: F1 Score Image source: https://towardsdatascience.com/precision-vs-recall-386cf9f89488
  • 28. Modeling & Evaluation: Confusion Matrix Deep Neural Network MaliciousBenign Benign Malicious
  • 29. Modeling & Evaluation: Explainability Global & local model explanation(SHAP) Source: https://github.com/slundberg/shap#sample-notebooks
  • 30. Modeling & Evaluation - Parameter Tuning Grid search (exhaustive) Random search (random)
  • 32. Shout outs Katie Ford (@katiegford) for the wonderful artwork Yi Fang and Paul who continue to push us Research Papers and the wonderful Python community
  • 33. References Detecting malicious URLs using machine learning techniques (Frank Vanhoenshoven ; Gonzalo Nápoles ; Rafael Falcon ; Koen Vanhoof ; Mario Köppen) What is design thinking? https://dribbble.com/stories/2019/03/22/what-is-design-thinking Interactive Design https://www.interaction-design.org/literature/article/stage-1-in-the-design-thinking-process-empathise-with-your-users Inc - Brainstorming Microsoft Data Science Process https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
  • 34. References Choi, H., Zhu, B.B. and Lee, H., 2011. Detecting Malicious Web Links and Identifying Their Attack Types. WebApps, 11(11), p.218. Bilge, L., Kirda, E., Kruegel, C. and Balduzzi, M., 2011, February. EXPOSURE: Finding Malicious Domains Using Passive DNS Analysis. In Ndss (pp. 1-17).

Editor's Notes

  1. Hi I’m Alistair… Data Scientist in Security Like Aussie Rules, Cooking ... Hello I’m Triss, Research master student working in security
  2. Around about 5 years ago… Developing websites for a number of clients. One afternoon, I get a call from an angry client, informing me I was embarrassed, was concerned for the client, and the incident cost a lot of time and money to fix. Now, in some way, shape or form, a lot of us have probably come across malicious stuff on the Internet.
  3. So why are we here? We hope to share with you: How Design Thinking can help teams create more meaningful machine learning products; How data science frameworks can provide structure to machine learning product development; How Python can make your machine learning dreams become reality
  4. Back in 2018, Triss and I were very keen to use machine learning to detect bad stuff on networks. Our Team offered a bunch of standard detection use cases but we struggled to make a decision on what to work on. We were jumping between open-source data sets, trying to build things, without a clear direction. The detection use cases on a spreadsheet didnt resonated. Constructive criticisms that came our way included: What exactly is the problem you are trying to solve? Can you even get real-world data for that use case? Do you know what you’re doing? This was mistake number 1. We hadn’t even thought about our end-user yet. We didn’t properly understand the domain we were trying to serve. And there we were attempting to jump straight into the code.
  5. This is where Design Thinking was introduced to us. It’s not something you hear a lot about in Cyber Security and Data Science. So what is design thinking? Design thinking is a form of creative problem solving. It provides a set of tools, for all folks, including non-creatives, to come up with great ideas to meaningful problems. It forces team to focus on the end-user, which leads to better products. If you’re starting a project in any domain, data science, security, finance, open-source, we suggest you take a look at what Design can offer your team. For our approach, we focus on the first three phases, before kicking off our development. https://dribbble.com/stories/2019/03/22/what-is-design-thinking
  6. This is a bit like a forced marriage Design Thinking
  7. Threat Science Framework So this is our highly leveraged, Threat Science Framework: It is an end-to-end pattern used to guide our threat detection projects from ideation to deployment. Picking the best parts from the processes I mentioned before. I’ll go through each step. Know thy user (empathize) Put yourself in the end users shoes - How do they feel? What is their goal? What are their challenges? Define the problem (define) After emphasizing with the end user, you can define the problem you’d like to solve. The problem statement. Ideate This is where you work with a diverse set of peers to come up with wild and wonderful ideas to solve the detection problem. This phase thrives when you include the unusual suspects. Know thy threat Next, begin to understand what the threat you are trying to detect. In our case what are the potential indicators of malicious websites. This understanding will drive what data we need to acquire. Data Acquisition & Understanding Now we gather the possible data, and explore it to understand it as best we can. Other tasks will include cleaning and wrangling of the data. This in my opinion, and many others, the most time consuming step. Feature engineering Through getting to know your data, you can begin to refine and enhance its feature space in preparing for the modeling phase. One Hot Encoding and Label Encoding for Categorical variables, Normalization and Standardization of Numerical variables, Binning & Discretization, Feature selection Modeling & Evaluation Next we setup our experiment, firstly partitioning our dataset into training and validation subsets, then we define our performance metric (which aligns to our problem) and then evaluate what models perform best. Fast.ai Scikit-learn PyTorch Tensorflow Keras Deployment Lastly we deploy our model, and allow it to be used by the necessary interfaces best suited to our end-user. Flask Starlette Docker Kubernetes [Picture of the process we defined]
  8. With this approach, before we start anything. We get to know the end-user. Our end-user was cyber security Analysts working in Security Defence Design Thinking tools and exercises offer a bunch of tools to get to know your end-user: Interviewing Service Safaris Guided Tours Empath maps Affinity maps Personas Threat Intelligence Open-source data MITRE ATT&CK™ There are endless examples online if you’d like to search these.
  9. So we interviewed security analysts and experts, to identify some key challenges faced in the industry. And these were... Management of numerous security tools The cyber security industry is a behemoth, and along with it comes a booming security software market that promises the world. Often security teams have too many tools at their disposal, meaning analysts must navigate between them to achieve an outcome. Blue Teams employ a wide range of tools allowing them to detect an attack, collect forensic data, perform data analysis and make changes to threat future attacks and mitigate threats. Alert Fatigue With numerous tools, comes even more alerts. Analysts are inundated each day with false positives leading to “alert fatigue” and diminishing performance. Alarm fatigue or alert fatigue occurs when one is exposed to a large number of frequent alarms (alerts) and consequently becomes desensitized to them. Hard to hold onto top talent The industry requires talented security analysts to maintain and utilise the complex security tools in the market. Not to mention, it takes considerable investment to not only grow new comers but retain them. From this we were able to build a person of the user we’d like to help. [Picture for each challenge]
  10. Once we were across the key challenges, we built a persona of our end-user: Security Analyst, Blue Team Goals: Defend, defend, defend Respond, respond, respond Needs: Fast and effective triage and incident response No false positives! Tools providing detection and response capability across the attack cycle Free time to work on interesting projects and threat hunting Pain points: Alert fatigue Triaging numerous false positives Keeping up with constantly evolving threat landscape. What will the next attack look like? [Picture portrait of security analyst] [Picture of the attack kill chain]
  11. Problem/Needs Statements How might we Statements Now you know your end-user really well. A typical structure when defining a problem/needs statement looks like so: <Users> need <something> so that <benefit>. For our project, we came up with: Security Analysts need automated, intelligent monitoring and response capability across the cyber attack life cycle so that they can best defend against an ever-evolving threat landscape. This phase ensures you have a coherent problem to solve.
  12. [Diagram showing a bunch of models working together] [Get high resolution] Arrived at a how might we statement Real problem And a broad Great ways to bring all your ideas together include: Brainstorming (generate ideas from a seed question) Brainwriting (each team member generates a few ideas, think deeply about them, then prioritise) Mindmapping (grouping ideas together) It’s really important to include a diverse set of inputs for this. We consulted:: Data Engineers Data Scientists Threat Hunters Security Analysts Project Managers And… we finally landed... https://zwick.nyc/nickel-dime-savings-app-design-sprint
  13. What came out of our How might we mode...
  14. So now we have an idea we’d like to build out, and it’s time to understand the threat domain. Malicious URLs can be associated with a number of different threats including: Phishing Cybersquatting Typesquatting Domain Hijacking Registrar Hacking Domain Generation Algorithms (DGA) Researching these threats gave us an understanding of their potential indicators, for example: DGA URLs are associated with highly randomised strings Typesquatting URLs includes substrings that are highly similar to common web destinations such as google and facebook. E.g. goggle or facebok to lure users It is also worth mentioning that URLS associated with these threats may share common attributes such as short domain life or expiry domain date.
  15. We begin by ingesting the data using pandas library To get a view of what the data structure looks like Get shape and dimension Screenshots etc These are the data sources we have used for our model Alex Top 1 Million Domains We made an assumption that these popular sites would be reliable examples of benign URLs due to their popularity and traffic The Alexa rank is calculated based on the browsing behavior of Internet users. Using a combination of estimated average daily Unique Visitors and Pageviews over a course of 3 months, the site ranking is calculated. Traffic ranks are updated daily. Unique Visitors are users who visit a site on a given day. Pageviews are the total number of user URL requests for a site. The data is collected using one of 25,000 browser extensions for Google Chrome, Firefox, and Internet Explorer. From <https://www.iplocation.net/alexa-traffic-rank> IANA IPv4 Address Space Registry Each registry is allocated a range of IPv4 address The allocation of Internet Protocol version 4 (IPv4) address space to various registries is listed here. Originally, all the IPv4 address spaces was managed directly by the IANA. Later parts of the address space were allocated to various other registries to manage for particular purposes or regional areas of the world. Phishtank database- Is a database of phishing websites URLs Malware domain list - Is a list of domain with malware AlienVault Reputation Database - list of IP addresses with reputation value categorisation of malicious & benign host Domain Generated Algorithm Database What are the limitations of the data? What would a solution like this require in product? What are the possible data sources available for integration and data enrichment?
  16. Binning Normalization Digits percentage Domain Age
  17. Say after you have trained a machine learning model and obtained an amazing accuracy of 90% How do you know it is performing well on real world data that the model has not seen before? Does it actually work? this is why having a test set that represent unseen data is very important It’s a common practice to split the data into training set, test set and validation set Training set is used to train your model Test set is used to test the performance of your model on previously unseen data, so that you know how well it performs Validation set is typically used for parameter and model selection It’s common to have around 20% of test set, but that ultimately depends on the overall size of your data If your data has a million rows, 1 or 2 % of test and validation set would be equally sufficient.
  18. We have attempted a few models for this use case and the top performing models are: Deep neural network Random forest Word embedding, which is a lexicon based model Neural network - Neural network with 3 layers and more is considered a deep neural network The layers and hidden units allow the model to learn complex features and high dimensional data Typically, deep neural network requires a lot of tuning, which means you need an extensive knowledge of the model in order to use it. They are usually trained using conventional Pytorch or Tensorflow frameworks. Alternatively, you can use pretrained models for transfer learning and libraries like fast.ai can make things a lot easier and faster, especially for beginners who are keen to get their hands dirty in dnn. Fast.ai is a wrapper for Pytorch. It offers multiple pretrained models that you can use and some useful additional functionalities. We used tabular model from fast.ai for this use case Pros and cons of deep learning Dl is - Powerful and can be used on many difficult learning tasks - such as image classification, videos It’s able to perform effective automatic feature extraction, reducing need for manual feature engineering The disadvantages are - Require massive amount of training data Can require huge computing power Architectures can be complex and hard to tune The models may not be easily interpretable - don't knw why it selects certain feature
  19. Random forest - Random forest is an ensemble of decision trees, which basically means, a collection of trees Decision trees predict the final label by splitting at multiple decision points based on selected features. In this case, having many trees provide better generalisation and reduce overfitting. The algorithm makes predictions based on majority vote by each individual trees pros - commonly used and generates good predictions - performs pretty well doesn't require extensive scaling of data able to handle a mixture of feature types - numerical n categorical cons - model is difficult to interpret not suitable for high dimensional data
  20. Word embedding - Word embedding is typically used to associate words with their label. For instance in imdb movie review use case, we are able to find words that are associated with positive and negative reviews In our use case, we used character based word embedding which creates a vector representation of each character in the url, using multiple dimensions. This was trained using tensorflow
  21. The table shows performance that we have obtained for each of our models AND deep neural network is the winner!
  22. Here we have few methods of evaluating machine learning models Accuracy - Is pretty straight forward its the percentage of correct predicted decisions, divide by the total number F1 score - And we have f1 score F1 score is quite the standard evaluation method, especially when it comes to machine learning competitions like Kaggle F1 score is most suitable for our use case because it takes into consideration of false positive and false negative rate We don’t want our model to give too many false positives to security analysts, because that would slow down incident response as they would be wasting time investigating false alerts - F1 score is also preferred over precision and recall, Because optimising precision or recall is trading off the other, for eg optimising precision will gv you a lower recall value Whereas optimising f1 score gives you a balance of both The diagram at the right is an example of a confusion matrix - which shows true positives and true negatives, and vice versa
  23. Here’s the confusion matrix for our winning model, deep neural network Confusion matrix gives a good visual representation of your model performance As you can see here our deep neural network model has a higher false negative rate compare to false positive
  24. Explainability of machine learning models has become quite of interest lately Complex models like deep neural network are called black boxes because it is difficult to know what is going on within the model Ideally we would want to avoid having models that rely on undesirable features such as those that could lead to biases in production. for instance, in the context of image classification, relying on snow background to classify whether a picture has a wolf instead of relying on the wolf itself. SHAP is a very cool library which allows further insights into the internal working of models You are able to get a global or local explanation of your model prediction Global tells you which feature influenced the model as a whole local informs you which feature influenced the outcome of individual predictions Here’s an example of how it works, Take the meerkat example, the red is positive SHAP values that increases the probability of it being a meerkat according to your model And blue is negative SHAP values that reduces probability of the class This shows that the model is relying on the eyes to detect whether it’s a picture of a meerkat You can find out more about this library on their github page, which also contains more examples
  25. What is parameter tuning? Parameter tuning is trying out different values for your model parameters in order to obtain a better result Take neural network as an example, to tune it, we would be trying different number of layers, hidden units, learning rate, and activation function - such as sigmoid or relu There are 2 main methods of optimizing parameters: they are Grid search Random search Grid search - searches exhaustively through a range of values that you have provided And gives you the optimal combination The downside of this is, it can be very time consuming and computationally expensive. Random search - On the other hand, random search searches through the range of values provided randomly It requires less processing time. But we aren’t guaranteed to find the optimal combination I’ll hand this back to alistair for the closing bit