SlideShare a Scribd company logo
TLSH for the SOC
Jonathan Oliver
About Me
• Data Scientist at TrendMicro
• PhD at Monash University
• Data Mining consultant for NASA and FAA
• Data Scientist at Mailfrontier
• Inventor TLSH
• Adjunct Professor at University of Queensland
This Talk
What?
• TLSH Tools for processing malware
• Data derived from Malware Bazaar
Why?
• Label new / unknown samples
How?
• Clustering Malware Bazaar using standard ML tools
• (HAC-T / DBSCAN)
• Visualization of clusters (from Malware Bazaar)
Quick Intro to TLSH
• Trendmicro Locality Sensitive Hash
• pip install py-tlsh
• Open source code at https://github.com/trendmicro/tlsh
• Fuzzy Hash
• With advantages from Machine Learning
• Works with Sklearn, Jupyter Notebooks and DBSCAN
• Adopted by VirusTotal
• Adopted by Malware Bazaar
• A part of the STIX standard
What do TLSH look like?
chrome.exe
SHA256:c70b8cbb2ac962b343535454e4f2bcb3e48d83a04792c64bc768d59b3c1bf403
T11c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db
T1c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db
chrome.exe
SHA256:723aa4a407160bd99430de690f1f0d34af4a6622e2c44fe95be3bda3d7c344b3
Distance Calculation
T11c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db
T1c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db
1 1 3 3 3
Total Distance = 11
0-30 Very Close Match
31-60 Close Match
61-100 Possible Match
Malware Bazaar
Malware Bazaar
As of 17 Sept 2021, Malware Bazaar https://bazaar.abuse.ch/ has a
dataset with
• 389300 samples
• 323709 samples have a label
We have clustered this dataset and found 16452 clusters
https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
Use Cases / Motivation
Typical Use Case
Demo (1)
• Clustered Malware Bazaar
• Cluster output and pattern file from 2021-09-17 provided at
• https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
• Use this to predict the malware family of Malware Bazaar 2021-09-18
Demo (1)
Demo (1)
Demo (1): Predicting Signature
• Difficult task as there are 592 distinct signatures in Malware Bazaar
• Associated 164 / 246 samples to clusters.
• We split the predictions into 3 categories
• Correct Signature 132/164
• Incorrect 13/164
• Inconclusive 19/164
Demo (1): Uses in the SOC
• Automatic labelling of unknown samples
• Scalable
• Suitable for Automation
• Associates unknown samples with similar historical samples
• Understand scope of the threat
• YARA rules
• …
ÞTake suitable action
Demo (2)
• Understanding Clustering
• Dendrograms for malware
• See
https://github.com/trendmicro/tlsh/blob/master/tlshCluster/malbaz.ipynb
Digging Deeper
• Why TLSH is the way that it is.
• Why it uses kskip-grams
• Comparison of TLSH with other Similarity Digests
• Comparison of Clustering Methods
Why K-skip-grams?
• Work on short strings / files
• Hard to attack
Kskip Ngrams
Data:
Ngram Features (N=4)
ABCD BCDE CDEF DEFG EFGH FGHI GHIJ
Kskip-Ngram N=4 K=2
AB AC AD BC BD BE CD CE CF DE DF DG EF EG EH FG
FH FI GH GI GJ HI HJ IJ
A B C D E F G H I J
Selecting K and N for Kskip-Ngrams
Computational Complexity(low score is good)
K=5 21
K=4 15 35
K=3 10 20 35
K=2 6 10 15 21
K=1 3 4 5 6 7
K=0
(Ngram)
1 1 1 1 1 1
N=3 N=4 N=5 N=6 N=7 N=8 …
Kskip-Ngram versus Ngrams
GAN-like experiment
Real World Data
Adversarial Agent
Discriminator
Match
No Match
Selecting K and N for Kskip-Ngrams
Adversarial Agent (Search Width = 15)
(low score is good)
K=5 7.5
K=4 11.3
K=3 13.7
K=2 16.1
K=1 16.0
K=0
(Ngram)
25.4 31.2 32 43.4 57.4
N=3 N=4 N=5 N=6 N=7 N=8 …
Selecting K and N for Kskip-Ngrams
Accuracy
Comparing LSH /
Similarity Digests
Ref: Mar)n-Perez et al. “Bringing order to approximate matching:
Classifica?on and a@acks on similarity digest algorithms”
Metric Trees for Nearest Neighbor Search
Nodes contain
(item, distance)
Metric Trees:
Do not work for
(bounded) Similarity
Measures
Comparing Clustering
Approaches
Types of Clustering
• Similarity of the files
• Fuzzy Hashes
• Feature based
• Deep Learning
• YARA Rules
• Apply a pattern (Smart pattern)
• Sandbox / behavioural analysis
• …
Fuzzy Hashes
• Cryptographic Hashes:
• Any change completely changes the hash
• Useful for collecting evidence
• Fuzzy Hashes:
• Have the convenience of cryptographic hashes
• Can measure the Similarity between files
• Speed and Scale
Potential Issues with Clustering
• Scale
• Does the method scale up to 10 million / 100 million files?
• Access to the file
• Does the method need to process the file?
• Manual effort
• Packers
• Multiple malware families may use the same packer
• Some methods will distinguish; other methods will not
Category Technique Speed /
Scale
Access to file Manual
effort
Can separate
families that
share a packer
Similarity Fuzzy Hash Fast No No No
Feature based
ML
Slow Yes Features No
Deep Learning Slow Yes Network ?
YARA rules Medium Yes Yes Yes
Smart Pattern Fast Yes Yes Yes
Sandbox /
Behavioral
Slow Yes No Yes
Clustering Solutions
• Use multiple methods of clustering
• Split clustering / categorization into phases
1. Large scale / quick / cheap
• Fuzzy hashes (TLSH) are ideal
2. When needed, use more expensive methods
• Extensive security knowledge required
• Sandboxes
• Smart Patterns
• YARA rules
• Deep Learning
• etc
Conclusion
• Get the tools.
• pip install py-tlsh
• Open Source (Apache license)
• https://github.com/trendmicro/tlsh
• Fuzzy Hashes / TLSH / Telfhash are really useful tools
• Working with huge databases
• Use standard dev-ops / ML tools for malware
• Jupyter notebooks
• Sklearn
• DBSCAN
• Dendrograms for visualizing clustering
Resources
• TLSH
• https://github.com/trendmicro/tlsh
• Papers on TLSH
• http://tlsh.org/papers.html
• Malware Bazaar
• https://bazaar.abuse.ch/
Thanks to University of Queensland

More Related Content

Similar to 2021_TLSH_SOC_pub.pdf

Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
packetloop
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
Eric Evans
 
Cryptography
CryptographyCryptography
Basic cryptography
Basic cryptographyBasic cryptography
Basic cryptography
Perfect Training Center
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
Scott Sutherland
 
CISSP - Chapter 3 - Cryptography
CISSP - Chapter 3 - CryptographyCISSP - Chapter 3 - Cryptography
CISSP - Chapter 3 - Cryptography
Karthikeyan Dhayalan
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
DoiT International
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
Threat hunting and achieving security maturity
Threat hunting and achieving security maturityThreat hunting and achieving security maturity
Threat hunting and achieving security maturity
DNIF
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
Alex Henderson
 
Introduction to cryptography part1-final
Introduction to cryptography  part1-finalIntroduction to cryptography  part1-final
Introduction to cryptography part1-final
Taymoor Nazmy
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
DaveEdwards12
 
CRYPTOGRAPHY
CRYPTOGRAPHYCRYPTOGRAPHY
CRYPTOGRAPHY
SHUBHA CHATURVEDI
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
UNIT 4 CRYPTOGRAPHIC SYSTEMS.pptx
UNIT 4  CRYPTOGRAPHIC SYSTEMS.pptxUNIT 4  CRYPTOGRAPHIC SYSTEMS.pptx
UNIT 4 CRYPTOGRAPHIC SYSTEMS.pptx
ssuserd5e356
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
botsplash.com
 
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
Andrew Morris
 
Malicious Domain Profiling
Malicious Domain Profiling Malicious Domain Profiling
Malicious Domain Profiling
E Hacking
 

Similar to 2021_TLSH_SOC_pub.pdf (20)

Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)Finding Needles in Haystacks (The Size of Countries)
Finding Needles in Haystacks (The Size of Countries)
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Cryptography
CryptographyCryptography
Cryptography
 
Basic cryptography
Basic cryptographyBasic cryptography
Basic cryptography
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
WTF is Penetration Testing v.2
WTF is Penetration Testing v.2WTF is Penetration Testing v.2
WTF is Penetration Testing v.2
 
CISSP - Chapter 3 - Cryptography
CISSP - Chapter 3 - CryptographyCISSP - Chapter 3 - Cryptography
CISSP - Chapter 3 - Cryptography
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?Is your Elastic Cluster Stable and Production Ready?
Is your Elastic Cluster Stable and Production Ready?
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Threat hunting and achieving security maturity
Threat hunting and achieving security maturityThreat hunting and achieving security maturity
Threat hunting and achieving security maturity
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
 
Introduction to cryptography part1-final
Introduction to cryptography  part1-finalIntroduction to cryptography  part1-final
Introduction to cryptography part1-final
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 
CRYPTOGRAPHY
CRYPTOGRAPHYCRYPTOGRAPHY
CRYPTOGRAPHY
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
UNIT 4 CRYPTOGRAPHIC SYSTEMS.pptx
UNIT 4  CRYPTOGRAPHIC SYSTEMS.pptxUNIT 4  CRYPTOGRAPHIC SYSTEMS.pptx
UNIT 4 CRYPTOGRAPHIC SYSTEMS.pptx
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
ShmooCon 2015: No Budget Threat Intelligence - Tracking Malware Campaigns on ...
 
Malicious Domain Profiling
Malicious Domain Profiling Malicious Domain Profiling
Malicious Domain Profiling
 

More from JonathanOliver26

blackhole.pdf
blackhole.pdfblackhole.pdf
blackhole.pdf
JonathanOliver26
 
HACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdfHACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdf
JonathanOliver26
 
2019 TrustCom: The role of ML and AI in Security
2019 TrustCom: The role of ML and AI in Security2019 TrustCom: The role of ML and AI in Security
2019 TrustCom: The role of ML and AI in Security
JonathanOliver26
 
Using lexigraphical distancing to block spam
Using lexigraphical distancing to block spamUsing lexigraphical distancing to block spam
Using lexigraphical distancing to block spam
JonathanOliver26
 
Introduction to MML and Supervised Learning
Introduction to MML and Supervised LearningIntroduction to MML and Supervised Learning
Introduction to MML and Supervised Learning
JonathanOliver26
 
Privacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliverPrivacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliver
JonathanOliver26
 
Privacy log files
Privacy log filesPrivacy log files
Privacy log files
JonathanOliver26
 

More from JonathanOliver26 (7)

blackhole.pdf
blackhole.pdfblackhole.pdf
blackhole.pdf
 
HACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdfHACT_Fast_Search_COINS_pub.pdf
HACT_Fast_Search_COINS_pub.pdf
 
2019 TrustCom: The role of ML and AI in Security
2019 TrustCom: The role of ML and AI in Security2019 TrustCom: The role of ML and AI in Security
2019 TrustCom: The role of ML and AI in Security
 
Using lexigraphical distancing to block spam
Using lexigraphical distancing to block spamUsing lexigraphical distancing to block spam
Using lexigraphical distancing to block spam
 
Introduction to MML and Supervised Learning
Introduction to MML and Supervised LearningIntroduction to MML and Supervised Learning
Introduction to MML and Supervised Learning
 
Privacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliverPrivacy solutions decode2021_jon_oliver
Privacy solutions decode2021_jon_oliver
 
Privacy log files
Privacy log filesPrivacy log files
Privacy log files
 

Recently uploaded

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
ramrag33
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
gaafergoudaay7aga
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 

Recently uploaded (20)

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
integral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdfintegral complex analysis chapter 06 .pdf
integral complex analysis chapter 06 .pdf
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 

2021_TLSH_SOC_pub.pdf

  • 1. TLSH for the SOC Jonathan Oliver
  • 2. About Me • Data Scientist at TrendMicro • PhD at Monash University • Data Mining consultant for NASA and FAA • Data Scientist at Mailfrontier • Inventor TLSH • Adjunct Professor at University of Queensland
  • 3. This Talk What? • TLSH Tools for processing malware • Data derived from Malware Bazaar Why? • Label new / unknown samples How? • Clustering Malware Bazaar using standard ML tools • (HAC-T / DBSCAN) • Visualization of clusters (from Malware Bazaar)
  • 4. Quick Intro to TLSH • Trendmicro Locality Sensitive Hash • pip install py-tlsh • Open source code at https://github.com/trendmicro/tlsh • Fuzzy Hash • With advantages from Machine Learning • Works with Sklearn, Jupyter Notebooks and DBSCAN • Adopted by VirusTotal • Adopted by Malware Bazaar • A part of the STIX standard
  • 5. What do TLSH look like? chrome.exe SHA256:c70b8cbb2ac962b343535454e4f2bcb3e48d83a04792c64bc768d59b3c1bf403 T11c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db T1c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db chrome.exe SHA256:723aa4a407160bd99430de690f1f0d34af4a6622e2c44fe95be3bda3d7c344b3
  • 8. Malware Bazaar As of 17 Sept 2021, Malware Bazaar https://bazaar.abuse.ch/ has a dataset with • 389300 samples • 323709 samples have a label We have clustered this dataset and found 16452 clusters https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz
  • 9. Use Cases / Motivation
  • 11. Demo (1) • Clustered Malware Bazaar • Cluster output and pattern file from 2021-09-17 provided at • https://github.com/trendmicro/tlsh/tree/master/tlshCluster/malbaz • Use this to predict the malware family of Malware Bazaar 2021-09-18
  • 14. Demo (1): Predicting Signature • Difficult task as there are 592 distinct signatures in Malware Bazaar • Associated 164 / 246 samples to clusters. • We split the predictions into 3 categories • Correct Signature 132/164 • Incorrect 13/164 • Inconclusive 19/164
  • 15. Demo (1): Uses in the SOC • Automatic labelling of unknown samples • Scalable • Suitable for Automation • Associates unknown samples with similar historical samples • Understand scope of the threat • YARA rules • … ÞTake suitable action
  • 16. Demo (2) • Understanding Clustering • Dendrograms for malware • See https://github.com/trendmicro/tlsh/blob/master/tlshCluster/malbaz.ipynb
  • 17. Digging Deeper • Why TLSH is the way that it is. • Why it uses kskip-grams • Comparison of TLSH with other Similarity Digests • Comparison of Clustering Methods
  • 18. Why K-skip-grams? • Work on short strings / files • Hard to attack
  • 19. Kskip Ngrams Data: Ngram Features (N=4) ABCD BCDE CDEF DEFG EFGH FGHI GHIJ Kskip-Ngram N=4 K=2 AB AC AD BC BD BE CD CE CF DE DF DG EF EG EH FG FH FI GH GI GJ HI HJ IJ A B C D E F G H I J
  • 20. Selecting K and N for Kskip-Ngrams Computational Complexity(low score is good) K=5 21 K=4 15 35 K=3 10 20 35 K=2 6 10 15 21 K=1 3 4 5 6 7 K=0 (Ngram) 1 1 1 1 1 1 N=3 N=4 N=5 N=6 N=7 N=8 …
  • 21. Kskip-Ngram versus Ngrams GAN-like experiment Real World Data Adversarial Agent Discriminator Match No Match
  • 22. Selecting K and N for Kskip-Ngrams Adversarial Agent (Search Width = 15) (low score is good) K=5 7.5 K=4 11.3 K=3 13.7 K=2 16.1 K=1 16.0 K=0 (Ngram) 25.4 31.2 32 43.4 57.4 N=3 N=4 N=5 N=6 N=7 N=8 …
  • 23. Selecting K and N for Kskip-Ngrams Accuracy
  • 25. Ref: Mar)n-Perez et al. “Bringing order to approximate matching: Classifica?on and a@acks on similarity digest algorithms”
  • 26. Metric Trees for Nearest Neighbor Search Nodes contain (item, distance)
  • 27. Metric Trees: Do not work for (bounded) Similarity Measures
  • 29. Types of Clustering • Similarity of the files • Fuzzy Hashes • Feature based • Deep Learning • YARA Rules • Apply a pattern (Smart pattern) • Sandbox / behavioural analysis • …
  • 30. Fuzzy Hashes • Cryptographic Hashes: • Any change completely changes the hash • Useful for collecting evidence • Fuzzy Hashes: • Have the convenience of cryptographic hashes • Can measure the Similarity between files • Speed and Scale
  • 31. Potential Issues with Clustering • Scale • Does the method scale up to 10 million / 100 million files? • Access to the file • Does the method need to process the file? • Manual effort • Packers • Multiple malware families may use the same packer • Some methods will distinguish; other methods will not
  • 32. Category Technique Speed / Scale Access to file Manual effort Can separate families that share a packer Similarity Fuzzy Hash Fast No No No Feature based ML Slow Yes Features No Deep Learning Slow Yes Network ? YARA rules Medium Yes Yes Yes Smart Pattern Fast Yes Yes Yes Sandbox / Behavioral Slow Yes No Yes
  • 33. Clustering Solutions • Use multiple methods of clustering • Split clustering / categorization into phases 1. Large scale / quick / cheap • Fuzzy hashes (TLSH) are ideal 2. When needed, use more expensive methods • Extensive security knowledge required • Sandboxes • Smart Patterns • YARA rules • Deep Learning • etc
  • 34. Conclusion • Get the tools. • pip install py-tlsh • Open Source (Apache license) • https://github.com/trendmicro/tlsh • Fuzzy Hashes / TLSH / Telfhash are really useful tools • Working with huge databases • Use standard dev-ops / ML tools for malware • Jupyter notebooks • Sklearn • DBSCAN • Dendrograms for visualizing clustering
  • 35. Resources • TLSH • https://github.com/trendmicro/tlsh • Papers on TLSH • http://tlsh.org/papers.html • Malware Bazaar • https://bazaar.abuse.ch/ Thanks to University of Queensland