SlideShare a Scribd company logo
1 of 23
© 2016 Cloudera, Inc. All rights reserved. 1
Malware Tracking at Scale
© 2016 Cloudera, Inc. All rights reserved. 2
About me
• Michael Bentley
• Formerly Director of Research and Response @ Lookout
• Currently working on data mining projects
• KK6WCN
• michael@setnorth.com
© 2016 Cloudera, Inc. All rights reserved. 3
Agenda
• What we are trying to accomplish
• How basic heuristics work
• Where basic heuristics don’t work
• Tracking with pairwise similarity and EMR
• Visualizations to help extract more information
• Mistakes and caveats
© 2016 Cloudera, Inc. All rights reserved. 4
What are we trying to accomplish
• Searching for major versions of software (malware)
• Find ways to detect it with simple heuristics
• Find ways to track it
• Dataset discovery
© 2016 Cloudera, Inc. All rights reserved. 5
Simple heuristics
• Detect on static data
• Detect on analysis stack created metadata
applications analysisacquisition
Hashes
Strings
Who signed
it / certificate
© 2016 Cloudera, Inc. All rights reserved. 6
Simple heuristics - hashes
APK file
Hashes
Icon
Dex File
© 2016 Cloudera, Inc. All rights reserved. 7
Simple heuristics - string detection
• Nice ASCII string delimited by
null bytes
• Malicious class path
• Byte code
• Exact match in one or both
directions of string
• Ctrl + F
Null byte
© 2016 Cloudera, Inc. All rights reserved. 8
Simple heuristics- certificates
• Same
malware
• Different
certificates
© 2016 Cloudera, Inc. All rights reserved. 9
Where simple heuristics are good
• Good for things that don’t change
• Computationally cheap
• About the same scenario for network (IDS) or
application inspection (malware detection)
© 2016 Cloudera, Inc. All rights reserved. 10
Where it’s problematic
• Anything with funding/making money.
• Malware created in Eastern Europe, Asia, Italy (Hacking
Team)
• Mass creation of certificates
• Code taken from Stack Overflow
• Anything with basic string obfuscation
• Hunting for new major versions
© 2016 Cloudera, Inc. All rights reserved. 11
Enter pairwise
similarity
You’re about to see a spreadsheet at a big data
conference
http://gunshowcomic.com/648
© 2016 Cloudera, Inc. All rights reserved. 12
Application pairwise similarity
© 2016 Cloudera, Inc. All rights reserved. 13
Go from pick one
app and rescan
corpus
© 2016 Cloudera, Inc. All rights reserved. 14
Pick one application – Rescan corpus
• Examine one app
• Find heuristic
• Rescan corpus
• Rinse repeat ad infinitum
• Throw people at the problem
http://bit.ly/2a0zcZR
© 2016 Cloudera, Inc. All rights reserved. 15
Decoding what you already have
• Pairwise similarity defines the
relationships for us
• Dots represent unique (SHA1)
applications
• Colors represent major versions of
malware
• Each color is within ~85% match of
code distance
© 2016 Cloudera, Inc. All rights reserved. 16
Clustering and intelligence
APK
APK
APK
APK
APK
APK
APK
Nearest neighbor
95% similar
Cluster 1
85% similar
Cluster 2
85% similar
Cluster 0
< 85% similar
• APKs are nodes and edges
• Clusters are neighborhoods
© 2016 Cloudera, Inc. All rights reserved. 17
Clustering and intelligence
© 2016 Cloudera, Inc. All rights reserved. 18
Clustering versus heuristics
© 2016 Cloudera, Inc. All rights reserved. 19
Evolution of malware over time
• By taking the clustering data and
then overlaying it with the packaged
at data we can watch malware
evolve over time.
• Color represents major version
• Time is a 4 month sliding window
• Shows iterations from malware
writers
© 2016 Cloudera, Inc. All rights reserved. 20
Pairwise problems and options
• Comparing 3500 applications is 12,250,000 operations
• As you bring more applications in, expect to scale EMR cluster or
reduce n.
• You can overmatch on similarity – outlier issue
© 2016 Cloudera, Inc. All rights reserved. 21
Tripping over the bar
• Pairwise similarity for 7k apps is about 5gB.
• So is S3
• Things go bad when you don’t respect the bucket
size
• Troubleshooting CSV sizes is a thing
• Doesn’t work well on small applications
• Temporary files on your local machine that are
70gB cause problems
© 2016 Cloudera, Inc. All rights reserved. 22
Knowledge
• I had never used NetworkX before ~2014
• I had no idea how to go from what we had into a decent format for visualizing this
(GraphML).
• Almost no experience in graph theory before ~2014
• Gilad Lotan had a great PyCon talk which got me started. I still reference his talks.
• Gephi is a great shortcut for visualizing in 2D if you aren’t familiar with D3
• Seth Hardy who gave tons of amazing feedback while I was learning
• Jack Urban who proved that it was possible to track applications as a network
• Gensim library is a great way to get started in doing comparisons of applications
• Lots of inspiration from the Defcon 22 OpenDNS talk (theirs is better)
Thank you.

More Related Content

What's hot

Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedSqrrl
 
Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...
Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...
Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...Flink Forward
 
WestJet Customer Presentation
WestJet Customer PresentationWestJet Customer Presentation
WestJet Customer PresentationSplunk
 
Detecting Mobile Malware with Apache Spark with David Pryce
 Detecting Mobile Malware with Apache Spark with David Pryce Detecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David PryceDatabricks
 
Sqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric SecuritySqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric SecuritySqrrl
 
Managing Indicator Deprecation in ThreatConnect
Managing Indicator Deprecation in ThreatConnectManaging Indicator Deprecation in ThreatConnect
Managing Indicator Deprecation in ThreatConnectThreatConnect
 
SplunkLive! Customer Presentation – athenahealth
SplunkLive! Customer Presentation – athenahealthSplunkLive! Customer Presentation – athenahealth
SplunkLive! Customer Presentation – athenahealthSplunk
 
University of Alberta Customer Presentation
University of Alberta Customer PresentationUniversity of Alberta Customer Presentation
University of Alberta Customer PresentationSplunk
 
April 2015 Webinar: Cyber Hunting with Sqrrl
April 2015 Webinar: Cyber Hunting with SqrrlApril 2015 Webinar: Cyber Hunting with Sqrrl
April 2015 Webinar: Cyber Hunting with SqrrlSqrrl
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Security Events Logging at Bell with the Elastic Stack
Security Events Logging at Bell with the Elastic StackSecurity Events Logging at Bell with the Elastic Stack
Security Events Logging at Bell with the Elastic StackElasticsearch
 
The Security Industry is Suffering from Fragmentation, What Can Your Organiza...
The Security Industry is Suffering from Fragmentation, What Can Your Organiza...The Security Industry is Suffering from Fragmentation, What Can Your Organiza...
The Security Industry is Suffering from Fragmentation, What Can Your Organiza...ThreatConnect
 
SplunkLive! Cincinnati - Hurricane Labs - Oct 2012
SplunkLive! Cincinnati - Hurricane Labs - Oct 2012SplunkLive! Cincinnati - Hurricane Labs - Oct 2012
SplunkLive! Cincinnati - Hurricane Labs - Oct 2012Splunk
 
Sqrrl 2.0 Launch Webinar
Sqrrl 2.0 Launch WebinarSqrrl 2.0 Launch Webinar
Sqrrl 2.0 Launch WebinarSqrrl
 
SplunkLive! Customer Presentation – Covance Inc"
SplunkLive! Customer Presentation – Covance Inc"SplunkLive! Customer Presentation – Covance Inc"
SplunkLive! Customer Presentation – Covance Inc"Splunk
 
Join2017_Deep Dive_AWS Operations
Join2017_Deep Dive_AWS OperationsJoin2017_Deep Dive_AWS Operations
Join2017_Deep Dive_AWS OperationsLooker
 
Palestra de abertura: Evolução e visão do Elastic Observability
Palestra de abertura: Evolução e visão do Elastic ObservabilityPalestra de abertura: Evolução e visão do Elastic Observability
Palestra de abertura: Evolução e visão do Elastic ObservabilityElasticsearch
 
Splunk @ Adobe
Splunk @ AdobeSplunk @ Adobe
Splunk @ AdobeSplunk
 
Advanced Threat Hunting - Botconf 2017
Advanced Threat Hunting - Botconf 2017Advanced Threat Hunting - Botconf 2017
Advanced Threat Hunting - Botconf 2017Kevin Finley
 

What's hot (20)

SQRRL threat hunting platform
SQRRL threat hunting platformSQRRL threat hunting platform
SQRRL threat hunting platform
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
 
Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...
Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...
Flink Forward Berlin 2018: Yonatan Most & Avihai Berkovitz - "Anomaly Detecti...
 
WestJet Customer Presentation
WestJet Customer PresentationWestJet Customer Presentation
WestJet Customer Presentation
 
Detecting Mobile Malware with Apache Spark with David Pryce
 Detecting Mobile Malware with Apache Spark with David Pryce Detecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David Pryce
 
Sqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric SecuritySqrrl May Webinar: Data-Centric Security
Sqrrl May Webinar: Data-Centric Security
 
Managing Indicator Deprecation in ThreatConnect
Managing Indicator Deprecation in ThreatConnectManaging Indicator Deprecation in ThreatConnect
Managing Indicator Deprecation in ThreatConnect
 
SplunkLive! Customer Presentation – athenahealth
SplunkLive! Customer Presentation – athenahealthSplunkLive! Customer Presentation – athenahealth
SplunkLive! Customer Presentation – athenahealth
 
University of Alberta Customer Presentation
University of Alberta Customer PresentationUniversity of Alberta Customer Presentation
University of Alberta Customer Presentation
 
April 2015 Webinar: Cyber Hunting with Sqrrl
April 2015 Webinar: Cyber Hunting with SqrrlApril 2015 Webinar: Cyber Hunting with Sqrrl
April 2015 Webinar: Cyber Hunting with Sqrrl
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Security Events Logging at Bell with the Elastic Stack
Security Events Logging at Bell with the Elastic StackSecurity Events Logging at Bell with the Elastic Stack
Security Events Logging at Bell with the Elastic Stack
 
The Security Industry is Suffering from Fragmentation, What Can Your Organiza...
The Security Industry is Suffering from Fragmentation, What Can Your Organiza...The Security Industry is Suffering from Fragmentation, What Can Your Organiza...
The Security Industry is Suffering from Fragmentation, What Can Your Organiza...
 
SplunkLive! Cincinnati - Hurricane Labs - Oct 2012
SplunkLive! Cincinnati - Hurricane Labs - Oct 2012SplunkLive! Cincinnati - Hurricane Labs - Oct 2012
SplunkLive! Cincinnati - Hurricane Labs - Oct 2012
 
Sqrrl 2.0 Launch Webinar
Sqrrl 2.0 Launch WebinarSqrrl 2.0 Launch Webinar
Sqrrl 2.0 Launch Webinar
 
SplunkLive! Customer Presentation – Covance Inc"
SplunkLive! Customer Presentation – Covance Inc"SplunkLive! Customer Presentation – Covance Inc"
SplunkLive! Customer Presentation – Covance Inc"
 
Join2017_Deep Dive_AWS Operations
Join2017_Deep Dive_AWS OperationsJoin2017_Deep Dive_AWS Operations
Join2017_Deep Dive_AWS Operations
 
Palestra de abertura: Evolução e visão do Elastic Observability
Palestra de abertura: Evolução e visão do Elastic ObservabilityPalestra de abertura: Evolução e visão do Elastic Observability
Palestra de abertura: Evolução e visão do Elastic Observability
 
Splunk @ Adobe
Splunk @ AdobeSplunk @ Adobe
Splunk @ Adobe
 
Advanced Threat Hunting - Botconf 2017
Advanced Threat Hunting - Botconf 2017Advanced Threat Hunting - Botconf 2017
Advanced Threat Hunting - Botconf 2017
 

Viewers also liked

Wrangle 2016: Driving Healthcare Operations with Small Data
Wrangle 2016: Driving Healthcare Operations with Small DataWrangle 2016: Driving Healthcare Operations with Small Data
Wrangle 2016: Driving Healthcare Operations with Small DataWrangleConf
 
Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...
Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...
Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...WrangleConf
 
Wrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes DataWrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes DataWrangleConf
 
Sensor Data Wrangling: From Metal to Cloud
Sensor Data Wrangling: From Metal to CloudSensor Data Wrangling: From Metal to Cloud
Sensor Data Wrangling: From Metal to CloudWrangleConf
 
From Science to Product (Company)
From Science to Product (Company)From Science to Product (Company)
From Science to Product (Company)WrangleConf
 
Condense Fact from the Vapor of Nuance
Condense Fact from the Vapor of Nuance Condense Fact from the Vapor of Nuance
Condense Fact from the Vapor of Nuance WrangleConf
 
Wrangle 2016: Data Science for HR
Wrangle 2016: Data Science for HRWrangle 2016: Data Science for HR
Wrangle 2016: Data Science for HRWrangleConf
 
The Unreasonable Effectiveness of Product Sense
The Unreasonable Effectiveness of Product SenseThe Unreasonable Effectiveness of Product Sense
The Unreasonable Effectiveness of Product SenseWrangleConf
 
Data Science in Drug Discovery
Data Science in Drug DiscoveryData Science in Drug Discovery
Data Science in Drug DiscoveryWrangleConf
 
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlowWrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlowWrangleConf
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation WrangleConf
 
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016MLconf
 

Viewers also liked (12)

Wrangle 2016: Driving Healthcare Operations with Small Data
Wrangle 2016: Driving Healthcare Operations with Small DataWrangle 2016: Driving Healthcare Operations with Small Data
Wrangle 2016: Driving Healthcare Operations with Small Data
 
Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...
Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...
Wrangle 2016 - Digital Vulnerability: Characterizing Risks and Contemplating ...
 
Wrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes DataWrangle 2016: Staying Hippocratic with High Stakes Data
Wrangle 2016: Staying Hippocratic with High Stakes Data
 
Sensor Data Wrangling: From Metal to Cloud
Sensor Data Wrangling: From Metal to CloudSensor Data Wrangling: From Metal to Cloud
Sensor Data Wrangling: From Metal to Cloud
 
From Science to Product (Company)
From Science to Product (Company)From Science to Product (Company)
From Science to Product (Company)
 
Condense Fact from the Vapor of Nuance
Condense Fact from the Vapor of Nuance Condense Fact from the Vapor of Nuance
Condense Fact from the Vapor of Nuance
 
Wrangle 2016: Data Science for HR
Wrangle 2016: Data Science for HRWrangle 2016: Data Science for HR
Wrangle 2016: Data Science for HR
 
The Unreasonable Effectiveness of Product Sense
The Unreasonable Effectiveness of Product SenseThe Unreasonable Effectiveness of Product Sense
The Unreasonable Effectiveness of Product Sense
 
Data Science in Drug Discovery
Data Science in Drug DiscoveryData Science in Drug Discovery
Data Science in Drug Discovery
 
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlowWrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
Wrangle 2016: (Lightning Talk) FizzBuzz in TensorFlow
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation
 
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
Stephanie deWet, Software Engineer, Pinterest at MLconf SF 2016
 

Similar to Wrangle 2016: Malware Tracking at Scale

Monitoring Attack Surface to Secure DevOps Pipelines
Monitoring Attack Surface to Secure DevOps PipelinesMonitoring Attack Surface to Secure DevOps Pipelines
Monitoring Attack Surface to Secure DevOps PipelinesDenim Group
 
The Times They Are a-Changin’: Domino Applications in the New World of HCL No...
The Times They Are a-Changin’: Domino Applications in the New World of HCL No...The Times They Are a-Changin’: Domino Applications in the New World of HCL No...
The Times They Are a-Changin’: Domino Applications in the New World of HCL No...panagenda
 
Dominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad Web
Dominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad WebDominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad Web
Dominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad Webpanagenda
 
Programming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT worldProgramming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT worldRogue Wave Software
 
Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...
Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...
Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...Denim Group
 
How enterprises learned to stop worrying and love open source
How enterprises learned to stop worrying and love open sourceHow enterprises learned to stop worrying and love open source
How enterprises learned to stop worrying and love open sourceRogue Wave Software
 
Collaborative security : Securing open source software
Collaborative security : Securing open source softwareCollaborative security : Securing open source software
Collaborative security : Securing open source softwarePriyanka Aash
 
EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...
EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...
EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...{code}
 
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Cloudera, Inc.
 
(Isc)² secure johannesburg
(Isc)² secure johannesburg (Isc)² secure johannesburg
(Isc)² secure johannesburg Tunde Ogunkoya
 
Rapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysisRapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysisRogue Wave Software
 
20th Anniversary - OWASP Top 10 2021.pptx
20th Anniversary - OWASP Top 10 2021.pptx20th Anniversary - OWASP Top 10 2021.pptx
20th Anniversary - OWASP Top 10 2021.pptxDedy Hariyadi
 
Webinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesWebinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesDataStax
 
Couchbase usage at Symantec
Couchbase usage at SymantecCouchbase usage at Symantec
Couchbase usage at Symantecgauravchandna
 
Cyber security - It starts with the embedded system
Cyber security - It starts with the embedded systemCyber security - It starts with the embedded system
Cyber security - It starts with the embedded systemRogue Wave Software
 
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...AppDynamics
 
Stack overflow code_laundering
Stack overflow code_launderingStack overflow code_laundering
Stack overflow code_launderingFoutse Khomh
 

Similar to Wrangle 2016: Malware Tracking at Scale (20)

Monitoring Attack Surface to Secure DevOps Pipelines
Monitoring Attack Surface to Secure DevOps PipelinesMonitoring Attack Surface to Secure DevOps Pipelines
Monitoring Attack Surface to Secure DevOps Pipelines
 
The Times They Are a-Changin’: Domino Applications in the New World of HCL No...
The Times They Are a-Changin’: Domino Applications in the New World of HCL No...The Times They Are a-Changin’: Domino Applications in the New World of HCL No...
The Times They Are a-Changin’: Domino Applications in the New World of HCL No...
 
Dominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad Web
Dominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad WebDominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad Web
Dominoapplikationen im Wandel der Zeit: Alles neu mit HCL Nomad Web
 
Programming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT worldProgramming languages and techniques for today’s embedded andIoT world
Programming languages and techniques for today’s embedded andIoT world
 
Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...
Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...
Monitoring Application Attack Surface to Integrate Security into DevOps Pipel...
 
How enterprises learned to stop worrying and love open source
How enterprises learned to stop worrying and love open sourceHow enterprises learned to stop worrying and love open source
How enterprises learned to stop worrying and love open source
 
Collaborative security : Securing open source software
Collaborative security : Securing open source softwareCollaborative security : Securing open source software
Collaborative security : Securing open source software
 
EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...
EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...
EMC World 2016 - code.10 Jumpstart your Open Source Presence through new Coll...
 
Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18Big Data Fundamentals 6.6.18
Big Data Fundamentals 6.6.18
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
(Isc)² secure johannesburg
(Isc)² secure johannesburg (Isc)² secure johannesburg
(Isc)² secure johannesburg
 
Rails tools
Rails toolsRails tools
Rails tools
 
Découvrez le Rugged DevOps
Découvrez le Rugged DevOpsDécouvrez le Rugged DevOps
Découvrez le Rugged DevOps
 
Rapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysisRapid software testing and conformance with static code analysis
Rapid software testing and conformance with static code analysis
 
20th Anniversary - OWASP Top 10 2021.pptx
20th Anniversary - OWASP Top 10 2021.pptx20th Anniversary - OWASP Top 10 2021.pptx
20th Anniversary - OWASP Top 10 2021.pptx
 
Webinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph DatabasesWebinar: Fighting Fraud with Graph Databases
Webinar: Fighting Fraud with Graph Databases
 
Couchbase usage at Symantec
Couchbase usage at SymantecCouchbase usage at Symantec
Couchbase usage at Symantec
 
Cyber security - It starts with the embedded system
Cyber security - It starts with the embedded systemCyber security - It starts with the embedded system
Cyber security - It starts with the embedded system
 
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
AppSphere 15 - How AppDynamics is Shaking up the Synthetic Monitoring Product...
 
Stack overflow code_laundering
Stack overflow code_launderingStack overflow code_laundering
Stack overflow code_laundering
 

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Wrangle 2016: Malware Tracking at Scale

  • 1. © 2016 Cloudera, Inc. All rights reserved. 1 Malware Tracking at Scale
  • 2. © 2016 Cloudera, Inc. All rights reserved. 2 About me • Michael Bentley • Formerly Director of Research and Response @ Lookout • Currently working on data mining projects • KK6WCN • michael@setnorth.com
  • 3. © 2016 Cloudera, Inc. All rights reserved. 3 Agenda • What we are trying to accomplish • How basic heuristics work • Where basic heuristics don’t work • Tracking with pairwise similarity and EMR • Visualizations to help extract more information • Mistakes and caveats
  • 4. © 2016 Cloudera, Inc. All rights reserved. 4 What are we trying to accomplish • Searching for major versions of software (malware) • Find ways to detect it with simple heuristics • Find ways to track it • Dataset discovery
  • 5. © 2016 Cloudera, Inc. All rights reserved. 5 Simple heuristics • Detect on static data • Detect on analysis stack created metadata applications analysisacquisition Hashes Strings Who signed it / certificate
  • 6. © 2016 Cloudera, Inc. All rights reserved. 6 Simple heuristics - hashes APK file Hashes Icon Dex File
  • 7. © 2016 Cloudera, Inc. All rights reserved. 7 Simple heuristics - string detection • Nice ASCII string delimited by null bytes • Malicious class path • Byte code • Exact match in one or both directions of string • Ctrl + F Null byte
  • 8. © 2016 Cloudera, Inc. All rights reserved. 8 Simple heuristics- certificates • Same malware • Different certificates
  • 9. © 2016 Cloudera, Inc. All rights reserved. 9 Where simple heuristics are good • Good for things that don’t change • Computationally cheap • About the same scenario for network (IDS) or application inspection (malware detection)
  • 10. © 2016 Cloudera, Inc. All rights reserved. 10 Where it’s problematic • Anything with funding/making money. • Malware created in Eastern Europe, Asia, Italy (Hacking Team) • Mass creation of certificates • Code taken from Stack Overflow • Anything with basic string obfuscation • Hunting for new major versions
  • 11. © 2016 Cloudera, Inc. All rights reserved. 11 Enter pairwise similarity You’re about to see a spreadsheet at a big data conference http://gunshowcomic.com/648
  • 12. © 2016 Cloudera, Inc. All rights reserved. 12 Application pairwise similarity
  • 13. © 2016 Cloudera, Inc. All rights reserved. 13 Go from pick one app and rescan corpus
  • 14. © 2016 Cloudera, Inc. All rights reserved. 14 Pick one application – Rescan corpus • Examine one app • Find heuristic • Rescan corpus • Rinse repeat ad infinitum • Throw people at the problem http://bit.ly/2a0zcZR
  • 15. © 2016 Cloudera, Inc. All rights reserved. 15 Decoding what you already have • Pairwise similarity defines the relationships for us • Dots represent unique (SHA1) applications • Colors represent major versions of malware • Each color is within ~85% match of code distance
  • 16. © 2016 Cloudera, Inc. All rights reserved. 16 Clustering and intelligence APK APK APK APK APK APK APK Nearest neighbor 95% similar Cluster 1 85% similar Cluster 2 85% similar Cluster 0 < 85% similar • APKs are nodes and edges • Clusters are neighborhoods
  • 17. © 2016 Cloudera, Inc. All rights reserved. 17 Clustering and intelligence
  • 18. © 2016 Cloudera, Inc. All rights reserved. 18 Clustering versus heuristics
  • 19. © 2016 Cloudera, Inc. All rights reserved. 19 Evolution of malware over time • By taking the clustering data and then overlaying it with the packaged at data we can watch malware evolve over time. • Color represents major version • Time is a 4 month sliding window • Shows iterations from malware writers
  • 20. © 2016 Cloudera, Inc. All rights reserved. 20 Pairwise problems and options • Comparing 3500 applications is 12,250,000 operations • As you bring more applications in, expect to scale EMR cluster or reduce n. • You can overmatch on similarity – outlier issue
  • 21. © 2016 Cloudera, Inc. All rights reserved. 21 Tripping over the bar • Pairwise similarity for 7k apps is about 5gB. • So is S3 • Things go bad when you don’t respect the bucket size • Troubleshooting CSV sizes is a thing • Doesn’t work well on small applications • Temporary files on your local machine that are 70gB cause problems
  • 22. © 2016 Cloudera, Inc. All rights reserved. 22 Knowledge • I had never used NetworkX before ~2014 • I had no idea how to go from what we had into a decent format for visualizing this (GraphML). • Almost no experience in graph theory before ~2014 • Gilad Lotan had a great PyCon talk which got me started. I still reference his talks. • Gephi is a great shortcut for visualizing in 2D if you aren’t familiar with D3 • Seth Hardy who gave tons of amazing feedback while I was learning • Jack Urban who proved that it was possible to track applications as a network • Gensim library is a great way to get started in doing comparisons of applications • Lots of inspiration from the Defcon 22 OpenDNS talk (theirs is better)