WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Jakub Sanojca & Joāo Da Silva, Avast
Researcher Data Engineer
Jakub Sanojca & Joāo Da Silva, Avast
Researcher Data Engineer
AI on Spark for Malware
Analysis and Anomalous Threat
Detection
Demonstrate how Avast leverages
AI and big data to burn malware.
Goal
Demonstrate how Avast leverages
AI and big data to burn malware.
Goal
Agenda
• What Avast does
• Malware research
• Structured Streaming
• AI anomaly detection
• Demo
Thank you
Thank you
• Big Data Systems
• AI team - especially Yura, Olga and Dmitry
• Threat researchers and analysts
Avast is dedicated to creating a world
that provides safety and privacy for all,
no matter who you are, where you are,
or how you connect.
Global reach
10#UnifiedDataAnalytics #SparkAISummit
Portfolio of security, privacy
and utility applications
World’s Largest Detection Network
300 M+
new files
monthly 10,000 +
globally
distributed
servers
200B+
URLs
12#UnifiedDataAnalytics #SparkAISummit
Training the Avast Machine Learning Engine
Purpose-built approach that takes < 12 hours to add
new features, train, and deploy into production
Malware classification
13#UnifiedDataAnalytics #SparkAISummit
Data
● >500 handcrafted features from binary
files from our experts
Task
● Classification to clean/malware/pup files
Two step ML Pipeline:
● Cluster data with custom k-means
● Classification inside the cluster is done
by Random Forest
Infrastructure: Underlying data lake - Burger
14#UnifiedDataAnalytics #SparkAISummit
15#UnifiedDataAnalytics #SparkAISummit15
Data
Features Clustering Training Validation Production
Clustering Training Validation
3h 4.5h 24 h
24 h
24 h 6 h
● ~700TB of binary files
● patented tailor-made solution
Architecture: Malware classification
Custom application Spark
• optimised & performant
• takes months to develop
• not that easy to change
• slower
• easy to experiment with
• very fast development
#UnifiedDataAnalytics #SparkAISummit
Threat Detections Streaming
1. Identify - threat researcher
2. Block - operator
3. Analyze and automate - data / AI researcher +
engineers
3 step threat approach
1. Identify - threat researcher
2. Block - operator
3. Analyze and automate - data / AI researcher +
engineers
3 step threat approach
1. Identify - threat researcher
2. Block - operator
3. Analyze and automate - data / AI researcher +
engineers
3 step threat approach
• Thousands of detection time series
• Where should operator focus?
Time series of detections
• Thousands of detection time series
• Where should operator focus?
Time series of detections
Short response time is necessary
Short response time is necessary
First idea - custom streaming app
• Python because of ML models
First idea - custom streaming app
• Python because of ML models
• Big part of code about already solved problems
First idea - custom streaming app
• Python because of ML models
• Big part of code about already solved problems
• POC written by researchers
First idea - custom streaming app
• Python because of ML models
• Big part of code about already solved problems
• POC written by researchers
• Gets job done, but not easy to maintain or experiment
Adopted solution:
Spark Structured Streaming
29#UnifiedDataAnalytics #SparkAISummit
30#UnifiedDataAnalytics #SparkAISummit
Structured Streaming
Advantages of
Structured Streaming
for fast threat detection
#UnifiedDataAnalytics #SparkAISummit
Advantages of Structured Streaming
• Unified processing engine
32#UnifiedDataAnalytics #SparkAISummit
Advantages of Structured Streaming
• Unified processing engine
• End to end AI with multiple sinks
33#UnifiedDataAnalytics #SparkAISummit
Advantages of Structured Streaming
• Unified processing engine
• End to end AI with multiple sinks
• Window aggregations and Watermarking
out of the box
34#UnifiedDataAnalytics #SparkAISummit
Advantages of Structured Streaming
• Unified processing engine
• End to end AI with multiple sinks
• Window aggregations and Watermarking out of the box
• Resilient streams
35#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
Structured Streaming
Adoption
Structured Streaming Adoption
• Unbounded table
37#UnifiedDataAnalytics #SparkAISummit
Structured Streaming Adoption
• Unbounded table
• Triggers
38#UnifiedDataAnalytics #SparkAISummit
Structured Streaming Adoption
• Unbounded table
• Triggers
39#UnifiedDataAnalytics #SparkAISummit
>>> writer = sdf.writeStream.trigger(processingTime='5 seconds')
Structured Streaming Adoption
• Unbounded table
• Triggers
40#UnifiedDataAnalytics #SparkAISummit
>>> writer = sdf.writeStream.trigger(processingTime='5 seconds')
>>> writer = sdf.writeStream.trigger(once=True)
Structured Streaming Adoption
• Unbounded table
• Triggers
41#UnifiedDataAnalytics #SparkAISummit
>>> writer = sdf.writeStream.trigger(processingTime='5 seconds')
>>> writer = sdf.writeStream.trigger(once=True)
>>> writer = sdf.writeStream.trigger(continuous='5 seconds')
Structured Streaming Adoption
• Unbounded table
• Triggers
• Micro Batch Processing vs Continuous processing
42#UnifiedDataAnalytics #SparkAISummit
Structured Streaming Adoption
• Unbounded table
• Triggers
• Micro Batch Processing vs Continuous processing
– org.apache.spark.sql.execution.streaming.MicroBatchExecution
43#UnifiedDataAnalytics #SparkAISummit
Structured Streaming Adoption
• Unbounded table
• Triggers
• Micro Batch Processing vs Continuous processing
– org.apache.spark.sql.execution.streaming.MicroBatchExecution
– org.apache.spark.sql.execution.streaming.ContinuousExecution
(experimental)
44#UnifiedDataAnalytics #SparkAISummit
Structured Streaming Adoption
• Unbounded table
• Triggers
• Micro Batch Processing vs Continuous processing
45#UnifiedDataAnalytics #SparkAISummit
Before
46#UnifiedDataAnalytics #SparkAISummit
Before
47#UnifiedDataAnalytics #SparkAISummit
Before After
48#UnifiedDataAnalytics #SparkAISummit
49#UnifiedDataAnalytics #SparkAISummit
#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection
on time series
How to quickly identify campaigns of malware and
potentially unwanted programs.
51#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
How to quickly identify campaigns of malware and potentially
unwanted programs:
• Traditional approaches - find outliers
52#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
How to quickly identify campaigns of malware and potentially
unwanted programs.
• Traditional approaches - find outliers
• Machine learning - predict and compare
– Neural networks - LSTMs vs CNNs
53#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
How to quickly identify campaigns of malware and potentially
unwanted programs.
• Traditional approaches - find outliers
• Machine learning - predict and compare
– Neural networks - LSTMs vs CNNs
– Other - auto-regressive models etc.
54#UnifiedDataAnalytics #SparkAISummit
AI driven anomaly detection on time series
• Sequential
55#UnifiedDataAnalytics #SparkAISummit
Threat anomaly detection: training
• Sequential
• Parallel! mapPartitions / pandas_udf
56#UnifiedDataAnalytics #SparkAISummit
Threat anomaly detection: training
• Sequential
• Parallel!
• Distributed - TensorflowOnSpark
57#UnifiedDataAnalytics #SparkAISummit
Threat anomaly detection: training
• pandas_udf for parallel predictions
• super easy to test on already stored data as batch job
58#UnifiedDataAnalytics #SparkAISummit
Threat anomaly detection: stream serving
Demo + Code Walkthrough
59#UnifiedDataAnalytics #SparkAISummit
Challenges
60#UnifiedDataAnalytics #SparkAISummit
• Multiple potential incompatibility surfaces
• Unexpected behavior / Unknowns
• Silent failures
Takeaways
• Easier collaboration between Science and Engineering teams
• An excellent toolbox to do anomaly detection in near real time
• Easy ML/AI/DL integration
• Parallelism
61#UnifiedDataAnalytics #SparkAISummit
Questions?
Jakub Sanojca & Joāo Da Silva, Avast
Researcher Data Engineer

AI on Spark for Malware Analysis and Anomalous Threat Detection