Successfully reported this slideshow.
Your SlideShare is downloading. ×

AI on Spark for Malware Analysis and Anomalous Threat Detection

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 62 Ad

AI on Spark for Malware Analysis and Anomalous Threat Detection

Download to read offline

At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.

At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to AI on Spark for Malware Analysis and Anomalous Threat Detection (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

AI on Spark for Malware Analysis and Anomalous Threat Detection

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Jakub Sanojca & Joāo Da Silva, Avast Researcher Data Engineer
  3. 3. Jakub Sanojca & Joāo Da Silva, Avast Researcher Data Engineer AI on Spark for Malware Analysis and Anomalous Threat Detection
  4. 4. Demonstrate how Avast leverages AI and big data to burn malware. Goal
  5. 5. Demonstrate how Avast leverages AI and big data to burn malware. Goal
  6. 6. Agenda • What Avast does • Malware research • Structured Streaming • AI anomaly detection • Demo
  7. 7. Thank you
  8. 8. Thank you • Big Data Systems • AI team - especially Yura, Olga and Dmitry • Threat researchers and analysts
  9. 9. Avast is dedicated to creating a world that provides safety and privacy for all, no matter who you are, where you are, or how you connect.
  10. 10. Global reach 10#UnifiedDataAnalytics #SparkAISummit Portfolio of security, privacy and utility applications
  11. 11. World’s Largest Detection Network 300 M+ new files monthly 10,000 + globally distributed servers 200B+ URLs
  12. 12. 12#UnifiedDataAnalytics #SparkAISummit Training the Avast Machine Learning Engine Purpose-built approach that takes < 12 hours to add new features, train, and deploy into production
  13. 13. Malware classification 13#UnifiedDataAnalytics #SparkAISummit Data ● >500 handcrafted features from binary files from our experts Task ● Classification to clean/malware/pup files Two step ML Pipeline: ● Cluster data with custom k-means ● Classification inside the cluster is done by Random Forest
  14. 14. Infrastructure: Underlying data lake - Burger 14#UnifiedDataAnalytics #SparkAISummit
  15. 15. 15#UnifiedDataAnalytics #SparkAISummit15 Data Features Clustering Training Validation Production Clustering Training Validation 3h 4.5h 24 h 24 h 24 h 6 h ● ~700TB of binary files ● patented tailor-made solution Architecture: Malware classification
  16. 16. Custom application Spark • optimised & performant • takes months to develop • not that easy to change • slower • easy to experiment with • very fast development
  17. 17. #UnifiedDataAnalytics #SparkAISummit Threat Detections Streaming
  18. 18. 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers 3 step threat approach
  19. 19. 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers 3 step threat approach
  20. 20. 1. Identify - threat researcher 2. Block - operator 3. Analyze and automate - data / AI researcher + engineers 3 step threat approach
  21. 21. • Thousands of detection time series • Where should operator focus? Time series of detections
  22. 22. • Thousands of detection time series • Where should operator focus? Time series of detections
  23. 23. Short response time is necessary
  24. 24. Short response time is necessary
  25. 25. First idea - custom streaming app • Python because of ML models
  26. 26. First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems
  27. 27. First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems • POC written by researchers
  28. 28. First idea - custom streaming app • Python because of ML models • Big part of code about already solved problems • POC written by researchers • Gets job done, but not easy to maintain or experiment
  29. 29. Adopted solution: Spark Structured Streaming 29#UnifiedDataAnalytics #SparkAISummit
  30. 30. 30#UnifiedDataAnalytics #SparkAISummit Structured Streaming
  31. 31. Advantages of Structured Streaming for fast threat detection #UnifiedDataAnalytics #SparkAISummit
  32. 32. Advantages of Structured Streaming • Unified processing engine 32#UnifiedDataAnalytics #SparkAISummit
  33. 33. Advantages of Structured Streaming • Unified processing engine • End to end AI with multiple sinks 33#UnifiedDataAnalytics #SparkAISummit
  34. 34. Advantages of Structured Streaming • Unified processing engine • End to end AI with multiple sinks • Window aggregations and Watermarking out of the box 34#UnifiedDataAnalytics #SparkAISummit
  35. 35. Advantages of Structured Streaming • Unified processing engine • End to end AI with multiple sinks • Window aggregations and Watermarking out of the box • Resilient streams 35#UnifiedDataAnalytics #SparkAISummit
  36. 36. #UnifiedDataAnalytics #SparkAISummit Structured Streaming Adoption
  37. 37. Structured Streaming Adoption • Unbounded table 37#UnifiedDataAnalytics #SparkAISummit
  38. 38. Structured Streaming Adoption • Unbounded table • Triggers 38#UnifiedDataAnalytics #SparkAISummit
  39. 39. Structured Streaming Adoption • Unbounded table • Triggers 39#UnifiedDataAnalytics #SparkAISummit >>> writer = sdf.writeStream.trigger(processingTime='5 seconds')
  40. 40. Structured Streaming Adoption • Unbounded table • Triggers 40#UnifiedDataAnalytics #SparkAISummit >>> writer = sdf.writeStream.trigger(processingTime='5 seconds') >>> writer = sdf.writeStream.trigger(once=True)
  41. 41. Structured Streaming Adoption • Unbounded table • Triggers 41#UnifiedDataAnalytics #SparkAISummit >>> writer = sdf.writeStream.trigger(processingTime='5 seconds') >>> writer = sdf.writeStream.trigger(once=True) >>> writer = sdf.writeStream.trigger(continuous='5 seconds')
  42. 42. Structured Streaming Adoption • Unbounded table • Triggers • Micro Batch Processing vs Continuous processing 42#UnifiedDataAnalytics #SparkAISummit
  43. 43. Structured Streaming Adoption • Unbounded table • Triggers • Micro Batch Processing vs Continuous processing – org.apache.spark.sql.execution.streaming.MicroBatchExecution 43#UnifiedDataAnalytics #SparkAISummit
  44. 44. Structured Streaming Adoption • Unbounded table • Triggers • Micro Batch Processing vs Continuous processing – org.apache.spark.sql.execution.streaming.MicroBatchExecution – org.apache.spark.sql.execution.streaming.ContinuousExecution (experimental) 44#UnifiedDataAnalytics #SparkAISummit
  45. 45. Structured Streaming Adoption • Unbounded table • Triggers • Micro Batch Processing vs Continuous processing 45#UnifiedDataAnalytics #SparkAISummit
  46. 46. Before 46#UnifiedDataAnalytics #SparkAISummit
  47. 47. Before 47#UnifiedDataAnalytics #SparkAISummit
  48. 48. Before After 48#UnifiedDataAnalytics #SparkAISummit
  49. 49. 49#UnifiedDataAnalytics #SparkAISummit
  50. 50. #UnifiedDataAnalytics #SparkAISummit AI driven anomaly detection on time series
  51. 51. How to quickly identify campaigns of malware and potentially unwanted programs. 51#UnifiedDataAnalytics #SparkAISummit AI driven anomaly detection on time series
  52. 52. How to quickly identify campaigns of malware and potentially unwanted programs: • Traditional approaches - find outliers 52#UnifiedDataAnalytics #SparkAISummit AI driven anomaly detection on time series
  53. 53. How to quickly identify campaigns of malware and potentially unwanted programs. • Traditional approaches - find outliers • Machine learning - predict and compare – Neural networks - LSTMs vs CNNs 53#UnifiedDataAnalytics #SparkAISummit AI driven anomaly detection on time series
  54. 54. How to quickly identify campaigns of malware and potentially unwanted programs. • Traditional approaches - find outliers • Machine learning - predict and compare – Neural networks - LSTMs vs CNNs – Other - auto-regressive models etc. 54#UnifiedDataAnalytics #SparkAISummit AI driven anomaly detection on time series
  55. 55. • Sequential 55#UnifiedDataAnalytics #SparkAISummit Threat anomaly detection: training
  56. 56. • Sequential • Parallel! mapPartitions / pandas_udf 56#UnifiedDataAnalytics #SparkAISummit Threat anomaly detection: training
  57. 57. • Sequential • Parallel! • Distributed - TensorflowOnSpark 57#UnifiedDataAnalytics #SparkAISummit Threat anomaly detection: training
  58. 58. • pandas_udf for parallel predictions • super easy to test on already stored data as batch job 58#UnifiedDataAnalytics #SparkAISummit Threat anomaly detection: stream serving
  59. 59. Demo + Code Walkthrough 59#UnifiedDataAnalytics #SparkAISummit
  60. 60. Challenges 60#UnifiedDataAnalytics #SparkAISummit • Multiple potential incompatibility surfaces • Unexpected behavior / Unknowns • Silent failures
  61. 61. Takeaways • Easier collaboration between Science and Engineering teams • An excellent toolbox to do anomaly detection in near real time • Easy ML/AI/DL integration • Parallelism 61#UnifiedDataAnalytics #SparkAISummit
  62. 62. Questions? Jakub Sanojca & Joāo Da Silva, Avast Researcher Data Engineer

×