Successfully reported this slideshow.
Your SlideShare is downloading. ×

CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 43 Ad

CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark

Download to read offline

Cybercrime is one the greatest threats to every company in the world today and a major problem for mankind in general. The damage due to Cybercrime is estimated to be around $6 Trillion By 2021. Security professionals are struggling to cope with the threat. As a result, powerful and easy to use tools are necessary to aid in this battle. For this purpose we created an anomaly detection framework focused on security which can identify anomalous access patterns. It is built on top of Apache Spark and can be applied in parallel over multiple tenants. This allows the model to be trained over the data of thousands of customers over a Databricks cluster within less than an hour. The model leverages proven technologies from Recommendation Engines to produce high quality anomalies. We thoroughly evaluated the model’s ability to identify actual anomalies by using synthetically generated data and also by creating an actual attack and showing that the model clearly identifies the attack as anomalous behavior. We plan to open source this library as part of a cyber-ML toolkit we will be offering.

Cybercrime is one the greatest threats to every company in the world today and a major problem for mankind in general. The damage due to Cybercrime is estimated to be around $6 Trillion By 2021. Security professionals are struggling to cope with the threat. As a result, powerful and easy to use tools are necessary to aid in this battle. For this purpose we created an anomaly detection framework focused on security which can identify anomalous access patterns. It is built on top of Apache Spark and can be applied in parallel over multiple tenants. This allows the model to be trained over the data of thousands of customers over a Databricks cluster within less than an hour. The model leverages proven technologies from Recommendation Engines to produce high quality anomalies. We thoroughly evaluated the model’s ability to identify actual anomalies by using synthetically generated data and also by creating an actual attack and showing that the model clearly identifies the attack as anomalous behavior. We plan to open source this library as part of a cyber-ML toolkit we will be offering.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Roy Levin, Microsoft CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache Spark #UnifiedDataAnalytics #SparkAISummit
  3. 3. Session goals • Present an easy-to-use framework that produces cyber-security-anomalies • Explain how recommendation systems are used to find anomalous resource access • Show how we evaluated the framework to show its usefulness 3
  4. 4. Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary Agenda 4
  5. 5. centralized cloud native Security Information & Event Management system Build Your Own ML (BYOML) 1. Log data from cloud resources 2. Process logs from Azure Databricks cluster 3. Author custom security analytics 5
  6. 6. 6 General Anomaly Detector Dataset Fault detection System health monitoring Security incidents … We would like to capture only Security-related-anomalies
  7. 7. 7 • • •
  8. 8. anomalous access • Train and apply on a simple-to-construct dataset – Avoid writing and maintaining complex rules and logic – Avoid the need to analyze multiple complex datasets such as: § Org-charts § RBAC tables § Cloud architectures 8
  9. 9. ? 9
  10. 10. Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary Agenda 10
  11. 11. • Given user & resource pair (u, r) • Provide an anomaly score of user u accessing resource r • If anomaly score is above some threshold then surface the event 11
  12. 12. ? The straight forward approach But users access new resources quite often, so this is just not good enough 12
  13. 13. ?Create profile per user and resource and see if access deviates from that profile 13
  14. 14. Intuition: • Take a recommendation system and use it for anti-recommendations 14
  15. 15. Recommendation Engines 15
  16. 16. Roy1 Inbal2 Hasan3 Lior4 Anat5 Arnon6 The God Father1 4 5 The Dark Knight2 3 2 5 Pulp Fiction3 5 3 5 4 4 5 40 Year Old Virgin4 2 4 3 3 Analyze That5 3 5 4 4 Anger Management6 3 5 5 Black Hawk Down7 5 4 Model Training Phase Movie Recommendations 16
  17. 17. Roy1 Inbal2 Hasan3 Lior4 Anat5 Arnon6 The God Father1 ? 4 ? 5 ? ? The Dark Knight2 3 ? ? ? 2 5 Pulp Fiction3 5 3 5 4 4 5 40 Year Old Virgin4 2 4 ? ? 3 3 Analyze That5 3 5 4 ? 4 ? Anger Management6 3 5 ? ? ? 5 Black Hawk Down7 5 ? ? 4 ? ? Romance Action Comedy x1 x2 xm f1 f2 f3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? f1 ? ? ? ? ? ? f2 ? ? ? ? ? ? f3 ? ? ? ? ? ? 𝜃" Romance Action Comedy 𝜃# 𝜃$ Model Training Phase Movie Recommendations 17
  18. 18. Roy1 Inbal2 Hasan3 Lior4 Anat5 Arnon6 The God Father1 ? 4 ? 5 ? ? The Dark Knight2 3 ? ? ? 2 5 Pulp Fiction3 5 3 5 4 4 5 40 Year Old Virgin4 2 4 ? ? 3 3 Analyze That5 3 5 4 ? 4 ? Anger Management6 3 5 ? ? ? 5 Black Hawk Down7 5 ? ? 4 ? ? Romance Action Comedy x1 x2 xm f1 f2 f3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Model Training Phase Movie Recommendations 18 f1 ? ? ? ? ? ? f2 ? ? ? ? ? ? f3 ? ? ? ? ? ? 𝜃" Romance Action Comedy 𝜃# 𝜃$
  19. 19. Roy1 Inbal2 Hasan3 Lior4 Anat5 Arnon6 The God Father1 ? 4 ? 5 ? ? The Dark Knight2 3 ? ? ? 2 5 Pulp Fiction3 5 3 5 4 4 5 40 Year Old Virgin4 2 4 ? ? 3 3 Analyze That5 3 5 4 ? 4 ? Anger Management6 3 5 ? ? ? 5 Black Hawk Down7 5 ? ? 4 ? ? Romance Action Comedy x1 x2 xm f1 f2 f3 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Model Apply Phase Movie Recommendations f1 ? ? ? ? ? ? f2 ? ? ? ? ? ? f3 ? ? ? ? ? ? 𝜃" Romance Action Comedy 𝜃# 𝜃$
  20. 20. Back to Anomalous Resource Access 20
  21. 21. • Let us re-examine our data: – User-resource pairs with number of times accessed • Standard CF model assumes explicit item ratings, some problems: – A rating is not really what we have in the input • Although more user access to a resource likely means he should be allowed access – We do not really have negative rating indications either, i.e., there is no explicit indicator saying that a user should not have access to some resource • what we do have is missing access 21
  22. 22. user1 user2 user3 user4 user5 user6 resource1 1200 1500 resource2 900 301 1 resource3 1500 599 1 902 1205 1500 resource4 299 1200 895 901 resource5 601 1500 1200 1203 resource6 603 1499 1495 resource7 1499 1200 user1 user2 user3 user4 user5 user6 resource1 9 10 resource2 8 6 5 resource3 10 7 5 8 9 10 resource4 6 9 8 8 resource5 7 10 9 9 resource6 7 10 10 resource7 10 9 Linear Scaling 22
  23. 23. user1 user2 user3 user4 user5 user6 resource1 9 10 resource2 8 6 5 resource3 10 7 5 8 9 10 resource4 6 9 8 8 resource5 7 10 9 9 resource6 7 10 10 resource7 10 9 Random Negative Samples 23
  24. 24. user1 user2 user3 user4 user5 user6 resource1 1 9 10 resource2 8 1 6 5 resource3 10 7 5 8 9 10 resource4 6 9 1 8 8 resource5 7 10 9 9 resource6 7 10 10 resource7 10 9 1 Random Negative Samples 24
  25. 25. user1 user2 user3 user4 user5 user6 resource1 1 9 10 resource2 8 1 6 5 resource3 10 7 5 8 9 10 resource4 6 9 1 8 8 resource5 7 10 9 9 resource6 7 10 10 resource7 10 9 1 Adjusting for user & resource bias and create an anomaly score − 25
  26. 26. Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary Agenda 26
  27. 27. • Actually: we are given a tenant-id, user, resource triplet (tid, u, r) • Provide anomaly score of user u accessing resource r per-tenant • Note: access within each tenant is isolated • Goals: – Process tenants in parallel – Cope with data from large tenants 27
  28. 28. • Create a PUDF which uses the Surprise Python library to run the CF algorithm locally on each worker node • Provided PUDF works on Pandas-DFs that are created per-group when apply is called • The method is applied as follows: – df.groupBy(tid_colname).apply(my_pudf) * SurPRISE: Simple Python RecommendatIon System Engine http://surpriselib.com/ 28
  29. 29. • Problem: the data from some tenants may be too large to fit into the memory of a single worker node • Solution: before applying, count number of entries per-tenant – If number of entries can fit in-memory then apply PUDF method – If not, then apply Spark CF, per tenant, one-by-one 29
  30. 30. • Training produces a model which is basically – A dataframe mapping (tenant-id, user) and (tenant-id, resource) pairs to their corresponding latent feature vectors • Applying the model requires: – Joining with respective user/resource to retrieve vectors – Applying a dot-product * Note: model can be applied with Structured Streaming 30
  31. 31. Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary Agenda 31
  32. 32. Experiments for Azure Sentinel AI 1. Synthetic dataset 2. Actual file share data from large customer • Users accessing shared network files 32
  33. 33. For training 33
  34. 34. Add cross group access For testing 1. 2. 34
  35. 35. Results 100%, i.e. all 100 cross group access receives top-100 anomaly scores! Add cross group access 35
  36. 36. File Share SMB server Actual Attack Description shares Machine 1 shares Machine 2 shares Machine n 58% of companies have over 100,000 folders open to everyone within the network (source: Varonis cybersecurity data security and analytics) 36
  37. 37. Algorithm Training shares Machine 1 shares Machine 2 shares Machine n 37
  38. 38. Testset (2 days after training) shares Machine 1 shares Machine 2 shares Machine n 38
  39. 39. Results dataset/anomaly scores Mean stddev min Max count Entire test set 0.05 1.16 -19.21 8.07 3.8M 𝑼𝒏𝒔𝒆𝒆𝒏 𝒗𝒂𝒍𝒊𝒅 𝒂𝒄𝒄𝒆𝒔𝒔 -0.28 0.38 -1.2 1.18 410 𝑹𝒆𝒔𝒕𝒓𝒊𝒄𝒕𝒆𝒅 𝒂𝒄𝒄𝒆𝒔𝒔 7.81 0.11 7.44 8.07 400 39
  40. 40. Motivation Formulation & Models Scalability for Large Datasets Evaluation Summary Agenda 40
  41. 41. 41 from sentinel_ai.peer_anomaly.spark_collaborative_filtering import AccessAnomaly access_anomaly = AccessAnomaly( # it is just an estimator tenant_colname, user_colname, res_colname, score_colname ) anom_model = access_anomaly.fit(training_dataset_scored_triplets) scored_test_dataset_triplets = anom_model.transform(test_dataset_triplets) scored_test_dataset_triplets.show() https://github.com/Azure/Azure-Sentinel-BYOML
  42. 42. • Introduced an Access Anomaly Detection framework for cyber security and how it fits into the BYOML pillar of Azure Sentinel – an anti-recommendation is an access-anomaly – code has been open sourced • The framework provides a simple-to-use API allowing security analysts to surface access anomalies • Call-to-action: experiment with the framework, continue this line of research, suggest and add more algorithm 42
  43. 43. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×