Successfully reported this slideshow.
Your SlideShare is downloading. ×

Apache Spark for Cyber Security in an Enterprise Company

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 34 Ad

Apache Spark for Cyber Security in an Enterprise Company

Download to read offline

In order to understand and react to their security situation, many cybersecurity operations use Security information and event management (SIEM) software nowadays. Using a traditional SIEM in a large company such as HP Enterprise is a challenge due to the increasing volume and rate of data. We present the solution used to reduce data volume processed by the SIEM using Spark Streaming and the results obtained in processing one of the largest data feeds in HPE: Firewall logs. Testing of SIEM rules the traditional way is a time-consuming process. Usually, it is necessary to wait one day to get results and statistic for one-day production data. An alternative approach to build a SIEM using Spark and other big data technologies will be drafted and results of “fast forward” processing of production data snapshots will be presented. HPE is the target of sophisticated well-crafted attacks and deployed cyber Security tools are not able to detect all of them. A simple application, built using Spark MLlib and company-specific data for training, for detection of malicious trending domains will be described. Takeaways: Spark streaming can be used to pre-process cybersecurity data and reduce their amount for further processing. Spark MLlib can be used to add the additional detecting capability for specific use cases.

In this presentation, we will share how Hewlett Packard Enterprise has implemented Apache Spark to deal with three main cyber security use cases:

1) Using Spark to help Security information and event management (SIEM) process an increasing amount of data
2) Using Spark to test SIEMs rules by “fast forward” processing of production data snapshots.
3) Implementing machine learning to add an additional detection capability

In order to understand and react to their security situation, many cybersecurity operations use Security information and event management (SIEM) software nowadays. Using a traditional SIEM in a large company such as HP Enterprise is a challenge due to the increasing volume and rate of data. We present the solution used to reduce data volume processed by the SIEM using Spark Streaming and the results obtained in processing one of the largest data feeds in HPE: Firewall logs. Testing of SIEM rules the traditional way is a time-consuming process. Usually, it is necessary to wait one day to get results and statistic for one-day production data. An alternative approach to build a SIEM using Spark and other big data technologies will be drafted and results of “fast forward” processing of production data snapshots will be presented. HPE is the target of sophisticated well-crafted attacks and deployed cyber Security tools are not able to detect all of them. A simple application, built using Spark MLlib and company-specific data for training, for detection of malicious trending domains will be described. Takeaways: Spark streaming can be used to pre-process cybersecurity data and reduce their amount for further processing. Spark MLlib can be used to add the additional detecting capability for specific use cases.

In this presentation, we will share how Hewlett Packard Enterprise has implemented Apache Spark to deal with three main cyber security use cases:

1) Using Spark to help Security information and event management (SIEM) process an increasing amount of data
2) Using Spark to test SIEMs rules by “fast forward” processing of production data snapshots.
3) Implementing machine learning to add an additional detection capability

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Apache Spark for Cyber Security in an Enterprise Company (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Apache Spark for Cyber Security in an Enterprise Company

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Josef Niedermeier, HPE Apache Spark for Cyber Security in an Enterprise Company #UnifiedDataAnalytics #SparkAISummit
  3. 3. Agenda • Introduction • Challenges in Cyber Security • Using Spark to help process an increasing amount of data – Offloading current applications – Replacing current applications by Big Data technologies • Adding additional detection capabilities by Machine Learning – Machine Learning Introduction – Use Cases – High level architecture – Lessons learned • Q&A 3#UnifiedDataAnalytics #SparkAISummit
  4. 4. Introduction - Team 4#UnifiedDataAnalytics #SparkAISummit Netwok Traffic Logs Users Actions Big Data Platform Actionable Intelligence Global Cyber Security Fusion Center Data Science Team Vulnerabilities Risk and Governance Cyber Security Operation Center Advanced Thread
  5. 5. SIEM Introduction - SIEM 5#UnifiedDataAnalytics #SparkAISummit  SIEM - security information and event management  Security Event Manager (SEM): generates alerts based on predefined rules and input events  Security Information Manager (SIM): stores relevant cyber security data and allows querying to get context data events Security Analysts SEM SIM Alerts Query/Context Aggregation Filtering Enriching
  6. 6. Challenges in Cyber Security • Scalability and performance – Increasing amount of data: according to Gartner, 25K EPS is enterprise size, but in big organization there are several 100K EPS. – Limited storage for historical data. – Long query response time. – IoT makes situation even worse. • Quickly evolving requirements • Lack of qualified and skilled professionals 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. Using Spark to help process an increasing amount of data #UnifiedDataAnalytics #SparkAISummit
  8. 8. Big Data Processing Offloading current applications 8#UnifiedDataAnalytics #SparkAISummit  offload of aggregation, filtering and enriching  offload of storage and querying SIEM events Security Analysts SEM SIM Alerts Query/Context Aggregation Filtering Enriching Big Data Storage API UI Query/Context
  9. 9. Big Data Processing – high level 9#UnifiedDataAnalytics #SparkAISummit HDFS NetFlow Log Netflow Collector Columnar Store Syslog Collector Distributed Processing Batch and Streaming Deduplication, filtering, aggregation, enriching SIEMNetFlow Syslog In Memory Data Grid
  10. 10. Big Data Processing 10#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation
  11. 11. Big Data Processing 11#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation Syslog Collector sends syslog events to Kafka. (custom build) High Available Load Balancer sends syslog events to live collectors. (custom build)
  12. 12. Big Data Processing 12#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation Firewall Aggregation (5 sec. streaming job) aggregates events. (using DStream.reduceByKey) DNS enrichment adds DNS names using DHCP and DNS logs.
  13. 13. Big Data Processing 13#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation SIEM Loader (5 sec. streaming job) sends aggregated events to the SIEM.
  14. 14. Big Data Processing 14#UnifiedDataAnalytics #SparkAISummit Firewall logs aggregation Columnar Store Loader (5 sec. streaming job) loads aggregated events to the Columnar Store Columnar Store offloads storage and querying
  15. 15. Big Data Processing 15#UnifiedDataAnalytics #SparkAISummit ● Environment ● Inputs 65,000 EPS and 32,000 EPS 5 sec micro-batches (Spark Streaming) ● 24 executors x 11 cores each on non-dedicated, heavily utilized Hortonworks cluster ● Results ● Number of the events is reduced to half ● Query times are reduced to seconds Firewall logs aggregation
  16. 16. SIEM functionality using BigData technology 16#UnifiedDataAnalytics #SparkAISummit Evens Security Analysts Alerts Big Data Storage Query/Context MS MS API/UIMS Orchestration MS Micro services based on Big Data Technologies implement SIEM functionality ● Easy to add/modify functionality ● Design driven by users ● Easier integration with processes
  17. 17. SIEM functionality using BigData technology 17#UnifiedDataAnalytics #SparkAISummit  Rule development and testing similar to software testing  Similar process and tools (Jira, Git etc)  Tools  Spark, In Memory Data Grid  Preliminary Results  15 - 20 minutes to test a rule on 24h data ( 2B events) (24 executors)  linearly scalable Rule Development Unit Testing Fast Forward Testing With Production Sample Production Deployment
  18. 18. Adding additional detection capabilities by Machine Learning #UnifiedDataAnalytics #SparkAISummit
  19. 19. Machine Learning - Introduction 19#UnifiedDataAnalytics #SparkAISummit 0 1 0 1 x2 x1 Supervised Learning 1 0 1 x2 x1 Unsupervised Learning 0 1 0 1 x2 x1 Supervised Learning 1 0 1 x2 x1 Unsupervised Learning We can derive structure from data and find outliers. We can find a function f and its parameters that fits training data and can be used for classification and regression. Labeled data – supervised learning Unlabeled data – unsupervised learning
  20. 20. Machine Learning - Supervised 20#UnifiedDataAnalytics #SparkAISummit Training Algorithm Model Parameters (hypothesis) Training Labeled Data New Data Classification /Regression Algorithm Classification /Regression Results Training: finding a function and its parameters to fit training data Actual Classification/Regression 20
  21. 21. Machine Learning – Example 21#UnifiedDataAnalytics #SparkAISummit 21 ● f: if x2 > (p0 + p1 * x1) then O else X ● finding parameters to minimize # of wrongly classified data points (cost function) p0 p1 Line Cost 0.6 0 3 0.9 -0.9 2 0.8 - 0.7 0 0 1 0 1 x2 x1 Supervised Learning Training Labeled Data 0 1 0 1 x2 x1 Supervised Learning 21 Parameters
  22. 22. Machine Learning - example 22#UnifiedDataAnalytics #SparkAISummit 22 classification if x2 > (0.8 – 0.7 * x1) then O else X New data Classified new data
  23. 23. Machine Learning – Terminology 23#UnifiedDataAnalytics #SparkAISummit 23 Precision= True Positive True Positive+False Positive =Proportion of selected items that are relevant Recall= True Positive True Positive+False Negative =Proportion of relevant items that was selected Source: https://en.wikipedia.org/wiki/Precision_and_recall
  24. 24. Machine Learning – Challenges 24#UnifiedDataAnalytics #SparkAISummit ● Too many false positives ● Precision ~ 99% can be too low ● Data cleanliness ● Wrong time on a device can be detected as anomaly ● Missing labeled data ● Hard to evaluate recall
  25. 25. Machine Learning – Challenges 25#UnifiedDataAnalytics #SparkAISummit 25 ● A ML algorithm for detecting a specific malware infection: ● precision = 99% ● recall = 99%. ● The infection is relatively rare: 1 % of computers are infected. What is probability that the computer is really infected if it is classified as infected? (99% or 91% or 50% or 1%) Is 99% precision good enough?
  26. 26. Machine Learning – Challenges 26#UnifiedDataAnalytics #SparkAISummit 26 Suppose there are 10 000 computers: ● 100 are infected ● 99 infected are correctly classified as infected (true positive) ● 1 infected is classified as not infected (false negative) ● 9,900 clean ● 99 are classified incorrectly as infected (false positive) ● 9,801 are correctly classified as not infected (true negative) ● 99 true positivo and 99 false positive = 198 computers classified as infected but only 99 are really infected so probability that the computer classified as infected is really infected is 50%. P(infected given classified as infected )= P(classified as infected given infected )∗P(infected ) P(classified as infected ) = 0.99∗0.01 (0.99∗0.01+0.01∗0.99) =0.5Using Bayes' theorem:
  27. 27. Machine Learning – Challenges 27#UnifiedDataAnalytics #SparkAISummit 27 ● Usually a human should make final assessment. ● Reasonable use cases: ● High ratio of “infection” ● Limited (selected) data Classifier with precision and recall 99 % infected computers [%] really infected/classified as infected [%] 1.00% 50% 0.10% 9% 0.01% 1%
  28. 28. Machine Learning and Spark 28#UnifiedDataAnalytics #SparkAISummit ● MLlib is Apache Spark's scalable machine learning library. ● ML algorithms ● ML workflow utilities (data → feature, evaluation, persistence, ...) ● Several deep learning frameworks ● Databricks – spark-deep-learning, Deep Learning Pipelines for Apache Spark ● Yahoo -TensorFlowOnSpark ● Intel – BigDL ● ...
  29. 29. Machine Learning Use Cases 29#UnifiedDataAnalytics #SparkAISummit Use Case Data source Features Algorythm Detect malicious URL Web proxy log Entropy, no of spec. chars, path length, URL length, contains org. domain out of position, has been seen, ... Random Forest, Long-Short Term Memory Generated domains (malicious) detection DNS log Domain string Long-Short Term Memory Classify server account activity Active Domain log Network distance, organization distance, time distance Naïve Bayes, Random Forest
  30. 30. Machine Learning Use Cases 30#UnifiedDataAnalytics #SparkAISummit Use Case Data source Features Algorythm Detect command and control communication Netflow data Duration of TCP/IP session, cardinality, octets/packet etc. Naïve Bayes, Random Forest
  31. 31. Spark MLlib Batch Job Machine Learning - Architecture 31#UnifiedDataAnalytics #SparkAISummit Feature extractor Training Data Algorithm Training HDFS Model parameters
  32. 32. Spark MLlib Batch or Streaming Job Machine Learning - Architecture 32#UnifiedDataAnalytics #SparkAISummit Feature extractor New Data Algorithm HDFS Model parameters Classification Classified data
  33. 33. Machine Learning – Lessons Learned 33#UnifiedDataAnalytics #SparkAISummit ● Do not implement ML just to click “we are using ML” ● Have good use cases including precision and recall requirements ● Visualization can be more useful than ML in some cases ● In most cases, there is necessary to validate a detection by an analyst. ● Cyber security analysts like if there are reasoning (why the classifier decide that it is malicious)
  34. 34. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×