Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-Source Spark

81 views

Published on

Booz Allen is at the forefront of cyber innovation and sometimes that means applying AI in an on-prem environment because of data sensitivity.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Apache Spark vs Apache Spark: An On-Prem Comparison of Databricks and Open-Source Spark

  1. 1. In partnership withIn partnership with SPARK VS SPARK AN ON-PREM COMPARISON OF DATABRICKS AND OPEN-SOURCE SPARK Justin Hoffman - Senior Lead Data Scientist at Booz Allen Hamilton In Collaboration with US Air Force In Collaboration with Databricks SPARK AI SUMMIT 2020
  2. 2. • Rapidly expanding attack surface • Inundation of cyber tools • Attacks are more sophisticated • Cyber talent shortage Networks are harder to secure than ever before, as defenders are increasingly overwhelmed by data and challenges The average intrusion is detected almost 200 days after the fact
  3. 3. In cybersecurity, speed is paramount
  4. 4. The Challenge: Go Fast…. On-Premise? COLLECT Get and track data from the source PROCESS Give Data the Power of Greater Context AGGREGATE From disparate data sources, one version of the truth EXPOSE Abstract away complexities through a single interface DATA SCIENCE ANALYTICS BUSINESS INTELLIGENCE REPORTING DATA CONSUMERS VISUALIZATIONS APPLICATIONS Security, Governance, Provenance, Lineage INSIGHTS DATA SOURCES SOCIAL MEDIA NEWSFEEDS WEB CRAWLERS PROPRIETARY SOURCES
  5. 5. BOOZ ALLEN HAMILTON - This document is intended solely for the client to whom it is addressed on the title slide. 7 Various Sensor Data Feeds Historic Data Storage Repository of historic data for compliance / retrospective analysis Data Storage Hosts normalized, enriched data for cyber operations Traditional Tooling Custom Dashboards Often prebuilt with analytic capabilities Security Operations Center (SOC) drives cyber hunt and defensive operations mission by analyzing data through existing tooling (COTS platforms and custom dashboards) Data Broker Security Orchestration, Automation, and Response: Integrated capabilities support the SOC by automating simple tasks when appropriate Suite of Crowd Sourced Analytics Curates risk scores for nuanced adversary techniques that were previously undetectable using containerized AI systems Normalization Engine Enrichment Engine Automated Threat Intelligence Enrichment Accelerates investigations by automating the collection of valuable context before the the data hits the SIEM Established Data Model with Automation Fuses multiple data feeds and normalizes raw data to a common data model Solution: A Service-Oriented Architecture for Capability Deployment
  6. 6. BOOZ ALLEN HAMILTON - This document is intended solely for the client to whom it is addressed on the title slide. 8 Various Sensor Data Feeds Data Storage Hosts normalized, enriched data for cyber operations Security Operations Center (SOC) drives cyber hunt and defensive operations mission by analyzing data through existing tooling (COTS platforms and custom dashboards) Data Broker Custom Cyber AI Models Identify IPs that are interesting Enrichment Engine Project Architecture: Focused on High Performance Computing HPC Specs • Master node - 1 • Worker Nodes – 6 • Memory – 128 GB • Cores – 16 • Gigabit Connectivity • RM – Yarn • Hive Metastore – MariaDB • DBIO Caching (DBR) - enabled
  7. 7. Results: Spark Open Source vs Spark DBR *https://databricks.com/glossary/what-is-databricks-runtime In Cloud On Prem DBR OSS Gains (DBR) SQL - Read and count 34.4 s 158.5 s 4.6X SQL – Filtered Count 1.7 s 72.5 s 42.65X ~1 Billion + records!! 1TB+ in size
  8. 8. Lessons Learned for Future On-Premise Installs Ø Spark DBR with Delta Lake performs almost 50X faster on complex joins for IP’s of interest Ø When performing Machine Learning at scale, Spark DBR still provides performance gains in DGA classification Ø We isolated some worker node failures to the open source Hadoop distribution which would cause Spark not to complete. Switching distributions solved the issue. Ø Simple applications leveraging RDDs (Resilient Distributed Dataset) will not have high performance gains Ø Leveraging the Delta Lake format and Parquet with MariaDB provides performance optimizations Ø No Cloud was leveraged in the making of this research

×