Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm


Published on

Strata Hadoop World 2017 San Jose

Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.

Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.

Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.

Published in: Technology
  • Be the first to comment

Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

  1. 1. Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Metron Kevin Mao Senior Data Engineer, Capital One @KevinJokaiMao
  2. 2. About Me  B.S., Computer Science, University of Maryland, Baltimore County  M.S., Computer Science, George Mason University  Enterprise Data Services, Data Intelligence  Purple Rain Project  Huge Zelda fan!
  3. 3. Agenda  Part 1: Motivation and Background  Part 2: Approach and Architecture  Part 3: Challenges  Part 4: Future Work  Part 5: Wrapping Up
  4. 4. Part 1: Motivation and Background
  5. 5. Capital One  45,000 Employees  45 Million Customers  26,000 EC2 Instances  Credit Cards  Traditional Banking  Home/Auto Loans  Brokerage Services
  6. 6. The Problem  The ways in which adversaries can attack your system are increasing - DNC hacks involved convincing spear phishing emails posing as Google Password Reset - Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock medical records system held hostage by ransomware  Organizations have to keep up by employing a more numerous and more diverse set of tools  Finding a way to effectively use those tools is difficult
  7. 7. The Data  HTTP Proxy logs  Email Metadata  VPN logs  Firewall events  DNS  Syslogs (*nix, Windows)  Security Endpoints  Threat Intelligence  IDS Events  Wireless Access Points  Mobile Device Management  And more...  ~ 40 distinct data feeds  ~ 5 Billion events per day  ~ 75,000 Peak events per second  ~ 5 TB per day
  8. 8. What We Started Out With  Enterprise SIEM (Security Information and Event Management) platform - Primary management tool for many years - Encountered stability issues while scaling out to 13 months of data retention  Splunk - Great UI experience - Scaling out to 13 months becomes prohibitively expensive
  9. 9. Where Does That Leave Us  We need a solution for security event and telemetry data that is diverse, voluminous, and fast-moving.  Horizontally and linearly scalable  Platform and interface built for: - SOC Analysts to quickly respond to incidents - Forensic Investigators to analyze historical data and compile reports - Threat Hunters to efficiently find vulnerabilities and malicious behavior  Affordable!
  10. 10. Purple Rain
  11. 11. PART 2: Approach and Architecture
  12. 12. NiFi  Data routing, transformation, and distribution platform  Easy to use Web UI  On-Prem Cluster – Collects data from all local devices - Flows into AWS Cluster - 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk  AWS Cluster – Collects, preprocesses, and tags incoming data - 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)  Individual data flows defined for each feed
  13. 13. Kafka  Distributed messaging platform - Publish-Subscribe model - Producer/Consumer implementations across many languages - Support for stream processing and ingestion via Kafka Streams/Connect  Serves as communication backbone for infrastructure  20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2)  Replication factor of 2  Set partition count to multiple of aggregate disk count
  14. 14. Storm  Distributed realtime stream computation system  Scales up by adding more worker nodes  Fault tolerant – When a node dies, jobs that were on that node are moved to another  Support for topology isolation, microbatching, and custom routing  Storm Nimbus/UI – m4.2xlarge  45 Storm Worker Nodes – m4.2xlarge  4 worker slots per node – 2 vCPU 8GB Mem
  15. 15. Metron  Security analytics framework built on top of Storm  Consists of two sets of Storm topologies: - Parser topologies – Parse raw data into human readable JSON format - Enrichment topologies – Enrich parsed data with contextual information, then send to storage tier.  Enrichment of incoming data streams with additional information - Domain Generation Algorithm (DGA) scoring via machine learning model - Active Directory user lookup - Geolocation/ASN data for external IP addresses - WHOIS lookup for unknown domain names
  16. 16. ElasticSearch  Distributed, RESTful search and analytics engine - Each data feed is comprised of its own set of daily indices - Each index is further subdivided into shards  Linearly scalable  Low latency full-text search  3 Master Nodes – m4.2xlarge  100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)
  17. 17. Kibana  Data visualization frontend for ElasticSearch  Alert management system  Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts  Multiple open source and custom plugins • Timelion • fermiumlabs/mathlion • prelert/kibana-swimlane-vis • sirensolutions/kibi • siresolutions/sentinl • snuids/heatmap • chenryn/kbn_sankey_vis • And more...
  18. 18. S3  Simple Storage Service – Object storage service in the cloud  Compatible with processing engines like Spark, EMR  Data stored in two formats: - Raw data – Used for replaying data through the pipeline and meeting our obligations as a system of record for some feeds - Parsed data – Stored in columnar format (ORC) for batch processing  Everything in S3 is encrypted
  19. 19. Monitoring  Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.)  Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm, ElasticSearch  Most data stored in a separate ElasticSearch cluster  Grafana for visualization  ElastAlert for platform alerting
  20. 20. PART 3: Challenges
  21. 21. Format Wars  Ingested raw data comes in a variety of formats - CSV, JSON, XML, CEF  Sometimes the formats are poorly defined - Windows Syslogs pretty indented using tabs, but no delimiters - Various subtypes come in different formats  Upstream changes to raw data format often propagate through our entire pipeline, eventually making the data in ElasticSearch unusable  Takeaway: Format and serialize data as far upstream as possible.
  22. 22. Monitoring and Alerting  Platform-level telemetry should be stored with all the other data - Instead of a separate Zabbix subsystem  Collect more granular application-level data - Most components expose metrics via JMX - Necessary to effectively troubleshoot performance bottlenecks - Useful for capacity planning  Logging data collection  Common problem among many teams at Capital One  Takeaway: Reduce duplication of work by offering common monitoring infrastructure, or even Monitoring-as-a-Service
  23. 23. Rehydration  EC2 Instances with AMIs older than 60 days must be terminated - Internal Capital One policy  Spent a lot of time developing automation and orchestration to spin up a full cluster from scratch  How do you rehydrate a newly provisioned platform with data?  How do you avoid service interruption to the user?  Blue/Green cluster deployment  Rolling rehydration every 30 days
  24. 24. Auditing  Internal Audit - 2 Internal Audits of NPI/PCI handling and storage processes  OCC (Office of the Comptroller of the Currency) - Audit of data sources, networking, and archival of data.  FRB (Federal Reserve Board) - IT Risk Management – Alerts considered as an authoritative source as part of first line of defense - Resiliency – Provide evidence of ability to failover within an acceptable window of time
  25. 25. Handling Sensitive Data  Social Security Numbers  Credit card info  Home/Auto Loans  Checking/Savings Account Data  Trading data  Automated process to scan for PII/PCI data and scrub it from the raw data stream - Secure raw data topics via encryption and access control - Streaming job to scrub raw feeds and produce into separate ‘clean’ topics  Backwards remediation process for data stored in HDFS/S3
  26. 26. PART 4: Future Work
  27. 27. Schema Management  Authoritative service for clients to retrieve schemas applied to datasets.  Implementation is protocol dependent. - Avro – Confluent Schema Registry - Protobuf – Central GH Repository  Streaming job to parse and schema-fy raw data prior to processing it. - Raw data that fails to fit schema diverted to alternate Kafka topic.
  28. 28. Monitoring  Consolidate monitoring stack. - Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends - Separate stacks for Time-series numeric and logging: - TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack - ELK stack - Both have tradeoffs
  29. 29. Generalized Data Processing  Metron is really good for working in the infosec space, but does not generalize well.  Exploring options for building a data platform to address multiple use cases. - Credit transactions - Credit fraud - Anti-Money Laundering - Legal  Focus on supporting machine learning.
  30. 30. PART 5: Wrapping Up
  31. 31. Retrospective  Users (SoC analysts, threat hunters, etc.) are generally happy with the platform.  Low query latency  Working to address concerns around data integrity (duplicates, loss, malformed)  They want more data! - Bro - Silvertail - Phantom
  32. 32. Q&A
  33. 33. Thanks! @KevinJokaiMao We’re hiring in SF, Chicago, and DC! Machine Learning Engineers Software Engineers Data Engineers Data Scientists