Mitigating One Million Security Threats With Kafka and Spark With Arun Janarthnam | Current 2022
Citrix Analytics (Security), a user behavior analytics service, protects 100’s of companies from risks and threats posed by users. The service processes 3 billion events per day and can identify security threats in under a minute.
Kafka is the backbone of our real-time platform. It seamlessly glues the numerous stages required for ETL, Feature Extraction, Model Training & Serving, data access etc and enables us to develop new products faster.
In this session, we will talk about how, in the last 6 months, 7M risk indicators were triggered and 1M threat mitigating actions were taken, and the integral role Kafka played in achieving it. We would also like to share some interesting ways Kafka is used at Citrix. Like, how topics are auto provisioned, and security is handled in a multi-tenant, public facing “northbound” Kafka cluster and the Kafka + Spark optimizations that reduced the cost of running 100’s of streaming jobs.
2. Who or what did we protect ?
What kind of threats did we protect against ?
3.
4. Cloud
Work used to be a place
Any where in the world Anywhere in the internet
On-prem
External Apps
5. One digital workspace platform to empower secure hybrid work
Citrix DaaS & VDI
Workspace
Secure Private
Access
ShareFile
* Not the full list of Citrix products
98%
98% of Fortune 500
100M
100M users in 100+ countries
400K
Over 400,000 customers
Cloud
External Apps
On-prem
9. Citrix Analytics Architecture
Data Platform
ML Platform
App Platform
Security Analytics
Performance
Analytics
Networking
Analytics
SPA / ZTNA
Usage Analytics
Citrix Analytics Platform
10+ Citrix Products
sending ~40 different
data streams
10. Data Ingestion
Streaming
Ingestion
Citrix
Products
Prod Data Lake
Batch
Ingestion
Common Data
Processing
Processed
Data
Raw Data
✓ User Correlation
✓ Data Augmentation
✓ Data Transformation
Workspace ShareFile
user1@citrix.com
user1 aaaUser1
username email AAA user name
User Correlation
Module
(Stream)
Active
Directory
User Correlation
distinct
cas_user_name
Batch Job
(every 15 mins)
User Lookup
Table
11. Data Ingestion
Streaming
Ingestion
Citrix
Products
Prod Data Lake
Batch
Ingestion
Common Data
Processing
Processed
Data
Raw Data
✓ User Correlation
✓ Data Augmentation
✓ Data Transformation
Workspace ShareFile
user1@citrix.com
user1 aaaUser1
username email AAA user name
User Correlation
Module
(Stream)
Active
Directory
User Correlation
distinct
cas_user_name
Batch Job
(every 15 mins)
User Lookup
Table
12. Data Ingestion
Streaming
Ingestion
Citrix
Products
Prod Data Lake
Batch
Ingestion
Common Data
Processing
Processed
Data
Raw Data
✓ User Correlation
✓ Data Augmentation
✓ Data Transformation
Workspace ShareFile
user1@citrix.com
user1 aaaUser1
username email AAA user name
User Correlation
Module
(Stream)
Active
Directory
User Correlation
distinct
cas_user_name
update
update
Can’t update billions of
records. A join on
username, solves the
duplicate user issue
Batch Job
(every 15 mins)
User Lookup
Table
13. Data Ingestion
Streaming
Ingestion
Citrix
Products
Prod Data Lake
Batch
Ingestion
Common Data
Processing
Processed
Data
Raw Data
✓ User Correlation
✓ Data Augmentation
✓ Data Transformation
Workspace ShareFile
user1@citrix.com
user1 aaaUser1
username email AAA user name
User Correlation
Module
(Stream)
Active
Directory
User Correlation
distinct
cas_user_name
update
update
lookups
Can’t update billions of
records. A join on
username, solves the
duplicate user issue
Batch Job
(every 15 mins)
User Lookup
Table
14. Data Ingestion
Streaming
Ingestion
Citrix
Products
Prod Data Lake
Batch
Ingestion
Common Data
Processing
Processed
Data
Raw Data
✓ User Correlation
✓ Data Augmentation
✓ Data Transformation
Augmentation & Transformation
• IP to Location Enrichment
• Accuracy & coverage is paramount.
• Handle “locale” city names
• Enrich URL data
• Normalize field names and values
• Data Quality checks
• Add licensing information
16. Detect Data,
Concept Drift &
any anomalies
A Platform to generate Insights
Data Collection Data Discovery
Feature
Engineering
Model
Experimentation &
Training
Model Evaluation
Model Deployment &
Serving
Monitor & Improve
Prod Data
Lake
Data
Exploration
Lake
Model Development
(Databricks Notebook & Auto ML)
Fast Data Exploration
(Databricks SQL)
Data Profiling & Quality
(Spark & DB SQL)
Experiment
Tracking
Model
Registry
Testing
Frameworks
CI/CD
pipelines
Metadata
Storage
Feature
Store
Model
Monitoring
Feedback
API
Pipeline Management – Azure Data Factory, Data Bricks Scheduler, DLT
Streaming
ML Jobs
Batch ML
Jobs
Model
Serving on
AKS
Detect anomalies
Reduce False Positives
Data Catalog
Anomaly
Detection
Models
Risk Score
ML Job
Networking
ML Jobs
17. Frameworks are a dev’s best friend
Prod Data Lake
Processed
Data
Risk Score BOT Detection WAF Violations
Batch App Framework
Potential Data
Exfiltration
Suspicious Logon
Streaming App Framework
First Time Indicators
Custom Indicator Framework
Offline Feature
Store
Online Feature
Store
Excessive operation
Indicators
State Store
• Avoid template code
• Feature Generation
• Feature Store
• State Management
• Custom Listeners
• Offset Monitoring
• Sample & Skeleton Apps
Framework Features
Citrix ADC
18. Custom Indicators (CI)
Processed
Data
Custom Indicator Framework
Security
Admin
Customers wants to create their own risk indicators
CI Controller
State Store
Meta Store
Create structured streaming
queries
Alert me when a user
located in India launches a
session from a new device
Templates for
easy CI creation
Condition
Trigger
A zero code spark platform
19. CI Design
CI 1: Geofence crossing
Data Source: Citrix Gateway
Event-Type = “VPN_AI” AND Country != “United States”
User has successful authentication from outside their country of operation
CI 4: Geofence crossing
Data Source: Share File
Is-Employee = “False” AND Operation-Name =
“Login” AND Country != “United States
Login of a non-employee from outside of country of
operation
CI 3: First time access from new IP
Data Source: Citrix Gateway
Event-Type = "Authentication" AND Status-Code =
"Successful login" AND Client-IP-Type != "private"
CI 2: Excessive authentication failure
Data Source: Citrix Gateway
Event-Type = "Authentication" AND Status-Code != "Successful login"
Every time
First time
3 times in
1 min
• Not efficient as the same data is read
again and again.
• A pre spark 2.4 bug (SPARK-19185)
caused issues when multiple Kafka
consumers from the same executor
reads the same topic with different
offsets.
Developed ~3.5 years ago
Spark Structured
Streaming Query for CI 1
Spark Structured
Streaming Query for CI 2
Spark Structured
Streaming Query for CI 3
Spark Structured
Streaming Query for CI 4
ONE Spark Structured
Streaming Query for C1,
C2 & C3
CASE
WHEN ({condition for C1}) and tenantId =‘{tenantId}’
THEN ‘ C1 UUID’
WHEN ({condition for C2}) and tenantId =‘{tenantId}’
THEN ‘ C2 UUID’
WHEN ({condition for C3}) and tenantId =‘{tenantId}’
THEN ‘C3 UUID’
ELSE null
END
• Broadcast trigger conditions for all CIs.
• Store relevant information for a CI + user in state
• Logic to trigger an alert is coded in custom “MapGroupsWithState” function.
First Time triggers
• Load initial state into spark memory to support ”first-time” indicators.
• Shouldn’t raise “first-time” alerts for newly onboarded customers.
• On-demand state load via redis to reduce “state” size.
State Store:
• To support 1 billion states, 100G hdfs space and 240G+local disk space is
required.
• So, extended local state store with RocksDBStateStoreProvider
• Looking forward to Project lightspeed which in addition to other streaming
enhancements, has improved state management.
Filter events by
conditions
Apply trigger conditions via
Spark State operations
RocksDB
State Store
• Read problem is solved by adding all conditions to
a single SQL.
• If a condition is met, the corresponding CI’s UUID
is emitted.
• This df is then later exploded with relevant CI
information.
• Max of 1000 conditions in one query
Gateway
Topic
Sharefile
Topic
Gateway
Topic
Gateway
Topic
CI Hits
Topic
Option 1 Optimizing filter conditions Handling diferent trigger
20. CI Challenges
Adding new Custom Indicators or modifying existing CI’s
• Both new and update CI operation result in query restart.
• For now, this is not an issue. But, as number of CI’s scale up, restart frequency will increase, which in turn will affect other CI’s
in the group and miss our SLO’s.
• In pipeline: use “Foreachbatch” to get fresh CI details.
• Similarly, new session window capability introduced in Spark 3.2 is promising. Some of the simple window based CI’s can be
moved to session windows.
Stale state
• Since users can define varying time limits for “excessive trigger”, state timeout is set to a bigger number (days).
• States of CI with smaller time limits can’t be timed out in timely manner.
• A state can only be accessed when an event with corresponding key happens (like event_type + user_name).
• So, there is no easy way to access that employee’s state and delete it.
• One option is to send a dummy user event, access the user state, delete it if it’s last ”true” access was x months ago.
• This is not an issue now, but with expected 5-10x growth this will affect our cost.
21. Things to consider while creating streaming jobs on-demand
• In a multi-tenant system, implement strong multi-tenancy checks. A rogue user can slip in a
bogus tenant condition and read get access to other tenants data.
• Only allow whitelisted field names and validate user values.
• When combining multiple queries into one, even one query with wrong syntax will fail the whole
group.
• Estimate # of hits by running that condition on a database where the data is stored in rest. This
helps to both estimate the memory requirements for a query and inform the user the amount of
alerts he/she will receive.
• While building initial state, load the state on demand. For example, don’t load all user data during
job start. There is a decent chance a good % of users will be dormant.
• Instead, load the state in Redis or similar fast db and lazily load that data.
23. Let’s look at a very recent “actual” attack
An Engineer with
malware affected
laptop
Hacker 1
2. steals credentials
3. sells credentials
4. buys credentials
5. Exhaustion attack
6. gets
access
1. logs on
Hacker 2
24. Detecting security threats
Streaming
Ingestion
Citrix
Products
Processed
Data
Suspicious logon
Unusual authentication
failures
Device with blacklisted
app
EPA scan failure
Unmanaged device
detected
Impossible Travel
Potential Data
Exfiltration
If Hacker 2 was in a faraway location,
impossible travel indicator will be triggered.
It’s highly probable that Hacker 2 was logging
from a new location, device and network
(w.r.t the employee), which will trigger
suspicious logon Risk Indicator.
Security alert raised during “Exhaustion
attack”
These risk indicators have a decent chance
for being triggered.
The hacker might have avoided an EPA scan,
When the employee logged on, blacklisted
app or any malware app might have been
detected.
How Citrix would have detected attacks of this type ?
Configured Risk Indicators
Some of these indicators are
pre-configured for our
customers.
< 1 minute detection time
Spark Streaming &
Custom Indicator
Framework code
25. Mitigating Threats
Actions:
• Log Off, Lock/disable user
• Start session recording
• Reduce privilege or restrict access.
• Enforce logon again with 2FA.
• Notify via Email
Defining Policies to handle threats
Default Policies
Security
Admin
26. Mitigating Threats
Processed
Data
Streaming /
Batch Apps
Custom
Indicators
Derived
Data
Alert
Listener
Service
Alert
Processing
Service
Citrix
Products
Webhooks
Policy
Service
Policy UI
Security
Admin
Workspace
Actions:
• Log Off user
• Start session recording
• Reduce privilege or restrict
access.
• Enforce logon again with 2FA.
• Notify via Email
From Risk Indicators to Actions
Confirms action result
Generate security
alerts
Decorates alerts
Takes actions on alerts as
defined by customers
Actions
Service
Bus
Defines Policies which are
actions that needs to be
taken when certain alerts
happens
27. Some metrics
Ingestion
Citrix
Products
Processed
Data
Streaming /
Batch Apps
Custom
Indicators
Derived
Data
Alert
Listener
Service
Alert
Processing
Service
Citrix
Products
• Latency for streaming path ~1 min
• Latency for batch path ~15 mins
~7 Billion events per day
Peak: 100k events/sec
Expected Growth: 10x in
next 2 years
Data Mesh
Offering Micro
Services
< 1 Minute
Emails & Other
Notification
In last 6 months:
7 Million Risk Indicators
triggered.
In last 6 months:
1 Million mitigative
actions triggered.