Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data correlation using
Pyspark and HDFS
John Conley
HPE
Niara/IntroSpect
ingest all kinds of data
make it visible to analyst
create actionable security intelligence
2
Example: IP/user correlation
MAC Time
01:AB:23:CD:45 12:34:56
↓ DHCP: 01:AB:23:CD:45 → 1.2.3.4 @ 11:15
MAC Time Source IP
...
Example: DNS correlation
DNS server IP Request Time
1.2.3.4 foo.com 12:34:56
↓ DNS log: 5.6.7.8 requested foo.com @ 12:35
...
Terminology
Data comes in as streams of records
Some records provide bindings
Some records are enriched by bindings
Some d...
Complications
Out-of-order and late-arriving data
bindings might come after affected records
Huge range of timescales
some...
Temporal models
1. current
bindings don't last very long
can do correlation in small window
2. keepalive
bindings last a l...
Assumptions
Use PySpark (1.4) on YARN for ETL
Use RDD API
Spark and HDFS only way to distribute data
Can run as applicatio...
First solution: Redis
Idea
Keep bindings in Redis DB
Redis lookups to do correlation for each record
This was the rst impl...
First solution: Redis
Design
Master on driver node, slaves on worker nodes to
support local reads (where possible)
SortedS...
First solution: Redis
Issues
Writes bottlenecked on master
Limited support of out-of-order or late-arriving
data without O...
First solution: Redis
Clustering?
Redis Cluster, twemproxy, etc.
sharding strategy?
cross-shard queries
lua scripts
RLEC
s...
Second solution: JOIN
Idea
rdd.join records RDD with bindings RDD
Join based on lookup key(s)
Then do temporal logic on jo...
Second solution: JOIN
Issues
PySpark (1.4) RDD join based on GroupByKey
Hotspotting problem with popular key leads to
bad ...
Third time's the charm
A practical approach
Write the bindings to HDFS
Read the relevant buckets on demand to do
correlati...
Extract bindings
Con gure what elds should go into bindings table
For each record:
Run a DAG of Transformer s
When appropr...
Long-term aggregation
organize bindings and records in time buckets
aggregate buckets appropriate to temporal model
curren...
Bindings (de-)serialization
Broadcast variable?
Creating and distributing broadcast variables in
PySpark is expensive:
1. ...
Bindings (de-)serialization
Custom HDFS-based solution
1. Get DataFrame of relevant bindings buckets
2. Sort by key and ti...
Correlation lifecycle
Batch of records read into RDD
Appropriate bindings read in MapPartitions
For each record:
Extract l...
Correlation con g knobs
Temporal
current , keepalive , timeless
delay_tolerance : how long do we wait for late-
arriving d...
Example con guration
DHCP correlation
keepalive long-lived leases
[left, right] = [28800, 0] 8 hour lease time
limit = 1 t...
Example con guration
DNS correlation
current look for close-by logs
[left, right] = [60, 60] 1 min on either side
limit = ...
Pluggable logic
In addition to con g, lots of places where custom
Python code can be plugged in
BindingExtractor to get bi...
Debugger
Given a batch and a lter (e.g. record ID), re-runs
correlation in 'local mode'
Drops into ipdb debugger at each c...
Status
The good:
This correlation framework has been alive and
well in Niara/IntroSpect for a couple of years
Has evolved ...
Some lessons learned
Data correlation a messy problem
Especially at scale
Especially with delayed/out-of-order data
(Py)Sp...
Thanks!
Ideas welcome
28
Upcoming SlideShare
Loading in …5
×

Data correlation using PySpark and HDFS

416 views

Published on

An overview of Niara/IntroSpect's data correlation solution implemented using PySpark and HDFS.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data correlation using PySpark and HDFS

  1. 1. Data correlation using Pyspark and HDFS John Conley HPE
  2. 2. Niara/IntroSpect ingest all kinds of data make it visible to analyst create actionable security intelligence 2
  3. 3. Example: IP/user correlation MAC Time 01:AB:23:CD:45 12:34:56 ↓ DHCP: 01:AB:23:CD:45 → 1.2.3.4 @ 11:15 MAC Time Source IP 01:AB:23:CD:45 12:34:56 1.2.3.4 ↓ Active directory: 1.2.3.4 → user1 @ 12:30 MAC Time Source IP Username 01:AB:23:CD:45 12:34:56 1.2.3.4 user1 3
  4. 4. Example: DNS correlation DNS server IP Request Time 1.2.3.4 foo.com 12:34:56 ↓ DNS log: 5.6.7.8 requested foo.com @ 12:35 Source IP Request Time 5.6.7.8 foo.com 12:34:56 4
  5. 5. Terminology Data comes in as streams of records Some records provide bindings Some records are enriched by bindings Some do both Process of enriching we'll call correlation 5
  6. 6. Complications Out-of-order and late-arriving data bindings might come after affected records Huge range of timescales some bindings last much longer than others some bindings are "timeless" High data volumes Too many records to buffer for long Too many bindings to keep in memory for long 6
  7. 7. Temporal models 1. current bindings don't last very long can do correlation in small window 2. keepalive bindings last a long time a window of records correlates with a much bigger window of bindings 3. timeless bindings are static 7
  8. 8. Assumptions Use PySpark (1.4) on YARN for ETL Use RDD API Spark and HDFS only way to distribute data Can run as application on externally-managed Hadoop cluster 8
  9. 9. First solution: Redis Idea Keep bindings in Redis DB Redis lookups to do correlation for each record This was the rst implementation in the product 9
  10. 10. First solution: Redis Design Master on driver node, slaves on worker nodes to support local reads (where possible) SortedSet to support temporal lookups Pipeline to mitigate network latency Periodic purge of old data to prevent OOM Lua scripts for atomic updates and purge 10
  11. 11. First solution: Redis Issues Writes bottlenecked on master Limited support of out-of-order or late-arriving data without OOM Master/slave for local read not possible on externally-managed cluster Read timeouts during heavy write or purge (De-)serialization 11
  12. 12. First solution: Redis Clustering? Redis Cluster, twemproxy, etc. sharding strategy? cross-shard queries lua scripts RLEC sounds too good to be true! ($) Still need to maintain cluster, do purge, etc. 12
  13. 13. Second solution: JOIN Idea rdd.join records RDD with bindings RDD Join based on lookup key(s) Then do temporal logic on joined values to do correlation 13
  14. 14. Second solution: JOIN Issues PySpark (1.4) RDD join based on GroupByKey Hotspotting problem with popular key leads to bad performance and/or OOM Multiple correlations require multiple chained joins and a lot of shuf ing Not straightforward how to translate to DataFrame /SQL join given complex temporal logic (but probably possible) 14
  15. 15. Third time's the charm A practical approach Write the bindings to HDFS Read the relevant buckets on demand to do correlation Which buckets depends on temporal model Could split by join key(s), but that requires shuf ing records for each correlation Like a broadcast join 15
  16. 16. Extract bindings Con gure what elds should go into bindings table For each record: Run a DAG of Transformer s When appropriate, run Extractor to create Row Save resulting Dataframe to parquet on HDFS Optional pluggable pre-aggregation logic 16
  17. 17. Long-term aggregation organize bindings and records in time buckets aggregate buckets appropriate to temporal model current no aggregation necessary keepalive push active bindings from bucket to bucket only push most recent binding for each key load same/adjacent binding bucket(s) when correlating record bucket 17
  18. 18. Bindings (de-)serialization Broadcast variable? Creating and distributing broadcast variables in PySpark is expensive: 1. Need to collect data to driver 2. Then serialize data using pickle 3. Then distribute using torrent protocol 4. Then load and unpickle on worker side 18
  19. 19. Bindings (de-)serialization Custom HDFS-based solution 1. Get DataFrame of relevant bindings buckets 2. Sort by key and timestamp 3. Serialize each binding using msgpack and base64 4. Only serialize the values; keys known through con g 5. SaveAsTextFile to store on HDFS 6. Read back using snakebite in workers, deserialize 19
  20. 20. Correlation lifecycle Batch of records read into RDD Appropriate bindings read in MapPartitions For each record: Extract lookup data For each correlator : Lookup binding(s) by key ( dict ) and timestamp ( bisect ) Apply lookup_update logic (con g or custom) Apply record_update logic (con g or custom) 20
  21. 21. Correlation con g knobs Temporal current , keepalive , timeless delay_tolerance : how long do we wait for late- arriving data? (like streaming watermark) left and right boundaries limit number of bindings to retrieve Update logic default_value force_update 21
  22. 22. Example con guration DHCP correlation keepalive long-lived leases [left, right] = [28800, 0] 8 hour lease time limit = 1 take only most recent binding force_update = False don't override pre-existing IPs 22
  23. 23. Example con guration DNS correlation current look for close-by logs [left, right] = [60, 60] 1 min on either side limit = [1, 1] one log from either side force_update = True override source IP prefer_side = left prefer if the log ts < record ts 23
  24. 24. Pluggable logic In addition to con g, lots of places where custom Python code can be plugged in BindingExtractor to get bindings from records Aggregator to pre-aggregate bindings LookupExtractor to get lookup keys from record LookupUpdater to update lookup keys after one correlation, before next RecordUpdater to update the record once all bindings have been obtained 24
  25. 25. Debugger Given a batch and a lter (e.g. record ID), re-runs correlation in 'local mode' Drops into ipdb debugger at each correlator Allows you to inspect the bindings and lookup elds, and step through the code Indispensable for debugging failed correlation 25
  26. 26. Status The good: This correlation framework has been alive and well in Niara/IntroSpect for a couple of years Has evolved to support multiple new data sources and correlations Has had decent stability and scalability To improve: Need still better latency and throughput Needs to be easier to extend 26
  27. 27. Some lessons learned Data correlation a messy problem Especially at scale Especially with delayed/out-of-order data (Py)Spark an exceptionally powerful and exible platform, but you have to sweat the details Don't go too far either way on any tradeoff, e.g.: quick implementation vs. optimal design build it yourself vs. use state-of-the-art frameworks 27
  28. 28. Thanks! Ideas welcome 28

×