Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses”
1.
2. Data engineering in cybersecurity: how
to collect, store and process terabytes
of data from viruses
Yaroslav Nedashkovsky, System Architect
SoftEleganceData
3. Agenda
1. Delphi by FDS
2. System Architecture
3. Collection and processing data
4. REST API
5. a-Gnostics — platform for analyzing petabytes of data
11. Data storage
Variety of data sources -> variety of data storages
- Cassandra, 5 nodes 2TB
- PostgreSQL (RDS), 1 node 300 GB
- S3, 1.5 TB and 100 TB historical data
- Elasticsearch in plan for integration
12. Data storage - Cassandra
- 5 EC2 m4.4xlarge nodes: 16 vCPU, 64 GB RAM, EBS: 2 TB
- Replication factor: 3
- Compaction: LeveledCompactionStrategy
- phi_convict_threshold: 12 (EC2)
- A lot of tables with list type which holds data with custom type
- Custom stress scripts (cassandra-stress tool couldn’t be used)
- Cassandra Cluster Manager (ccm) for local testing
13. Why Cassandra, maybe something else?
- Main requirements:
• Linear scalability and high availability
• Multi-datacenter replication
• Fit to our data structure
- Candidates: Cassandra, Riak, MongoDB, DynamoDB (spring 2017)
- Cassandra and Riak were select for comparison
14. Cassandra vs Riak
Riak plus:
• Faster in read ops
• Cluster maintenance
Riak minus:
• Slower in «update» ops
• Disk space usage
• Multi-datacenter replication
• Suddenly - basho is dead?!
Cassandra plus:
• Faster in «update» ops
• Multi-datacenter replication
• CQL
Cassandra minus:
• Slower in read ops (could we skip this ?)
• Cluster maintenance
Test, test, test before deploying to production!!!
15. Data storage - PostgreSQL
- 1 EC2 m4.4xlarge node: 16 vCPU, 64 GB RAM
- AWS RDS remove a headache in Database management:
• Read Replicas
• Automated Backups
• Change EC2 instance and storage capacity at runtime
• Monitoring
Before select RDS for production look at limitations, maybe this is not your choice!!!
16. PostgreSQL – table partitioning
- Scale-in solution for performance optimization
- Partitioning via table inheritance
- Partition elimination during request
- Easy drop historical data and unneeded data
18. AWS Kinesis
- Real-time platform for data streaming
- Stream consists of shards
- Scale input by splitting shards
- 1MB/second ingest rate (for one shard)
- 2MB/second egress rate (for one shard)
- KPL/KCL libraries for producing/consuming data
- In terms of Kafka:
Topic -> Stream
Partition -> Shard
Zookeper -> DynamoDB
19. Spark Streaming
- 3 EC2 c5.2xlarge nodes: 8 vCPU, 16 GB RAM
- 2 streams, 4 shards per stream
- Deploy in standalone mode
- amazon-kinesis-data-generator for test data flow
Good practice:
- Total time processing is less than batch interval
- Well balanced loading, number of receivers (DStream) multiple of Executors
23. Structured Streaming
- Replace batch streaming in future ?!
- Spark 2.3.0 introduce Continuous Processing
- Kafka and File source available for production usage
- SPARK-18165 – Kinesis support (now available only for Databricks customers)