Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses”

Data engineering in cybersecurity: how
to collect, store and process terabytes
of data from viruses
Yaroslav Nedashkovsky, System Architect
SoftEleganceData

Agenda
1. Delphi by FDS
2. System Architecture
3. Collection and processing data
4. REST API
5. a-Gnostics — platform for analyzing petabytes of data

Challenges
- Variety of data sources
- Heterogeneous cloud infrastructure
- All sensitive data should be encrypted

Data storage
Variety of data sources -> variety of data storages
- Cassandra, 5 nodes 2TB
- PostgreSQL (RDS), 1 node 300 GB
- S3, 1.5 TB and 100 TB historical data
- Elasticsearch in plan for integration

Data storage - Cassandra
- 5 EC2 m4.4xlarge nodes: 16 vCPU, 64 GB RAM, EBS: 2 TB
- Replication factor: 3
- Compaction: LeveledCompactionStrategy
- phi_convict_threshold: 12 (EC2)
- A lot of tables with list type which holds data with custom type
- Custom stress scripts (cassandra-stress tool couldn’t be used)
- Cassandra Cluster Manager (ccm) for local testing

Why Cassandra, maybe something else?
- Main requirements:
• Linear scalability and high availability
• Multi-datacenter replication
• Fit to our data structure
- Candidates: Cassandra, Riak, MongoDB, DynamoDB (spring 2017)
- Cassandra and Riak were select for comparison

Cassandra vs Riak
Riak plus:
• Faster in read ops
• Cluster maintenance
Riak minus:
• Slower in «update» ops
• Disk space usage
• Suddenly - basho is dead?!
Cassandra plus:
• Faster in «update» ops
• CQL
Cassandra minus:
• Slower in read ops (could we skip this ?)
• Cluster maintenance
Test, test, test before deploying to production!!!

Data storage - PostgreSQL
- 1 EC2 m4.4xlarge node: 16 vCPU, 64 GB RAM
- AWS RDS remove a headache in Database management:
• Read Replicas
• Automated Backups
• Change EC2 instance and storage capacity at runtime
• Monitoring
Before select RDS for production look at limitations, maybe this is not your choice!!!

PostgreSQL – table partitioning
- Scale-in solution for performance optimization
- Partitioning via table inheritance
- Partition elimination during request
- Easy drop historical data and unneeded data

3. Collection and processing data

AWS Kinesis
- Real-time platform for data streaming
- Stream consists of shards
- Scale input by splitting shards
- 1MB/second ingest rate (for one shard)
- 2MB/second egress rate (for one shard)
- KPL/KCL libraries for producing/consuming data
- In terms of Kafka:
Topic -> Stream
Partition -> Shard
Zookeper -> DynamoDB

Spark Streaming
- 3 EC2 c5.2xlarge nodes: 8 vCPU, 16 GB RAM
- 2 streams, 4 shards per stream
- Deploy in standalone mode
- amazon-kinesis-data-generator for test data flow
Good practice:
- Total time processing is less than batch interval
- Well balanced loading, number of receivers (DStream) multiple of Executors

val kinesisDataStream = KinesisInputDStream.builder
.streamingContext(ssc)
.streamName(kinesisStreamName)
.checkpointAppName(kinesisAppName)
.initialPositionInStream(InitialPositionInStream.LATEST)
.checkpointInterval(kinesisCheckpointInterval)
.buildWithMessageHandler(DataStreamParser.parse)
kinesisDataStream.foreachRDD(rdd => DataStreamHandler.streamProcessor(rdd))
Spark Streaming - KinesisInputDStream

Spark Streaming - StreamProcessor
lazy val hds = new HikariDataSource(hikariConfig)
lazy val geoIpReader = new DatabaseReader.Builder(geoIpFile).withCache(new CHMCache()).build()
def streamProcessor(rdd: RDD[DropDataRecord]) = {
rdd.foreachPartition(partition => {
var dropdataConnection = hds.getConnection()
var statement = dropdataConnection.prepareStatement(insertStatement)
partition.foreach(sinkHoleDropDataRecord => {
...
val countryResponse = geoIpReader.country(inetAddress)
...
statement.addBatch()
})
statement.executeBatch()
dropdataConnection.close()
})
}

Structured Streaming
- Replace batch streaming in future ?!
- Spark 2.3.0 introduce Continuous Processing
- Kafka and File source available for production usage
- SPARK-18165 – Kinesis support (now available only for Databricks customers)

- Access to data and “value” provided through rest endpoints
- Generate token(JWT) for each client
REST API – value for the customer

Next steps
- Stream visualization
- Add caching (Elastic cache)
- Elasticsearch integration
- AI, ML

COLLECT. ANALYZE. INSIGHT.
data@softelegance.com
www.a-gnostics.com

Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses”

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses”

Similar to Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses” (20)

More from Lviv Startup Club

More from Lviv Startup Club (20)

Recently uploaded

Recently uploaded (20)

Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses”