SlideShare a Scribd company logo
1 of 29
Data engineering in cybersecurity: how
to collect, store and process terabytes
of data from viruses
Yaroslav Nedashkovsky, System Architect
SoftEleganceData
Agenda
1. Delphi by FDS
2. System Architecture
3. Collection and processing data
4. REST API
5. a-Gnostics — platform for analyzing petabytes of data
1. Delphi by FDS
Challenges
- Variety of data sources
- Heterogeneous cloud infrastructure
- All sensitive data should be encrypted
2. System Architecture
DevOps
Continuous Integration
Data storage
Variety of data sources -> variety of data storages
- Cassandra, 5 nodes 2TB
- PostgreSQL (RDS), 1 node 300 GB
- S3, 1.5 TB and 100 TB historical data
- Elasticsearch in plan for integration
Data storage - Cassandra
- 5 EC2 m4.4xlarge nodes: 16 vCPU, 64 GB RAM, EBS: 2 TB
- Replication factor: 3
- Compaction: LeveledCompactionStrategy
- phi_convict_threshold: 12 (EC2)
- A lot of tables with list type which holds data with custom type
- Custom stress scripts (cassandra-stress tool couldn’t be used)
- Cassandra Cluster Manager (ccm) for local testing
Why Cassandra, maybe something else?
- Main requirements:
• Linear scalability and high availability
• Multi-datacenter replication
• Fit to our data structure
- Candidates: Cassandra, Riak, MongoDB, DynamoDB (spring 2017)
- Cassandra and Riak were select for comparison
Cassandra vs Riak
Riak plus:
• Faster in read ops
• Cluster maintenance
Riak minus:
• Slower in «update» ops
• Disk space usage
• Multi-datacenter replication
• Suddenly - basho is dead?!
Cassandra plus:
• Faster in «update» ops
• Multi-datacenter replication
• CQL
Cassandra minus:
• Slower in read ops (could we skip this ?)
• Cluster maintenance
Test, test, test before deploying to production!!!
Data storage - PostgreSQL
- 1 EC2 m4.4xlarge node: 16 vCPU, 64 GB RAM
- AWS RDS remove a headache in Database management:
• Read Replicas
• Automated Backups
• Change EC2 instance and storage capacity at runtime
• Monitoring
Before select RDS for production look at limitations, maybe this is not your choice!!!
PostgreSQL – table partitioning
- Scale-in solution for performance optimization
- Partitioning via table inheritance
- Partition elimination during request
- Easy drop historical data and unneeded data
3. Collection and processing data
AWS Kinesis
- Real-time platform for data streaming
- Stream consists of shards
- Scale input by splitting shards
- 1MB/second ingest rate (for one shard)
- 2MB/second egress rate (for one shard)
- KPL/KCL libraries for producing/consuming data
- In terms of Kafka:
Topic -> Stream
Partition -> Shard
Zookeper -> DynamoDB
Spark Streaming
- 3 EC2 c5.2xlarge nodes: 8 vCPU, 16 GB RAM
- 2 streams, 4 shards per stream
- Deploy in standalone mode
- amazon-kinesis-data-generator for test data flow
Good practice:
- Total time processing is less than batch interval
- Well balanced loading, number of receivers (DStream) multiple of Executors
val kinesisDataStream = KinesisInputDStream.builder
.streamingContext(ssc)
.streamName(kinesisStreamName)
.checkpointAppName(kinesisAppName)
.initialPositionInStream(InitialPositionInStream.LATEST)
.checkpointInterval(kinesisCheckpointInterval)
.buildWithMessageHandler(DataStreamParser.parse)
kinesisDataStream.foreachRDD(rdd => DataStreamHandler.streamProcessor(rdd))
Spark Streaming - KinesisInputDStream
Spark Streaming - StreamProcessor
lazy val hds = new HikariDataSource(hikariConfig)
lazy val geoIpReader = new DatabaseReader.Builder(geoIpFile).withCache(new CHMCache()).build()
def streamProcessor(rdd: RDD[DropDataRecord]) = {
rdd.foreachPartition(partition => {
var dropdataConnection = hds.getConnection()
var statement = dropdataConnection.prepareStatement(insertStatement)
partition.foreach(sinkHoleDropDataRecord => {
...
val countryResponse = geoIpReader.country(inetAddress)
...
statement.addBatch()
})
statement.executeBatch()
dropdataConnection.close()
})
}
Spark Streaming + Kinesis
Structured Streaming
- Replace batch streaming in future ?!
- Spark 2.3.0 introduce Continuous Processing
- Kafka and File source available for production usage
- SPARK-18165 – Kinesis support (now available only for Databricks customers)
4. REST API
- Access to data and “value” provided through rest endpoints
- Generate token(JWT) for each client
REST API – value for the customer
Next steps
- Stream visualization
- Add caching (Elastic cache)
- Elasticsearch integration
- AI, ML
5. a-Gnostics
COLLECT. ANALYZE. INSIGHT.
data@softelegance.com
www.a-gnostics.com

More Related Content

What's hot

Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScyllaDB
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach
 
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBComparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBScyllaDB
 
MEETUP - Unboxing Apache Cassandra 3.10
MEETUP - Unboxing Apache Cassandra 3.10MEETUP - Unboxing Apache Cassandra 3.10
MEETUP - Unboxing Apache Cassandra 3.10Erick Ramirez
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & MonitoringJohn Paulett
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgwzhouyuan
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedEqunix Business Solutions
 
Gcc bio team-presentation
Gcc bio team-presentationGcc bio team-presentation
Gcc bio team-presentationanushkanet
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016DataStax
 
CASSANDRA MEETUP - Choosing the right cloud instances for success
CASSANDRA MEETUP - Choosing the right cloud instances for successCASSANDRA MEETUP - Choosing the right cloud instances for success
CASSANDRA MEETUP - Choosing the right cloud instances for successErick Ramirez
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...DataStax
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtAung Thu Rha Hein
 
Running MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSRunning MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSMongoDB
 
Unified readonly cache for ceph
Unified readonly cache for cephUnified readonly cache for ceph
Unified readonly cache for cephzhouyuan
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurryddlatham
 
Seattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersSeattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersbtoddb
 

What's hot (20)

Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast EnoughScylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
Scylla Summit 2018: In-Memory Scylla - When Fast Storage is Not Fast Enough
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBComparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
 
MEETUP - Unboxing Apache Cassandra 3.10
MEETUP - Unboxing Apache Cassandra 3.10MEETUP - Unboxing Apache Cassandra 3.10
MEETUP - Unboxing Apache Cassandra 3.10
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & Monitoring
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
 
Gcc bio team-presentation
Gcc bio team-presentationGcc bio team-presentation
Gcc bio team-presentation
 
Cassandra database design best practises
Cassandra database design best practisesCassandra database design best practises
Cassandra database design best practises
 
MySQL on Ceph
MySQL on CephMySQL on Ceph
MySQL on Ceph
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
 
CASSANDRA MEETUP - Choosing the right cloud instances for success
CASSANDRA MEETUP - Choosing the right cloud instances for successCASSANDRA MEETUP - Choosing the right cloud instances for success
CASSANDRA MEETUP - Choosing the right cloud instances for success
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaught
 
Running MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWSRunning MongoDB 3.0 on AWS
Running MongoDB 3.0 on AWS
 
Cassandra useful features
Cassandra useful featuresCassandra useful features
Cassandra useful features
 
Unified readonly cache for ceph
Unified readonly cache for cephUnified readonly cache for ceph
Unified readonly cache for ceph
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurry
 
Seattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersSeattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffers
 

Similar to Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses”

MU - No SQL.pptx
MU - No SQL.pptxMU - No SQL.pptx
MU - No SQL.pptxkapil yadav
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical dataOleksandr Semenov
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackRich Lee
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practicesHaseeb Alam
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Amazon Web Services
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedis Labs
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 

Similar to Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses” (20)

MU - No SQL.pptx
MU - No SQL.pptxMU - No SQL.pptx
MU - No SQL.pptx
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical data
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Oracle rac 10g best practices
Oracle rac 10g best practicesOracle rac 10g best practices
Oracle rac 10g best practices
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015Getting Started with Big Data and HPC in the Cloud - August 2015
Getting Started with Big Data and HPC in the Cloud - August 2015
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
NoSQL Session II
NoSQL Session IINoSQL Session II
NoSQL Session II
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 

More from Lviv Startup Club

Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Lviv Startup Club
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Lviv Startup Club
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Lviv Startup Club
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Lviv Startup Club
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Lviv Startup Club
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Lviv Startup Club
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Lviv Startup Club
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Lviv Startup Club
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Lviv Startup Club
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Lviv Startup Club
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Lviv Startup Club
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Lviv Startup Club
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Lviv Startup Club
 
Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)Lviv Startup Club
 
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...Lviv Startup Club
 
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)Lviv Startup Club
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Lviv Startup Club
 
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Lviv Startup Club
 
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)Lviv Startup Club
 
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Lviv Startup Club
 

More from Lviv Startup Club (20)

Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
 
Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)Ihor Pavlenko: PMO Resource Management (UA)
Ihor Pavlenko: PMO Resource Management (UA)
 
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
Anastasiia Khait: Building Product Passion: Empowering Development Teams thro...
 
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
Oksana Krykun: Перші 90 днів в роботі над новим продуктом (UA)
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
 
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)Nikita Zahurdaiev: PMO Tools and Technologies (UA)
Nikita Zahurdaiev: PMO Tools and Technologies (UA)
 
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
Nikita Zahurdaiev: Developing PMO Services and Functions (UA)
 
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
Oleksandr Krakovetskyi: What's wrong with Generative AI? (UA)
 

Recently uploaded

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 

Recently uploaded (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 

Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to collect, store and process terabytes of data from viruses”

  • 1.
  • 2. Data engineering in cybersecurity: how to collect, store and process terabytes of data from viruses Yaroslav Nedashkovsky, System Architect SoftEleganceData
  • 3. Agenda 1. Delphi by FDS 2. System Architecture 3. Collection and processing data 4. REST API 5. a-Gnostics — platform for analyzing petabytes of data
  • 5.
  • 6. Challenges - Variety of data sources - Heterogeneous cloud infrastructure - All sensitive data should be encrypted
  • 8.
  • 11. Data storage Variety of data sources -> variety of data storages - Cassandra, 5 nodes 2TB - PostgreSQL (RDS), 1 node 300 GB - S3, 1.5 TB and 100 TB historical data - Elasticsearch in plan for integration
  • 12. Data storage - Cassandra - 5 EC2 m4.4xlarge nodes: 16 vCPU, 64 GB RAM, EBS: 2 TB - Replication factor: 3 - Compaction: LeveledCompactionStrategy - phi_convict_threshold: 12 (EC2) - A lot of tables with list type which holds data with custom type - Custom stress scripts (cassandra-stress tool couldn’t be used) - Cassandra Cluster Manager (ccm) for local testing
  • 13. Why Cassandra, maybe something else? - Main requirements: • Linear scalability and high availability • Multi-datacenter replication • Fit to our data structure - Candidates: Cassandra, Riak, MongoDB, DynamoDB (spring 2017) - Cassandra and Riak were select for comparison
  • 14. Cassandra vs Riak Riak plus: • Faster in read ops • Cluster maintenance Riak minus: • Slower in «update» ops • Disk space usage • Multi-datacenter replication • Suddenly - basho is dead?! Cassandra plus: • Faster in «update» ops • Multi-datacenter replication • CQL Cassandra minus: • Slower in read ops (could we skip this ?) • Cluster maintenance Test, test, test before deploying to production!!!
  • 15. Data storage - PostgreSQL - 1 EC2 m4.4xlarge node: 16 vCPU, 64 GB RAM - AWS RDS remove a headache in Database management: • Read Replicas • Automated Backups • Change EC2 instance and storage capacity at runtime • Monitoring Before select RDS for production look at limitations, maybe this is not your choice!!!
  • 16. PostgreSQL – table partitioning - Scale-in solution for performance optimization - Partitioning via table inheritance - Partition elimination during request - Easy drop historical data and unneeded data
  • 17. 3. Collection and processing data
  • 18. AWS Kinesis - Real-time platform for data streaming - Stream consists of shards - Scale input by splitting shards - 1MB/second ingest rate (for one shard) - 2MB/second egress rate (for one shard) - KPL/KCL libraries for producing/consuming data - In terms of Kafka: Topic -> Stream Partition -> Shard Zookeper -> DynamoDB
  • 19. Spark Streaming - 3 EC2 c5.2xlarge nodes: 8 vCPU, 16 GB RAM - 2 streams, 4 shards per stream - Deploy in standalone mode - amazon-kinesis-data-generator for test data flow Good practice: - Total time processing is less than batch interval - Well balanced loading, number of receivers (DStream) multiple of Executors
  • 20. val kinesisDataStream = KinesisInputDStream.builder .streamingContext(ssc) .streamName(kinesisStreamName) .checkpointAppName(kinesisAppName) .initialPositionInStream(InitialPositionInStream.LATEST) .checkpointInterval(kinesisCheckpointInterval) .buildWithMessageHandler(DataStreamParser.parse) kinesisDataStream.foreachRDD(rdd => DataStreamHandler.streamProcessor(rdd)) Spark Streaming - KinesisInputDStream
  • 21. Spark Streaming - StreamProcessor lazy val hds = new HikariDataSource(hikariConfig) lazy val geoIpReader = new DatabaseReader.Builder(geoIpFile).withCache(new CHMCache()).build() def streamProcessor(rdd: RDD[DropDataRecord]) = { rdd.foreachPartition(partition => { var dropdataConnection = hds.getConnection() var statement = dropdataConnection.prepareStatement(insertStatement) partition.foreach(sinkHoleDropDataRecord => { ... val countryResponse = geoIpReader.country(inetAddress) ... statement.addBatch() }) statement.executeBatch() dropdataConnection.close() }) }
  • 22. Spark Streaming + Kinesis
  • 23. Structured Streaming - Replace batch streaming in future ?! - Spark 2.3.0 introduce Continuous Processing - Kafka and File source available for production usage - SPARK-18165 – Kinesis support (now available only for Databricks customers)
  • 25. - Access to data and “value” provided through rest endpoints - Generate token(JWT) for each client REST API – value for the customer
  • 26. Next steps - Stream visualization - Add caching (Elastic cache) - Elasticsearch integration - AI, ML
  • 28.