SlideShare a Scribd company logo
1 of 122
Download to read offline
Cold Storage that isn’t glacial
Joshua Hollander, Principal Software Engineer
ProtectWise
1 / 41
Who are we and what do we do?
2 / 41
Who are we and what do we do?
We record full fidelity Network Data
2 / 41
Who are we and what do we do?
We record full fidelity Network Data
Every network conversation (NetFlow)
2 / 41
Who are we and what do we do?
We record full fidelity Network Data
Every network conversation (NetFlow)
Full fidelity packet data (PCAP)
2 / 41
Who are we and what do we do?
We record full fidelity Network Data
Every network conversation (NetFlow)
Full fidelity packet data (PCAP)
Searchable detailed protocol data for:
HTTP
DNS
DHCP
Files
Security Events (IDS)
and more...
2 / 41
Who are we and what do we do?
We record full fidelity Network Data
Every network conversation (NetFlow)
Full fidelity packet data (PCAP)
Searchable detailed protocol data for:
HTTP
DNS
DHCP
Files
Security Events (IDS)
and more...
We analyze all that data and detect threats others can't see
2 / 41
Network data piles up fast!
3 / 41
Network data piles up fast!
Over 300 C* servers in production
Over 1200 servers in EC2
Over 150TB in C*
About 90TB of SOLR indexes
100TB of cold storage data
2 PB of PCAP
3 / 41
Network data piles up fast!
Over 300 C* servers in production
Over 1200 servers in EC2
Over 150TB in C*
About 90TB of SOLR indexes
100TB of cold storage data
2 PB of PCAP
And we are growing rapidly!
3 / 41
So what?
4 / 41
So what?
All those servers cost a lot of money
4 / 41
Right sizing our AWS bill
5 / 41
Right sizing our AWS bill
This is all time series data:
Lots of writes/reads to recent data
Some reads and very few writes to older data
5 / 41
Right sizing our AWS bill
This is all time series data:
Lots of writes/reads to recent data
Some reads and very few writes to older data
So just move all that older, cold data, to cheaper storage.
How hard could that possibly be?
5 / 41
Problem 1:
Data distributed evenly across all these expensive servers
Using Size Tiered Compaction
Can't just migrate old SSTables to new cheap servers
6 / 41
Problem 1:
Data distributed evenly across all these expensive servers
Using Size Tiered Compaction
Can't just migrate old SSTables to new cheap servers
Solution: use Date Tiered Compaction?
We update old data regularly
What SSTables can you safely migrate?
Result:
6 / 41
Problem 1:
Data distributed evenly across all these expensive servers
Using Size Tiered Compaction
Can't just migrate old SSTables to new cheap servers
Solution: use Date Tiered Compaction?
We update old data regularly
What SSTables can you safely migrate?
Result:
6 / 41
Problem 2: DSE/SOLR re-index takes forever!
We have nodes with up to 300GB of SOLR indexes
Bootstraping a new node requires re-index after streaming
Re-index can take a week or more!!!
At that pace we simply cannot bootstrap new nodes
7 / 41
Problem 2: DSE/SOLR re-index takes forever!
We have nodes with up to 300GB of SOLR indexes
Bootstraping a new node requires re-index after streaming
Re-index can take a week or more!!!
At that pace we simply cannot bootstrap new nodes
Solution: Time sharded clusters in application code
Fanout searches for large time windows
Assemble results in code (using same algorithm SOLR does)
7 / 41
Problem 2: DSE/SOLR re-index takes forever!
We have nodes with up to 300GB of SOLR indexes
Bootstraping a new node requires re-index after streaming
Re-index can take a week or more!!!
At that pace we simply cannot bootstrap new nodes
Solution: Time sharded clusters in application code
Fanout searches for large time windows
Assemble results in code (using same algorithm SOLR does)
Result:
¯_( )_/¯
7 / 41
What did we gain?
8 / 41
What did we gain?
We can now migrate older timeshards to cheaper servers!
8 / 41
What did we gain?
We can now migrate older timeshards to cheaper servers!
However:
Cold data servers are still too expensive
DevOps time suck is massive
Product wants us to store even more data!
8 / 41
Throwing ideas at the wall
9 / 41
Throwing ideas at the wall
We have this giant data warehouse, let's use that
Response time is too slow: 10 seconds to pull single record by ID!
Complex ETL pipeline where latency measured in hours
Data model is different
Read only
Reliability, etc, etc
9 / 41
Throwing ideas at the wall
We have this giant data warehouse, let's use that
Response time is too slow: 10 seconds to pull single record by ID!
Complex ETL pipeline where latency measured in hours
Data model is different
Read only
Reliability, etc, etc
What about Elastic Search, HBase, Hive/Parquet, MongoDB, etc, etc?
9 / 41
Throwing ideas at the wall
We have this giant data warehouse, let's use that
Response time is too slow: 10 seconds to pull single record by ID!
Complex ETL pipeline where latency measured in hours
Data model is different
Read only
Reliability, etc, etc
What about Elastic Search, HBase, Hive/Parquet, MongoDB, etc, etc?
Wait. Hold on! This Parquet thing is interesting...
9 / 41
Parquet
10 / 41
Parquet
Columnar
Projections are very efficient
Enables vectorized execution
10 / 41
Parquet
Columnar
Projections are very efficient
Enables vectorized execution
Compressed
Using Run Length Encoding
Throw Snappy, LZO, or LZ4 on top of that
10 / 41
Parquet
Columnar
Projections are very efficient
Enables vectorized execution
Compressed
Using Run Length Encoding
Throw Snappy, LZO, or LZ4 on top of that
Schema
Files encode schema and other meta-data
Support exists for merging disparate schema amongst files
10 / 41
Row Group
Horizontal grouping of columns
Within each row group data is arranged by column in chunks
Column Chunk
Chunk of data for an individual column
Unit of parallelization for I/O
Page
Column chunks are divided up into pages for compression and encoding
Parquet: Some details
11 / 41
So you have a nice file format...
Now what?
12 / 41
Need to get data out of Cassandra
13 / 41
Need to get data out of Cassandra
Spark seems good for that
13 / 41
Need to get data out of Cassandra
Spark seems good for that
Need to put the data somewhere
13 / 41
Need to get data out of Cassandra
Spark seems good for that
Need to put the data somewhere
S3 is really cheap and fairly well supported by Spark
13 / 41
Need to get data out of Cassandra
Spark seems good for that
Need to put the data somewhere
S3 is really cheap and fairly well supported by Spark
Need to be able to query the data
13 / 41
Need to get data out of Cassandra
Spark seems good for that
Need to put the data somewhere
S3 is really cheap and fairly well supported by Spark
Need to be able to query the data
Spark seems good for that too
13 / 41
So we are using Spark and S3...
Now what?
14 / 41
Lots and lots of files
Sizing parquet
15 / 41
Lots and lots of files
Sizing parquet
Parquet docs recommend 1GB files for HDFS
15 / 41
Lots and lots of files
Sizing parquet
Parquet docs recommend 1GB files for HDFS
For S3 the sweet spot appears to be 128 to 256MB
15 / 41
Lots and lots of files
Sizing parquet
Parquet docs recommend 1GB files for HDFS
For S3 the sweet spot appears to be 128 to 256MB
We have terabytes of files
Scans take forever!
15 / 41
Lots and lots of files
Sizing parquet
Parquet docs recommend 1GB files for HDFS
For S3 the sweet spot appears to be 128 to 256MB
We have terabytes of files
Scans take forever!
15 / 41
Partitioning
Our queries are always filtered by:
1. Customer
2. Time
16 / 41
Partitioning
Our queries are always filtered by:
1. Customer
2. Time
So we Partition by:
├── cid=X
| ├── year=2015
| └── year=2016
| └── month=0
| └── day=0
| └── hour=0
└── cid=Y
16 / 41
Partitioning
Our queries are always filtered by:
1. Customer
2. Time
So we Partition by:
├── cid=X
| ├── year=2015
| └── year=2016
| └── month=0
| └── day=0
| └── hour=0
└── cid=Y
Spark understands and translates query filters to this folder structure
16 / 41
Big Improvement!
Now a customer can query a time range quickly
17 / 41
Partitioning problems
18 / 41
Partitioning problems
Customers generally ask questions such as:
"Over the last 6 months, how many times did I see IP X using protocol Y?"
18 / 41
Partitioning problems
Customers generally ask questions such as:
"Over the last 6 months, how many times did I see IP X using protocol Y?"
"When did IP X not use port 80 for HTTP?"
18 / 41
Partitioning problems
Customers generally ask questions such as:
"Over the last 6 months, how many times did I see IP X using protocol Y?"
"When did IP X not use port 80 for HTTP?"
"Who keeps scanning server Z for open SSH ports?"
18 / 41
Partitioning problems
Customers generally ask questions such as:
"Over the last 6 months, how many times did I see IP X using protocol Y?"
"When did IP X not use port 80 for HTTP?"
"Who keeps scanning server Z for open SSH ports?"
Queries would take minutes.
18 / 41
Queries spanning large time windows
select count(*) from events where ip = '192.168.0.1' and cid = 1 and year = 2016
19 / 41
Queries spanning large time windows
select count(*) from events where ip = '192.168.0.1' and cid = 1 and year = 2016
├── cid=X
| ├── year=2015
| └── year=2016
| |── month=0
| | └── day=0
| | └── hour=0
| | └── 192.168.0.1_was_NOT_here.parquet
| └── month=1
| └── day=0
| └── hour=0
| └── 192.168.0.1_WAS_HERE.parquet
└── cid=Y
19 / 41
Problem #1
Requires listing out all the sub-dirs for large time ranges.
Remember S3 is not really a file system
Slow!
20 / 41
Problem #1
Requires listing out all the sub-dirs for large time ranges.
Remember S3 is not really a file system
Slow!
Problem #2
Pulling potentially thousands of files from S3.
Slow and Costly!
20 / 41
Solving problem #1
Put partition info and file listings in a db ala Hive.
21 / 41
Solving problem #1
Put partition info and file listings in a db ala Hive.
Why not just use Hive?
21 / 41
Solving problem #1
Put partition info and file listings in a db ala Hive.
Why not just use Hive?
Still not fast enough
Also does not help with Problem #2
21 / 41
DSE/SOLR to the Rescue!
22 / 41
DSE/SOLR to the Rescue!
Store file meta data in SOLR
22 / 41
DSE/SOLR to the Rescue!
Store file meta data in SOLR
Efficiently skip elements of partition hierarchy!
select count(*) from events where month = 6
22 / 41
DSE/SOLR to the Rescue!
Store file meta data in SOLR
Efficiently skip elements of partition hierarchy!
select count(*) from events where month = 6
Avoids pulling all meta in Spark driver
1. Get partition counts and schema info from SOLR driver-side
2. Submit SOLR RDD to cluster
3. Run mapPartitions on SOLR RDD and turn into Parquet RDDs
As an optimization for small file sets we pull the SOLR rows driver side
22 / 41
Boxitecture
23 / 41
Performance gains!
Source Scan/Filter Time
SOLR < 100 milliseconds
Hive > 5 seconds
S3 directory listing > 5 minutes!!!
24 / 41
Problem #1 Solved!
25 / 41
Problem #1 Solved!
What about Problem #2?
25 / 41
Solving problem #2
Still need to pull potentially thousands of files to answer our query!
26 / 41
Solving problem #2
Still need to pull potentially thousands of files to answer our query!
Can we partition differently?
26 / 41
Solving problem #2
Still need to pull potentially thousands of files to answer our query!
Can we partition differently?
Field Cardinality Result
Protocol Medium (9000) ❌
Port High (65535) ❌❌
IP Addresses Astronomically High (3.4 undecillion) ❌❌❌
Nope! Nope! Nope!
26 / 41
Searching High Cardinality Data
27 / 41
Searching High Cardinality Data
Assumptions
1. Want to reduce # of files pulled for a given query
2. Cannot store all exact values in SOLR
3. We are okay with a few false positives
27 / 41
Searching High Cardinality Data
Assumptions
1. Want to reduce # of files pulled for a given query
2. Cannot store all exact values in SOLR
3. We are okay with a few false positives
This sounds like a job for...
27 / 41
Bloom Filters!
28 / 41
Towards a "Searchable" Bloom Filter
29 / 41
Towards a "Searchable" Bloom Filter
Normal SOLR index looks vaguely like
Term Doc IDs
192.168.0.1 1,2,3,5,8,13...
10.0.0.1 2,4,6,8...
8.8.8.8 1,2,3,4,5,6
29 / 41
Towards a "Searchable" Bloom Filter
Normal SOLR index looks vaguely like
Term Doc IDs
192.168.0.1 1,2,3,5,8,13...
10.0.0.1 2,4,6,8...
8.8.8.8 1,2,3,4,5,6
Terms are going to grow out of control
29 / 41
Towards a "Searchable" Bloom Filter
Normal SOLR index looks vaguely like
Term Doc IDs
192.168.0.1 1,2,3,5,8,13...
10.0.0.1 2,4,6,8...
8.8.8.8 1,2,3,4,5,6
Terms are going to grow out of control
If only we could constrain to a reasonable number of values?
29 / 41
Terms as a "bloom filter"
30 / 41
Terms as a "bloom filter"
30 / 41
Terms as a "bloom filter"
What if our terms were the offsets of the Bloom Filter values?
30 / 41
Terms as a "bloom filter"
What if our terms were the offsets of the Bloom Filter values?
Term Doc IDs
0 1,2,3,5,8,13...
1 2,4,6,8...
2 1,2,3,4,5,6
3 1,2,3
... ...
N 1,2,3,4,5...
30 / 41
Index
Term Doc IDs
0 0,1,2
1 1,2
2 1
3 0
4 1,2
5 0
Indexing
Field Value Indexed Values Doc ID
ip 192.168.0.1 {0, 3, 5} 0
ip 10.0.0.1 {1, 2, 4} 1
ip 8.8.8.8 {0, 1, 4} 2
Queries
Field Query String Actual Query
ip ip:192.168.0.1 ip_bits:0 AND 3 AND 5
ip ip:10.0.0.1 ip_bits:1 AND 4 AND 5
Searchable Bloom Filters
31 / 41
Problem #2 Solved!
32 / 41
Problem #2 Solved!
Enormous filtering power
32 / 41
Problem #2 Solved!
Enormous filtering power
Relatively minimal cost in space and computation
32 / 41
Key Lookups
33 / 41
Key Lookups
Need to retain this C* functionality
33 / 41
Key Lookups
Need to retain this C* functionality
Spark/Parquet has no direct support
What partition would it choose?
The partition would have to be encoded in the key?!
33 / 41
Key Lookups
Need to retain this C* functionality
Spark/Parquet has no direct support
What partition would it choose?
The partition would have to be encoded in the key?!
Solution:
Our keys have time encoded in them
Enables us to generate the partition path containing the row
33 / 41
Key Lookups
Need to retain this C* functionality
Spark/Parquet has no direct support
What partition would it choose?
The partition would have to be encoded in the key?!
Solution:
Our keys have time encoded in them
Enables us to generate the partition path containing the row
That was easy!
33 / 41
Other reasons to "customize"
34 / 41
Other reasons to "customize"
Parquet has support for filter pushdown
34 / 41
Other reasons to "customize"
Parquet has support for filter pushdown
Spark has support for Parquet filter pushdown, but...
34 / 41
Other reasons to "customize"
Parquet has support for filter pushdown
Spark has support for Parquet filter pushdown, but...
Uses INT96 for Timestamp
No pushdown support: SPARK-11784
All our queries involve timestamps!
34 / 41
Other reasons to "customize"
Parquet has support for filter pushdown
Spark has support for Parquet filter pushdown, but...
Uses INT96 for Timestamp
No pushdown support: SPARK-11784
All our queries involve timestamps!
IP Addresses
Spark, Impala, Presto have no direct support
Use string or binary
Wanted to be able to push down CIDR range comparisons
34 / 41
Other reasons to "customize"
Parquet has support for filter pushdown
Spark has support for Parquet filter pushdown, but...
Uses INT96 for Timestamp
No pushdown support: SPARK-11784
All our queries involve timestamps!
IP Addresses
Spark, Impala, Presto have no direct support
Use string or binary
Wanted to be able to push down CIDR range comparisons
Lack of pushdown for these leads to wasted I/O and GC pressure.
34 / 41
Archiving
35 / 41
Archiving
Currently, when Time shard fills up:
1. Roll new hot time shard
2. Run Spark job to Archive data to S3
3. Swap out "warm" shard for cold storage (automagical)
4. Drop the "warm" shard
35 / 41
Archiving
Currently, when Time shard fills up:
1. Roll new hot time shard
2. Run Spark job to Archive data to S3
3. Swap out "warm" shard for cold storage (automagical)
4. Drop the "warm" shard
Not an ideal process, but deals with legacy requirements
35 / 41
Archiving
Currently, when Time shard fills up:
1. Roll new hot time shard
2. Run Spark job to Archive data to S3
3. Swap out "warm" shard for cold storage (automagical)
4. Drop the "warm" shard
Not an ideal process, but deals with legacy requirements
TODO:
1. Stream data straight to cold storage
2. Materialize customer edits in to hot storage
3. Merge hot and cold data at query time (already done)
35 / 41
What have we done so far?
36 / 41
What have we done so far?
1. Time sharded C* clusters with SOLR
36 / 41
What have we done so far?
1. Time sharded C* clusters with SOLR
2. Cheap speedy Cold storage based on S3 and Spark
36 / 41
What have we done so far?
1. Time sharded C* clusters with SOLR
2. Cheap speedy Cold storage based on S3 and Spark
3. A mechanism for archiving data to S3
36 / 41
That's cool, but...
37 / 41
That's cool, but...
How do we handle queries to 3 different stores?
C*
SOLR
Spark
37 / 41
That's cool, but...
How do we handle queries to 3 different stores?
C*
SOLR
Spark
Handle Timesharding and Functional Sharding?
37 / 41
That's cool, but...
How do we handle queries to 3 different stores?
C*
SOLR
Spark
Handle Timesharding and Functional Sharding?
37 / 41
Lot's of Scala query DSL libraries:
Quill
Slick
Phantom
etc
38 / 41
Lot's of Scala query DSL libraries:
Quill
Slick
Phantom
etc
AFAIK nobody supports:
1. Simultaneously querying heterogeneous data stores
38 / 41
Lot's of Scala query DSL libraries:
Quill
Slick
Phantom
etc
AFAIK nobody supports:
1. Simultaneously querying heterogeneous data stores
2. Stitching together time series data from multiple stores
38 / 41
Lot's of Scala query DSL libraries:
Quill
Slick
Phantom
etc
AFAIK nobody supports:
1. Simultaneously querying heterogeneous data stores
2. Stitching together time series data from multiple stores
3. Managing sharding:
Configuration
Discovery
38 / 41
Enter Quaero:
39 / 41
Enter Quaero:
Abstracts away data store differences
Query AST (Algebraic Data Type in Scala)
Command/Strategy pattern for easily plugging in new data stores
39 / 41
Enter Quaero:
Abstracts away data store differences
Query AST (Algebraic Data Type in Scala)
Command/Strategy pattern for easily plugging in new data stores
Deep understanding of sharding patterns
Handles merging of time/functional sharded data
Adding shards is configuration driven
39 / 41
Enter Quaero:
Abstracts away data store differences
Query AST (Algebraic Data Type in Scala)
Command/Strategy pattern for easily plugging in new data stores
Deep understanding of sharding patterns
Handles merging of time/functional sharded data
Adding shards is configuration driven
Exposes a "typesafe" query DSL similar to Phantom or Rogue
Reduce/eliminate bespoke code for retrieving the same data from all 3 stores
39 / 41
Open Sourcing? Maybe!?
There is still a lot of work to be done
40 / 41
We are hiring!
Interwebs: Careers @ Protectwise
Twitter: @ProtectWise
41 / 41

More Related Content

What's hot

Running Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRunning Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRedis Labs
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Data Con LA
 
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxHow jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxDataStax
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital BankDataStax
 
DataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra RockstarDataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra RockstarDataStax
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016Łukasz Grala
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the CloudDataWorks Summit
 
Managing Cassandra Databases with OpenStack Trove
Managing Cassandra Databases with OpenStack TroveManaging Cassandra Databases with OpenStack Trove
Managing Cassandra Databases with OpenStack TroveTesora
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 

What's hot (20)

Running Analytics at the Speed of Your Business
Running Analytics at the Speed of Your BusinessRunning Analytics at the Speed of Your Business
Running Analytics at the Speed of Your Business
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 
How jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStaxHow jKool Analyzes Streaming Data in Real Time with DataStax
How jKool Analyzes Streaming Data in Real Time with DataStax
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital Bank
 
DataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra RockstarDataStax Training – Everything you need to become a Cassandra Rockstar
DataStax Training – Everything you need to become a Cassandra Rockstar
 
Expert summit SQL Server 2016
Expert summit   SQL Server 2016Expert summit   SQL Server 2016
Expert summit SQL Server 2016
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Managing Cassandra Databases with OpenStack Trove
Managing Cassandra Databases with OpenStack TroveManaging Cassandra Databases with OpenStack Trove
Managing Cassandra Databases with OpenStack Trove
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 

Similar to Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra Summit 2016

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFSJohn Conley
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelinesconfluent
 
Internet Week 2018: 1.1.1.0/24 A report from the (anycast) trenches
Internet Week 2018: 1.1.1.0/24 A report from the (anycast) trenchesInternet Week 2018: 1.1.1.0/24 A report from the (anycast) trenches
Internet Week 2018: 1.1.1.0/24 A report from the (anycast) trenchesAPNIC
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsBenjamin Darfler
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Alluxio, Inc.
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022HostedbyConfluent
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 

Similar to Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra Summit 2016 (20)

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Data correlation using PySpark and HDFS
Data correlation using PySpark and HDFSData correlation using PySpark and HDFS
Data correlation using PySpark and HDFS
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data PipelinesETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
 
Internet Week 2018: 1.1.1.0/24 A report from the (anycast) trenches
Internet Week 2018: 1.1.1.0/24 A report from the (anycast) trenchesInternet Week 2018: 1.1.1.0/24 A report from the (anycast) trenches
Internet Week 2018: 1.1.1.0/24 A report from the (anycast) trenches
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
Buckle Up! With Valerie Burchby and Xinran Waibe | Current 2022
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 

More from DataStax

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?DataStax
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...DataStax
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsDataStax
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphDataStax
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyDataStax
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...DataStax
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache KafkaDataStax
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseDataStax
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...DataStax
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesDataStax
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudDataStax
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceDataStax
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...DataStax
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...DataStax
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...DataStax
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)DataStax
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsDataStax
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingDataStax
 

More from DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Recently uploaded

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 

Recently uploaded (20)

why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 

Cold Storage That Isn't Glacial (Joshua Hollander, Protectwise) | Cassandra Summit 2016

  • 1. Cold Storage that isn’t glacial Joshua Hollander, Principal Software Engineer ProtectWise 1 / 41
  • 2. Who are we and what do we do? 2 / 41
  • 3. Who are we and what do we do? We record full fidelity Network Data 2 / 41
  • 4. Who are we and what do we do? We record full fidelity Network Data Every network conversation (NetFlow) 2 / 41
  • 5. Who are we and what do we do? We record full fidelity Network Data Every network conversation (NetFlow) Full fidelity packet data (PCAP) 2 / 41
  • 6. Who are we and what do we do? We record full fidelity Network Data Every network conversation (NetFlow) Full fidelity packet data (PCAP) Searchable detailed protocol data for: HTTP DNS DHCP Files Security Events (IDS) and more... 2 / 41
  • 7. Who are we and what do we do? We record full fidelity Network Data Every network conversation (NetFlow) Full fidelity packet data (PCAP) Searchable detailed protocol data for: HTTP DNS DHCP Files Security Events (IDS) and more... We analyze all that data and detect threats others can't see 2 / 41
  • 8. Network data piles up fast! 3 / 41
  • 9. Network data piles up fast! Over 300 C* servers in production Over 1200 servers in EC2 Over 150TB in C* About 90TB of SOLR indexes 100TB of cold storage data 2 PB of PCAP 3 / 41
  • 10. Network data piles up fast! Over 300 C* servers in production Over 1200 servers in EC2 Over 150TB in C* About 90TB of SOLR indexes 100TB of cold storage data 2 PB of PCAP And we are growing rapidly! 3 / 41
  • 12. So what? All those servers cost a lot of money 4 / 41
  • 13. Right sizing our AWS bill 5 / 41
  • 14. Right sizing our AWS bill This is all time series data: Lots of writes/reads to recent data Some reads and very few writes to older data 5 / 41
  • 15. Right sizing our AWS bill This is all time series data: Lots of writes/reads to recent data Some reads and very few writes to older data So just move all that older, cold data, to cheaper storage. How hard could that possibly be? 5 / 41
  • 16. Problem 1: Data distributed evenly across all these expensive servers Using Size Tiered Compaction Can't just migrate old SSTables to new cheap servers 6 / 41
  • 17. Problem 1: Data distributed evenly across all these expensive servers Using Size Tiered Compaction Can't just migrate old SSTables to new cheap servers Solution: use Date Tiered Compaction? We update old data regularly What SSTables can you safely migrate? Result: 6 / 41
  • 18. Problem 1: Data distributed evenly across all these expensive servers Using Size Tiered Compaction Can't just migrate old SSTables to new cheap servers Solution: use Date Tiered Compaction? We update old data regularly What SSTables can you safely migrate? Result: 6 / 41
  • 19. Problem 2: DSE/SOLR re-index takes forever! We have nodes with up to 300GB of SOLR indexes Bootstraping a new node requires re-index after streaming Re-index can take a week or more!!! At that pace we simply cannot bootstrap new nodes 7 / 41
  • 20. Problem 2: DSE/SOLR re-index takes forever! We have nodes with up to 300GB of SOLR indexes Bootstraping a new node requires re-index after streaming Re-index can take a week or more!!! At that pace we simply cannot bootstrap new nodes Solution: Time sharded clusters in application code Fanout searches for large time windows Assemble results in code (using same algorithm SOLR does) 7 / 41
  • 21. Problem 2: DSE/SOLR re-index takes forever! We have nodes with up to 300GB of SOLR indexes Bootstraping a new node requires re-index after streaming Re-index can take a week or more!!! At that pace we simply cannot bootstrap new nodes Solution: Time sharded clusters in application code Fanout searches for large time windows Assemble results in code (using same algorithm SOLR does) Result: ¯_( )_/¯ 7 / 41
  • 22. What did we gain? 8 / 41
  • 23. What did we gain? We can now migrate older timeshards to cheaper servers! 8 / 41
  • 24. What did we gain? We can now migrate older timeshards to cheaper servers! However: Cold data servers are still too expensive DevOps time suck is massive Product wants us to store even more data! 8 / 41
  • 25. Throwing ideas at the wall 9 / 41
  • 26. Throwing ideas at the wall We have this giant data warehouse, let's use that Response time is too slow: 10 seconds to pull single record by ID! Complex ETL pipeline where latency measured in hours Data model is different Read only Reliability, etc, etc 9 / 41
  • 27. Throwing ideas at the wall We have this giant data warehouse, let's use that Response time is too slow: 10 seconds to pull single record by ID! Complex ETL pipeline where latency measured in hours Data model is different Read only Reliability, etc, etc What about Elastic Search, HBase, Hive/Parquet, MongoDB, etc, etc? 9 / 41
  • 28. Throwing ideas at the wall We have this giant data warehouse, let's use that Response time is too slow: 10 seconds to pull single record by ID! Complex ETL pipeline where latency measured in hours Data model is different Read only Reliability, etc, etc What about Elastic Search, HBase, Hive/Parquet, MongoDB, etc, etc? Wait. Hold on! This Parquet thing is interesting... 9 / 41
  • 30. Parquet Columnar Projections are very efficient Enables vectorized execution 10 / 41
  • 31. Parquet Columnar Projections are very efficient Enables vectorized execution Compressed Using Run Length Encoding Throw Snappy, LZO, or LZ4 on top of that 10 / 41
  • 32. Parquet Columnar Projections are very efficient Enables vectorized execution Compressed Using Run Length Encoding Throw Snappy, LZO, or LZ4 on top of that Schema Files encode schema and other meta-data Support exists for merging disparate schema amongst files 10 / 41
  • 33. Row Group Horizontal grouping of columns Within each row group data is arranged by column in chunks Column Chunk Chunk of data for an individual column Unit of parallelization for I/O Page Column chunks are divided up into pages for compression and encoding Parquet: Some details 11 / 41
  • 34. So you have a nice file format... Now what? 12 / 41
  • 35. Need to get data out of Cassandra 13 / 41
  • 36. Need to get data out of Cassandra Spark seems good for that 13 / 41
  • 37. Need to get data out of Cassandra Spark seems good for that Need to put the data somewhere 13 / 41
  • 38. Need to get data out of Cassandra Spark seems good for that Need to put the data somewhere S3 is really cheap and fairly well supported by Spark 13 / 41
  • 39. Need to get data out of Cassandra Spark seems good for that Need to put the data somewhere S3 is really cheap and fairly well supported by Spark Need to be able to query the data 13 / 41
  • 40. Need to get data out of Cassandra Spark seems good for that Need to put the data somewhere S3 is really cheap and fairly well supported by Spark Need to be able to query the data Spark seems good for that too 13 / 41
  • 41. So we are using Spark and S3... Now what? 14 / 41
  • 42. Lots and lots of files Sizing parquet 15 / 41
  • 43. Lots and lots of files Sizing parquet Parquet docs recommend 1GB files for HDFS 15 / 41
  • 44. Lots and lots of files Sizing parquet Parquet docs recommend 1GB files for HDFS For S3 the sweet spot appears to be 128 to 256MB 15 / 41
  • 45. Lots and lots of files Sizing parquet Parquet docs recommend 1GB files for HDFS For S3 the sweet spot appears to be 128 to 256MB We have terabytes of files Scans take forever! 15 / 41
  • 46. Lots and lots of files Sizing parquet Parquet docs recommend 1GB files for HDFS For S3 the sweet spot appears to be 128 to 256MB We have terabytes of files Scans take forever! 15 / 41
  • 47. Partitioning Our queries are always filtered by: 1. Customer 2. Time 16 / 41
  • 48. Partitioning Our queries are always filtered by: 1. Customer 2. Time So we Partition by: ├── cid=X | ├── year=2015 | └── year=2016 | └── month=0 | └── day=0 | └── hour=0 └── cid=Y 16 / 41
  • 49. Partitioning Our queries are always filtered by: 1. Customer 2. Time So we Partition by: ├── cid=X | ├── year=2015 | └── year=2016 | └── month=0 | └── day=0 | └── hour=0 └── cid=Y Spark understands and translates query filters to this folder structure 16 / 41
  • 50. Big Improvement! Now a customer can query a time range quickly 17 / 41
  • 52. Partitioning problems Customers generally ask questions such as: "Over the last 6 months, how many times did I see IP X using protocol Y?" 18 / 41
  • 53. Partitioning problems Customers generally ask questions such as: "Over the last 6 months, how many times did I see IP X using protocol Y?" "When did IP X not use port 80 for HTTP?" 18 / 41
  • 54. Partitioning problems Customers generally ask questions such as: "Over the last 6 months, how many times did I see IP X using protocol Y?" "When did IP X not use port 80 for HTTP?" "Who keeps scanning server Z for open SSH ports?" 18 / 41
  • 55. Partitioning problems Customers generally ask questions such as: "Over the last 6 months, how many times did I see IP X using protocol Y?" "When did IP X not use port 80 for HTTP?" "Who keeps scanning server Z for open SSH ports?" Queries would take minutes. 18 / 41
  • 56. Queries spanning large time windows select count(*) from events where ip = '192.168.0.1' and cid = 1 and year = 2016 19 / 41
  • 57. Queries spanning large time windows select count(*) from events where ip = '192.168.0.1' and cid = 1 and year = 2016 ├── cid=X | ├── year=2015 | └── year=2016 | |── month=0 | | └── day=0 | | └── hour=0 | | └── 192.168.0.1_was_NOT_here.parquet | └── month=1 | └── day=0 | └── hour=0 | └── 192.168.0.1_WAS_HERE.parquet └── cid=Y 19 / 41
  • 58. Problem #1 Requires listing out all the sub-dirs for large time ranges. Remember S3 is not really a file system Slow! 20 / 41
  • 59. Problem #1 Requires listing out all the sub-dirs for large time ranges. Remember S3 is not really a file system Slow! Problem #2 Pulling potentially thousands of files from S3. Slow and Costly! 20 / 41
  • 60. Solving problem #1 Put partition info and file listings in a db ala Hive. 21 / 41
  • 61. Solving problem #1 Put partition info and file listings in a db ala Hive. Why not just use Hive? 21 / 41
  • 62. Solving problem #1 Put partition info and file listings in a db ala Hive. Why not just use Hive? Still not fast enough Also does not help with Problem #2 21 / 41
  • 63. DSE/SOLR to the Rescue! 22 / 41
  • 64. DSE/SOLR to the Rescue! Store file meta data in SOLR 22 / 41
  • 65. DSE/SOLR to the Rescue! Store file meta data in SOLR Efficiently skip elements of partition hierarchy! select count(*) from events where month = 6 22 / 41
  • 66. DSE/SOLR to the Rescue! Store file meta data in SOLR Efficiently skip elements of partition hierarchy! select count(*) from events where month = 6 Avoids pulling all meta in Spark driver 1. Get partition counts and schema info from SOLR driver-side 2. Submit SOLR RDD to cluster 3. Run mapPartitions on SOLR RDD and turn into Parquet RDDs As an optimization for small file sets we pull the SOLR rows driver side 22 / 41
  • 68. Performance gains! Source Scan/Filter Time SOLR < 100 milliseconds Hive > 5 seconds S3 directory listing > 5 minutes!!! 24 / 41
  • 70. Problem #1 Solved! What about Problem #2? 25 / 41
  • 71. Solving problem #2 Still need to pull potentially thousands of files to answer our query! 26 / 41
  • 72. Solving problem #2 Still need to pull potentially thousands of files to answer our query! Can we partition differently? 26 / 41
  • 73. Solving problem #2 Still need to pull potentially thousands of files to answer our query! Can we partition differently? Field Cardinality Result Protocol Medium (9000) ❌ Port High (65535) ❌❌ IP Addresses Astronomically High (3.4 undecillion) ❌❌❌ Nope! Nope! Nope! 26 / 41
  • 75. Searching High Cardinality Data Assumptions 1. Want to reduce # of files pulled for a given query 2. Cannot store all exact values in SOLR 3. We are okay with a few false positives 27 / 41
  • 76. Searching High Cardinality Data Assumptions 1. Want to reduce # of files pulled for a given query 2. Cannot store all exact values in SOLR 3. We are okay with a few false positives This sounds like a job for... 27 / 41
  • 78. Towards a "Searchable" Bloom Filter 29 / 41
  • 79. Towards a "Searchable" Bloom Filter Normal SOLR index looks vaguely like Term Doc IDs 192.168.0.1 1,2,3,5,8,13... 10.0.0.1 2,4,6,8... 8.8.8.8 1,2,3,4,5,6 29 / 41
  • 80. Towards a "Searchable" Bloom Filter Normal SOLR index looks vaguely like Term Doc IDs 192.168.0.1 1,2,3,5,8,13... 10.0.0.1 2,4,6,8... 8.8.8.8 1,2,3,4,5,6 Terms are going to grow out of control 29 / 41
  • 81. Towards a "Searchable" Bloom Filter Normal SOLR index looks vaguely like Term Doc IDs 192.168.0.1 1,2,3,5,8,13... 10.0.0.1 2,4,6,8... 8.8.8.8 1,2,3,4,5,6 Terms are going to grow out of control If only we could constrain to a reasonable number of values? 29 / 41
  • 82. Terms as a "bloom filter" 30 / 41
  • 83. Terms as a "bloom filter" 30 / 41
  • 84. Terms as a "bloom filter" What if our terms were the offsets of the Bloom Filter values? 30 / 41
  • 85. Terms as a "bloom filter" What if our terms were the offsets of the Bloom Filter values? Term Doc IDs 0 1,2,3,5,8,13... 1 2,4,6,8... 2 1,2,3,4,5,6 3 1,2,3 ... ... N 1,2,3,4,5... 30 / 41
  • 86. Index Term Doc IDs 0 0,1,2 1 1,2 2 1 3 0 4 1,2 5 0 Indexing Field Value Indexed Values Doc ID ip 192.168.0.1 {0, 3, 5} 0 ip 10.0.0.1 {1, 2, 4} 1 ip 8.8.8.8 {0, 1, 4} 2 Queries Field Query String Actual Query ip ip:192.168.0.1 ip_bits:0 AND 3 AND 5 ip ip:10.0.0.1 ip_bits:1 AND 4 AND 5 Searchable Bloom Filters 31 / 41
  • 88. Problem #2 Solved! Enormous filtering power 32 / 41
  • 89. Problem #2 Solved! Enormous filtering power Relatively minimal cost in space and computation 32 / 41
  • 91. Key Lookups Need to retain this C* functionality 33 / 41
  • 92. Key Lookups Need to retain this C* functionality Spark/Parquet has no direct support What partition would it choose? The partition would have to be encoded in the key?! 33 / 41
  • 93. Key Lookups Need to retain this C* functionality Spark/Parquet has no direct support What partition would it choose? The partition would have to be encoded in the key?! Solution: Our keys have time encoded in them Enables us to generate the partition path containing the row 33 / 41
  • 94. Key Lookups Need to retain this C* functionality Spark/Parquet has no direct support What partition would it choose? The partition would have to be encoded in the key?! Solution: Our keys have time encoded in them Enables us to generate the partition path containing the row That was easy! 33 / 41
  • 95. Other reasons to "customize" 34 / 41
  • 96. Other reasons to "customize" Parquet has support for filter pushdown 34 / 41
  • 97. Other reasons to "customize" Parquet has support for filter pushdown Spark has support for Parquet filter pushdown, but... 34 / 41
  • 98. Other reasons to "customize" Parquet has support for filter pushdown Spark has support for Parquet filter pushdown, but... Uses INT96 for Timestamp No pushdown support: SPARK-11784 All our queries involve timestamps! 34 / 41
  • 99. Other reasons to "customize" Parquet has support for filter pushdown Spark has support for Parquet filter pushdown, but... Uses INT96 for Timestamp No pushdown support: SPARK-11784 All our queries involve timestamps! IP Addresses Spark, Impala, Presto have no direct support Use string or binary Wanted to be able to push down CIDR range comparisons 34 / 41
  • 100. Other reasons to "customize" Parquet has support for filter pushdown Spark has support for Parquet filter pushdown, but... Uses INT96 for Timestamp No pushdown support: SPARK-11784 All our queries involve timestamps! IP Addresses Spark, Impala, Presto have no direct support Use string or binary Wanted to be able to push down CIDR range comparisons Lack of pushdown for these leads to wasted I/O and GC pressure. 34 / 41
  • 102. Archiving Currently, when Time shard fills up: 1. Roll new hot time shard 2. Run Spark job to Archive data to S3 3. Swap out "warm" shard for cold storage (automagical) 4. Drop the "warm" shard 35 / 41
  • 103. Archiving Currently, when Time shard fills up: 1. Roll new hot time shard 2. Run Spark job to Archive data to S3 3. Swap out "warm" shard for cold storage (automagical) 4. Drop the "warm" shard Not an ideal process, but deals with legacy requirements 35 / 41
  • 104. Archiving Currently, when Time shard fills up: 1. Roll new hot time shard 2. Run Spark job to Archive data to S3 3. Swap out "warm" shard for cold storage (automagical) 4. Drop the "warm" shard Not an ideal process, but deals with legacy requirements TODO: 1. Stream data straight to cold storage 2. Materialize customer edits in to hot storage 3. Merge hot and cold data at query time (already done) 35 / 41
  • 105. What have we done so far? 36 / 41
  • 106. What have we done so far? 1. Time sharded C* clusters with SOLR 36 / 41
  • 107. What have we done so far? 1. Time sharded C* clusters with SOLR 2. Cheap speedy Cold storage based on S3 and Spark 36 / 41
  • 108. What have we done so far? 1. Time sharded C* clusters with SOLR 2. Cheap speedy Cold storage based on S3 and Spark 3. A mechanism for archiving data to S3 36 / 41
  • 110. That's cool, but... How do we handle queries to 3 different stores? C* SOLR Spark 37 / 41
  • 111. That's cool, but... How do we handle queries to 3 different stores? C* SOLR Spark Handle Timesharding and Functional Sharding? 37 / 41
  • 112. That's cool, but... How do we handle queries to 3 different stores? C* SOLR Spark Handle Timesharding and Functional Sharding? 37 / 41
  • 113. Lot's of Scala query DSL libraries: Quill Slick Phantom etc 38 / 41
  • 114. Lot's of Scala query DSL libraries: Quill Slick Phantom etc AFAIK nobody supports: 1. Simultaneously querying heterogeneous data stores 38 / 41
  • 115. Lot's of Scala query DSL libraries: Quill Slick Phantom etc AFAIK nobody supports: 1. Simultaneously querying heterogeneous data stores 2. Stitching together time series data from multiple stores 38 / 41
  • 116. Lot's of Scala query DSL libraries: Quill Slick Phantom etc AFAIK nobody supports: 1. Simultaneously querying heterogeneous data stores 2. Stitching together time series data from multiple stores 3. Managing sharding: Configuration Discovery 38 / 41
  • 118. Enter Quaero: Abstracts away data store differences Query AST (Algebraic Data Type in Scala) Command/Strategy pattern for easily plugging in new data stores 39 / 41
  • 119. Enter Quaero: Abstracts away data store differences Query AST (Algebraic Data Type in Scala) Command/Strategy pattern for easily plugging in new data stores Deep understanding of sharding patterns Handles merging of time/functional sharded data Adding shards is configuration driven 39 / 41
  • 120. Enter Quaero: Abstracts away data store differences Query AST (Algebraic Data Type in Scala) Command/Strategy pattern for easily plugging in new data stores Deep understanding of sharding patterns Handles merging of time/functional sharded data Adding shards is configuration driven Exposes a "typesafe" query DSL similar to Phantom or Rogue Reduce/eliminate bespoke code for retrieving the same data from all 3 stores 39 / 41
  • 121. Open Sourcing? Maybe!? There is still a lot of work to be done 40 / 41
  • 122. We are hiring! Interwebs: Careers @ Protectwise Twitter: @ProtectWise 41 / 41