SlideShare a Scribd company logo
1 of 17
Data: How Does
It Work?
An overview of the Castle Global Analytics
Pipeline (CGAP™)
1. General overview
1. Our current solution
1. Numbers and comparisons
Raw data
aggregated
from multiple
sources
Extract,
transform,
and load
(ETL)
Persist
ingestible
data to
warehouse
Analyze
ingested
data for
insights
Exhibit A: Generic Data Pipeline
Raw data
aggregated
from multiple
sources
Extract,
transform,
and load
(ETL)
Persist
ingestible
data to
warehouse
Analyze
ingested
data for
insights
● Two main types of real-time events: client and server
● Challenges: client call site consistency, batching, QA
● Multiple persistent databases
● Real-time events temporarily stored in Kafka - revisit
● Databases always up to date, copied over at intervals
CGAP™ - How We Obtain Raw Data
Kafka Diagram
Raw data
aggregated
from multiple
sources
Extract,
transform,
and load
(ETL)
Persist
ingestible
data to
warehouse
Analyze
ingested
data for
insights
The ETL Job (Spark)
● Multiple batch jobs, one per kafka topic, run at intervals
● Concurrently read out of multiple kafka partitions
● Save offset state per job
● Parse, flatten, write JSON
● Takes care of compression and uploading to cloud storage
● Extra batch load jobs run for persistent relational DBs
Spark Diagram
Spark Diagram (cont.)
Raw data
aggregated
from multiple
sources
Extract,
transform,
and load
(ETL)
Persist
ingestible
data to
warehouse
Analyze
ingested
data for
insights
Exhibit A: Generic Data Pipeline
Data Warehousing (AWS Redshift)
● Load in data from AWS S3 to Redshift
● Two clusters, Kiwi/Plaza, 10 nodes total
● Scaling/upgrading requires some write downtime
● Support for full SQL joins, based on PostgreSQL 8.0
● Specialized proprietary hardware/software layers
● Scales up to petabytes
Raw data
aggregated
from multiple
sources
Extract,
transform,
and load
(ETL)
Persist
ingestible
data to
warehouse
Analyze
ingested
data for
insights
Exhibit A: Generic Data Pipeline
Business Intelligence
● Primarily using Mode as the UI to Redshift, so so cheap
● Allows for some visualizations from queries
Our Numbers
● Kiwi: half a billion events ingested daily
○ 200MM from clients, 280MM from server (170MM pushes)
○ Uncompressed size of JSON: 500GB/day
○ Compressed size, Spark output: 8GB/day
○ Compressed and indexed size, Redshift: 15GB/day
● Plaza: 14MM events per day, mostly server
○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8 minutes
Our Numbers (cont.)
● Currently storing 1.2TB of compressed Redshift data
○ Roughly 2 months of full Kiwi data (14 billion rows)
○ Past 2 months, archive and coalesce to save space
● At Plaza’s rate, can store over two years on just 256GB
● Costs:
○ S3 storage long-term: $5000/year
○ Redshift, current status: $16,000/year
○ Machines for batch load: 1 MBP
○ Mode: $40/month
Thanks to...
● The Apache Foundation for making open source systems
● AWS for providing us their economies of scale
● You, for listening

More Related Content

What's hot

Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKAndrew Trossman
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd
 
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkkbajda
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB InternalsInfluxData
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtAung Thu Rha Hein
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingVianney FOUCAULT
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouseunicast
 
Scylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and ScyllaScylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB
 
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times FasterScylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times FasterScyllaDB
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@GrabShubham Tagra
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaValery Tkachenko
 

What's hot (20)

Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Building tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systemsBuilding tiered data stores using aesop to bridge sql and no sql systems
Building tiered data stores using aesop to bridge sql and no sql systems
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
TiDB Introduction
TiDB IntroductionTiDB Introduction
TiDB Introduction
 
Toronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELKToronto High Scalability meetup - Scaling ELK
Toronto High Scalability meetup - Scaling ELK
 
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevWebinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev
 
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaught
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
 
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingClickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouse
 
Scylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and ScyllaScylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2016: Graph Processing with Titan and Scylla
 
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times FasterScylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
 
Journey and evolution of Presto@Grab
Journey and evolution of Presto@GrabJourney and evolution of Presto@Grab
Journey and evolution of Presto@Grab
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 

Similar to CGAPTM Data Pipeline: How We Obtain Raw Data and Insights

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexDataWorks Summit
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017Thomas Weise
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven MicroservicesFabrizio Fortino
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
How QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterHow QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterMariaDB plc
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streamsYoni Farin
 

Similar to CGAPTM Data Pipeline: How We Obtain Raw Data and Insights (20)

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
How QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it fasterHow QBerg scaled to store data longer, query it faster
How QBerg scaled to store data longer, query it faster
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Galaxy Big Data with MariaDB
Galaxy Big Data with MariaDBGalaxy Big Data with MariaDB
Galaxy Big Data with MariaDB
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 

CGAPTM Data Pipeline: How We Obtain Raw Data and Insights

  • 1. Data: How Does It Work? An overview of the Castle Global Analytics Pipeline (CGAP™)
  • 2. 1. General overview 1. Our current solution 1. Numbers and comparisons
  • 3. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights Exhibit A: Generic Data Pipeline
  • 4. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights
  • 5. ● Two main types of real-time events: client and server ● Challenges: client call site consistency, batching, QA ● Multiple persistent databases ● Real-time events temporarily stored in Kafka - revisit ● Databases always up to date, copied over at intervals CGAP™ - How We Obtain Raw Data
  • 7. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights
  • 8. The ETL Job (Spark) ● Multiple batch jobs, one per kafka topic, run at intervals ● Concurrently read out of multiple kafka partitions ● Save offset state per job ● Parse, flatten, write JSON ● Takes care of compression and uploading to cloud storage ● Extra batch load jobs run for persistent relational DBs
  • 11. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights Exhibit A: Generic Data Pipeline
  • 12. Data Warehousing (AWS Redshift) ● Load in data from AWS S3 to Redshift ● Two clusters, Kiwi/Plaza, 10 nodes total ● Scaling/upgrading requires some write downtime ● Support for full SQL joins, based on PostgreSQL 8.0 ● Specialized proprietary hardware/software layers ● Scales up to petabytes
  • 13. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights Exhibit A: Generic Data Pipeline
  • 14. Business Intelligence ● Primarily using Mode as the UI to Redshift, so so cheap ● Allows for some visualizations from queries
  • 15. Our Numbers ● Kiwi: half a billion events ingested daily ○ 200MM from clients, 280MM from server (170MM pushes) ○ Uncompressed size of JSON: 500GB/day ○ Compressed size, Spark output: 8GB/day ○ Compressed and indexed size, Redshift: 15GB/day ● Plaza: 14MM events per day, mostly server ○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8 minutes
  • 16. Our Numbers (cont.) ● Currently storing 1.2TB of compressed Redshift data ○ Roughly 2 months of full Kiwi data (14 billion rows) ○ Past 2 months, archive and coalesce to save space ● At Plaza’s rate, can store over two years on just 256GB ● Costs: ○ S3 storage long-term: $5000/year ○ Redshift, current status: $16,000/year ○ Machines for batch load: 1 MBP ○ Mode: $40/month
  • 17. Thanks to... ● The Apache Foundation for making open source systems ● AWS for providing us their economies of scale ● You, for listening