CGAPTM Data Pipeline: How We Obtain Raw Data and Insights

•Download as PPTX, PDF•

0 likes•127 views

Boyang Niu

Data: How Does
It Work?
An overview of the Castle Global Analytics
Pipeline (CGAP™)

1. General overview
1. Our current solution
1. Numbers and comparisons

Raw data
aggregated
from multiple
sources
Extract,
transform,
and load
(ETL)
Persist
ingestible
data to
warehouse
Analyze
ingested
data for
insights
Exhibit A: Generic Data Pipeline

● Two main types of real-time events: client and server
● Challenges: client call site consistency, batching, QA
● Multiple persistent databases
● Real-time events temporarily stored in Kafka - revisit
● Databases always up to date, copied over at intervals
CGAP™ - How We Obtain Raw Data

The ETL Job (Spark)
● Multiple batch jobs, one per kafka topic, run at intervals
● Concurrently read out of multiple kafka partitions
● Save offset state per job
● Parse, flatten, write JSON
● Takes care of compression and uploading to cloud storage
● Extra batch load jobs run for persistent relational DBs

Data Warehousing (AWS Redshift)
● Load in data from AWS S3 to Redshift
● Two clusters, Kiwi/Plaza, 10 nodes total
● Scaling/upgrading requires some write downtime
● Support for full SQL joins, based on PostgreSQL 8.0
● Specialized proprietary hardware/software layers
● Scales up to petabytes

Business Intelligence
● Primarily using Mode as the UI to Redshift, so so cheap
● Allows for some visualizations from queries

Our Numbers
● Kiwi: half a billion events ingested daily
○ 200MM from clients, 280MM from server (170MM pushes)
○ Uncompressed size of JSON: 500GB/day
○ Compressed size, Spark output: 8GB/day
○ Compressed and indexed size, Redshift: 15GB/day
● Plaza: 14MM events per day, mostly server
○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8 minutes

Our Numbers (cont.)
● Currently storing 1.2TB of compressed Redshift data
○ Roughly 2 months of full Kiwi data (14 billion rows)
○ Past 2 months, archive and coalesce to save space
● At Plaza’s rate, can store over two years on just 256GB
● Costs:
○ S3 storage long-term: $5000/year
○ Redshift, current status: $16,000/year
○ Machines for batch load: 1 MBP
○ Mode: $40/month

Thanks to...
● The Apache Foundation for making open source systems
● AWS for providing us their economies of scale
● You, for listening

What's hot

Alluxio Data Orchestration Platform for the CloudShubham Tagra

Building tiered data stores using aesop to bridge sql and no sql systemsRegunath B

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...ScyllaDB

The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP

Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.

TiDB IntroductionMorgan Tocker

Toronto High Scalability meetup - Scaling ELKAndrew Trossman

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander ZaitsevAltinity Ltd

HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon

Presto Strata Hadoop SJ 2016 short talkkbajda

InfluxDB InternalsInfluxData

Can the elephants handle the no sql onslaughtAung Thu Rha Hein

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience SharingVianney FOUCAULT

The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue

Monitoring with Clickhouseunicast

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Scylla Summit 2018: How We Made Large Partition Scans Over Two Times FasterScyllaDB

Journey and evolution of Presto@GrabShubham Tagra

Clickhouse at Cloudflare. By Marek VavrusaValery Tkachenko

What's hot (20)

Alluxio Data Orchestration Platform for the Cloud

Building tiered data stores using aesop to bridge sql and no sql systems

MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...

The Dark Side Of Go -- Go runtime related problems in TiDB in production

Improve Presto Architectural Decisions with Shadow Cache

TiDB Introduction

Toronto High Scalability meetup - Scaling ELK

Webinar 2017. Supercharge your analytics with ClickHouse. Alexander Zaitsev

HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...

Presto Strata Hadoop SJ 2016 short talk

InfluxDB Internals

Can the elephants handle the no sql onslaught

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing

The evolution of Netflix's S3 data warehouse (Strata NY 2018)

Monitoring with Clickhouse

Scylla Summit 2016: Graph Processing with Titan and Scylla

Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

Journey and evolution of Presto@Grab

Clickhouse at Cloudflare. By Marek Vavrusa

Similar to CGAPTM Data Pipeline: How We Obtain Raw Data and Insights

Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit

It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko

Kafka streams decoupling with storesYoni Farin

Big data should be simpleDori Waldman

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyHostedbyConfluent

From Batch to Streaming ET(L) with Apache ApexDataWorks Summit

From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017Thomas Weise

From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex

Stsg17 speaker yousunjeongYousun Jeong

EVCache & Moneta (GoSF)Scott Mansfield

Netflix Open Source Meetup Season 4 Episode 2aspyker

Event Driven MicroservicesFabrizio Fortino

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks

Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services

How QBerg scaled to store data longer, query it fasterMariaDB plc

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks

Galaxy Big Data with MariaDBMariaDB Corporation

Real time data pipline with kafka streamsYoni Farin

Similar to CGAPTM Data Pipeline: How We Obtain Raw Data and Insights (20)

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

It's Time To Stop Using Lambda Architecture

Kafka streams decoupling with stores

Big data should be simple

AWS Big Data Demystified #1: Big data architecture lessons learned

It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify

From Batch to Streaming ET(L) with Apache Apex

From Batch to Streaming ET(L) with Apache Apex at Berlin Buzzwords 2017

From Batch to Streaming with Apache Apex Dataworks Summit 2017

Stsg17 speaker yousunjeong

EVCache & Moneta (GoSF)

Netflix Open Source Meetup Season 4 Episode 2

Event Driven Microservices

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...

Hoodie: How (And Why) We built an analytical datastore on Spark

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...

How QBerg scaled to store data longer, query it faster

Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka

Galaxy Big Data with MariaDB

Real time data pipline with kafka streams

CGAPTM Data Pipeline: How We Obtain Raw Data and Insights

1. Data: How Does It Work? An overview of the Castle Global Analytics Pipeline (CGAP™)

2. 1. General overview 1. Our current solution 1. Numbers and comparisons

3. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights Exhibit A: Generic Data Pipeline

4. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights

5. ● Two main types of real-time events: client and server ● Challenges: client call site consistency, batching, QA ● Multiple persistent databases ● Real-time events temporarily stored in Kafka - revisit ● Databases always up to date, copied over at intervals CGAP™ - How We Obtain Raw Data

6. Kafka Diagram

7. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights

8. The ETL Job (Spark) ● Multiple batch jobs, one per kafka topic, run at intervals ● Concurrently read out of multiple kafka partitions ● Save offset state per job ● Parse, flatten, write JSON ● Takes care of compression and uploading to cloud storage ● Extra batch load jobs run for persistent relational DBs

9. Spark Diagram

10. Spark Diagram (cont.)

11. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights Exhibit A: Generic Data Pipeline

12. Data Warehousing (AWS Redshift) ● Load in data from AWS S3 to Redshift ● Two clusters, Kiwi/Plaza, 10 nodes total ● Scaling/upgrading requires some write downtime ● Support for full SQL joins, based on PostgreSQL 8.0 ● Specialized proprietary hardware/software layers ● Scales up to petabytes

13. Raw data aggregated from multiple sources Extract, transform, and load (ETL) Persist ingestible data to warehouse Analyze ingested data for insights Exhibit A: Generic Data Pipeline

14. Business Intelligence ● Primarily using Mode as the UI to Redshift, so so cheap ● Allows for some visualizations from queries

15. Our Numbers ● Kiwi: half a billion events ingested daily ○ 200MM from clients, 280MM from server (170MM pushes) ○ Uncompressed size of JSON: 500GB/day ○ Compressed size, Spark output: 8GB/day ○ Compressed and indexed size, Redshift: 15GB/day ● Plaza: 14MM events per day, mostly server ○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8 minutes

16. Our Numbers (cont.) ● Currently storing 1.2TB of compressed Redshift data ○ Roughly 2 months of full Kiwi data (14 billion rows) ○ Past 2 months, archive and coalesce to save space ● At Plaza’s rate, can store over two years on just 256GB ● Costs: ○ S3 storage long-term: $5000/year ○ Redshift, current status: $16,000/year ○ Machines for batch load: 1 MBP ○ Mode: $40/month

17. Thanks to... ● The Apache Foundation for making open source systems ● AWS for providing us their economies of scale ● You, for listening

CGAPTM Data Pipeline: How We Obtain Raw Data and Insights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CGAPTM Data Pipeline: How We Obtain Raw Data and Insights

Similar to CGAPTM Data Pipeline: How We Obtain Raw Data and Insights (20)

CGAPTM Data Pipeline: How We Obtain Raw Data and Insights