5. ● Two main types of real-time events: client and server
● Challenges: client call site consistency, batching, QA
● Multiple persistent databases
● Real-time events temporarily stored in Kafka - revisit
● Databases always up to date, copied over at intervals
CGAP™ - How We Obtain Raw Data
8. The ETL Job (Spark)
● Multiple batch jobs, one per kafka topic, run at intervals
● Concurrently read out of multiple kafka partitions
● Save offset state per job
● Parse, flatten, write JSON
● Takes care of compression and uploading to cloud storage
● Extra batch load jobs run for persistent relational DBs
12. Data Warehousing (AWS Redshift)
● Load in data from AWS S3 to Redshift
● Two clusters, Kiwi/Plaza, 10 nodes total
● Scaling/upgrading requires some write downtime
● Support for full SQL joins, based on PostgreSQL 8.0
● Specialized proprietary hardware/software layers
● Scales up to petabytes
15. Our Numbers
● Kiwi: half a billion events ingested daily
○ 200MM from clients, 280MM from server (170MM pushes)
○ Uncompressed size of JSON: 500GB/day
○ Compressed size, Spark output: 8GB/day
○ Compressed and indexed size, Redshift: 15GB/day
● Plaza: 14MM events per day, mostly server
○ Plaza loads: 4 hours in 53s vs. Kiwi 4 hours in 8 minutes
16. Our Numbers (cont.)
● Currently storing 1.2TB of compressed Redshift data
○ Roughly 2 months of full Kiwi data (14 billion rows)
○ Past 2 months, archive and coalesce to save space
● At Plaza’s rate, can store over two years on just 256GB
● Costs:
○ S3 storage long-term: $5000/year
○ Redshift, current status: $16,000/year
○ Machines for batch load: 1 MBP
○ Mode: $40/month
17. Thanks to...
● The Apache Foundation for making open source systems
● AWS for providing us their economies of scale
● You, for listening