Aerospike at Tapad

•Download as PPTX, PDF•

3 likes•9,398 views

Tap into the Secrets of Tapad's Success – How Tapad's ad-tech platform scales to hundreds of thousands of requests per second.

Technology

Magic Scaling Sprinkles
Experience operating a high-volume, low-
latency ad buying platform at Tapad
@tobym
@TapadEng

Who am I?
Toby Matejovsky
First engineer hired at Tapad 3+ years
ago
Scala developer
@tobym

What are we talking about?
One of the key components that allows
Tapad’s realtime ad-buying platform to hit
350,000 TPS.

Outline
• What Tapad does
• Why Aerospike is a good fit
• Operational experience
• What’s next

What Tapad Does
(Real-time bidding)
The Tapad Difference.
A Unified View.

Ad exchange
Want to show an ad to device
123?
Tapad
Sure, show this ad for $2 CPM
No thanks
Great, you won. Ad was displayed!
How about to device XYZ?
95% response time: ~30 ms

Why Aerospike?
Fast
Safe
Scale out
Expiration/eviction

Super fast key-value store
350,000 reads per second
on 7 nodes
99% of reads are under 1 millisecond

Safe
Replication factor
XDR (cross datacenter replication)
SSD-backed

Scale out
Linear scalability, just add a node*
*will revisit this during the next section

Expiration and eviction
Old data expires automatically
Oldest data is evicted if the database is running
out of space
This is desired behavior in ad-tech world

Operational experience with Aerospike at Tapad
Configuration
Migrations
Eviction
Usage

Tapad’s Aerospike Configuration
100% keys in memory
100% data in SSD storage
Replication factor 2
512-byte block size
Need lots of free space in memory and storage for defrag (high-
water mark)

Migrations and partitions
New node requires data migration, means degraded
performance
Network partition may trigger some data migration

Eviction
Awesome feature, not intuitive if objects’ TTLs are not
nicely distributed

Usage
Blocking and non-blocking clients available
LZ4-compressed protobuf
Hot key error

What’s next?
Smaller minimum block size
Replace Redis (UDFs)
Multiple keys to reference the same record

Thank You
@tobym
@TapadEng
Toby Matejovsky, Director of Engineering
toby@tapad.com
@tobym

Similar to Aerospike at Tapad

Low latency microservices in java QCon New York 2016Peter Lawrey

How to save 16 million euro for your start up businesskantanmt

Tectonic Shift: A New Foundation for Data Driven BusinessAerospike, Inc.

The Power of Amazon EC2 Spot Instances Best Practices and Real-World Use CasesAmazon Web Services

Why all software teams move towards zero innovation speed - And what to do ab...Dirk Jan Swagerman

The Road to Amazon and BeyondVasiliy Fomichev

Dean Sheehan [InfluxData] | InfluxDB Time Series Engine Overview | InfluxDays...InfluxData

What is aerospike database and why is it vastly superior to other database an...Aerospike

AWS Cloud Kata | Bangkok - Getting to ProfitabilityAmazon Web Services

Serverless @ Haufe.Group presented at AWS Summit Berlin 2018Nils Rhode

Moovd Quick PresentationMoovd

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit

How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleMariaDB plc

Everything You Need to Know About ShardingMongoDB

High availability, real-time and scalable architecturesJampp

DEM09 [Repeat] Fearless: From Monolith to Serverless with DynatraceAmazon Web Services

DEM04 Fearless: From Monolith to Serverless with DynatraceAmazon Web Services

Datetimemanagergueste4545f

ITAM UK 2017 Vendor negotiations in a cloudy world_Kylie FowlerMartin Thompson

AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...Amazon Web Services

Similar to Aerospike at Tapad (20)

Low latency microservices in java QCon New York 2016

How to save 16 million euro for your start up business

Tectonic Shift: A New Foundation for Data Driven Business

The Power of Amazon EC2 Spot Instances Best Practices and Real-World Use Cases

Why all software teams move towards zero innovation speed - And what to do ab...

The Road to Amazon and Beyond

Dean Sheehan [InfluxData] | InfluxDB Time Series Engine Overview | InfluxDays...

What is aerospike database and why is it vastly superior to other database an...

AWS Cloud Kata | Bangkok - Getting to Profitability

Serverless @ Haufe.Group presented at AWS Summit Berlin 2018

Moovd Quick Presentation

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale

Everything You Need to Know About Sharding

High availability, real-time and scalable architectures

DEM09 [Repeat] Fearless: From Monolith to Serverless with Dynatrace

DEM04 Fearless: From Monolith to Serverless with Dynatrace

Datetimemanager

ITAM UK 2017 Vendor negotiations in a cloudy world_Kylie Fowler

AWS re:Invent 2016: Learn How FINRA Aligns Billions of Time Ordered Events wi...

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

Key Features Of Token Development (1).pptxLBM Solutions

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

AI as an Interface for Commercial BuildingsMemoori

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

CloudStudio User manual (basic edition):comworks

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Unleash Your Potential - Namagunga Girls Coding Club

DMCC Future of Trade Web3 - Special Edition

Benefits Of Flutter Compared To Other Frameworks

Science&tech:THE INFORMATION AGE STS.pdf

Key Features Of Token Development (1).pptx

Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Advanced Test Driven-Development @ php[tek] 2024

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

AI as an Interface for Commercial Buildings

Understanding the Laravel MVC Architecture

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Unblocking The Main Thread Solving ANRs and Frozen Frames

Human Factors of XR: Using Human Factors to Design XR Systems

CloudStudio User manual (basic edition):

My INSURER PTE LTD - Insurtech Innovation Award 2024

Aerospike at Tapad

1. Magic Scaling Sprinkles Experience operating a high-volume, low- latency ad buying platform at Tapad @tobym @TapadEng

2. Who am I? Toby Matejovsky First engineer hired at Tapad 3+ years ago Scala developer @tobym

3. What are we talking about? One of the key components that allows Tapad’s realtime ad-buying platform to hit 350,000 TPS.

4. Outline • What Tapad does • Why Aerospike is a good fit • Operational experience • What’s next

5. What Tapad Does (Real-time bidding) The Tapad Difference. A Unified View.

6. Ad exchange Want to show an ad to device 123? Tapad Sure, show this ad for $2 CPM No thanks Great, you won. Ad was displayed! How about to device XYZ? 95% response time: ~30 ms

8. Why Aerospike? Fast Safe Scale out Expiration/eviction

9. Super fast key-value store 350,000 reads per second on 7 nodes 99% of reads are under 1 millisecond

10.

11. Safe Replication factor XDR (cross datacenter replication) SSD-backed

12. Scale out Linear scalability, just add a node* *will revisit this during the next section

13. Expiration and eviction Old data expires automatically Oldest data is evicted if the database is running out of space This is desired behavior in ad-tech world

14. Operational experience with Aerospike at Tapad Configuration Migrations Eviction Usage

15.

16. Tapad’s Aerospike Configuration 100% keys in memory 100% data in SSD storage Replication factor 2 512-byte block size Need lots of free space in memory and storage for defrag (high- water mark)

17.

18. Migrations and partitions New node requires data migration, means degraded performance Network partition may trigger some data migration

19. Eviction Awesome feature, not intuitive if objects’ TTLs are not nicely distributed

20.

21. Usage Blocking and non-blocking clients available LZ4-compressed protobuf Hot key error

22. What’s next? Smaller minimum block size Replace Redis (UDFs) Multiple keys to reference the same record

23. Thank You @tobym @TapadEng Toby Matejovsky, Director of Engineering toby@tapad.com @tobym

Editor's Notes

Didn’t realize this had animation, which is why this is out of order. But, it actually shows the asynchronous nature of the system pretty well!
This number is bid-requests per second. A given request may trigger multiple reads because of following aliases.
Marketers understand delivery and response for a single channel However, marketers don’t have data about consumer exposure and marketing impact across devices
Marketers understand delivery and response for a single channel However, marketers don’t have data about consumer exposure and marketing impact across devices
Replication factor 2. If a node goes down, the data is still there. XDR - our west coast bidders read from west coast aerospike, and vice versa. Changes propagate in milliseconds.
Read and write performance scales linearly. Much easier with a key-value store than a relational database, but hey! it fits our data model perfectly.
This is desired behavior because new information is more actionable than old information. Month old record with no recent activity may never be accessible (cleared cookies). Chuck it, make room for new data. Expiring old data also makes our ETL process faster because the old data is gone so we don’t need to export it. Evicting older data makes space for good data when we enter a very high-write situation. Lets the system continue operating.
Config - what Tapad’s deployment looks like. We don’t have as many disks as the setups from Aerospike’s benchmarks because we don’t actually need that much storage. Migrations - what happens when you add a node to the cluster? Eviction - it’s a great feature for us, but was counter-intuitive for us; we had an interesting scenario with this. more later Usage - as a software developer, what is it like using aerospike?
Multiple terabytes of data in 3.3 billion objects Each object is pretty small, about 200 bytes. More on this later in relation to block size, which is 512 bytes
Recommended high-water mark is 50% memory and 50% storage. Aerospike needs this to defray the data. We set it aggressively to 70% on memory and storage to save money on hardware. It’s possible to send updates at a rate faster than the defragger can keep up, leading to out-of-disk-space issues even when you have 40% of the disk free. oops. Solution was to set the highwater mark lower (to 50%), which evicted a bunch of older data and got us back in business.
Most of our records are above 128 bytes and below 256 bytes. Block size is 512 bytes which means we are often wasting 50% of the record’s storage space. Block size can be set to 128 bytes in versions 2.7 and 3.1. Looking forward to deploying this.
Adding a new node could mean a day (24 hours) of degraded performance. This is tunable based on speed of migration. Faster migration = cluster is slower bc many writes; alternative is slow migration. New nodes should be homogenous; a new node with more storage than the other nodes cannot make use of it until all nodes have as much storage
Go to next slide for the picture of the buckets.
Eviction takes place by splitting records into buckets based on TTL, and evicting randomly from lowest TTL (soonest to expire) If records are not evenly distributed, chaos ensues. Appears that data is being evicted randomly. Solution was to change the long TTL to a short one, and refresh those devices with a regularly scheduled job.
hot key = 3 outstanding reads on the same key; can’t get away from hot spots in data. Error is sent from the client.
Multiple keys pointing to same record could save us a lot of extra reads (250k bid requests vs 350k reads - a lot of that is following id aliases)

Aerospike at Tapad

Recommended

Recommended

More Related Content

Similar to Aerospike at Tapad

Similar to Aerospike at Tapad (20)

Recently uploaded

Recently uploaded (20)

Aerospike at Tapad

Editor's Notes