Your SlideShare is downloading. ×
0
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Aerospike at Tapad
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Aerospike at Tapad

5,701

Published on

Tap into the Secrets of Tapad's Success – How Tapad's ad-tech platform scales to hundreds of thousands of requests per second.

Tap into the Secrets of Tapad's Success – How Tapad's ad-tech platform scales to hundreds of thousands of requests per second.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,701
On Slideshare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
21
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Didn’t realize this had animation, which is why this is out of order.
    But, it actually shows the asynchronous nature of the system pretty well!
  • This number is bid-requests per second. A given request may trigger multiple reads because of following aliases.
  • Marketers understand delivery and response for a single channel
    However, marketers don’t have data about consumer exposure and marketing impact across devices
  • Marketers understand delivery and response for a single channel
    However, marketers don’t have data about consumer exposure and marketing impact across devices
  • Replication factor 2. If a node goes down, the data is still there.
    XDR - our west coast bidders read from west coast aerospike, and vice versa. Changes propagate in milliseconds.
  • Read and write performance scales linearly. Much easier with a key-value store than a relational database, but hey! it fits our data model perfectly.
  • This is desired behavior because new information is more actionable than old information. Month old record with no recent activity may never be accessible (cleared cookies). Chuck it, make room for new data.
    Expiring old data also makes our ETL process faster because the old data is gone so we don’t need to export it.
    Evicting older data makes space for good data when we enter a very high-write situation. Lets the system continue operating.
  • Config - what Tapad’s deployment looks like. We don’t have as many disks as the setups from Aerospike’s benchmarks because we don’t actually need that much storage.
    Migrations - what happens when you add a node to the cluster?
    Eviction - it’s a great feature for us, but was counter-intuitive for us; we had an interesting scenario with this. more later
    Usage - as a software developer, what is it like using aerospike?
  • Multiple terabytes of data in 3.3 billion objects
    Each object is pretty small, about 200 bytes. More on this later in relation to block size, which is 512 bytes
  • Recommended high-water mark is 50% memory and 50% storage. Aerospike needs this to defray the data. We set it aggressively to 70% on memory and storage to save money on hardware.
    It’s possible to send updates at a rate faster than the defragger can keep up, leading to out­-of-­disk­-space issues even when you have 40% of the disk free. oops. Solution was to set the high­water mark lower (to 50%), which evicted a bunch of older data and got us back in business.
  • Most of our records are above 128 bytes and below 256 bytes.
    Block size is 512 bytes which means we are often wasting 50% of the record’s storage space.
    Block size can be set to 128 bytes in versions 2.7 and 3.1. Looking forward to deploying this.
  • Adding a new node could mean a day (24 hours) of degraded performance.
    This is tunable based on speed of migration. Faster migration = cluster is slower bc many writes; alternative is slow migration.
    New nodes should be homogenous; a new node with more storage than the other nodes cannot make use of it until all nodes have as much storage
  • Go to next slide for the picture of the buckets.
  • Eviction takes place by splitting records into buckets based on TTL, and evicting randomly from lowest TTL (soonest to expire)
    If records are not evenly distributed, chaos ensues. Appears that data is being evicted randomly.
    Solution was to change the long TTL to a short one, and refresh those devices with a regularly scheduled job.
  • hot key = 3 outstanding reads on the same key; can’t get away from hot spots in data. Error is sent from the client.
  • Multiple keys pointing to same record could save us a lot of extra reads (250k bid requests vs 350k reads - a lot of that is following id aliases)
  • Transcript

    • 1. Magic Scaling Sprinkles Experience operating a high-volume, low- latency ad buying platform at Tapad @tobym @TapadEng
    • 2. Who am I? Toby Matejovsky First engineer hired at Tapad 3+ years ago Scala developer @tobym
    • 3. What are we talking about? One of the key components that allows Tapad’s realtime ad-buying platform to hit 350,000 TPS.
    • 4. Outline • What Tapad does • Why Aerospike is a good fit • Operational experience • What’s next
    • 5. What Tapad Does (Real-time bidding) The Tapad Difference. A Unified View.
    • 6. Ad exchange Want to show an ad to device 123? Tapad Sure, show this ad for $2 CPM No thanks Great, you won. Ad was displayed! How about to device XYZ? 95% response time: ~30 ms
    • 7. Why Aerospike? Fast Safe Scale out Expiration/eviction
    • 8. Super fast key-value store 350,000 reads per second on 7 nodes 99% of reads are under 1 millisecond
    • 9. Safe Replication factor XDR (cross datacenter replication) SSD-backed
    • 10. Scale out Linear scalability, just add a node* *will revisit this during the next section
    • 11. Expiration and eviction Old data expires automatically Oldest data is evicted if the database is running out of space This is desired behavior in ad-tech world
    • 12. Operational experience with Aerospike at Tapad Configuration Migrations Eviction Usage
    • 13. Tapad’s Aerospike Configuration 100% keys in memory 100% data in SSD storage Replication factor 2 512-byte block size Need lots of free space in memory and storage for defrag (high- water mark)
    • 14. Migrations and partitions New node requires data migration, means degraded performance Network partition may trigger some data migration
    • 15. Eviction Awesome feature, not intuitive if objects’ TTLs are not nicely distributed
    • 16. Usage Blocking and non-blocking clients available LZ4-compressed protobuf Hot key error
    • 17. What’s next? Smaller minimum block size Replace Redis (UDFs) Multiple keys to reference the same record
    • 18. Thank You @tobym @TapadEng Toby Matejovsky, Director of Engineering toby@tapad.com @tobym

    ×