Using Cassandra for RTB systems


Published on

This Tel-Aviv Cassandra 2014 Meetup Presentation

Published in: Technology
1 Comment
  • Hello, i'm very interrested on the bid decision trees, could you give more details please ?
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Using Cassandra for RTB systems

  1. 1. Real Time Bidding with Apache Cassandra
  2. 2. RTB @ Kenshoo: Introducing RTB - Concepts - Architecture - Challenges
  3. 3. Real Time Bidding (RTB) ● Real-time bidding is a dynamic auction process where each impression is a bid for in (near) real time versus a static auction ● Kenshoo is engaged In Facebook Exchange (FBX) ● In FBX, each bid has a life-time of 120ms. All transactions have to complete within that period, and the winning ad is presented to the user. ● Kenshoo employs ad re-targeting, where search engine campaigns are extended to the social network, thus giving a much higher ROI for our customers
  4. 4. Flow WebSite
  5. 5. RTB Logical Architecture RTB RTB Front Opt Out Bidder Win Error Pixel Matcher Cassandra Cookie to Segment(s) RTB Backend Bid decision Trees Campaigns Metadata RTB Brain RTB Reporter
  6. 6. RTB @ Kenshoo: Focus on RTB Cassandra - Architecture - Challenges
  7. 7. Requirements ● ● Handle 25K+ requests within the 120ms bid time-frame including network latencies Ability to scale up to 1M per minute requests while keeping the current latency ● Handle ~10K writes/second with low latency ● Multi DC Configuration, all nodes must be sync-ed in real-time ● Seamless Operations: Compactions and Repairs ● High Security
  8. 8. C* Physical Architecture (US) West Region (US) East Region App App App App App App Internet GRE VPN FBX WEST VPN FBX EAST
  9. 9. C* Cluster Information ● ● ● ● ● ● ● ● ● Cassandra version 1.2.6 Oracle Java 7 Manual tokens, Vnodes Are Coming Soon Multi-DC Configuration Network Topology DC Connectivity between VPCs via Linux GRE Amazon C3.2xlarge instance type Ubuntu 13.10 with EXT4 SSD (Ephemeral) The Ring
  10. 10. C* Cluster Network Between Sites ● For security reasons we, ○ ○ ● Do not use EC2Snitch or EC2MultiRegionSnitch Connected the nodes via VPN (Linux GRE) Linux GRE is fast, reliable and provides high throughput (~1Gb/s)
  11. 11. C* Cluster Storage ● We started with Amazon EBS: ○ ○ ○ ● With small #nodes (up to 4 nodes): You want persistent storage; avoid running repairs if you lose a node 4xEBS devices in RAID10 configuration: Provide up to 1000 IOPs and bursts of up to 2000 IOPS Cheap in AWS 8 nodes with Ephemeral Devices: ○ ○ ○ ○ Lower risk: if you lose a node, recovery isn’t as heavy on the whole cluster We used RAID0 Higher performance (double than EBS) Free, bundled within the instances
  12. 12. C* Cluster Storage continued ● 16 nodes with Ephemeral Devices: ○ ○ ○ ● When load became heavy we grew to 16 nodes Compactions and repairs harmed the cluster latency We had to use Provisioned IOPs devices for C* maintenance C3 Instance type with SSD: ○ Came just in time providing ephemeral SSD storage ○ They solved our performance problems and enabled seamless compactions and repairs ○ Amazon currently has scarce deployment of this H/W and nodes are not stable ○ Not available yet in all regions ○ C3 Nodes Deployment are not always a possiblity due to AWS capacity issues ○ Amazon promised to resolve the C3 issues next month
  13. 13. C* Cluster Performance
  14. 14. Monitoring ● We heavily rely on DataStax OpsCenter ● We grab OpsCenter Metrics out for graphings ● We wrote our own Read/Write Speed Test on separate dedicated KeySpace on each node to detect bottlenecks and problematic nodes ● We Sample the data separately from the Application to detect if the problem origins are C* or the application
  15. 15. What have we learned ● ● ● ● Storage: ○ Use SSD: ■ It provides high and stable disk performance ■ Neutralizes Compaction and Repair effects on the cluster ■ Worth the money Network: ■ Use highest bandwidth VPN possible ■ GRE is great (lacks encryption, but provides best bandwidth) Maintenance: ○ Run Compact Daily: It does miracle to performance on heavy loads ○ If you are not on SSD, disable thrift on the node before running compaction ○ Do compactions in sequence, node by node ○ On high-load systems, avoid repair as possible, it’s better to decommission and recommission a node than to run repair! ○ If you have to repair, always use “-pr” flag and if possible use the incremental repair option (requires heavy scripting) Monitoring: ○ Write a sampler and speed tester for each node to detect bottlenecks and performance issues sources
  16. 16. Thank you