Your SlideShare is downloading. ×
Using Cassandra for RTB systems
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Using Cassandra for RTB systems


Published on

This Tel-Aviv Cassandra 2014 Meetup Presentation

This Tel-Aviv Cassandra 2014 Meetup Presentation

Published in: Technology

1 Comment
  • Hello, i'm very interrested on the bid decision trees, could you give more details please ?
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Real Time Bidding with Apache Cassandra
  • 2. RTB @ Kenshoo: Introducing RTB - Concepts - Architecture - Challenges
  • 3. Real Time Bidding (RTB) ● Real-time bidding is a dynamic auction process where each impression is a bid for in (near) real time versus a static auction ● Kenshoo is engaged In Facebook Exchange (FBX) ● In FBX, each bid has a life-time of 120ms. All transactions have to complete within that period, and the winning ad is presented to the user. ● Kenshoo employs ad re-targeting, where search engine campaigns are extended to the social network, thus giving a much higher ROI for our customers
  • 4. Flow WebSite
  • 5. RTB Logical Architecture RTB RTB Front Opt Out Bidder Win Error Pixel Matcher Cassandra Cookie to Segment(s) RTB Backend Bid decision Trees Campaigns Metadata RTB Brain RTB Reporter
  • 6. RTB @ Kenshoo: Focus on RTB Cassandra - Architecture - Challenges
  • 7. Requirements ● ● Handle 25K+ requests within the 120ms bid time-frame including network latencies Ability to scale up to 1M per minute requests while keeping the current latency ● Handle ~10K writes/second with low latency ● Multi DC Configuration, all nodes must be sync-ed in real-time ● Seamless Operations: Compactions and Repairs ● High Security
  • 8. C* Physical Architecture (US) West Region (US) East Region App App App App App App Internet GRE VPN FBX WEST VPN FBX EAST
  • 9. C* Cluster Information ● ● ● ● ● ● ● ● ● Cassandra version 1.2.6 Oracle Java 7 Manual tokens, Vnodes Are Coming Soon Multi-DC Configuration Network Topology DC Connectivity between VPCs via Linux GRE Amazon C3.2xlarge instance type Ubuntu 13.10 with EXT4 SSD (Ephemeral) The Ring
  • 10. C* Cluster Network Between Sites ● For security reasons we, ○ ○ ● Do not use EC2Snitch or EC2MultiRegionSnitch Connected the nodes via VPN (Linux GRE) Linux GRE is fast, reliable and provides high throughput (~1Gb/s)
  • 11. C* Cluster Storage ● We started with Amazon EBS: ○ ○ ○ ● With small #nodes (up to 4 nodes): You want persistent storage; avoid running repairs if you lose a node 4xEBS devices in RAID10 configuration: Provide up to 1000 IOPs and bursts of up to 2000 IOPS Cheap in AWS 8 nodes with Ephemeral Devices: ○ ○ ○ ○ Lower risk: if you lose a node, recovery isn’t as heavy on the whole cluster We used RAID0 Higher performance (double than EBS) Free, bundled within the instances
  • 12. C* Cluster Storage continued ● 16 nodes with Ephemeral Devices: ○ ○ ○ ● When load became heavy we grew to 16 nodes Compactions and repairs harmed the cluster latency We had to use Provisioned IOPs devices for C* maintenance C3 Instance type with SSD: ○ Came just in time providing ephemeral SSD storage ○ They solved our performance problems and enabled seamless compactions and repairs ○ Amazon currently has scarce deployment of this H/W and nodes are not stable ○ Not available yet in all regions ○ C3 Nodes Deployment are not always a possiblity due to AWS capacity issues ○ Amazon promised to resolve the C3 issues next month
  • 13. C* Cluster Performance
  • 14. Monitoring ● We heavily rely on DataStax OpsCenter ● We grab OpsCenter Metrics out for graphings ● We wrote our own Read/Write Speed Test on separate dedicated KeySpace on each node to detect bottlenecks and problematic nodes ● We Sample the data separately from the Application to detect if the problem origins are C* or the application
  • 15. What have we learned ● ● ● ● Storage: ○ Use SSD: ■ It provides high and stable disk performance ■ Neutralizes Compaction and Repair effects on the cluster ■ Worth the money Network: ■ Use highest bandwidth VPN possible ■ GRE is great (lacks encryption, but provides best bandwidth) Maintenance: ○ Run Compact Daily: It does miracle to performance on heavy loads ○ If you are not on SSD, disable thrift on the node before running compaction ○ Do compactions in sequence, node by node ○ On high-load systems, avoid repair as possible, it’s better to decommission and recommission a node than to run repair! ○ If you have to repair, always use “-pr” flag and if possible use the incremental repair option (requires heavy scripting) Monitoring: ○ Write a sampler and speed tester for each node to detect bottlenecks and performance issues sources
  • 16. Thank you