Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

1,414 views

Published on

Josh Glover, Software Engineer at Videoplaza, will introduce you to the domain of video advertising and show how Videoplaza uses Apache Cassandra as part of a system that solves the difficult problem of allowing clients to analyse the performance of their advertising campaigns in real-time. Videoplaza needs to aggregate data for tens of thousands of combinations of dimensions and metrics for hundreds of clients from an incoming stream of thousands of requests per second, and do it fast enough so that clients can see trends as they happen.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,414
On SlideShare
0
From Embeds
0
Number of Embeds
563
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • OnLine Analytical Processing
  • Relational OnLine Analytical Processing
  • Multidimensional / Materialised OnLine Analytical Processing
  • Time: hours, months, years, etc. Event: views, clicks Device: iPhone, PC, PS3 Demo: age, gender, income, interests Location
  • Create report template -> aggregate ID
  • Tracker publishes to message broker (we use RabbitMQ). If you don’t know anything about messaging, definitely talk to me afterwards! For each event, you will increment a bunch of counters (e.g. ad and event, ad and time and event, etc.)
  • Upgrading to 1.2 this week!
  • 10mb sstables
  • Scala-style; key -> value Explain transaction ID here
  • Adef tells us which combinations of dimensions are necessary Dimensions repository contains dimension values
  • Read rows one at a time due to Thrift max message size After we upgrade, we can use binary CQL client for better performance
  • Java futures
  • Shard by hashing values of all dimensions except time
  • We can replay data, but we can’t unplay it. No way to decrement a Hyper LogLog counter.
  • Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

    1. 1. Karbon Insight: Realtime Reporting
    2. 2. Introduction to ad serving Video player Ad player Distributor Tracker
    3. 3. Event tracking •View (event ID 127) •Click (event ID 128) •and many more
    4. 4. What do our customers want? •Any report they can dream up •Right away!
    5. 5. Simple report: hour by ad and event
    6. 6. Realtime reporting Multidimensional OLAP cube Ad Event Time
    7. 7. ROLAP with star schema
    8. 8. Disadvantages of ROLAP •Slow queries •Lots of joins •Expensive to scale •SQL limitations
    9. 9. MOLAP to the rescue!
    10. 10. What is a counter?
    11. 11. You can’t always get what you want...
    12. 12. • Time • Event • Ad • Device • Category • Location • Tag • Demography Possible report dimensions
    13. 13. Many counters 8 dimensions average size of 50 508 counters! (39 trillion)
    14. 14. Average campaign length: 21 days (504 hours)
    15. 15. Time flies like a banana 21 days = 39 trillion counters 42 days -> 78 trillion 84 days -> 156 trillion 365 days -> 677 trillion
    16. 16. 5 years down the road 3.39 quadrillion
    17. 17. 3.39 quadrillion is a rather large number indeed Number of stars in 7500 galaxies like the Milky way. 15% of the surveyed universe!
    18. 18. But you might just get what you need!
    19. 19. Fake it till you can make it Don’t aggregate anything until they ask for it!
    20. 20. •Time period •By hour •And ad •Views •Clicks
    21. 21. Counter Storage
    22. 22. Why Cassandra? •Fast writes •Linear scaling •Battle-hardened •(Relatively) simple operations •Great community!
    23. 23. Cassandra TrackerTracker FlusherFlusher AggregatorAggregator MergerMerger live00 ... live31 RabbitMQ flush00 ... flush31counter00 ... counter31
    24. 24. Our setup •DataStax CE 1.1.9 •18 node cluster •1 datacentre
    25. 25. Data model •1 keyspace (RF: 3) •1 column family •Leveled compaction
    26. 26. Row keys aggregate definition ID dimension values time granularity
    27. 27. adef1|(ad1:127)|hour adef1|(ad1:128)|hour adef1|(ad2:127)|hour ... adef1|(ad5:128)|day Example row keys
    28. 28. Columns time value -> counter transaction ID -> id
    29. 29. 2013-09-10.18 -> 6348 txID -> 876219102 Example columns 2013-09-10.19 -> 9784
    30. 30. total -> 6348 txID -> 876219102 Columns for rows with no time aggregation
    31. 31. Reading counters
    32. 32. Build row key adef1|(ad1:127)|hour
    33. 33. Prepare query keyspace .prepareQuery(columnFamily) .getKey(rowKey)
    34. 34. Column ranges 2013-09-10.17 ... 2013-09-10.23
    35. 35. Execute query asynchronously
    36. 36. Get column value First byte is counter type (long, double, Hyper LogLog)
    37. 37. Writing counters
    38. 38. Flush shards ... Flusher 1 shards 00-08 Flusher 4 shards 24-32 Cassandra
    39. 39. Merge increment rows with read cache Skip rows with the same transaction ID
    40. 40. Write rows in mutation batches (of 400)
    41. 41. Things we got wrong
    42. 42. Each CF has 1M heap overhead Too many column families Multi-tenancy FTW! FAIL #1
    43. 43. CLI defaults to replication factor of 1! Manual operations Tools and automation FTW! FAIL #2
    44. 44. No way to undo data loading No snapshots Automated snapshots FTW! FAIL #3
    45. 45. Post-processing of queried data Timezones Store data in customer timezone FAIL #4
    46. 46. 10 TB of data 1500 wps 40,000 rps
    47. 47. Q&A

    ×