Your SlideShare is downloading. ×
0
Karbon Insight: Realtime Reporting
Introduction to ad serving
Video player
Ad player
Distributor Tracker
Event tracking
•View (event ID 127)
•Click (event ID 128)
•and many more
What do our customers want?
•Any report they can dream up
•Right away!
Simple report: hour by ad and event
Realtime reporting
Multidimensional OLAP cube
Ad
Event
Time
ROLAP with star schema
Disadvantages of ROLAP
•Slow queries
•Lots of joins
•Expensive to scale
•SQL limitations
MOLAP to the rescue!
What is a counter?
You can’t always get
what you want...
•
Time
•
Event
•
Ad
•
Device
•
Category
•
Location
•
Tag
•
Demography
Possible report dimensions
Many counters
8 dimensions
average size of 50
508
counters!
(39 trillion)
Average campaign length:
21 days
(504 hours)
Time flies like a banana
21 days = 39 trillion counters
42 days -> 78 trillion
84 days -> 156 trillion
365 days -> 677 tri...
5 years down the road
3.39
quadrillion
3.39 quadrillion is a rather large number indeed
Number of stars in 7500 galaxies
like the Milky way.
15% of the surveyed
...
But you
might
just get
what
you
need!
Fake it till you can make it
Don’t aggregate
anything until they ask
for it!
•Time period
•By hour
•And ad
•Views
•Clicks
Counter Storage
Why Cassandra?
•Fast writes
•Linear scaling
•Battle-hardened
•(Relatively) simple
operations
•Great community!
Cassandra
TrackerTracker
FlusherFlusher
AggregatorAggregator
MergerMerger
live00 ... live31
RabbitMQ
flush00 ... flush31co...
Our setup
•DataStax CE 1.1.9
•18 node cluster
•1 datacentre
Data model
•1 keyspace (RF: 3)
•1 column family
•Leveled compaction
Row keys
aggregate definition ID
dimension values
time granularity
adef1|(ad1:127)|hour
adef1|(ad1:128)|hour
adef1|(ad2:127)|hour
...
adef1|(ad5:128)|day
Example row keys
Columns
time value ->
counter
transaction ID ->
id
2013-09-10.18 -> 6348
txID -> 876219102
Example columns
2013-09-10.19 -> 9784
total -> 6348
txID -> 876219102
Columns for rows with no time aggregation
Reading counters
Build row key
adef1|(ad1:127)|hour
Prepare query
keyspace
.prepareQuery(columnFamily)
.getKey(rowKey)
Column ranges
2013-09-10.17
...
2013-09-10.23
Execute query
asynchronously
Get column value
First byte is counter type
(long, double, Hyper
LogLog)
Writing counters
Flush shards
...
Flusher 1
shards 00-08
Flusher 4
shards 24-32
Cassandra
Merge increment rows
with read cache
Skip rows with the same
transaction ID
Write rows in
mutation batches
(of 400)
Things we got wrong
Each CF has 1M heap overhead
Too many column families
Multi-tenancy FTW!
FAIL #1
CLI defaults to replication
factor of 1!
Manual operations
Tools and automation FTW!
FAIL #2
No way to undo data loading
No snapshots
Automated snapshots FTW!
FAIL #3
Post-processing of queried data
Timezones
Store data in customer
timezone
FAIL #4
10 TB of data
1500 wps
40,000 rps
Q&A
Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013
Upcoming SlideShare
Loading in...5
×

Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

1,050

Published on

Josh Glover, Software Engineer at Videoplaza, will introduce you to the domain of video advertising and show how Videoplaza uses Apache Cassandra as part of a system that solves the difficult problem of allowing clients to analyse the performance of their advertising campaigns in real-time. Videoplaza needs to aggregate data for tens of thousands of combinations of dimensions and metrics for hundreds of clients from an incoming stream of thousands of requests per second, and do it fast enough so that clients can see trends as they happen.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,050
On Slideshare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • OnLine Analytical Processing
  • Relational OnLine Analytical Processing
  • Multidimensional / Materialised OnLine Analytical Processing
  • Time: hours, months, years, etc. Event: views, clicks Device: iPhone, PC, PS3 Demo: age, gender, income, interests Location
  • Create report template -> aggregate ID
  • Tracker publishes to message broker (we use RabbitMQ). If you don’t know anything about messaging, definitely talk to me afterwards! For each event, you will increment a bunch of counters (e.g. ad and event, ad and time and event, etc.)
  • Upgrading to 1.2 this week!
  • 10mb sstables
  • Scala-style; key -> value Explain transaction ID here
  • Adef tells us which combinations of dimensions are necessary Dimensions repository contains dimension values
  • Read rows one at a time due to Thrift max message size After we upgrade, we can use binary CQL client for better performance
  • Java futures
  • Shard by hashing values of all dimensions except time
  • We can replay data, but we can’t unplay it. No way to decrement a Hyper LogLog counter.
  • Transcript of "Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013"

    1. 1. Karbon Insight: Realtime Reporting
    2. 2. Introduction to ad serving Video player Ad player Distributor Tracker
    3. 3. Event tracking •View (event ID 127) •Click (event ID 128) •and many more
    4. 4. What do our customers want? •Any report they can dream up •Right away!
    5. 5. Simple report: hour by ad and event
    6. 6. Realtime reporting Multidimensional OLAP cube Ad Event Time
    7. 7. ROLAP with star schema
    8. 8. Disadvantages of ROLAP •Slow queries •Lots of joins •Expensive to scale •SQL limitations
    9. 9. MOLAP to the rescue!
    10. 10. What is a counter?
    11. 11. You can’t always get what you want...
    12. 12. • Time • Event • Ad • Device • Category • Location • Tag • Demography Possible report dimensions
    13. 13. Many counters 8 dimensions average size of 50 508 counters! (39 trillion)
    14. 14. Average campaign length: 21 days (504 hours)
    15. 15. Time flies like a banana 21 days = 39 trillion counters 42 days -> 78 trillion 84 days -> 156 trillion 365 days -> 677 trillion
    16. 16. 5 years down the road 3.39 quadrillion
    17. 17. 3.39 quadrillion is a rather large number indeed Number of stars in 7500 galaxies like the Milky way. 15% of the surveyed universe!
    18. 18. But you might just get what you need!
    19. 19. Fake it till you can make it Don’t aggregate anything until they ask for it!
    20. 20. •Time period •By hour •And ad •Views •Clicks
    21. 21. Counter Storage
    22. 22. Why Cassandra? •Fast writes •Linear scaling •Battle-hardened •(Relatively) simple operations •Great community!
    23. 23. Cassandra TrackerTracker FlusherFlusher AggregatorAggregator MergerMerger live00 ... live31 RabbitMQ flush00 ... flush31counter00 ... counter31
    24. 24. Our setup •DataStax CE 1.1.9 •18 node cluster •1 datacentre
    25. 25. Data model •1 keyspace (RF: 3) •1 column family •Leveled compaction
    26. 26. Row keys aggregate definition ID dimension values time granularity
    27. 27. adef1|(ad1:127)|hour adef1|(ad1:128)|hour adef1|(ad2:127)|hour ... adef1|(ad5:128)|day Example row keys
    28. 28. Columns time value -> counter transaction ID -> id
    29. 29. 2013-09-10.18 -> 6348 txID -> 876219102 Example columns 2013-09-10.19 -> 9784
    30. 30. total -> 6348 txID -> 876219102 Columns for rows with no time aggregation
    31. 31. Reading counters
    32. 32. Build row key adef1|(ad1:127)|hour
    33. 33. Prepare query keyspace .prepareQuery(columnFamily) .getKey(rowKey)
    34. 34. Column ranges 2013-09-10.17 ... 2013-09-10.23
    35. 35. Execute query asynchronously
    36. 36. Get column value First byte is counter type (long, double, Hyper LogLog)
    37. 37. Writing counters
    38. 38. Flush shards ... Flusher 1 shards 00-08 Flusher 4 shards 24-32 Cassandra
    39. 39. Merge increment rows with read cache Skip rows with the same transaction ID
    40. 40. Write rows in mutation batches (of 400)
    41. 41. Things we got wrong
    42. 42. Each CF has 1M heap overhead Too many column families Multi-tenancy FTW! FAIL #1
    43. 43. CLI defaults to replication factor of 1! Manual operations Tools and automation FTW! FAIL #2
    44. 44. No way to undo data loading No snapshots Automated snapshots FTW! FAIL #3
    45. 45. Post-processing of queried data Timezones Store data in customer timezone FAIL #4
    46. 46. 10 TB of data 1500 wps 40,000 rps
    47. 47. Q&A
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×