• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013
 

Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

on

  • 1,134 views

Josh Glover, Software Engineer at Videoplaza, will introduce you to the domain of video advertising and show how Videoplaza uses Apache Cassandra as part of a system that solves the difficult problem ...

Josh Glover, Software Engineer at Videoplaza, will introduce you to the domain of video advertising and show how Videoplaza uses Apache Cassandra as part of a system that solves the difficult problem of allowing clients to analyse the performance of their advertising campaigns in real-time. Videoplaza needs to aggregate data for tens of thousands of combinations of dimensions and metrics for hundreds of clients from an incoming stream of thousands of requests per second, and do it fast enough so that clients can see trends as they happen.

Statistics

Views

Total Views
1,134
Views on SlideShare
615
Embed Views
519

Actions

Likes
0
Downloads
5
Comments
0

13 Embeds 519

http://planetcassandra.org 275
http://www.planetcassandra.org 185
http://planetcassandra.com 18
http://planetca.w11.wh-2.com 11
http://www.newsblur.com 9
http://localhost 7
http://newsblur.com 4
http://www.planetcassandra.com 2
https://twitter.com 2
http://www.feedspot.com 2
http://digg.com 2
http://23.253.69.203 1
http://reader.faltering.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • OnLine Analytical Processing
  • Relational OnLine Analytical Processing
  • Multidimensional / Materialised OnLine Analytical Processing
  • Time: hours, months, years, etc. Event: views, clicks Device: iPhone, PC, PS3 Demo: age, gender, income, interests Location
  • Create report template -> aggregate ID
  • Tracker publishes to message broker (we use RabbitMQ). If you don’t know anything about messaging, definitely talk to me afterwards! For each event, you will increment a bunch of counters (e.g. ad and event, ad and time and event, etc.)
  • Upgrading to 1.2 this week!
  • 10mb sstables
  • Scala-style; key -> value Explain transaction ID here
  • Adef tells us which combinations of dimensions are necessary Dimensions repository contains dimension values
  • Read rows one at a time due to Thrift max message size After we upgrade, we can use binary CQL client for better performance
  • Java futures
  • Shard by hashing values of all dimensions except time
  • We can replay data, but we can’t unplay it. No way to decrement a Hyper LogLog counter.

Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013 Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013 Presentation Transcript

  • Karbon Insight: Realtime Reporting
  • Introduction to ad serving Video player Ad player Distributor Tracker
  • Event tracking •View (event ID 127) •Click (event ID 128) •and many more
  • What do our customers want? •Any report they can dream up •Right away!
  • Simple report: hour by ad and event
  • Realtime reporting Multidimensional OLAP cube Ad Event Time
  • ROLAP with star schema
  • Disadvantages of ROLAP •Slow queries •Lots of joins •Expensive to scale •SQL limitations
  • MOLAP to the rescue!
  • What is a counter?
  • You can’t always get what you want...
  • • Time • Event • Ad • Device • Category • Location • Tag • Demography Possible report dimensions
  • Many counters 8 dimensions average size of 50 508 counters! (39 trillion)
  • Average campaign length: 21 days (504 hours)
  • Time flies like a banana 21 days = 39 trillion counters 42 days -> 78 trillion 84 days -> 156 trillion 365 days -> 677 trillion
  • 5 years down the road 3.39 quadrillion
  • 3.39 quadrillion is a rather large number indeed Number of stars in 7500 galaxies like the Milky way. 15% of the surveyed universe!
  • But you might just get what you need!
  • Fake it till you can make it Don’t aggregate anything until they ask for it!
  • •Time period •By hour •And ad •Views •Clicks
  • Counter Storage
  • Why Cassandra? •Fast writes •Linear scaling •Battle-hardened •(Relatively) simple operations •Great community!
  • Cassandra TrackerTracker FlusherFlusher AggregatorAggregator MergerMerger live00 ... live31 RabbitMQ flush00 ... flush31counter00 ... counter31
  • Our setup •DataStax CE 1.1.9 •18 node cluster •1 datacentre
  • Data model •1 keyspace (RF: 3) •1 column family •Leveled compaction
  • Row keys aggregate definition ID dimension values time granularity
  • adef1|(ad1:127)|hour adef1|(ad1:128)|hour adef1|(ad2:127)|hour ... adef1|(ad5:128)|day Example row keys
  • Columns time value -> counter transaction ID -> id
  • 2013-09-10.18 -> 6348 txID -> 876219102 Example columns 2013-09-10.19 -> 9784
  • total -> 6348 txID -> 876219102 Columns for rows with no time aggregation
  • Reading counters
  • Build row key adef1|(ad1:127)|hour
  • Prepare query keyspace .prepareQuery(columnFamily) .getKey(rowKey)
  • Column ranges 2013-09-10.17 ... 2013-09-10.23
  • Execute query asynchronously
  • Get column value First byte is counter type (long, double, Hyper LogLog)
  • Writing counters
  • Flush shards ... Flusher 1 shards 00-08 Flusher 4 shards 24-32 Cassandra
  • Merge increment rows with read cache Skip rows with the same transaction ID
  • Write rows in mutation batches (of 400)
  • Things we got wrong
  • Each CF has 1M heap overhead Too many column families Multi-tenancy FTW! FAIL #1
  • CLI defaults to replication factor of 1! Manual operations Tools and automation FTW! FAIL #2
  • No way to undo data loading No snapshots Automated snapshots FTW! FAIL #3
  • Post-processing of queried data Timezones Store data in customer timezone FAIL #4
  • 10 TB of data 1500 wps 40,000 rps
  • Q&A