Cassandra at Digby

934 views

Published on

Published in: Technology, Business
  • Be the first to comment

Cassandra at Digby

  1. 1. Cassandra at Digby Cody Koeninger ckoeninger@digby.com
  2. 2. Localpoint Architecture Localpoint In-App SDK Location Algorithm – Opt-in – Push – Rich Message – Message Management Localpoint Cloud Messaging Identity • Attributes • Location History • Campaign History Campaign Management (Push, Triggered) – Mobile Offer Management – Campaign Reporting Create/Manage Location Location API Accuracy, Power, Privacy Optimization – Geofence Management - Cross-OS, Cross-Device Create/Manage Analytics / Events Engine Profiles Campaign API Real-Time API Visits – Dwell Time – Frequency - Occupancy Publish/Subscribe • CRM API Analytics Engine API Transaction Record Export © 2013 Digby. CONFIDENTIAL Web Console
  3. 3. Why Cassandra? ● Somewhat of a green field project: add market segmentation (aka “Profiles”) to our existing geolocation / messaging infrastructure ● Horizontal scalability ● Homogenous deployment, less ops pain ● No pre-existing investment in Hadoop ● Data model matches our problem
  4. 4. Devices ● Android and iOS mobile devices ● Unique ID ● ● Other parts of the codebase handle geolocation. Here we're concerned primarily with device as an ID ~Millions of devices
  5. 5. Attributes ● Arbitrary key-value pairs associated to devices ● Defined by marketers and app developers ● String, boolean, integer, date ● Encrypted due to PII concerns ● e.g. birthdate: 1989-01-01, ownsPs3: true ● ~100 attributes
  6. 6. Profiles ● ● ● ● Market segmentation on attributes of devices Boolean expressions comparing to a fixed value Combined via Boolean 'and', aka set intersection. No 'or' e.g. wantsPs4: birthdate >= 1978-01-01 && ownsPs3 == true && ownsPs4 == false ● May be defined long after attributes are defined ● ~100 profiles
  7. 7. Data Modeling ● ● ● ● For nonrelational data stores, you need to know what your queries are before you store data Probably true of relational databases as well, but they let you get away with it Answering queries via primary key is ideal Cassandra has 2 parts to a primary key lookup: partitioning (by hash), then clustering (by order)
  8. 8. Use Case 1: Triggered Messaging ● ● When a device breaches a geofence, check to see if it is in a profile, then send a promotion e.g. device is near a store, and is in the wantsPs4 profile, tell it there are Ps4s in stock ● Latency is important ● Query: Given a device, which profiles is it in?
  9. 9. Use Case 2: Scheduled Messaging ● ● At some date and time, find all the devices in a given profile, and send them a promotion e.g. send all devices in the wantsPs4 profile a message telling them Ps4 is out of stock for months, but Xbox One is on sale cheap ● Throughput is more important than latency ● Query: Given a profile, which devices are in it?
  10. 10. Use Case 3: Historical Analytics ● ● ● Marketers may want to analyze past data based on attributes that were known at that time, but not included in profiles at that time In other words, we need to know raw facts (attributes), not just derived conclusions (profile membership) Query: Given a device and time, what were the attributes for that device at that time
  11. 11. Brainstorming ● Need to answer 3 questions: ● given Device, get Profiles ● given Profile, get Devices ● given (Device, Time), get Attributes
  12. 12. given (Device, Time), get Attributes create table attributes ( brandCode ascii, deviceId ascii, unixtime bigint, attrs blob, primary key ((brandCode, deviceId), unixtime) ) with compact storage and clustering order by (unixtime desc) select attrs from attributes where brandCode = ? and deviceId = ? and unixtime <= ? limit 1
  13. 13. given Device, get Profiles select attrs from attributes where brandCode = ? and deviceId = ? limit 1 Then, in code, filter the (relatively small) set of profiles based on whether attrs match it
  14. 14. given Profile, get Devices create table profile_devices ( brandCode ascii, profileId bigint, deviceId ascii, primary key((brandCode, profileId), deviceId) ) with compact storage select deviceId from profile_devices where brandCode = ? and profileId = ?
  15. 15. Why Spark? ● ● Scala Distributed computing that will interop with Hadoop IO (and thus Cassandra), but doesn't depend on HDFS ● Approachable codebase (20kloc, vs 200kloc+) ● Interactive shell ● Fast to write, fast to run
  16. 16. Why Spark? file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
  17. 17. Deployment ● http://spark.incubator.apache.org/docs/0.8.1/spark-standalone.html ● Spark worker processes on Cassandra storage nodes ● Gives data locality ● Spark master process on Cassandra monitoring machine ● Cluster start/stop done via ssh key from master ● Submit jobs to master url ● Consider pre-installing dependency jars on workers ● Must use exact same binary version of Scala throughout
  18. 18. Spark / Cassandra Interop // from CassandraTest.scala in the Spark distro val casRdd = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[ColumnFamilyInputFormat], classOf[ByteBuffer], classOf[SortedMap[ByteBuffer, IColumn]]) // Let us first get all the paragraphs from the retrieved rows val paraRdd = casRdd.map { case (key, value) => { ByteBufferUtil.string(value.get(ByteBufferUtil.bytes("para")).value()) } } // Lets get the word count in paras val counts = paraRdd.flatMap(p => p.split(" ")). map(word => (word, 1)). reduceByKey(_ + _) counts.collect().foreach { case (word, count) => println(word + ":" + count) }
  19. 19. Spark Resources ● ● ● Project homepage http://spark.incubator.apache.org/ AMP Camp tutorials http://ampcamp.berkeley.edu/ Introduction to Spark internals http://www.youtube.com/watch?v=49Hr5xZyTEA

×