Your SlideShare is downloading. ×
Cassandra at Digby
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Cassandra at Digby


Published on

Published in: Technology, Business

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Short Script:
    “All of this is made possible by the advanced technology we’ve made available in the Digby Mobile Suite, an enterprise-grade and PCI certified SaaS platform that is our focus as a company. Our customer, using Digby Services or in self-implementation mode, can use each of these products, in blue, to support the building of applications. Each of them is modular and works with the others, all of them connected to the base platform and a collection of shared services and integration points. The Digby Mobile Console, as mentioned before, is the place where customers can manage the products they have deployed and access relevant analytics both within each product and across the entire solution.
    This Digby Mobile Suite allows for the deployment of powerful mobile websites and rich applications quickly, efficiently, and with less risk than any custom-built work. It handles cross-platform differences elegantly. And in a space that is constantly changing and innovating, each of these products has its own roadmap where we continue to handle any platform changes and bring innovations to market that make the products more powerful over time. Additionally, future products mean that customers can extend the capabilities of their mobile footprint even more widely, ensuring they are keeping pace with consumer expectations.”
  • Transcript

    • 1. Cassandra at Digby Cody Koeninger
    • 2. Localpoint Architecture Localpoint In-App SDK Location Algorithm – Opt-in – Push – Rich Message – Message Management Localpoint Cloud Messaging Identity • Attributes • Location History • Campaign History Campaign Management (Push, Triggered) – Mobile Offer Management – Campaign Reporting Create/Manage Location Location API Accuracy, Power, Privacy Optimization – Geofence Management - Cross-OS, Cross-Device Create/Manage Analytics / Events Engine Profiles Campaign API Real-Time API Visits – Dwell Time – Frequency - Occupancy Publish/Subscribe • CRM API Analytics Engine API Transaction Record Export © 2013 Digby. CONFIDENTIAL Web Console
    • 3. Why Cassandra? ● Somewhat of a green field project: add market segmentation (aka “Profiles”) to our existing geolocation / messaging infrastructure ● Horizontal scalability ● Homogenous deployment, less ops pain ● No pre-existing investment in Hadoop ● Data model matches our problem
    • 4. Devices ● Android and iOS mobile devices ● Unique ID ● ● Other parts of the codebase handle geolocation. Here we're concerned primarily with device as an ID ~Millions of devices
    • 5. Attributes ● Arbitrary key-value pairs associated to devices ● Defined by marketers and app developers ● String, boolean, integer, date ● Encrypted due to PII concerns ● e.g. birthdate: 1989-01-01, ownsPs3: true ● ~100 attributes
    • 6. Profiles ● ● ● ● Market segmentation on attributes of devices Boolean expressions comparing to a fixed value Combined via Boolean 'and', aka set intersection. No 'or' e.g. wantsPs4: birthdate >= 1978-01-01 && ownsPs3 == true && ownsPs4 == false ● May be defined long after attributes are defined ● ~100 profiles
    • 7. Data Modeling ● ● ● ● For nonrelational data stores, you need to know what your queries are before you store data Probably true of relational databases as well, but they let you get away with it Answering queries via primary key is ideal Cassandra has 2 parts to a primary key lookup: partitioning (by hash), then clustering (by order)
    • 8. Use Case 1: Triggered Messaging ● ● When a device breaches a geofence, check to see if it is in a profile, then send a promotion e.g. device is near a store, and is in the wantsPs4 profile, tell it there are Ps4s in stock ● Latency is important ● Query: Given a device, which profiles is it in?
    • 9. Use Case 2: Scheduled Messaging ● ● At some date and time, find all the devices in a given profile, and send them a promotion e.g. send all devices in the wantsPs4 profile a message telling them Ps4 is out of stock for months, but Xbox One is on sale cheap ● Throughput is more important than latency ● Query: Given a profile, which devices are in it?
    • 10. Use Case 3: Historical Analytics ● ● ● Marketers may want to analyze past data based on attributes that were known at that time, but not included in profiles at that time In other words, we need to know raw facts (attributes), not just derived conclusions (profile membership) Query: Given a device and time, what were the attributes for that device at that time
    • 11. Brainstorming ● Need to answer 3 questions: ● given Device, get Profiles ● given Profile, get Devices ● given (Device, Time), get Attributes
    • 12. given (Device, Time), get Attributes create table attributes ( brandCode ascii, deviceId ascii, unixtime bigint, attrs blob, primary key ((brandCode, deviceId), unixtime) ) with compact storage and clustering order by (unixtime desc) select attrs from attributes where brandCode = ? and deviceId = ? and unixtime <= ? limit 1
    • 13. given Device, get Profiles select attrs from attributes where brandCode = ? and deviceId = ? limit 1 Then, in code, filter the (relatively small) set of profiles based on whether attrs match it
    • 14. given Profile, get Devices create table profile_devices ( brandCode ascii, profileId bigint, deviceId ascii, primary key((brandCode, profileId), deviceId) ) with compact storage select deviceId from profile_devices where brandCode = ? and profileId = ?
    • 15. Why Spark? ● ● Scala Distributed computing that will interop with Hadoop IO (and thus Cassandra), but doesn't depend on HDFS ● Approachable codebase (20kloc, vs 200kloc+) ● Interactive shell ● Fast to write, fast to run
    • 16. Why Spark? file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
    • 17. Deployment ● ● Spark worker processes on Cassandra storage nodes ● Gives data locality ● Spark master process on Cassandra monitoring machine ● Cluster start/stop done via ssh key from master ● Submit jobs to master url ● Consider pre-installing dependency jars on workers ● Must use exact same binary version of Scala throughout
    • 18. Spark / Cassandra Interop // from CassandraTest.scala in the Spark distro val casRdd = sc.newAPIHadoopRDD(job.getConfiguration(), classOf[ColumnFamilyInputFormat], classOf[ByteBuffer], classOf[SortedMap[ByteBuffer, IColumn]]) // Let us first get all the paragraphs from the retrieved rows val paraRdd = { case (key, value) => { ByteBufferUtil.string(value.get(ByteBufferUtil.bytes("para")).value()) } } // Lets get the word count in paras val counts = paraRdd.flatMap(p => p.split(" ")). map(word => (word, 1)). reduceByKey(_ + _) counts.collect().foreach { case (word, count) => println(word + ":" + count) }
    • 19. Spark Resources ● ● ● Project homepage AMP Camp tutorials Introduction to Spark internals