NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"

Before we get into the heavy
stuff, Let's imagine hacking
around with C* for a bit...

You run a large video website
● CREATE TABLE videos (
videoid uuid,
videoname varchar,
username varchar,
description varchar, tags varchar,
upload_date timestamp,
PRIMARY KEY (videoid,videoname) );
● INSERT INTO videos (videoid, videoname, username,
description, tags, upload_date) VALUES ('99051fe9-6a9c-
46c2-b949-38ef78858dd0','My funny cat','ctodd', 'My cat
likes to play the piano! So funny.','cats,piano,lol','2012-06-01
08:00:00');

You have a bajillion users
● CREATE TABLE users (
username varchar,
firstname varchar,
lastname varchar,
email varchar,
password varchar,
created_date timestamp,
PRIMARY KEY (username));
● INSERT INTO users (username, firstname, lastname, email,
password, created_date) VALUES ('tcodd','Ted','Codd',
'tcodd@relational.com','5f4dcc3b5aa765d61d8327deb882cf99'
,'2011-06-01 08:00:00');

That's great! Then you ask
yourself...

● Can I slice a slice (or sub query)?
● Can I do advanced where clauses ?
● Can I union two slices server side?
● Can I join data from two tables without two
request/response round trips?
● What about procedures?
● Can I write functions or aggregation functions?

Let's look at the API's we have

http://www.slideshare.net/aaronmorton/apachecon-nafeb2013

But none of those API's do what I
want, and it seems simple
enough to do...

Intravert joins the “party”
at the API Layer

Why not just do it client side?
● Move processing close to data
– Idea borrowed from Hadoop
● Doing work close to the source can result in:
– Less network IO
– Less memory spend encoding/decoding 'throw
away' data
– New storage and access paradigms

Vertx + cassandra
● What is vertx ?
– Distributed Event Bus which spans the server and
even penetrates into client side for effortless 'real-
time' web applications
● What are the cool features?
– Asynchronous
– Hot re-loadable modules
– Modules can be written in groovy, ruby, java, java-
script

http://vertx.io

Transport, payload, and
batching

HTTP Transport
● HTTP is easy to use on firewall'ed networks
● Easy to secure
● Easy to compress
● The defacto way to do everything anyway
● IntraVert attempts to limit round-trips
– Not provide a terse binary format

JSON Payload
● Simple nested types like list, map, String
● Request is composed of N operations
● Each operation has a configurable timeout
● Again, IntraVert attempts to limit round-trips
– Not provide a terse message format

Why not use lighting fast transport
and serialization library X?
● X's language/code gen issues
● You probably can not TCP dump X
● Net-admins don't like 90 jars for health checks
● IntraVert attempts to limit round-trips:
– Prepared statements
– Server side filtering
– Other cool stuff

Sample request and response
{"e": [ { {
"type": "SETKEYSPACE",
"exception":null,
"op": { "keyspace": "myks" }
"exceptionId":null,
}, {
"type": "SETCOLUMNFAMILY", "opsRes": {
"op": { "columnfamily": "mycf" } "0":"OK",
}, { "1":"OK",
"type": "SLICE",
"2":[{
"op": {
"name":"Founders",
"rowkey": "beers",
"start": "Allagash", "value":"Breakfast Stout"
"end": "Sierra Nevada", }]
"size": 9 }}
} }]}

Imagine your data looks like...
{ "rowkey": "beers", "name":
"Allagash", "value": "Allagash Tripel" }
{ "rowkey": "beers", "name":
"Founders", "value": "Breakfast Stout" }
{ "rowkey": "beers", "name": "Dogfish
Head",
"value": "Hellhound IPA" }

Application requirement
● User request wishes to know which beers are
“Breakfast Stout” (s)
● Common “solutions”:
– Write a copy of the data sorted by type
– Request all the data and parse on client side

Using an IntraVert filter
● Send a function to the server
● Function is applied to subsequent get or slice
operations
● Only results of the filter are returned to the
client

Defining a filter JavaScript
● Syntax to create a filter
{
"type": "CREATEFILTER",
"op": {
"name": "stouts",
"spec": "javascript",
"value": "function(row) { if (row['value'] == 'Breakfast Stout')
return row; else return null; }"
}
},

Defining a filter Groovy/Java

● We can define a groovy closure or Java filter
{
"type": "CREATEFILTER",
"op": {
"name": "stouts",
"spec": "groovy",
"{ row -> if (row["value"] == "Breakfast Stout") return row else
return null }"
}
},

Common filter use cases
● Transform data
● Prune columns/rows like a where clause
● Extract data from complex fields (json, xml,
protobuf, etc)

It's the cure for your “redis envy”

Imagine your data looks like...
● { “row key”:”1”, ● { “row key”:”4”,
name:”a” ,val...} name:”a” ,val...}
● { “row key”:”1”, ● { “row key”:”4”,
name:”b” ,val...} name:”z” ,val...}

Application Requirements
● User wishes to intersect the column names of
two slices/queries
● Common “solutions”
– Pull all results to client and apply the intersection
there

Server Side MultiProcessor
● Send a class that implements MultiProcessor
interface to server
● public List<Map> multiProcess
(Map<Integer,Object> input, Map params);
● Do one or more get/slice operations as input
● Invoke MultiProcessor on input

Multi-processor use cases
● Union N slices
● Intersection N slices
● Some “Join” scenarios

Fat client becomes
the 'Phat client'

Imagine you want to insert this data
● User wishes to enter this event for multiple column
families
– 09/10/201111:12:13
– App=Yahoo
– Platform=iOS
– OS=4.3.4
– Device=iPad2,1
– Resolution=768x1024
– Events–videoPlayPercent=38–Taste=great

http://www.slideshare.net/charmalloc/jsteincassandranyc2011

Inserting the data
aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”

def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {
c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))
}

aggregateKeys(KEYSPACE ”ByMonth") = month //201109
aggregateKeys(KEYSPACE "ByDay") = day //20110910
aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012
aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213

def r(columnName: String): Unit = {
aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
val (columnFamily,row) = tuple
if (row !=null && row.size > 0)
rows add (columnFamily -> row has columnName inc) //increment the counter
}
}
}
ccAppPlatformOSVersionDeviceResolution(r)

http://www.slideshare.net/charmalloc/jsteincassandranyc2011

Solution
● Send the data once and compute the N
permutations on the server side
public void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) {
JsonObject params = request.getObject("mpparams");
String uid = (String) params.getString("userid");
String fname = (String) params.getString("fname");
String lname = (String) params.getString("lname");
String city = (String) params.getString("city");

RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid));
QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname"));
rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime());
QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname"));
rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime());
...
try {
StorageProxy.mutate(mutations, ConsistencyLevel.ONE);
} catch (WriteTimeoutException | UnavailableException | OverloadedException e) {
e.printStackTrace();
response.putString("status", "FAILED");
}
response.putString("status", "OK");
}

IntraVert status
● Still pre 1.0
● Good docs
– https://github.com/zznate/intravert-ug/wiki/_pages
● Functional equivalent to thrift (mostly features
ported)
● CQL support
● Virgil (coming soon)
● Hbase like scanners (coming soon)

Hack at it

https://github.com/zznate/intravert-ug

NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"

More Related Content

What's hot

Similar to NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"

More from DataStax Academy

Recently uploaded

NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"