Before we get into the heavy
stuff, Let's imagine hacking
 around with C* for a bit...
You run a large video website
●   CREATE TABLE videos (
    videoid uuid,
    videoname varchar,
    username varchar,
    description varchar, tags varchar,
    upload_date timestamp,
    PRIMARY KEY (videoid,videoname) );
●   INSERT INTO videos (videoid, videoname, username,
    description, tags, upload_date) VALUES ('99051fe9-6a9c-
    46c2-b949-38ef78858dd0','My funny cat','ctodd', 'My cat
    likes to play the piano! So funny.','cats,piano,lol','2012-06-01
    08:00:00');
You have a bajillion users
●   CREATE TABLE users (
    username varchar,
    firstname varchar,
    lastname varchar,
    email varchar,
    password varchar,
    created_date timestamp,
    PRIMARY KEY (username));
●   INSERT INTO users (username, firstname, lastname, email,
    password, created_date) VALUES ('tcodd','Ted','Codd',
    'tcodd@relational.com','5f4dcc3b5aa765d61d8327deb882cf99'
    ,'2011-06-01 08:00:00');
You can query up a storm
●   SELECT firstname,lastname FROM users WHERE username='tcodd';
    firstname | lastname
    -----------+----------
          Ted |      Codd


●   SELECT * FROM videos WHERE videoid = 'b3a76c6b-7c7f-4af6-964f-
    803a9283c401' and videoname>'N';
    videoid                     | videoname         | description
                     | tags   | upload_date        | username
    b3a76c6b-7c7f-4af6-964f-803a9283c401 | Now my dog plays piano! | My
    dog learned to play the piano because of the cat. | dogs,piano,lol | 2012-
    08-30 16:50:00+0000 | ctodd
That's great! Then you ask
         yourself...
●   Can I slice a slice (or sub query)?
●   Can I do advanced where clauses ?
●   Can I union two slices server side?
●   Can I join data from two tables without two
    request/response round trips?
●   What about procedures?
●   Can I write functions or aggregation functions?
Let's look at the API's we have




 http://www.slideshare.net/aaronmorton/apachecon-nafeb2013
But none of those API's do what I
   want, and it seems simple
        enough to do...
Intravert joins the “party”
     at the API Layer
Why not just do it client side?
●   Move processing close to data
    –   Idea borrowed from Hadoop
●   Doing work close to the source can result in:
    –   Less network IO
    –   Less memory spend encoding/decoding 'throw
        away' data
    –   New storage and access paradigms
Vertx + cassandra
●   What is vertx ?
    –   Distributed Event Bus which spans the server and
        even penetrates into client side for effortless 'real-
        time' web applications
●   What are the cool features?
    –   Asynchronous
    –   Hot re-loadable modules
    –   Modules can be written in groovy, ruby, java, java-
        script

                           http://vertx.io
Transport, payload, and
      batching
HTTP Transport
●   HTTP is easy to use on firewall'ed networks
●   Easy to secure
●   Easy to compress
●   The defacto way to do everything anyway
●   IntraVert attempts to limit round-trips
    –   Not provide a terse binary format
JSON Payload
●   Simple nested types like list, map, String
●   Request is composed of N operations
●   Each operation has a configurable timeout
●   Again, IntraVert attempts to limit round-trips
    –   Not provide a terse message format
Why not use lighting fast transport
      and serialization library X?
●   X's language/code gen issues
●   You probably can not TCP dump X
●   Net-admins don't like 90 jars for health checks
●   IntraVert attempts to limit round-trips:
    –   Prepared statements
    –   Server side filtering
    –   Other cool stuff
Sample request and response
{"e": [ {                               {
     "type": "SETKEYSPACE",
                                            "exception":null,
     "op": { "keyspace": "myks" }
                                            "exceptionId":null,
  }, {
     "type": "SETCOLUMNFAMILY",             "opsRes": {
     "op": { "columnfamily": "mycf" }          "0":"OK",
  }, {                                         "1":"OK",
     "type": "SLICE",
                                               "2":[{
     "op": {
                                                    "name":"Founders",
         "rowkey": "beers",
         "start": "Allagash",                       "value":"Breakfast Stout"
         "end": "Sierra Nevada",               }]
         "size": 9                      }}
} }]}
Server side filter
Imagine your data looks like...
{ "rowkey": "beers", "name":
"Allagash", "value": "Allagash Tripel" }
{ "rowkey": "beers", "name":
"Founders", "value": "Breakfast Stout" }
{ "rowkey": "beers", "name": "Dogfish
Head",
"value": "Hellhound IPA" }
Application requirement
●   User request wishes to know which beers are
    “Breakfast Stout” (s)
●   Common “solutions”:
    –   Write a copy of the data sorted by type
    –   Request all the data and parse on client side
Using an IntraVert filter
●   Send a function to the server
●   Function is applied to subsequent get or slice
    operations
●   Only results of the filter are returned to the
    client
Defining a filter JavaScript
●   Syntax to create a filter
      {
           "type": "CREATEFILTER",
           "op": {
               "name": "stouts",
               "spec": "javascript",
            "value": "function(row) { if (row['value'] == 'Breakfast Stout')
    return row; else return null; }"
           }
      },
Defining a filter Groovy/Java

●   We can define a groovy closure or Java filter
    {
        "type": "CREATEFILTER",
        "op": {
         "name": "stouts",
         "spec": "groovy",
       "{ row -> if (row["value"] == "Breakfast Stout") return row else
    return null }"
            }
    },
Filter flow
Common filter use cases
●   Transform data
●   Prune columns/rows like a where clause
●   Extract data from complex fields (json, xml,
    protobuf, etc)
Some light relief
Server Side Multi-Processor
It's the cure for your “redis envy”
Imagine your data looks like...
●   { “row key”:”1”,    ●   { “row key”:”4”,
    name:”a” ,val...}       name:”a” ,val...}
●   { “row key”:”1”,    ●   { “row key”:”4”,
    name:”b” ,val...}       name:”z” ,val...}
Application Requirements
●   User wishes to intersect the column names of
    two slices/queries
●   Common “solutions”
    –   Pull all results to client and apply the intersection
        there
Server Side MultiProcessor
●   Send a class that implements MultiProcessor
    interface to server
●   public List<Map> multiProcess
    (Map<Integer,Object> input, Map params);
●   Do one or more get/slice operations as input
●   Invoke MultiProcessor on input
Multi-processor flow
Multi-processor use cases
●   Union N slices
●   Intersection N slices
●   Some “Join” scenarios
Fat client becomes
  the 'Phat client'
Imagine you want to insert this data
●   User wishes to enter this event for multiple column
    families
    –   09/10/201111:12:13
    –   App=Yahoo
    –   Platform=iOS
    –   OS=4.3.4
    –   Device=iPad2,1
    –   Resolution=768x1024
    –   Events–videoPlayPercent=38–Taste=great

         http://www.slideshare.net/charmalloc/jsteincassandranyc2011
Inserting the data
aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”


def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {
    c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))
}


aggregateKeys(KEYSPACE  ”ByMonth") = month //201109
aggregateKeys(KEYSPACE  "ByDay") = day //20110910
aggregateKeys(KEYSPACE  ”ByHour") = hour //2011091012
aggregateKeys(KEYSPACE  ”ByMinute") = minute //201109101213


def r(columnName: String): Unit = {
    aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
    val (columnFamily,row) = tuple
    if (row !=null && row.size > 0)
    rows add (columnFamily -> row has columnName inc) //increment the counter
    }
    }
}
ccAppPlatformOSVersionDeviceResolution(r)




                        http://www.slideshare.net/charmalloc/jsteincassandranyc2011
Solution
    ●   Send the data once and compute the N
        permutations on the server side
public void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) {
    JsonObject params = request.getObject("mpparams");
    String uid = (String) params.getString("userid");
    String fname = (String) params.getString("fname");
    String lname = (String) params.getString("lname");
    String city = (String) params.getString("city");

    RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid));
    QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname"));
    rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime());
    QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname"));
    rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime());
    ...
      try {
      StorageProxy.mutate(mutations, ConsistencyLevel.ONE);
    } catch (WriteTimeoutException | UnavailableException | OverloadedException e) {
       e.printStackTrace();
       response.putString("status", "FAILED");
    }
    response.putString("status", "OK");
}
Service Processor Flow
IntraVert status
●   Still pre 1.0
●   Good docs
    –   https://github.com/zznate/intravert-ug/wiki/_pages
●   Functional equivalent to thrift (mostly features
    ported)
●   CQL support
●   Virgil (coming soon)
●   Hbase like scanners (coming soon)
Hack at it




https://github.com/zznate/intravert-ug
Questions?

NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"

  • 1.
    Before we getinto the heavy stuff, Let's imagine hacking around with C* for a bit...
  • 2.
    You run alarge video website ● CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY (videoid,videoname) ); ● INSERT INTO videos (videoid, videoname, username, description, tags, upload_date) VALUES ('99051fe9-6a9c- 46c2-b949-38ef78858dd0','My funny cat','ctodd', 'My cat likes to play the piano! So funny.','cats,piano,lol','2012-06-01 08:00:00');
  • 3.
    You have abajillion users ● CREATE TABLE users ( username varchar, firstname varchar, lastname varchar, email varchar, password varchar, created_date timestamp, PRIMARY KEY (username)); ● INSERT INTO users (username, firstname, lastname, email, password, created_date) VALUES ('tcodd','Ted','Codd', 'tcodd@relational.com','5f4dcc3b5aa765d61d8327deb882cf99' ,'2011-06-01 08:00:00');
  • 4.
    You can queryup a storm ● SELECT firstname,lastname FROM users WHERE username='tcodd'; firstname | lastname -----------+---------- Ted | Codd ● SELECT * FROM videos WHERE videoid = 'b3a76c6b-7c7f-4af6-964f- 803a9283c401' and videoname>'N'; videoid | videoname | description | tags | upload_date | username b3a76c6b-7c7f-4af6-964f-803a9283c401 | Now my dog plays piano! | My dog learned to play the piano because of the cat. | dogs,piano,lol | 2012- 08-30 16:50:00+0000 | ctodd
  • 5.
    That's great! Thenyou ask yourself...
  • 6.
    Can I slice a slice (or sub query)? ● Can I do advanced where clauses ? ● Can I union two slices server side? ● Can I join data from two tables without two request/response round trips? ● What about procedures? ● Can I write functions or aggregation functions?
  • 7.
    Let's look atthe API's we have http://www.slideshare.net/aaronmorton/apachecon-nafeb2013
  • 8.
    But none ofthose API's do what I want, and it seems simple enough to do...
  • 9.
    Intravert joins the“party” at the API Layer
  • 10.
    Why not justdo it client side? ● Move processing close to data – Idea borrowed from Hadoop ● Doing work close to the source can result in: – Less network IO – Less memory spend encoding/decoding 'throw away' data – New storage and access paradigms
  • 11.
    Vertx + cassandra ● What is vertx ? – Distributed Event Bus which spans the server and even penetrates into client side for effortless 'real- time' web applications ● What are the cool features? – Asynchronous – Hot re-loadable modules – Modules can be written in groovy, ruby, java, java- script http://vertx.io
  • 12.
  • 13.
    HTTP Transport ● HTTP is easy to use on firewall'ed networks ● Easy to secure ● Easy to compress ● The defacto way to do everything anyway ● IntraVert attempts to limit round-trips – Not provide a terse binary format
  • 14.
    JSON Payload ● Simple nested types like list, map, String ● Request is composed of N operations ● Each operation has a configurable timeout ● Again, IntraVert attempts to limit round-trips – Not provide a terse message format
  • 15.
    Why not uselighting fast transport and serialization library X? ● X's language/code gen issues ● You probably can not TCP dump X ● Net-admins don't like 90 jars for health checks ● IntraVert attempts to limit round-trips: – Prepared statements – Server side filtering – Other cool stuff
  • 16.
    Sample request andresponse {"e": [ { { "type": "SETKEYSPACE", "exception":null, "op": { "keyspace": "myks" } "exceptionId":null, }, { "type": "SETCOLUMNFAMILY", "opsRes": { "op": { "columnfamily": "mycf" } "0":"OK", }, { "1":"OK", "type": "SLICE", "2":[{ "op": { "name":"Founders", "rowkey": "beers", "start": "Allagash", "value":"Breakfast Stout" "end": "Sierra Nevada", }] "size": 9 }} } }]}
  • 17.
  • 18.
    Imagine your datalooks like... { "rowkey": "beers", "name": "Allagash", "value": "Allagash Tripel" } { "rowkey": "beers", "name": "Founders", "value": "Breakfast Stout" } { "rowkey": "beers", "name": "Dogfish Head", "value": "Hellhound IPA" }
  • 19.
    Application requirement ● User request wishes to know which beers are “Breakfast Stout” (s) ● Common “solutions”: – Write a copy of the data sorted by type – Request all the data and parse on client side
  • 20.
    Using an IntraVertfilter ● Send a function to the server ● Function is applied to subsequent get or slice operations ● Only results of the filter are returned to the client
  • 21.
    Defining a filterJavaScript ● Syntax to create a filter { "type": "CREATEFILTER", "op": { "name": "stouts", "spec": "javascript", "value": "function(row) { if (row['value'] == 'Breakfast Stout') return row; else return null; }" } },
  • 22.
    Defining a filterGroovy/Java ● We can define a groovy closure or Java filter { "type": "CREATEFILTER", "op": { "name": "stouts", "spec": "groovy", "{ row -> if (row["value"] == "Breakfast Stout") return row else return null }" } },
  • 23.
  • 24.
    Common filter usecases ● Transform data ● Prune columns/rows like a where clause ● Extract data from complex fields (json, xml, protobuf, etc)
  • 25.
  • 26.
  • 27.
    It's the curefor your “redis envy”
  • 28.
    Imagine your datalooks like... ● { “row key”:”1”, ● { “row key”:”4”, name:”a” ,val...} name:”a” ,val...} ● { “row key”:”1”, ● { “row key”:”4”, name:”b” ,val...} name:”z” ,val...}
  • 29.
    Application Requirements ● User wishes to intersect the column names of two slices/queries ● Common “solutions” – Pull all results to client and apply the intersection there
  • 30.
    Server Side MultiProcessor ● Send a class that implements MultiProcessor interface to server ● public List<Map> multiProcess (Map<Integer,Object> input, Map params); ● Do one or more get/slice operations as input ● Invoke MultiProcessor on input
  • 31.
  • 32.
    Multi-processor use cases ● Union N slices ● Intersection N slices ● Some “Join” scenarios
  • 33.
    Fat client becomes the 'Phat client'
  • 34.
    Imagine you wantto insert this data ● User wishes to enter this event for multiple column families – 09/10/201111:12:13 – App=Yahoo – Platform=iOS – OS=4.3.4 – Device=iPad2,1 – Resolution=768x1024 – Events–videoPlayPercent=38–Taste=great http://www.slideshare.net/charmalloc/jsteincassandranyc2011
  • 35.
    Inserting the data aggregateColumnNames(”AppPlatformOSVersionDeviceResolution")= "app+platform+osversion+device+resolution#” def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = { c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution)) } aggregateKeys(KEYSPACE ”ByMonth") = month //201109 aggregateKeys(KEYSPACE "ByDay") = day //20110910 aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012 aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213 def r(columnName: String): Unit = { aggregateKeys.foreach{tuple:(ColumnFamily, String) => { val (columnFamily,row) = tuple if (row !=null && row.size > 0) rows add (columnFamily -> row has columnName inc) //increment the counter } } } ccAppPlatformOSVersionDeviceResolution(r) http://www.slideshare.net/charmalloc/jsteincassandranyc2011
  • 36.
    Solution ● Send the data once and compute the N permutations on the server side public void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) { JsonObject params = request.getObject("mpparams"); String uid = (String) params.getString("userid"); String fname = (String) params.getString("fname"); String lname = (String) params.getString("lname"); String city = (String) params.getString("city"); RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid)); QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname")); rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime()); QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname")); rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime()); ... try { StorageProxy.mutate(mutations, ConsistencyLevel.ONE); } catch (WriteTimeoutException | UnavailableException | OverloadedException e) { e.printStackTrace(); response.putString("status", "FAILED"); } response.putString("status", "OK"); }
  • 37.
  • 38.
    IntraVert status ● Still pre 1.0 ● Good docs – https://github.com/zznate/intravert-ug/wiki/_pages ● Functional equivalent to thrift (mostly features ported) ● CQL support ● Virgil (coming soon) ● Hbase like scanners (coming soon)
  • 39.
  • 40.