Intravert Server side processing for Cassandra

    Intravert Server side processing for Cassandra Presentation Transcript

    • Before we get into the heavystuff, Lets imagine hacking around with C* for a bit...
    • You run a large video website● CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY (videoid,videoname) );● INSERT INTO videos (videoid, videoname, username, description, tags, upload_date) VALUES (99051fe9-6a9c- 46c2-b949-38ef78858dd0,My funny cat,ctodd, My cat likes to play the piano! So funny.,cats,piano,lol,2012-06-01 08:00:00);
    • You have a bajillion users● CREATE TABLE users ( username varchar, firstname varchar, lastname varchar, email varchar, password varchar, created_date timestamp, PRIMARY KEY (username));● INSERT INTO users (username, firstname, lastname, email, password, created_date) VALUES (tcodd,Ted,Codd, tcodd@relational.com,5f4dcc3b5aa765d61d8327deb882cf99 ,2011-06-01 08:00:00);
    • You can query up a storm● SELECT firstname,lastname FROM users WHERE username=tcodd; firstname | lastname -----------+---------- Ted | Codd● SELECT * FROM videos WHERE videoid = b3a76c6b-7c7f-4af6-964f- 803a9283c401 and videoname>N; videoid | videoname | description | tags | upload_date | username b3a76c6b-7c7f-4af6-964f-803a9283c401 | Now my dog plays piano! | My dog learned to play the piano because of the cat. | dogs,piano,lol | 2012- 08-30 16:50:00+0000 | ctodd
    • Thats great! Then you ask yourself...
    • ● Can I slice a slice (or sub query)?● Can I do advanced where clauses ?● Can I union two slices server side?● Can I join data from two tables without two request/response round trips?● What about procedures?● Can I write functions or aggregation functions?
    • Lets look at the APIs we have http://www.slideshare.net/aaronmorton/apachecon-nafeb2013
    • But none of those APIs do what I want, and it seems simple enough to do...
    • Intravert joins the “party” at the API Layer
    • Why not just do it client side?● Move processing close to data – Idea borrowed from Hadoop● Doing work close to the source can result in: – Less network IO – Less memory spend encoding/decoding throw away data – New storage and access paradigms
    • Vertx + cassandra● What is vertx ? – Distributed Event Bus which spans the server and even penetrates into client side for effortless real- time web applications● What are the cool features? – Asynchronous – Hot re-loadable modules – Modules can be written in groovy, ruby, java, java- script http://vertx.io
    • Transport, payload, and batching
    • HTTP Transport● HTTP is easy to use on firewalled networks● Easy to secure● Easy to compress● The defacto way to do everything anyway● IntraVert attempts to limit round-trips – Not provide a terse binary format
    • JSON Payload● Simple nested types like list, map, String● Request is composed of N operations● Each operation has a configurable timeout● Again, IntraVert attempts to limit round-trips – Not provide a terse message format
    • Why not use lighting fast transport and serialization library X?● Xs language/code gen issues● You probably can not TCP dump X● Net-admins dont like 90 jars for health checks● IntraVert attempts to limit round-trips: – Prepared statements – Server side filtering – Other cool stuff
    • Sample request and response{"e": [ { { "type": "SETKEYSPACE", "exception":null, "op": { "keyspace": "myks" } "exceptionId":null, }, { "type": "SETCOLUMNFAMILY", "opsRes": { "op": { "columnfamily": "mycf" } "0":"OK", }, { "1":"OK", "type": "SLICE", "2":[{ "op": { "name":"Founders", "rowkey": "beers", "start": "Allagash", "value":"Breakfast Stout" "end": "Sierra Nevada", }] "size": 9 }}} }]}
    • Server side filter
    • Imagine your data looks like...{ "rowkey": "beers", "name":"Allagash", "value": "Allagash Tripel" }{ "rowkey": "beers", "name":"Founders", "value": "Breakfast Stout" }{ "rowkey": "beers", "name": "DogfishHead","value": "Hellhound IPA" }
    • Application requirement● User request wishes to know which beers are “Breakfast Stout” (s)● Common “solutions”: – Write a copy of the data sorted by type – Request all the data and parse on client side
    • Using an IntraVert filter● Send a function to the server● Function is applied to subsequent get or slice operations● Only results of the filter are returned to the client
    • Defining a filter JavaScript● Syntax to create a filter { "type": "CREATEFILTER", "op": { "name": "stouts", "spec": "javascript", "value": "function(row) { if (row[value] == Breakfast Stout) return row; else return null; }" } },
    • Defining a filter Groovy/Java● We can define a groovy closure or Java filter { "type": "CREATEFILTER", "op": { "name": "stouts", "spec": "groovy", "{ row -> if (row["value"] == "Breakfast Stout") return row else return null }" } },
    • Filter flow
    • Common filter use cases● Transform data● Prune columns/rows like a where clause● Extract data from complex fields (json, xml, protobuf, etc)
    • Some light relief
    • Server Side Multi-Processor
    • Its the cure for your “redis envy”
    • Imagine your data looks like...● { “row key”:”1”, ● { “row key”:”4”, name:”a” ,val...} name:”a” ,val...}● { “row key”:”1”, ● { “row key”:”4”, name:”b” ,val...} name:”z” ,val...}
    • Application Requirements● User wishes to intersect the column names of two slices/queries● Common “solutions” – Pull all results to client and apply the intersection there
    • Server Side MultiProcessor● Send a class that implements MultiProcessor interface to server● public List<Map> multiProcess (Map<Integer,Object> input, Map params);● Do one or more get/slice operations as input● Invoke MultiProcessor on input
    • Multi-processor flow
    • Multi-processor use cases● Union N slices● Intersection N slices● Some “Join” scenarios
    • Fat client becomes the Phat client
    • Imagine you want to insert this data● User wishes to enter this event for multiple column families – 09/10/201111:12:13 – App=Yahoo – Platform=iOS – OS=4.3.4 – Device=iPad2,1 – Resolution=768x1024 – Events–videoPlayPercent=38–Taste=great http://www.slideshare.net/charmalloc/jsteincassandranyc2011
    • Inserting the dataaggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = { c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))}aggregateKeys(KEYSPACE ”ByMonth") = month //201109aggregateKeys(KEYSPACE "ByDay") = day //20110910aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213def r(columnName: String): Unit = { aggregateKeys.foreach{tuple:(ColumnFamily, String) => { val (columnFamily,row) = tuple if (row !=null && row.size > 0) rows add (columnFamily -> row has columnName inc) //increment the counter } }}ccAppPlatformOSVersionDeviceResolution(r) http://www.slideshare.net/charmalloc/jsteincassandranyc2011
    • Solution ● Send the data once and compute the N permutations on the server sidepublic void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) { JsonObject params = request.getObject("mpparams"); String uid = (String) params.getString("userid"); String fname = (String) params.getString("fname"); String lname = (String) params.getString("lname"); String city = (String) params.getString("city"); RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid)); QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname")); rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime()); QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname")); rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime()); ... try { StorageProxy.mutate(mutations, ConsistencyLevel.ONE); } catch (WriteTimeoutException | UnavailableException | OverloadedException e) { e.printStackTrace(); response.putString("status", "FAILED"); } response.putString("status", "OK");}
    • Service Processor Flow
    • IntraVert status● Still pre 1.0● Good docs – https://github.com/zznate/intravert-ug/wiki/_pages● Functional equivalent to thrift (mostly features ported)● CQL support● Virgil (coming soon)● Hbase like scanners (coming soon)
    • Hack at ithttps://github.com/zznate/intravert-ug
    • Questions?