Before we get into the heavystuff, Lets imagine hacking around with C* for a bit...
You run a large video website●   CREATE TABLE videos (    videoid uuid,    videoname varchar,    username varchar,    desc...
You have a bajillion users●   CREATE TABLE users (    username varchar,    firstname varchar,    lastname varchar,    emai...
You can query up a storm●   SELECT firstname,lastname FROM users WHERE username=tcodd;    firstname | lastname    --------...
Thats great! Then you ask         yourself...
●   Can I slice a slice (or sub query)?●   Can I do advanced where clauses ?●   Can I union two slices server side?●   Can...
Lets look at the APIs we have http://www.slideshare.net/aaronmorton/apachecon-nafeb2013
But none of those APIs do what I   want, and it seems simple        enough to do...
Intravert joins the “party”     at the API Layer
Why not just do it client side?●   Move processing close to data    –   Idea borrowed from Hadoop●   Doing work close to t...
Vertx + cassandra●   What is vertx ?    –   Distributed Event Bus which spans the server and        even penetrates into c...
Transport, payload, and      batching
HTTP Transport●   HTTP is easy to use on firewalled networks●   Easy to secure●   Easy to compress●   The defacto way to d...
JSON Payload●   Simple nested types like list, map, String●   Request is composed of N operations●   Each operation has a ...
Why not use lighting fast transport      and serialization library X?●   Xs language/code gen issues●   You probably can n...
Sample request and response{"e": [ {                               {     "type": "SETKEYSPACE",                           ...
Server side filter
Imagine your data looks like...{ "rowkey": "beers", "name":"Allagash", "value": "Allagash Tripel" }{ "rowkey": "beers", "n...
Application requirement●   User request wishes to know which beers are    “Breakfast Stout” (s)●   Common “solutions”:    ...
Using an IntraVert filter●   Send a function to the server●   Function is applied to subsequent get or slice    operations...
Defining a filter JavaScript●   Syntax to create a filter      {           "type": "CREATEFILTER",           "op": {      ...
Defining a filter Groovy/Java●   We can define a groovy closure or Java filter    {        "type": "CREATEFILTER",        ...
Filter flow
Common filter use cases●   Transform data●   Prune columns/rows like a where clause●   Extract data from complex fields (j...
Some light relief
Server Side Multi-Processor
Its the cure for your “redis envy”
Imagine your data looks like...●   { “row key”:”1”,    ●   { “row key”:”4”,    name:”a” ,val...}       name:”a” ,val...}● ...
Application Requirements●   User wishes to intersect the column names of    two slices/queries●   Common “solutions”    – ...
Server Side MultiProcessor●   Send a class that implements MultiProcessor    interface to server●   public List<Map> multi...
Multi-processor flow
Multi-processor use cases●   Union N slices●   Intersection N slices●   Some “Join” scenarios
Fat client becomes  the Phat client
Imagine you want to insert this data●   User wishes to enter this event for multiple column    families    –   09/10/20111...
Inserting the dataaggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution...
Solution    ●   Send the data once and compute the N        permutations on the server sidepublic void process(JsonObject ...
Service Processor Flow
IntraVert status●   Still pre 1.0●   Good docs    –   https://github.com/zznate/intravert-ug/wiki/_pages●   Functional equ...
Hack at ithttps://github.com/zznate/intravert-ug
Questions?
Upcoming SlideShare
Loading in …5
×

NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"

3,832 views
3,738 views

Published on

The ColumnFamily data model and wide-row support provides the ability to store and access data efficiently in a de-normalized state. Recent enhancements for CQL's spare tables and built-in indexing provide the capability to store data in a manner similar to that of relational databases. For many use cases hybrid approaches are needed, because complete de-normalization is appropriate for some access patterns whereas more structured data is appropriate for others. At times a single logical event becomes multiple insertions across multiple column families. Likewise a user request might require a several reads across different column families. This talk describes some of these scenarios and demonstrates how advanced operations such multiple step procedures, filtering, intersection, and paging can be implemented client side or server side with the help of the IntraVert plugin.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,832
On SlideShare
0
From Embeds
0
Number of Embeds
69
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"

  1. 1. Before we get into the heavystuff, Lets imagine hacking around with C* for a bit...
  2. 2. You run a large video website● CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY (videoid,videoname) );● INSERT INTO videos (videoid, videoname, username, description, tags, upload_date) VALUES (99051fe9-6a9c- 46c2-b949-38ef78858dd0,My funny cat,ctodd, My cat likes to play the piano! So funny.,cats,piano,lol,2012-06-01 08:00:00);
  3. 3. You have a bajillion users● CREATE TABLE users ( username varchar, firstname varchar, lastname varchar, email varchar, password varchar, created_date timestamp, PRIMARY KEY (username));● INSERT INTO users (username, firstname, lastname, email, password, created_date) VALUES (tcodd,Ted,Codd, tcodd@relational.com,5f4dcc3b5aa765d61d8327deb882cf99 ,2011-06-01 08:00:00);
  4. 4. You can query up a storm● SELECT firstname,lastname FROM users WHERE username=tcodd; firstname | lastname -----------+---------- Ted | Codd● SELECT * FROM videos WHERE videoid = b3a76c6b-7c7f-4af6-964f- 803a9283c401 and videoname>N; videoid | videoname | description | tags | upload_date | username b3a76c6b-7c7f-4af6-964f-803a9283c401 | Now my dog plays piano! | My dog learned to play the piano because of the cat. | dogs,piano,lol | 2012- 08-30 16:50:00+0000 | ctodd
  5. 5. Thats great! Then you ask yourself...
  6. 6. ● Can I slice a slice (or sub query)?● Can I do advanced where clauses ?● Can I union two slices server side?● Can I join data from two tables without two request/response round trips?● What about procedures?● Can I write functions or aggregation functions?
  7. 7. Lets look at the APIs we have http://www.slideshare.net/aaronmorton/apachecon-nafeb2013
  8. 8. But none of those APIs do what I want, and it seems simple enough to do...
  9. 9. Intravert joins the “party” at the API Layer
  10. 10. Why not just do it client side?● Move processing close to data – Idea borrowed from Hadoop● Doing work close to the source can result in: – Less network IO – Less memory spend encoding/decoding throw away data – New storage and access paradigms
  11. 11. Vertx + cassandra● What is vertx ? – Distributed Event Bus which spans the server and even penetrates into client side for effortless real- time web applications● What are the cool features? – Asynchronous – Hot re-loadable modules – Modules can be written in groovy, ruby, java, java- script http://vertx.io
  12. 12. Transport, payload, and batching
  13. 13. HTTP Transport● HTTP is easy to use on firewalled networks● Easy to secure● Easy to compress● The defacto way to do everything anyway● IntraVert attempts to limit round-trips – Not provide a terse binary format
  14. 14. JSON Payload● Simple nested types like list, map, String● Request is composed of N operations● Each operation has a configurable timeout● Again, IntraVert attempts to limit round-trips – Not provide a terse message format
  15. 15. Why not use lighting fast transport and serialization library X?● Xs language/code gen issues● You probably can not TCP dump X● Net-admins dont like 90 jars for health checks● IntraVert attempts to limit round-trips: – Prepared statements – Server side filtering – Other cool stuff
  16. 16. Sample request and response{"e": [ { { "type": "SETKEYSPACE", "exception":null, "op": { "keyspace": "myks" } "exceptionId":null, }, { "type": "SETCOLUMNFAMILY", "opsRes": { "op": { "columnfamily": "mycf" } "0":"OK", }, { "1":"OK", "type": "SLICE", "2":[{ "op": { "name":"Founders", "rowkey": "beers", "start": "Allagash", "value":"Breakfast Stout" "end": "Sierra Nevada", }] "size": 9 }}} }]}
  17. 17. Server side filter
  18. 18. Imagine your data looks like...{ "rowkey": "beers", "name":"Allagash", "value": "Allagash Tripel" }{ "rowkey": "beers", "name":"Founders", "value": "Breakfast Stout" }{ "rowkey": "beers", "name": "DogfishHead","value": "Hellhound IPA" }
  19. 19. Application requirement● User request wishes to know which beers are “Breakfast Stout” (s)● Common “solutions”: – Write a copy of the data sorted by type – Request all the data and parse on client side
  20. 20. Using an IntraVert filter● Send a function to the server● Function is applied to subsequent get or slice operations● Only results of the filter are returned to the client
  21. 21. Defining a filter JavaScript● Syntax to create a filter { "type": "CREATEFILTER", "op": { "name": "stouts", "spec": "javascript", "value": "function(row) { if (row[value] == Breakfast Stout) return row; else return null; }" } },
  22. 22. Defining a filter Groovy/Java● We can define a groovy closure or Java filter { "type": "CREATEFILTER", "op": { "name": "stouts", "spec": "groovy", "{ row -> if (row["value"] == "Breakfast Stout") return row else return null }" } },
  23. 23. Filter flow
  24. 24. Common filter use cases● Transform data● Prune columns/rows like a where clause● Extract data from complex fields (json, xml, protobuf, etc)
  25. 25. Some light relief
  26. 26. Server Side Multi-Processor
  27. 27. Its the cure for your “redis envy”
  28. 28. Imagine your data looks like...● { “row key”:”1”, ● { “row key”:”4”, name:”a” ,val...} name:”a” ,val...}● { “row key”:”1”, ● { “row key”:”4”, name:”b” ,val...} name:”z” ,val...}
  29. 29. Application Requirements● User wishes to intersect the column names of two slices/queries● Common “solutions” – Pull all results to client and apply the intersection there
  30. 30. Server Side MultiProcessor● Send a class that implements MultiProcessor interface to server● public List<Map> multiProcess (Map<Integer,Object> input, Map params);● Do one or more get/slice operations as input● Invoke MultiProcessor on input
  31. 31. Multi-processor flow
  32. 32. Multi-processor use cases● Union N slices● Intersection N slices● Some “Join” scenarios
  33. 33. Fat client becomes the Phat client
  34. 34. Imagine you want to insert this data● User wishes to enter this event for multiple column families – 09/10/201111:12:13 – App=Yahoo – Platform=iOS – OS=4.3.4 – Device=iPad2,1 – Resolution=768x1024 – Events–videoPlayPercent=38–Taste=great http://www.slideshare.net/charmalloc/jsteincassandranyc2011
  35. 35. Inserting the dataaggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = { c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))}aggregateKeys(KEYSPACE ”ByMonth") = month //201109aggregateKeys(KEYSPACE "ByDay") = day //20110910aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213def r(columnName: String): Unit = { aggregateKeys.foreach{tuple:(ColumnFamily, String) => { val (columnFamily,row) = tuple if (row !=null && row.size > 0) rows add (columnFamily -> row has columnName inc) //increment the counter } }}ccAppPlatformOSVersionDeviceResolution(r) http://www.slideshare.net/charmalloc/jsteincassandranyc2011
  36. 36. Solution ● Send the data once and compute the N permutations on the server sidepublic void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) { JsonObject params = request.getObject("mpparams"); String uid = (String) params.getString("userid"); String fname = (String) params.getString("fname"); String lname = (String) params.getString("lname"); String city = (String) params.getString("city"); RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid)); QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname")); rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime()); QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname")); rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime()); ... try { StorageProxy.mutate(mutations, ConsistencyLevel.ONE); } catch (WriteTimeoutException | UnavailableException | OverloadedException e) { e.printStackTrace(); response.putString("status", "FAILED"); } response.putString("status", "OK");}
  37. 37. Service Processor Flow
  38. 38. IntraVert status● Still pre 1.0● Good docs – https://github.com/zznate/intravert-ug/wiki/_pages● Functional equivalent to thrift (mostly features ported)● CQL support● Virgil (coming soon)● Hbase like scanners (coming soon)
  39. 39. Hack at ithttps://github.com/zznate/intravert-ug
  40. 40. Questions?

×