Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra + Hadoop = Brisk

19,065 views

Published on

An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.

Published in: Technology
  • Be the first to comment

Cassandra + Hadoop = Brisk

  1. 1. London<br />Our sponsors:<br />Acunu<br />
  2. 2. But first, a short back story…<br />
  3. 3. 9<br />10<br />11<br />12<br />5<br />6<br />7<br />8<br />1<br />2<br />3<br />4<br />
  4. 4. 21<br />22<br />23<br />24<br />GC HELL!<br />17<br />18<br />19<br />20<br />13<br />14<br />15<br />16<br />
  5. 5. 33<br />34<br />35<br />36<br />29<br />30<br />31<br />32<br />25<br />26<br />27<br />28<br />
  6. 6. 33<br />34<br />35<br />36<br />29<br />30<br />31<br />32<br />25<br />26<br />27<br />28<br />
  7. 7. Please volunteer if you would like to give a talk, Internet fame awaits<br />
  8. 8. <ul><li> My experience with Cassandra in production is positive
  9. 9. Analytics is more difficult than it could be
  10. 10. Welcome Brisk! </li></li></ul><li><ul><li> Brisk combines Hadoop, Hive and Cassandra in a “distribution”</li></li></ul><li>
  11. 11. In a nutshell<br /><ul><li>CassandraFS as HDFS compatible layer; no namenode, no SPOF
  12. 12. Can split cluster for OLAP and OLTP workloads, scaling up either as required</li></li></ul><li>Demonstrating brisk…<br />Building an Ad Network!<br />
  13. 13. Demonstrating brisk…<br />Building anAd Network!<br />
  14. 14. The plan:<br /><ul><li> Simple data model – segment users into buckets
  15. 15. System to put users in buckets via a pixel
  16. 16. Real-time queries
  17. 17. Analytics</li></li></ul><li>We Have Your KidneysThe ad-network for the paranoid generation<br /><ul><li> Cookie based identification
  18. 18. API provides:
  19. 19. Add user to a bucket (including ability to define expiry time)
  20. 20. Get buckets a user belongs to</li></li></ul><li>Setup Brisk<br />http://www.datastax.com/docs/0.8/brisk/install_brisk_ami<br /><ul><li> Step-by-step guide with pictures!
  21. 21. Ubuntu 10.10 image with RAID 0 ephemeral disks
  22. 22. Jairam has been bug-fixing some minor issues</li></li></ul><li>
  23. 23. Data model<br />CF = users<br />[userUUID] [segmentID] = 1<br />CF = segments<br />[segmentID] [userUUID] = 1<br />
  24. 24. Data model<br />create keyspacewhyk<br />... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' <br />... and strategy_options = [{replication_factor:1}];<br />create column family users <br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />create column family segments<br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />
  25. 25. Data model<br />create keyspacewhyk<br />... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' <br />... and strategy_options = [{replication_factor:1}];<br />create column family users <br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />create column family segments<br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />
  26. 26. Our pixel<br />http://wehaveyourkidneys.com/add.php? segment=<alphaNumericCode> &expire=<numberOfSeconds><br /><ul><li> We’ll use Cassandra’s expiring columns feature </li></li></ul><li>PHP code – uses phpcassa<br />$pool = new ConnectionPool('whyk', array('localhost'));<br />$users = new ColumnFamily($pool, 'users');<br />$segments = new ColumnFamily($pool, 'segments');<br />$users->insert(<br />$userUuid,<br />array($segment => 1),<br />NULL, // default TS<br />$expires<br /> );<br />$segments->insert(<br />$segment,<br />array($userUuid => 1),<br />NULL, // default TS<br />$expires<br /> );<br />
  27. 27. Real-time access<br />http://wehaveyourkidneys.com/show.php<br />$pool = new ConnectionPool('whyk', array('localhost'));<br />$users = new ColumnFamily($pool, 'users');<br />// @todo this only gets first 100!<br />$segments = $users->get($userUuid);<br />header('Content-Type: application/json');<br />echo json_encode(array_keys($segments));<br />
  28. 28. Analytics<br />How many users in each segment?<br />Launch HIVE (very easy!)<br />root@brisk-01:~# brisk hive<br />
  29. 29. CREATE EXTERNAL TABLE whyk.users<br /> (userUuid string, segmentId string, value string)<br />STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’<br />WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );<br />select segmentId, count(1) as total<br />from whyk.users<br />group by segmentId<br />order by total desc;<br />
  30. 30. Summary<br />http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/<br />
  31. 31. Real time access+<br />Batch analytics<br />
  32. 32. Easy<br />Easy to setup<br />Easy to deploy mixed-modeclustersEasy to query (Hive)<br />
  33. 33. No Single Pointof Failure<br />
  34. 34. Further reading…<br />Installing the Brisk AMI<br />http://www.datastax.com/docs/0.8/brisk/install_brisk_ami<br />Key advantages of Brisk – from Jonathan Ellis<br />http://hackerne.ws/item?id=2528271<br />Why I’m very excited about DataStax’s Brisk – by Nathan Milford<br />http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/<br />The demo code on Github<br />https://github.com/davegardnerisme/we-have-your-kidneys<br />

×