Cassandra + Hadoop = Brisk

18,206 views
18,115 views

Published on

An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.

Published in: Technology
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
18,206
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
116
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • Started at Imagini; May 2010New ad-targeting product! Lots of users.MySQL DB for profiles, MySQL based server for events reportingProfile DB cannot update rows so we only insert; this means clients have to merge together all rows for a user on every readMySQL DB has a habbit of dying, requiring a repair and downtime; having 2 DBs managed to put off total death but not for long
  • Choosing Cassandra after some research; no single point of failure attractive, high write throughput attractive, linear scaling attractiveWelcome to GC hell!Start Cassandra London – like alcoholics anonymous; a support network
  • Batch analytics; how? No Hive support, no support for streaming jarPig input readerNo output reader; require HDFS
  • Keep up the meetupsAcunu generous at providing speakers; downside is hearing sales pitch!0.7 comes along; downside is not compatible with 0.6; Thrift interface changes0.8 comes along; CQL, countersBrisk!
  • A summary
  • Some points about “distribution” Some points about Cloudera and reaction
  • Realtime + batch analytics combinedNo single point of failure; we don’t need Hadoop’snamenode anymoreCross DC clusters
  • No adsNo networkNo publishersCool domain name
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • Cassandra + Hadoop = Brisk

    1. 1. London<br />Our sponsors:<br />Acunu<br />
    2. 2. But first, a short back story…<br />
    3. 3. 9<br />10<br />11<br />12<br />5<br />6<br />7<br />8<br />1<br />2<br />3<br />4<br />
    4. 4. 21<br />22<br />23<br />24<br />GC HELL!<br />17<br />18<br />19<br />20<br />13<br />14<br />15<br />16<br />
    5. 5. 33<br />34<br />35<br />36<br />29<br />30<br />31<br />32<br />25<br />26<br />27<br />28<br />
    6. 6. 33<br />34<br />35<br />36<br />29<br />30<br />31<br />32<br />25<br />26<br />27<br />28<br />
    7. 7. Please volunteer if you would like to give a talk, Internet fame awaits<br />
    8. 8. <ul><li> My experience with Cassandra in production is positive
    9. 9. Analytics is more difficult than it could be
    10. 10. Welcome Brisk! </li></li></ul><li><ul><li> Brisk combines Hadoop, Hive and Cassandra in a “distribution”</li></li></ul><li>
    11. 11. In a nutshell<br /><ul><li>CassandraFS as HDFS compatible layer; no namenode, no SPOF
    12. 12. Can split cluster for OLAP and OLTP workloads, scaling up either as required</li></li></ul><li>Demonstrating brisk…<br />Building an Ad Network!<br />
    13. 13. Demonstrating brisk…<br />Building anAd Network!<br />
    14. 14. The plan:<br /><ul><li> Simple data model – segment users into buckets
    15. 15. System to put users in buckets via a pixel
    16. 16. Real-time queries
    17. 17. Analytics</li></li></ul><li>We Have Your KidneysThe ad-network for the paranoid generation<br /><ul><li> Cookie based identification
    18. 18. API provides:
    19. 19. Add user to a bucket (including ability to define expiry time)
    20. 20. Get buckets a user belongs to</li></li></ul><li>Setup Brisk<br />http://www.datastax.com/docs/0.8/brisk/install_brisk_ami<br /><ul><li> Step-by-step guide with pictures!
    21. 21. Ubuntu 10.10 image with RAID 0 ephemeral disks
    22. 22. Jairam has been bug-fixing some minor issues</li></li></ul><li>
    23. 23. Data model<br />CF = users<br />[userUUID] [segmentID] = 1<br />CF = segments<br />[segmentID] [userUUID] = 1<br />
    24. 24. Data model<br />create keyspacewhyk<br />... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' <br />... and strategy_options = [{replication_factor:1}];<br />create column family users <br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />create column family segments<br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />
    25. 25. Data model<br />create keyspacewhyk<br />... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' <br />... and strategy_options = [{replication_factor:1}];<br />create column family users <br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />create column family segments<br />... with comparator = 'AsciiType'<br />... and rows_cached = 5000;<br />
    26. 26. Our pixel<br />http://wehaveyourkidneys.com/add.php? segment=<alphaNumericCode> &expire=<numberOfSeconds><br /><ul><li> We’ll use Cassandra’s expiring columns feature </li></li></ul><li>PHP code – uses phpcassa<br />$pool = new ConnectionPool('whyk', array('localhost'));<br />$users = new ColumnFamily($pool, 'users');<br />$segments = new ColumnFamily($pool, 'segments');<br />$users->insert(<br />$userUuid,<br />array($segment => 1),<br />NULL, // default TS<br />$expires<br /> );<br />$segments->insert(<br />$segment,<br />array($userUuid => 1),<br />NULL, // default TS<br />$expires<br /> );<br />
    27. 27. Real-time access<br />http://wehaveyourkidneys.com/show.php<br />$pool = new ConnectionPool('whyk', array('localhost'));<br />$users = new ColumnFamily($pool, 'users');<br />// @todo this only gets first 100!<br />$segments = $users->get($userUuid);<br />header('Content-Type: application/json');<br />echo json_encode(array_keys($segments));<br />
    28. 28. Analytics<br />How many users in each segment?<br />Launch HIVE (very easy!)<br />root@brisk-01:~# brisk hive<br />
    29. 29. CREATE EXTERNAL TABLE whyk.users<br /> (userUuid string, segmentId string, value string)<br />STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’<br />WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );<br />select segmentId, count(1) as total<br />from whyk.users<br />group by segmentId<br />order by total desc;<br />
    30. 30. Summary<br />http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/<br />
    31. 31. Real time access+<br />Batch analytics<br />
    32. 32. Easy<br />Easy to setup<br />Easy to deploy mixed-modeclustersEasy to query (Hive)<br />
    33. 33. No Single Pointof Failure<br />
    34. 34. Further reading…<br />Installing the Brisk AMI<br />http://www.datastax.com/docs/0.8/brisk/install_brisk_ami<br />Key advantages of Brisk – from Jonathan Ellis<br />http://hackerne.ws/item?id=2528271<br />Why I’m very excited about DataStax’s Brisk – by Nathan Milford<br />http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/<br />The demo code on Github<br />https://github.com/davegardnerisme/we-have-your-kidneys<br />

    ×