Your SlideShare is downloading. ×
Cassandra + Hadoop = Brisk
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cassandra + Hadoop = Brisk

16,877
views

Published on

An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple …

An introduction to DataStax's Brisk (a distribution of Cassandra, Hadoop and Hive). Includes a back story of my own experience with Cassandra plus a demo of Brisk built around a very simple ad-network-type application.

Published in: Technology

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
16,877
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
109
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Started at Imagini; May 2010New ad-targeting product! Lots of users.MySQL DB for profiles, MySQL based server for events reportingProfile DB cannot update rows so we only insert; this means clients have to merge together all rows for a user on every readMySQL DB has a habbit of dying, requiring a repair and downtime; having 2 DBs managed to put off total death but not for long
  • Choosing Cassandra after some research; no single point of failure attractive, high write throughput attractive, linear scaling attractiveWelcome to GC hell!Start Cassandra London – like alcoholics anonymous; a support network
  • Batch analytics; how? No Hive support, no support for streaming jarPig input readerNo output reader; require HDFS
  • Keep up the meetupsAcunu generous at providing speakers; downside is hearing sales pitch!0.7 comes along; downside is not compatible with 0.6; Thrift interface changes0.8 comes along; CQL, countersBrisk!
  • A summary
  • Some points about “distribution” Some points about Cloudera and reaction
  • Realtime + batch analytics combinedNo single point of failure; we don’t need Hadoop’snamenode anymoreCross DC clusters
  • No adsNo networkNo publishersCool domain name
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • Transcript

    • 1. London
      Our sponsors:
      Acunu
    • 2. But first, a short back story…
    • 3. 9
      10
      11
      12
      5
      6
      7
      8
      1
      2
      3
      4
    • 4. 21
      22
      23
      24
      GC HELL!
      17
      18
      19
      20
      13
      14
      15
      16
    • 5. 33
      34
      35
      36
      29
      30
      31
      32
      25
      26
      27
      28
    • 6. 33
      34
      35
      36
      29
      30
      31
      32
      25
      26
      27
      28
    • 7. Please volunteer if you would like to give a talk, Internet fame awaits
    • 8.
      • My experience with Cassandra in production is positive
      • 9. Analytics is more difficult than it could be
      • 10. Welcome Brisk!
      • Brisk combines Hadoop, Hive and Cassandra in a “distribution”
    • 11. In a nutshell
      • CassandraFS as HDFS compatible layer; no namenode, no SPOF
      • 12. Can split cluster for OLAP and OLTP workloads, scaling up either as required
    • Demonstrating brisk…
      Building an Ad Network!
    • 13. Demonstrating brisk…
      Building anAd Network!
    • 14. The plan:
      • Simple data model – segment users into buckets
      • 15. System to put users in buckets via a pixel
      • 16. Real-time queries
      • 17. Analytics
    • We Have Your KidneysThe ad-network for the paranoid generation
      • Cookie based identification
      • 18. API provides:
      • 19. Add user to a bucket (including ability to define expiry time)
      • 20. Get buckets a user belongs to
    • Setup Brisk
      http://www.datastax.com/docs/0.8/brisk/install_brisk_ami
      • Step-by-step guide with pictures!
      • 21. Ubuntu 10.10 image with RAID 0 ephemeral disks
      • 22. Jairam has been bug-fixing some minor issues
    • 23. Data model
      CF = users
      [userUUID] [segmentID] = 1
      CF = segments
      [segmentID] [userUUID] = 1
    • 24. Data model
      create keyspacewhyk
      ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
      ... and strategy_options = [{replication_factor:1}];
      create column family users
      ... with comparator = 'AsciiType'
      ... and rows_cached = 5000;
      create column family segments
      ... with comparator = 'AsciiType'
      ... and rows_cached = 5000;
    • 25. Data model
      create keyspacewhyk
      ... with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
      ... and strategy_options = [{replication_factor:1}];
      create column family users
      ... with comparator = 'AsciiType'
      ... and rows_cached = 5000;
      create column family segments
      ... with comparator = 'AsciiType'
      ... and rows_cached = 5000;
    • 26. Our pixel
      http://wehaveyourkidneys.com/add.php? segment=<alphaNumericCode> &expire=<numberOfSeconds>
      • We’ll use Cassandra’s expiring columns feature
    • PHP code – uses phpcassa
      $pool = new ConnectionPool('whyk', array('localhost'));
      $users = new ColumnFamily($pool, 'users');
      $segments = new ColumnFamily($pool, 'segments');
      $users->insert(
      $userUuid,
      array($segment => 1),
      NULL, // default TS
      $expires
      );
      $segments->insert(
      $segment,
      array($userUuid => 1),
      NULL, // default TS
      $expires
      );
    • 27. Real-time access
      http://wehaveyourkidneys.com/show.php
      $pool = new ConnectionPool('whyk', array('localhost'));
      $users = new ColumnFamily($pool, 'users');
      // @todo this only gets first 100!
      $segments = $users->get($userUuid);
      header('Content-Type: application/json');
      echo json_encode(array_keys($segments));
    • 28. Analytics
      How many users in each segment?
      Launch HIVE (very easy!)
      root@brisk-01:~# brisk hive
    • 29. CREATE EXTERNAL TABLE whyk.users
      (userUuid string, segmentId string, value string)
      STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’
      WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );
      select segmentId, count(1) as total
      from whyk.users
      group by segmentId
      order by total desc;
    • 30. Summary
      http://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/
    • 31. Real time access+
      Batch analytics
    • 32. Easy
      Easy to setup
      Easy to deploy mixed-modeclustersEasy to query (Hive)
    • 33. No Single Pointof Failure
    • 34. Further reading…
      Installing the Brisk AMI
      http://www.datastax.com/docs/0.8/brisk/install_brisk_ami
      Key advantages of Brisk – from Jonathan Ellis
      http://hackerne.ws/item?id=2528271
      Why I’m very excited about DataStax’s Brisk – by Nathan Milford
      http://blog.milford.io/2011/04/why-i-am-very-excited-about-datastaxs-brisk/
      The demo code on Github
      https://github.com/davegardnerisme/we-have-your-kidneys