LondonOur sponsors:Acunu
But first, a short back story…
910111256781234
21222324GC HELL!1718192013141516
333435362930313225262728
333435362930313225262728
Please volunteer if you would like to give a talk, Internet fame awaits
 My experience with Cassandra in    production is positive
 Analytics is more difficult than it    could be
 Welcome Brisk!  Brisk combines Hadoop, Hive and   Cassandra in a “distribution”
In a nutshellCassandraFS as HDFS compatible    layer; no namenode, no SPOF
 Can split cluster for OLAP and OLTP    workloads, scaling up either as    requiredDemonstrating brisk…Building an Ad Network!
Demonstrating brisk…Building anAd Network!
The plan: Simple data model – segment users    into buckets
 System to put users in buckets via   a pixel
 Real-time queries
 AnalyticsWe Have Your KidneysThe ad-network for the paranoid generation Cookie based identification
 API provides:
 Add user to a bucket (including    ability to define expiry time)
 Get buckets a user belongs toSetup Briskhttp://www.datastax.com/docs/0.8/brisk/install_brisk_ami Step-by-step guide with pictures!
Ubuntu 10.10 image with RAID 0    ephemeral disks
Jairam has been bug-fixing some    minor issues
Data modelCF = users[userUUID] [segmentID] = 1CF = segments[segmentID] [userUUID] = 1
Data modelcreate keyspacewhyk...     with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ...     and strategy_options = [{replication_factor:1}];create column family users ...     with comparator = 'AsciiType'...     and rows_cached = 5000;create column family segments...     with comparator = 'AsciiType'...     and rows_cached = 5000;
Data modelcreate keyspacewhyk...     with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' ...     and strategy_options = [{replication_factor:1}];create column family users ...     with comparator = 'AsciiType'...     and rows_cached = 5000;create column family segments...     with comparator = 'AsciiType'...     and rows_cached = 5000;
Our pixelhttp://wehaveyourkidneys.com/add.php?	segment=<alphaNumericCode>	&expire=<numberOfSeconds> We’ll use Cassandra’s expiring      columns feature PHP code – uses phpcassa$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');$segments = new ColumnFamily($pool, 'segments');$users->insert($userUuid,array($segment => 1),NULL,    // default TS$expires     );$segments->insert($segment,array($userUuid => 1),NULL,    // default TS$expires     );
Real-time accesshttp://wehaveyourkidneys.com/show.php$pool = new ConnectionPool('whyk', array('localhost'));$users = new ColumnFamily($pool, 'users');// @todo this only gets first 100!$segments = $users->get($userUuid);header('Content-Type: application/json');echo json_encode(array_keys($segments));
AnalyticsHow many users in each segment?Launch HIVE (very easy!)root@brisk-01:~# brisk hive
CREATE EXTERNAL TABLE whyk.users	(userUuid string, segmentId string, value string)STORED BY 	'org.apache.hadoop.hive.cassandra.CassandraStorageHandler’WITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value" );select segmentId, count(1) as totalfrom whyk.usersgroup by segmentIdorder by total desc;
Summaryhttp://www.flickr.com/photos/sovietuk/2956044892/sizes/o/in/photostream/

Cassandra + Hadoop = Brisk

Editor's Notes

  • #4 Started at Imagini; May 2010New ad-targeting product! Lots of users.MySQL DB for profiles, MySQL based server for events reportingProfile DB cannot update rows so we only insert; this means clients have to merge together all rows for a user on every readMySQL DB has a habbit of dying, requiring a repair and downtime; having 2 DBs managed to put off total death but not for long
  • #5 Choosing Cassandra after some research; no single point of failure attractive, high write throughput attractive, linear scaling attractiveWelcome to GC hell!Start Cassandra London – like alcoholics anonymous; a support network
  • #6 Batch analytics; how? No Hive support, no support for streaming jarPig input readerNo output reader; require HDFS
  • #7 Keep up the meetupsAcunu generous at providing speakers; downside is hearing sales pitch!0.7 comes along; downside is not compatible with 0.6; Thrift interface changes0.8 comes along; CQL, countersBrisk!
  • #9 A summary
  • #10 Some points about “distribution” Some points about Cloudera and reaction
  • #11 Realtime + batch analytics combinedNo single point of failure; we don’t need Hadoop’snamenode anymoreCross DC clusters
  • #14 No adsNo networkNo publishersCool domain name
  • #15 User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • #16 User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • #22 User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)
  • #23 User segments / buckets; have some ID, can expire (we’ll use Cassandra’s expiring columns)Real-time updates via a simple script (via CQL?)Real-time queries (is user in this segment? What segments is user X in?)Analytics (how many users in each segment? How many segments are users in on average? Std dev?)