RTB- How often?10 - 30M auctions per day for mobile devices in Russiae.g. 10-30Gb of data /day- What do we need?Be effective in showing Ads- They call it "big data & data mining"We decided to use Cassandra&Hadoop for that
1. Cassandra tokens∆ = [T0, T4] - token rangeex. ∆ = [-2^63, +2^63]Every time one writes (K,V) into Cassandra:- ex. token(K) in [T2, T3]- (K,V) will be put into node 3 (if replica 1)
2. Cassandra Load BalancingPartitioner generates tokens for your keysE.g. it creates token(K)Cassandra offers the following partitioners:● Murmur3Partitioner (default): Uniformly distributes dataacross the cluster based on MurmurHash hash values.● RandomPartitioner: Uniformly distributes data across thecluster based on MD5 hash values.● ByteOrderedPartitioner: Keeps an ordered distribution of datalexically by key bytesThe Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and theright choice for new clusters in almost all cases.http://www.datastax.com/docs/1.2/cluster_architecture/partitioners
3. Cassandra indexes and knows- Cassandra support common data formatsE.g. byte, string, long, double- Cassandra support secondary indexesE.g. you can select your data not bulky- Cassandra knows how much data (records) in everytoken range
Cassandra & Map-ReduceGoogle says:1. Cassandra is integrated with Map-Reducehttp://wiki.apache.org/cassandra/HadoopSupport2. It is outdated3. It is used for Hadoop 1.0.3 or whatever versionThis means: Please install hadoop+mr cluster yourself
Cassandra & Map-Reduce (we want)1. Cloudera Hadoop Distribution (CDH4)Cloudera manager installs your cluster in couple of clicks2. Up to date (Cassandra 1.1.x - 1.2.x)Solution:A) Take Cassandra sources fromhttp://cassandra.apache.org/download/B) Take package org.apache.cassandra.hadoop and recompile it, havingdependencies from CDH4&Cassadnra[1.x]And Jar is ready to go for your map-reduce jobs
1. Allocate your clusterDataStax says:To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes.This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.Why?
Because this:works 100 times faster than this:
2. Number of map tasksJob control parameter: InputSplitSize (default 65536)Estimates how much data one mapper will receiveEvery map task has its own token range to read data from: [-2^63, +2^63] / number of map tasks
3. How job reads the dataJobControlParameter: RangeBatchSize (default: 4096)Bulk volume to read including your filters (primary & secondary indexes)Cassandra does filtering job on server side( [-2^63, +2^63] / number of map tasks )
Pros:1. Easy to manage (Cassandra cluster & cloudera manager is2. Easy to index3. Supports query language & data types supportCons:1. Scalable extremely expensive (every node should run cassandra + hadoop)2. Reading is very slow3. Reading big amount is impossibleNote: Netflix reading using cassandra to manage the data.But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!http://www.datastax.com/dev/blog/2012-in-review-performanceConclusion
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.