Cassandra&map reduce


Published on

Some notes on Cassandra&MapReduce integration

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cassandra&map reduce

  1. 1. Cassandra & MRand how we used it in
  2. 2. Agenda- Why do we need Cassandra & MapReduce- 3 notes about Cassandra- 3 notes Cassandra + MapReduce
  3. 3. Real Time Bidding (RTB)
  4. 4. RTB- How often?10 - 30M auctions per day for mobile devices in Russiae.g. 10-30Gb of data /day- What do we need?Be effective in showing Ads- They call it "big data & data mining"We decided to use Cassandra&Hadoop for that
  5. 5. 1. Cassandra tokens∆ = [T0, T4] - token rangeex. ∆ = [-2^63, +2^63]Every time one writes (K,V) into Cassandra:- ex. token(K) in [T2, T3]- (K,V) will be put into node 3 (if replica 1)
  6. 6. 2. Cassandra Load BalancingPartitioner generates tokens for your keysE.g. it creates token(K)Cassandra offers the following partitioners:● Murmur3Partitioner (default): Uniformly distributes dataacross the cluster based on MurmurHash hash values.● RandomPartitioner: Uniformly distributes data across thecluster based on MD5 hash values.● ByteOrderedPartitioner: Keeps an ordered distribution of datalexically by key bytesThe Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and theright choice for new clusters in almost all cases.
  7. 7. 3. Cassandra indexes and knows- Cassandra support common data formatsE.g. byte, string, long, double- Cassandra support secondary indexesE.g. you can select your data not bulky- Cassandra knows how much data (records) in everytoken range
  8. 8. Cassandra & Map-ReduceGoogle says:1. Cassandra is integrated with Map-Reduce It is outdated3. It is used for Hadoop 1.0.3 or whatever versionThis means: Please install hadoop+mr cluster yourself
  9. 9. Cassandra & Map-Reduce (we want)1. Cloudera Hadoop Distribution (CDH4)Cloudera manager installs your cluster in couple of clicks2. Up to date (Cassandra 1.1.x - 1.2.x)Solution:A) Take Cassandra sources from Take package org.apache.cassandra.hadoop and recompile it, havingdependencies from CDH4&Cassadnra[1.x]And Jar is ready to go for your map-reduce jobs
  10. 10. 1. Allocate your clusterDataStax says:To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes.This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.Why?
  11. 11. Because this:works 100 times faster than this:
  12. 12. 2. Number of map tasksJob control parameter: InputSplitSize (default 65536)Estimates how much data one mapper will receiveEvery map task has its own token range to read data from: [-2^63, +2^63] / number of map tasks
  13. 13. 3. How job reads the dataJobControlParameter: RangeBatchSize (default: 4096)Bulk volume to read including your filters (primary & secondary indexes)Cassandra does filtering job on server side( [-2^63, +2^63] / number of map tasks )
  14. 14. Pros:1. Easy to manage (Cassandra cluster & cloudera manager is2. Easy to index3. Supports query language & data types supportCons:1. Scalable extremely expensive (every node should run cassandra + hadoop)2. Reading is very slow3. Reading big amount is impossibleNote: Netflix reading using cassandra to manage the data.But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!