Your SlideShare is downloading. ×

Cassandra&map reduce


Published on

Some notes on Cassandra&MapReduce integration

Some notes on Cassandra&MapReduce integration

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Cassandra & MRand how we used it in
  • 2. Agenda- Why do we need Cassandra & MapReduce- 3 notes about Cassandra- 3 notes Cassandra + MapReduce
  • 3. Real Time Bidding (RTB)
  • 4. RTB- How often?10 - 30M auctions per day for mobile devices in Russiae.g. 10-30Gb of data /day- What do we need?Be effective in showing Ads- They call it "big data & data mining"We decided to use Cassandra&Hadoop for that
  • 5. 1. Cassandra tokens∆ = [T0, T4] - token rangeex. ∆ = [-2^63, +2^63]Every time one writes (K,V) into Cassandra:- ex. token(K) in [T2, T3]- (K,V) will be put into node 3 (if replica 1)
  • 6. 2. Cassandra Load BalancingPartitioner generates tokens for your keysE.g. it creates token(K)Cassandra offers the following partitioners:● Murmur3Partitioner (default): Uniformly distributes dataacross the cluster based on MurmurHash hash values.● RandomPartitioner: Uniformly distributes data across thecluster based on MD5 hash values.● ByteOrderedPartitioner: Keeps an ordered distribution of datalexically by key bytesThe Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and theright choice for new clusters in almost all cases.
  • 7. 3. Cassandra indexes and knows- Cassandra support common data formatsE.g. byte, string, long, double- Cassandra support secondary indexesE.g. you can select your data not bulky- Cassandra knows how much data (records) in everytoken range
  • 8. Cassandra & Map-ReduceGoogle says:1. Cassandra is integrated with Map-Reduce It is outdated3. It is used for Hadoop 1.0.3 or whatever versionThis means: Please install hadoop+mr cluster yourself
  • 9. Cassandra & Map-Reduce (we want)1. Cloudera Hadoop Distribution (CDH4)Cloudera manager installs your cluster in couple of clicks2. Up to date (Cassandra 1.1.x - 1.2.x)Solution:A) Take Cassandra sources from Take package org.apache.cassandra.hadoop and recompile it, havingdependencies from CDH4&Cassadnra[1.x]And Jar is ready to go for your map-reduce jobs
  • 10. 1. Allocate your clusterDataStax says:To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes.This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.Why?
  • 11. Because this:works 100 times faster than this:
  • 12. 2. Number of map tasksJob control parameter: InputSplitSize (default 65536)Estimates how much data one mapper will receiveEvery map task has its own token range to read data from: [-2^63, +2^63] / number of map tasks
  • 13. 3. How job reads the dataJobControlParameter: RangeBatchSize (default: 4096)Bulk volume to read including your filters (primary & secondary indexes)Cassandra does filtering job on server side( [-2^63, +2^63] / number of map tasks )
  • 14. Pros:1. Easy to manage (Cassandra cluster & cloudera manager is2. Easy to index3. Supports query language & data types supportCons:1. Scalable extremely expensive (every node should run cassandra + hadoop)2. Reading is very slow3. Reading big amount is impossibleNote: Netflix reading using cassandra to manage the data.But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!