Cassandra&map reduce

  • 914 views
Uploaded on

Some notes on Cassandra&MapReduce integration

Some notes on Cassandra&MapReduce integration

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
914
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
27
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cassandra & MRand how we used it in
  • 2. Agenda- Why do we need Cassandra & MapReduce- 3 notes about Cassandra- 3 notes Cassandra + MapReduce
  • 3. Real Time Bidding (RTB)
  • 4. RTB- How often?10 - 30M auctions per day for mobile devices in Russiae.g. 10-30Gb of data /day- What do we need?Be effective in showing Ads- They call it "big data & data mining"We decided to use Cassandra&Hadoop for that
  • 5. 1. Cassandra tokens∆ = [T0, T4] - token rangeex. ∆ = [-2^63, +2^63]Every time one writes (K,V) into Cassandra:- ex. token(K) in [T2, T3]- (K,V) will be put into node 3 (if replica 1)
  • 6. 2. Cassandra Load BalancingPartitioner generates tokens for your keysE.g. it creates token(K)Cassandra offers the following partitioners:● Murmur3Partitioner (default): Uniformly distributes dataacross the cluster based on MurmurHash hash values.● RandomPartitioner: Uniformly distributes data across thecluster based on MD5 hash values.● ByteOrderedPartitioner: Keeps an ordered distribution of datalexically by key bytesThe Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and theright choice for new clusters in almost all cases.http://www.datastax.com/docs/1.2/cluster_architecture/partitioners
  • 7. 3. Cassandra indexes and knows- Cassandra support common data formatsE.g. byte, string, long, double- Cassandra support secondary indexesE.g. you can select your data not bulky- Cassandra knows how much data (records) in everytoken range
  • 8. Cassandra & Map-ReduceGoogle says:1. Cassandra is integrated with Map-Reducehttp://wiki.apache.org/cassandra/HadoopSupport2. It is outdated3. It is used for Hadoop 1.0.3 or whatever versionThis means: Please install hadoop+mr cluster yourself
  • 9. Cassandra & Map-Reduce (we want)1. Cloudera Hadoop Distribution (CDH4)Cloudera manager installs your cluster in couple of clicks2. Up to date (Cassandra 1.1.x - 1.2.x)Solution:A) Take Cassandra sources fromhttp://cassandra.apache.org/download/B) Take package org.apache.cassandra.hadoop and recompile it, havingdependencies from CDH4&Cassadnra[1.x]And Jar is ready to go for your map-reduce jobs
  • 10. 1. Allocate your clusterDataStax says:To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes.This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.Why?
  • 11. Because this:works 100 times faster than this:
  • 12. 2. Number of map tasksJob control parameter: InputSplitSize (default 65536)Estimates how much data one mapper will receiveEvery map task has its own token range to read data from: [-2^63, +2^63] / number of map tasks
  • 13. 3. How job reads the dataJobControlParameter: RangeBatchSize (default: 4096)Bulk volume to read including your filters (primary & secondary indexes)Cassandra does filtering job on server side( [-2^63, +2^63] / number of map tasks )
  • 14. Pros:1. Easy to manage (Cassandra cluster & cloudera manager is2. Easy to index3. Supports query language & data types supportCons:1. Scalable extremely expensive (every node should run cassandra + hadoop)2. Reading is very slow3. Reading big amount is impossibleNote: Netflix reading using cassandra to manage the data.But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!http://www.datastax.com/dev/blog/2012-in-review-performanceConclusion