Store and Process Big Data with Hadoop and Cassandra
Upcoming SlideShare
Loading in...5

Store and Process Big Data with Hadoop and Cassandra






Total Views
Views on SlideShare
Embed Views



2 Embeds 11 9
http://searchutil01 2



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Store and Process Big Data with Hadoop and Cassandra Store and Process Big Data with Hadoop and Cassandra Presentation Transcript

  • Store and Process Big Datawith Hadoop and Cassandra Apache BarCamp By Deependra Ariyadewa WSO2, Inc.
  • Store Data with ● Project site : ● The latest release version is 1.0.7 ● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala ● Cassandra Users : ● The largest known Cassandra cluster has over 300 TB of data in over 400 machines. ● Commercial support
  • Cassandra Deployment architecture hash(key1) hash(key2) key => {(k,v),(k,v),(k,v)} hash(key) => key order
  • How to Install Cassandra ● Download the artifact apache-cassandra-1.0.7-bin.tar.gz from ● Extract tar -xzvf apache-cassandra-1.0.7-bin.tar.gz ● Set up folder paths mkdir -p /var/log/cassandra chown -R `whoami` /var/log/cassandra mkdir -p /var/lib/cassandra chown -R `whoami` /var/lib/cassandra
  • How to Configure CassandraMain Configuration file : $CASSANDRA_HOME/conf/cassandra.yaml cluster_name: Test Cluster seed_provider: - seeds: "" storage_port: 7000 listen_address: localhost rpc_address: localhost rpc_port: 9160
  • Cassandra Clustering initial_token: partitioner: org.apache.cassandra.dht.RandomPartitioner
  • Cassandra DevOps$CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost [default@unknown] show keyspaces; Keyspace: system: Replication Strategy: org.apache.cassandra.locator.LocalStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: HintsColumnFamily (Super) "hinted handoff data" Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db. marshal.BytesType Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 0.01/0 GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 0.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
  • Cassandra CLI[default@apache] create column family Location with comparator=UTF8Type anddefault_validation_class=UTF8Type and key_validation_class=UTF8Type;f04561a0-60ed-11e1-0000-242d50cf1fbfWaiting for schema agreement...... schemas agree across the cluster[default@apache] set Location[00001][City]=Colombo;Value inserted.Elapsed time: 140 msec(s).[default@apache] list Location;Using default limit of 100-------------------RowKey: 00001=> (column=City, value=Colombo, timestamp=1330311097464000)1 Row Returned.Elapsed time: 122 msec(s).
  • Store Data with Hectorimport me.prettyprint.cassandra.service.CassandraHostConfigurator;import me.prettyprint.hector.api.Cluster;import me.prettyprint.hector.api.factory.HFactory;import java.util.HashMap;import java.util.Map;public class ExampleHelper { public static final String CLUSTER_NAME = "ClusterOne"; public static final String USERNAME_KEY = "username"; public static final String PASSWORD_KEY = "password"; public static final String RPC_PORT = "9160"; public static final String CSS_NODE0 = "localhost"; public static final String CSS_NODE1 = ""; public static final String CSS_NODE2 = ""; public static Cluster createCluster(String username, String password) { Map<String, String> credentials = new HashMap<String, String>(); credentials.put(USERNAME_KEY, username); credentials.put(PASSWORD_KEY, password); String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," + CSS_NODE2 + ":" + RPC_PORT; return HFactory.createCluster(CLUSTER_NAME, new CassandraHostConfigurator(hostList), credentials); }}
  • Store Data with HectorCreate Keyspace: KeyspaceDefinition definition = new ThriftKsDef(keyspaceName); cluster.addKeyspace(definition);Add column family: ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily); cluster.addColumnFamily(familyDefinition);Write Data:Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer());String columnValue = UUID.randomUUID().toString(); mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue));Read Data: ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace); columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName); QueryResult<HColumn<String, String>> result = columnQuery.execute(); HColumn<String, String> hColumn = result.get(); System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "n");
  • Variable Consistency ● ANY: Wait until some replica has responded. ● ONE: Wait until one replica has responded. ● TWO: Wait until two replicas have responded. ● THREE: Wait until three replicas have responded. ● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was stablished. ● EACH_QUORUM: Wait for quorum on each datacenter. ● QUORUM: Wait for a quorum of replicas (no matter which datacenter). ● ALL: Blocks for all the replicas before returning to the client.
  • Variable ConsistencyCreate a customized Consistency Level:ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel();Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>();clmap.put("MyColumnFamily", HConsistencyLevel.ONE);configurableConsistencyLevel.setReadCfConsistencyLevels(clmap);configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap);HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);
  • CQLInsert data with CQL: cqlsh> INSERT INTO Location (KEY, City) VALUES (00001, Colombo);Retrieve data with CQL cqlsh> select * from Location where KEY=00001;
  • Apache ● Project Site: ● Latest Version 1.0.1 ● Hadoop is in use at Amazon, Yahoo, Adobe, eBay, Facebook ● Commercial support :
  • Hadoop deployment Architecture
  • How to install Hadoop ● Download the artifact from: ● Extract : tar -xzvf hadoop-1.0.1.tar.gz ● Copy and extract installation to each data node. scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop ● Start Hadoop : $HADOOP_HOME:/bin/start-all
  • Hadoop CLI - HDFSFormat Namenode : $HADOOP_HOME:/bin/hadoop namenode -formatFile operations on HDFS: $HADOOP_HOME:/bin/hadoop dfs -lsr / $HADOOP_HOME:/bin/hadoop dfs -mkdir /users/deep/wso2
  • Mapreducesource:
  • Simple Mapreduce JobMapperpublic static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • Simple Mapreduce JobReducer:public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporterreporter) throwsIOException { int sum = 0; while (values.hasNext()) { sum +=; } output.collect(key, new IntWritable(sum)); } }
  • Simple Mapreduce JobJob Runner: JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);
  • High level Mapreduce Interfaces● Hive● Pig
  • Q&A