Handling realtime and analyticworkloads in a single clusterwith Hadoop and CassandraHandling realtime and analyticworkload...
Basic Cassandra + Hadoop IntegrationC*C*C*C*C*C*C*C*CassandraClusterHadoop ClusterNameNode & JobTrackerDataNode DataNodeDa...
ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gend...
ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gend...
ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gend...
ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gend...
ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gend...
CFIF – Wide Row SupportInput Key:jimage: 36Input Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny ag...
CFIF – Wide Row SupportInput Key:jimcar: camaroInput Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnn...
CFIF – Wide Row SupportInput Key:jimgender: MInput Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny ...
CFIF – Wide Row SupportInput Key:carolage: 37Input Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny ...
CFIF – Wide Row SupportInput Key:carolcar: subaruInput Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujoh...
CFIF – Cassandra Secondary Index SupportIndexExpression expr =new IndexExpression(ByteBufferUtil.bytes("car"),IndexOperato...
ColumnFamilyOutputFormat● Key: ByteBuffer (row key)● Value: List<Mutation>– Mutation: insert or delete a columnC*C*C*C*C*C...
CFOF – Creating MutationsByteBuffer rowkey = ByteBufferUtil.bytes(“carol”);Column column = new Column();column.name = Byte...
BulkOutputFormatHadoop Temporary DirSSTable 1 SSTable 2 SSTable N...flushwriteBulkRecordWriterMemory Buffer
DataStax Enterprise:Cassandra and Hadoop in a Single Cluster
Basic Features● Single, simplified component● Workload separation● No SPOF● Peer to peer● JobTracker failover● No addition...
System Administrators ViewAddress DC Rack Workload Status State Load Owns Token148873535527910577765226390751398592512101....
Wait, but where are my files?Hadoop M/RHDFSHadoop M/RCFSCassandra Server
Cassandra File System Properties● Decentralized● Replicated● HDFS compatible– compatible with Hadoop filesystem utilities–...
CFS Architecture
CFS Compaction● Keeps track of deleted rows (blocks)● When all blocks in SSTable removed,deletes the whole SSTableCassandr...
Hive Integration● CassandraHiveMetaStore– stores Hive database metadata in Cassandra– no need to run a separate RDBMS● Cas...
Hive Integration – ExampleCREATE EXTERNAL TABLE MyHiveTable(row_key string, col1 string, col2 string)STORED BY org.apache....
Custom Column MappingCREATE EXTERNAL TABLE Users(userid string, name string, email string, phone string)STORED BY org.apac...
Upcoming SlideShare
Loading in …5
×

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

8,127 views
7,940 views

Published on

Published in: Sports, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,127
On SlideShare
0
From Embeds
0
Number of Embeds
2,134
Actions
Shares
0
Downloads
67
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

  1. 1. Handling realtime and analyticworkloads in a single clusterwith Hadoop and CassandraHandling realtime and analyticworkloads in a single clusterwith Hadoop and CassandraPiotr Kołaczkowskipkolaczk@datastax.com@pkolaczkPiotr Kołaczkowskipkolaczk@datastax.com@pkolaczk
  2. 2. Basic Cassandra + Hadoop IntegrationC*C*C*C*C*C*C*C*CassandraClusterHadoop ClusterNameNode & JobTrackerDataNode DataNodeDataNode DataNodeDataNode DataNodeCFIFCFOF
  3. 3. ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: FKey: ByteBufferValue: SortedMap<ByteBuffer, IColumn>(column name, value, timestamp)row keycolumn name
  4. 4. ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: FInput Key:jimage: 36 car: camaro gender: MInput Value:
  5. 5. ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: FInput Key:carolage: 37 car: subaruInput Value:
  6. 6. ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: FInput Key:johnnyage: 12 gender: MInput Value:
  7. 7. ColumnFamilyInputFormatjim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: FInput Key:suzyage: 10 gender: FInput Value:
  8. 8. CFIF – Wide Row SupportInput Key:jimage: 36Input Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: F
  9. 9. CFIF – Wide Row SupportInput Key:jimcar: camaroInput Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: F
  10. 10. CFIF – Wide Row SupportInput Key:jimgender: MInput Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: F
  11. 11. CFIF – Wide Row SupportInput Key:carolage: 37Input Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: F
  12. 12. CFIF – Wide Row SupportInput Key:carolcar: subaruInput Value:jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: F
  13. 13. CFIF – Cassandra Secondary Index SupportIndexExpression expr =new IndexExpression(ByteBufferUtil.bytes("car"),IndexOperator.EQ,ByteBufferUitl.bytes("subaru"));ConfigHelper.setInputRange(job.getConfiguration(),Arrays.asList(expr));jim age: 36 car: camaro gender: Mcarol age: 37 car: subarujohnny age: 12 gender: Msuzy age: 10 gender: F
  14. 14. ColumnFamilyOutputFormat● Key: ByteBuffer (row key)● Value: List<Mutation>– Mutation: insert or delete a columnC*C*C*C*C*C*C*C*CassandraClusterColumnFamilyRecordWriterwritequeueclientthrift
  15. 15. CFOF – Creating MutationsByteBuffer rowkey = ByteBufferUtil.bytes(“carol”);Column column = new Column();column.name = ByteBufferUtil.bytes(“age”);column.value = ByteBufferUtil.bytes(37);List<Mutation> mutations;Mutation mutation = new Mutation();mutation.column_or_supercolumn = new ColumnOrSuperColumn();mutation.column_or_supercolumn.column = column;mutations.add(mutation);context.write(rowkey, mutationList);
  16. 16. BulkOutputFormatHadoop Temporary DirSSTable 1 SSTable 2 SSTable N...flushwriteBulkRecordWriterMemory Buffer
  17. 17. DataStax Enterprise:Cassandra and Hadoop in a Single Cluster
  18. 18. Basic Features● Single, simplified component● Workload separation● No SPOF● Peer to peer● JobTracker failover● No additional Cassandra config
  19. 19. System Administrators ViewAddress DC Rack Workload Status State Load Owns Token148873535527910577765226390751398592512101.202.204.101 Analytics rack1 Analytics(JT) Up Normal 78,96 GB 12,50% 0101.202.204.102 Analytics rack1 Analytics(TT) Up Normal 82,65 GB 12,50% 21267647932558653966460912964485513216101.202.204.103 Analytics rack1 Analytics(TT) Up Normal 74,96 GB 12,50% 42535295865117307932921825928971026432101.202.204.104 Analytics rack1 Analytics(TT) Up Normal 78,79 GB 12,50% 63802943797675961899382738893456539648101.202.204.105 Cassandra rack1 Cassandra Up Normal 67,42 GB 12,50% 85070591730234615865843651857942052864101.202.204.106 Cassandra rack1 Cassandra Up Normal 60,86 GB 12,50% 106338239662793269832304564822427566080101.202.204.107 Cassandra rack1 Cassandra Up Normal 81,27 GB 12,50% 127605887595351923798765477786913079296101.202.204.108 Cassandra rack1 Cassandra Up Normal 77,17 GB 12,50% 148873535527910577765226390751398592512Easy monitoring ofyour nodes,regardless of theirworkload type
  20. 20. Wait, but where are my files?Hadoop M/RHDFSHadoop M/RCFSCassandra Server
  21. 21. Cassandra File System Properties● Decentralized● Replicated● HDFS compatible– compatible with Hadoop filesystem utilities– allows for running M/R programs on DSE withoutany change● Compressed
  22. 22. CFS Architecture
  23. 23. CFS Compaction● Keeps track of deleted rows (blocks)● When all blocks in SSTable removed,deletes the whole SSTableCassandra Storageblock 1block 2block 3block 4block 5block 6ts 1ts 2block 6 block 6block 7block 8ts 3ts 4block 6block 9block 10X
  24. 24. Hive Integration● CassandraHiveMetaStore– stores Hive database metadata in Cassandra– no need to run a separate RDBMS● CassandraStorageHandler– allows for direct access to C* tables with CFIF andCFOF
  25. 25. Hive Integration – ExampleCREATE EXTERNAL TABLE MyHiveTable(row_key string, col1 string, col2 string)STORED BY org.apache.hadoop.hive.cassandra.CassandraStorageHandlerTBLPROPERTIES ("cassandra.ks.name" = "MyCassandraKS");SELECT count(*) FROM MyHiveTable;Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks determined at compile time: 1In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>In order to set a constant number of reducers:set mapred.reduce.tasks=<number>Starting Job = job_201306041030_0001, Tracking URL = http://192.168.123.10:50030/jobdetails.jsp?jobid=job_201306041030_0001Kill Command = /usr/bin/dse hadoop job -Dmapred.job.tracker=192.168.123.10:8012 -kill job_201306041030_0001Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 12013-06-04 15:11:54,573 Stage-1 map = 0%, reduce = 0%2013-06-04 15:11:58,622 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec2013-06-04 15:11:59,691 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 1.04 sec...2013-06-04 15:12:28,288 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:29,304 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:30,330 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 sec2013-06-04 15:12:31,339 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 31.91 secMapReduce Total cumulative CPU time: 31 seconds 910 msecEnded Job = job_201306041030_0001MapReduce Jobs Launched:Job 0: Map: 9 Reduce: 1 Cumulative CPU: 31.91 sec HDFS Read: 0 HDFS Write: 0 SUCCESSTotal MapReduce CPU Time Spent: 31 seconds 910 msecOK1000000Time taken: 46.246 seconds
  26. 26. Custom Column MappingCREATE EXTERNAL TABLE Users(userid string, name string, email string, phone string)STORED BY org.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITHSERDEPROPERTIES ("cassandra.columns.mapping" = ":key,user_name,primary_email,home_phone");Cassandra: row key user_name primary_email home_phoneHive: userid name email phone

×