CFS: Cassandra Backed Storage for Hadoop
Upcoming SlideShare
Loading in...5

CFS: Cassandra Backed Storage for Hadoop



Nick Bailey

Nick Bailey



Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://localhost 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

CFS: Cassandra Backed Storage for Hadoop CFS: Cassandra Backed Storage for Hadoop Presentation Transcript

  • CFSCassandra-backed storage for HadoopNick
  • ©2012 DataStaxMotivation2
  • ©2012 DataStaxHelp me Cassandra, you’re myonly hope3 View slide
  • ©2012 DataStaxCassandra• Distributed architecture• No SPOF• Scalable• Real time data• No ad-hoc query support4 View slide
  • ©2012 DataStaxCassandra, why can’t you...5
  • ©2012 the things Hadoop wasbuilt for.6
  • ©2012 DataStaxCassandra + Hadoop = <37
  • ©2012 DataStaxThe Solution• InputFormat/OutputFormat• Unfortunately, still need a DFS• Run tasktrackers/datanodes locally• Data Locality FTW!• Run namenode/jobtracker somewhere• Since Cassandra 0.6 (the dark ages)8
  • ©2012 DataStaxOk, but what about these partsthat suck...9
  • ©2012 DataStaxDo not want...• Multiple hadoop stacks?• SPOF?• 3 JVMS?10
  • ©2012 DataStaxCFS11
  • ©2012 DataStaxCassandra Data model in 1minute12
  • ©2012 DataStaxColumn Families• Column Family ~= Table• Row Key + columns• Columns are sparse13
  • ©2012 DataStaxStatic - Users Column Family14Row Keynickmbailey password: * name: Nickzznate password: * name: Nate phone: 512-7777
  • ©2012 DataStaxSelect * from Users where name=Nick;Secondary Indexes15
  • ©2012 DataStaxDynamic - Friends16Row Keynickmbailey zznate: thobbs:zznate jbeiber: thobbs: steve_watt:
  • ©2012 DataStaxSo what about CFS...17
  • ©2012 DataStaxSimple...18
  • ©2012 DataStax 19
  • ©2012 DataStaxCF: inode• Essentially, namenode replacement• File metadata20
  • ©2012 DataStax 21
  • ©2012 DataStaxCF: inode• Row Key = UUID• Allows for file renames• Secondary indexes for file browsing• Columns:22Columnfilename /home/nick/data.txtparent_path /home/nick/attributes nick:nick:777TimeUUID1 <block metadata>TimeUUID2 <block metadata>TimeUUID3 <block metadata>...
  • ©2012 DataStax 23
  • ©2012 DataStaxCF: sblocks• Essentially, datanode replacement• Stores actual contents of files• Each row is an hdfs block• Row Key = Block ID24ColumnTimeUUID1 <compressed file data>TimeUUID2 <compressed file data>TimeUUID3 <compressed file data>...
  • ©2012 DataStax 25
  • ©2012 DataStaxWrites• Write file metadata• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subblock.size’• Read in a block• split into sub blocks• Update inode, sblocks• rinse, repeat26
  • ©2012 DataStax 27
  • ©2012 DataStaxReads• Check for file in inode• Determine appropriate blocks• Request blocks via thrift• If data is local...• ...get location on local filesystem• If data is remote...• ...get actual file content via thrift28
  • ©2012 DataStaxWhat Else?• Current Implementation: 1.0.4• <property><name>fs.cfs.impl</name><value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value></property>• Supports HDFS append()• Immutability makes things easy• See the first incarnation•
  • Want a job?
  • Questions?