Your SlideShare is downloading. ×
0
CFSCassandra-backed storage for HadoopNick Bailey@nickmbaileynick@datastax.com
©2012 DataStaxMotivation2
©2012 DataStaxHelp me Cassandra, you’re myonly hope3
©2012 DataStaxCassandra• Distributed architecture• No SPOF• Scalable• Real time data• No ad-hoc query support4
©2012 DataStaxCassandra, why can’t you...5
©2012 DataStax...do the things Hadoop wasbuilt for.6
©2012 DataStaxCassandra + Hadoop = <37
©2012 DataStaxThe Solution• InputFormat/OutputFormat• Unfortunately, still need a DFS• Run tasktrackers/datanodes locally•...
©2012 DataStaxOk, but what about these partsthat suck...9
©2012 DataStaxDo not want...• Multiple hadoop stacks?• SPOF?• 3 JVMS?10
©2012 DataStaxCFS11
©2012 DataStaxCassandra Data model in 1minute12
©2012 DataStaxColumn Families• Column Family ~= Table• Row Key + columns• Columns are sparse13
©2012 DataStaxStatic - Users Column Family14Row Keynickmbailey password: * name: Nickzznate password: * name: Nate phone: ...
©2012 DataStaxSelect * from Users where name=Nick;Secondary Indexes15
©2012 DataStaxDynamic - Friends16Row Keynickmbailey zznate: thobbs:zznate jbeiber: thobbs: steve_watt:
©2012 DataStaxSo what about CFS...17
©2012 DataStaxSimple...18
©2012 DataStax 19
©2012 DataStaxCF: inode• Essentially, namenode replacement• File metadata20
©2012 DataStax 21
©2012 DataStaxCF: inode• Row Key = UUID• Allows for file renames• Secondary indexes for file browsing• Columns:22Columnfilena...
©2012 DataStax 23
©2012 DataStaxCF: sblocks• Essentially, datanode replacement• Stores actual contents of files• Each row is an hdfs block• R...
©2012 DataStax 25
©2012 DataStaxWrites• Write file metadata• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subbloc...
©2012 DataStax 27
©2012 DataStaxReads• Check for file in inode• Determine appropriate blocks• Request blocks via thrift• If data is local...•...
©2012 DataStaxWhat Else?• Current Implementation: 1.0.4• <property><name>fs.cfs.impl</name><value>com.datastax.bdp.hadoop....
Want a job?nick@datastax.com
Questions?
Upcoming SlideShare
Loading in...5
×

CFS: Cassandra Backed Storage for Hadoop

2,665

Published on

Nick Bailey
@Nickmbailey
nick@datastax.com

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,665
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
39
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "CFS: Cassandra Backed Storage for Hadoop"

  1. 1. CFSCassandra-backed storage for HadoopNick Bailey@nickmbaileynick@datastax.com
  2. 2. ©2012 DataStaxMotivation2
  3. 3. ©2012 DataStaxHelp me Cassandra, you’re myonly hope3
  4. 4. ©2012 DataStaxCassandra• Distributed architecture• No SPOF• Scalable• Real time data• No ad-hoc query support4
  5. 5. ©2012 DataStaxCassandra, why can’t you...5
  6. 6. ©2012 DataStax...do the things Hadoop wasbuilt for.6
  7. 7. ©2012 DataStaxCassandra + Hadoop = <37
  8. 8. ©2012 DataStaxThe Solution• InputFormat/OutputFormat• Unfortunately, still need a DFS• Run tasktrackers/datanodes locally• Data Locality FTW!• Run namenode/jobtracker somewhere• Since Cassandra 0.6 (the dark ages)8
  9. 9. ©2012 DataStaxOk, but what about these partsthat suck...9
  10. 10. ©2012 DataStaxDo not want...• Multiple hadoop stacks?• SPOF?• 3 JVMS?10
  11. 11. ©2012 DataStaxCFS11
  12. 12. ©2012 DataStaxCassandra Data model in 1minute12
  13. 13. ©2012 DataStaxColumn Families• Column Family ~= Table• Row Key + columns• Columns are sparse13
  14. 14. ©2012 DataStaxStatic - Users Column Family14Row Keynickmbailey password: * name: Nickzznate password: * name: Nate phone: 512-7777
  15. 15. ©2012 DataStaxSelect * from Users where name=Nick;Secondary Indexes15
  16. 16. ©2012 DataStaxDynamic - Friends16Row Keynickmbailey zznate: thobbs:zznate jbeiber: thobbs: steve_watt:
  17. 17. ©2012 DataStaxSo what about CFS...17
  18. 18. ©2012 DataStaxSimple...18
  19. 19. ©2012 DataStax 19
  20. 20. ©2012 DataStaxCF: inode• Essentially, namenode replacement• File metadata20
  21. 21. ©2012 DataStax 21
  22. 22. ©2012 DataStaxCF: inode• Row Key = UUID• Allows for file renames• Secondary indexes for file browsing• Columns:22Columnfilename /home/nick/data.txtparent_path /home/nick/attributes nick:nick:777TimeUUID1 <block metadata>TimeUUID2 <block metadata>TimeUUID3 <block metadata>...
  23. 23. ©2012 DataStax 23
  24. 24. ©2012 DataStaxCF: sblocks• Essentially, datanode replacement• Stores actual contents of files• Each row is an hdfs block• Row Key = Block ID24ColumnTimeUUID1 <compressed file data>TimeUUID2 <compressed file data>TimeUUID3 <compressed file data>...
  25. 25. ©2012 DataStax 25
  26. 26. ©2012 DataStaxWrites• Write file metadata• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subblock.size’• Read in a block• split into sub blocks• Update inode, sblocks• rinse, repeat26
  27. 27. ©2012 DataStax 27
  28. 28. ©2012 DataStaxReads• Check for file in inode• Determine appropriate blocks• Request blocks via thrift• If data is local...• ...get location on local filesystem• If data is remote...• ...get actual file content via thrift28
  29. 29. ©2012 DataStaxWhat Else?• Current Implementation: 1.0.4• <property><name>fs.cfs.impl</name><value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value></property>• Supports HDFS append()• Immutability makes things easy• See the first incarnation• https://github.com/riptano/brisk29
  30. 30. Want a job?nick@datastax.com
  31. 31. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×