Your SlideShare is downloading. ×
Cassandra4Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cassandra4Hadoop

213
views

Published on

May 2013: Cassandra presentation for the HadoopNJ user group by Edward Capriolo

May 2013: Cassandra presentation for the HadoopNJ user group by Edward Capriolo

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
213
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cassandra
  • 2. FUNdamentals Overview
  • 3. Main pointsStructured log storageColumns ordered by name inside keyRows ordered by hash of row key *Column family storageFully distributed peer-to-peerPartitioned by row keyDynamo consistency
  • 4. Structured log storageNo writes in place for youJVM heap is reserved for memtablesMemtables are sortedMemtables reach a specific size they areflushed to disk− Creates sstable file− Bloom filter file− Index file
  • 5. CompactionSSTables mergedDeleted columns physically removedTwo compaction strategies− Sized− LevelDB
  • 6. Commit logsEvery write/delete operation goes to commitlogIf a node were to shutdown with un-flushedmemtables (every shutdown really)Replay the commit logs
  • 7. Columns ordered inside keyCassandra likes wide rows− Up to 2 billion− (but not really would be a 32GB row)set mystuff[ecapriolo][a]=1set mystuff[ecapriolo][b]=2set mystuff[ecapriolo][c]=3...slice mystuff[ecapriolo] [b] [g]
  • 8. Rows ordered by hash of row keyAll columns of row a1 on the same nodeBut all columns of row a2 may not be onsame nodeReduces hot spotsBut there is no total ordering based on rowkeys
  • 9. Peer to PeerNode list and token range is gossip-edEach node responsible for local storage andrequestsWhen a new node joins it take some tokenrange away from other nodes.
  • 10. Ed Ed Edstacey stacey staceybob bob bobReplication 3
  • 11. Dynamo consistencyOperations have a requested ConsistencyLevel− ONE− QUORUMCL nodes ack the operation before the userreceives ackIf an operation fails it is safe to retry *
  • 12. Fully distributed. The goodHighly availableRedundantFault tolerant
  • 13. Fully distributed! The badLocksCountersTombstonesConsistency
  • 14. Hadoop
  • 15. Hadoop and CassandraColumnFamilyInputFormat− Takes a ColumnFamily as input− Map(ByteBuffer[] key,SortedMap<ByteBuffer,Column>ColumnFamilyOutputFormat− Writes out to a column family− OutputFormat ByteBuffer,List<Mutation>
  • 16. Hadoop optimizationsTasks run with locality if c* and h same nodeInputFormat can leverage c* secondaryindexesOutputFormat can use bulk loader− C* writes are helluva fast anyway
  • 17. Hive and CassandraHive support similar to the hbase handlersupportCreate a hive table specifying propertiessimilar to those in map reducehive> CREATE EXTERNAL TABLEUsers(userid string, name string, emailstring, phone string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandler WITH
  • 18. Other support out theregithub.com/edwardcapriolo/hive-cassandra-udfs− Delete UDF− Composite splitter/builder UDFSNot very hard to roll your own input format− OneRowInputFormat− ListOfRowsInputFormat
  • 19. Pig CassandraNice support for pig/cassandraPigmalian libraryBut I dont use it− Cause I use hive− You should as well− And get my book :)
  • 20. Comparison between c*and “other noSQL”I know your talking about hbase :)Cassandra does not store multiple versions ofcolumn− Last update wins− Use UUID as part of column name insteadThe row keys are not globally ordered *− Unless you are using ByteOrderPartitioner (no oneshould use this)
  • 21. Comparison between c*and “other noSQL”Each c* replica actively servers reads & writesCassandra directly manages its storageShards are pre-defined tokens (no auto-split)Qualifier/column name can NOT be null
  • 22. Key Performance tips
  • 23. Know your dataDesign for the long tail scenarios− With design x our largest customer will have10000000000000 columns in one rowHow large will this column family be in 5months?What is the request rate?How random is the read pattern
  • 24. Understanding write-once filesDeletes are writes that get compacted awaylaterCan you optimize from blind writes?What percent of your application isupdate/insert?
  • 25. Profiling / Dark LaunchCompressionCompaction strategy
  • 26. MetricsCollect the JMX information− Column family− CachesSet milestone alerts (traps)
  • 27. HardwareFast disk (you almost always want SSD)RAM− Caches, bloom filters, young genCPU− Garbage collector, deserialization + compactionneeds cpu to work
  • 28. Anti patternsUsing one row key as a queueDoing N reads to satisfy a requestRead before writeUsing collection support in place of wide rowsEncoding
  • 29. Questions?

×