Your SlideShare is downloading. ×
0
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Bulk Loading Data into Cassandra

5,622

Published on

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path. …

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path.

In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
  • Why we should use sstableloader vs copy vs client inserts. Is here any discussion on the different ways to do this and their pros/cons like missed data and so on?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
5,622
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
126
Comments
1
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Planet Cassandra 2014 Bulk-Loading Data into Cassandra Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com
  • 2. About Us • Work with clients to deliver and improve Apache Cassandra services • Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer • Based in New Zealand & USA
  • 3. Why is bulk loading useful? • Performance tests
  • 4. Why is bulk loading useful? • Performance tests • Migrating historical data
  • 5. Why is bulk loading useful? • Performance tests • Migrating historical data • Changing topologies
  • 6. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 7. Cassandra Write Path write[0]
  • 8. Cassandra Write Path • write[0] Writes written to both the commit log and memtable. commitlog memtable
  • 9. Cassandra Write Path • • write[0] Writes written to both the commit log and memtable. Memtable is sorted. commitlog memtable
  • 10. Cassandra Write Path • write[0] Memtable flushed out to sstables. commitlog memtable sstable[0] sstable[2] sstable[1]
  • 11. Cassandra Write Path • write[0] Compaction helps keep the read latency low. commitlog memtable sstable[0] sstable[2] sstable[1] sstable[n]
  • 12. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
  • 13. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Contains all data needed to regenerate components
  • 14. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index of row keys
  • 15. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index summary from Index.db file
  • 16. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Bloom filter over sstable
  • 17. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Table of contents of all components
  • 18. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 19. create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; ! create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType'; Set up keyspace and column family
  • 20. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  • 21. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  • 22. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  • 23. ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }
  • 24. patricia@dev:~/../data$ total 64 -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia ls -lh mykeyspace/mycf staff staff staff staff staff staff staff 43B 79K 16B 36B 4.3K 80B 79B Feb Feb Feb Feb Feb Feb Feb 2 2 2 2 2 2 2 15:31 15:31 15:31 15:31 15:31 15:31 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Examining sstable output
  • 25. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 26. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 27. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 28. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 29. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server
  • 30. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command
  • 31. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command • Parallelise processes
  • 32. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 33. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  • 34. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  • 35. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  • 36. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 37. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  • 38. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  • 39. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  • 40. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a
  • 41. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 42. cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; ! cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ; CQL: Keep schema consistent
  • 43. CQL3 Considerations • Uses CompositeType comparator
  • 44. Planet Cassandra 2014 Q&A Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com

×