Bulk Loading Data into Cassandra

  • 3,341 views
Uploaded on

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path. …

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path.

In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,341
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
82
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Planet Cassandra 2014 Bulk-Loading Data into Cassandra Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com
  • 2. About Us • Work with clients to deliver and improve Apache Cassandra services • Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer • Based in New Zealand & USA
  • 3. Why is bulk loading useful? • Performance tests
  • 4. Why is bulk loading useful? • Performance tests • Migrating historical data
  • 5. Why is bulk loading useful? • Performance tests • Migrating historical data • Changing topologies
  • 6. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 7. Cassandra Write Path write[0]
  • 8. Cassandra Write Path • write[0] Writes written to both the commit log and memtable. commitlog memtable
  • 9. Cassandra Write Path • • write[0] Writes written to both the commit log and memtable. Memtable is sorted. commitlog memtable
  • 10. Cassandra Write Path • write[0] Memtable flushed out to sstables. commitlog memtable sstable[0] sstable[2] sstable[1]
  • 11. Cassandra Write Path • write[0] Compaction helps keep the read latency low. commitlog memtable sstable[0] sstable[2] sstable[1] sstable[n]
  • 12. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
  • 13. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Contains all data needed to regenerate components
  • 14. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index of row keys
  • 15. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index summary from Index.db file
  • 16. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Bloom filter over sstable
  • 17. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Table of contents of all components
  • 18. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 19. create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; ! create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType'; Set up keyspace and column family
  • 20. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  • 21. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  • 22. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  • 23. ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }
  • 24. patricia@dev:~/../data$ total 64 -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia ls -lh mykeyspace/mycf staff staff staff staff staff staff staff 43B 79K 16B 36B 4.3K 80B 79B Feb Feb Feb Feb Feb Feb Feb 2 2 2 2 2 2 2 15:31 15:31 15:31 15:31 15:31 15:31 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Examining sstable output
  • 25. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 26. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 27. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 28. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  • 29. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server
  • 30. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command
  • 31. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command • Parallelise processes
  • 32. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 33. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  • 34. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  • 35. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  • 36. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 37. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  • 38. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  • 39. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  • 40. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a
  • 41. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  • 42. cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; ! cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ; CQL: Keep schema consistent
  • 43. CQL3 Considerations • Uses CompositeType comparator
  • 44. Planet Cassandra 2014 Q&A Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com