0
Planet Cassandra 2014

Bulk-Loading Data into Cassandra
Patricia Gorla

@patriciagorla

Cassandra Consultant

www.thelastp...
About Us
•

Work with clients to deliver and improve
Apache Cassandra services


•

Apache Cassandra committer, Datastax
M...
Why is bulk loading useful?
•

Performance tests
Why is bulk loading useful?
•

Performance tests

•

Migrating historical data
Why is bulk loading useful?
•

Performance tests

•

Migrating historical data

•

Changing topologies
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
Cassandra Write Path

write[0]
Cassandra Write Path
•

write[0]

Writes written to both the commit log and
memtable.

commitlog

memtable
Cassandra Write Path
•

•

write[0]

Writes written to both the commit log and
memtable.

Memtable is sorted.

commitlog

...
Cassandra Write Path
•

write[0]

Memtable flushed out to sstables.

commitlog

memtable

sstable[0]
sstable[2]
sstable[1]
Cassandra Write Path
•

write[0]

Compaction helps keep the read latency
low.

commitlog

memtable

sstable[0]
sstable[2]
...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db
mykeyspace-mycf-jb-1-Data.db
mykeyspace-mycf-jb-1-Filter.db
m...
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
create keyspace test
with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = {repli...
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
Asci...
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
Asci...
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter(
directory,
partitioner,
keyspace,
columnFamily,
Asci...
ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024));
KeyGenerator keyGen = new KeyGenerator();
long dataSize ...
patricia@dev:~/../data$
total 64
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-rw-r--r-- 1 patricia
-...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost
Str...
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server

•

Throttle command
$ bin/sstableloader Keyspace1/ColFam1
•

Run command on separate server

•

Throttle command

•

Parallelise processes
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableS...
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableS...
// list of orders by user
customerOrders = new SSTableSimpleUnsortedWriter(…);
// orders by order id
orders = new SSTableS...
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d 
cass1,cass2,...
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats
pending tasks: 30
A...
!

•

How Data is Stored

•

Case Studies
	 - Generating Dummy Data
	 - Backfilling Historical Data
	 - Changing Topologies...
cqlsh> CREATE KEYSPACE "test"
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 };
!

cqlsh> CREATE C...
CQL3 Considerations
•

Uses CompositeType comparator
Planet Cassandra 2014

Q&A
Patricia Gorla

@patriciagorla

Cassandra Consultant

www.thelastpickle.com
Upcoming SlideShare
Loading in...5
×

Bulk Loading Data into Cassandra

5,930

Published on

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path.

In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.

Published in: Technology
1 Comment
7 Likes
Statistics
Notes
  • Why we should use sstableloader vs copy vs client inserts. Is here any discussion on the different ways to do this and their pros/cons like missed data and so on?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
5,930
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
131
Comments
1
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Bulk Loading Data into Cassandra"

  1. 1. Planet Cassandra 2014 Bulk-Loading Data into Cassandra Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com
  2. 2. About Us • Work with clients to deliver and improve Apache Cassandra services • Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer • Based in New Zealand & USA
  3. 3. Why is bulk loading useful? • Performance tests
  4. 4. Why is bulk loading useful? • Performance tests • Migrating historical data
  5. 5. Why is bulk loading useful? • Performance tests • Migrating historical data • Changing topologies
  6. 6. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  7. 7. Cassandra Write Path write[0]
  8. 8. Cassandra Write Path • write[0] Writes written to both the commit log and memtable. commitlog memtable
  9. 9. Cassandra Write Path • • write[0] Writes written to both the commit log and memtable. Memtable is sorted. commitlog memtable
  10. 10. Cassandra Write Path • write[0] Memtable flushed out to sstables. commitlog memtable sstable[0] sstable[2] sstable[1]
  11. 11. Cassandra Write Path • write[0] Compaction helps keep the read latency low. commitlog memtable sstable[0] sstable[2] sstable[1] sstable[n]
  12. 12. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
  13. 13. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Contains all data needed to regenerate components
  14. 14. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index of row keys
  15. 15. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index summary from Index.db file
  16. 16. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Bloom filter over sstable
  17. 17. Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Table of contents of all components
  18. 18. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  19. 19. create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; ! create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType'; Set up keyspace and column family
  20. 20. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  21. 21. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  22. 22. AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java
  23. 23. ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }
  24. 24. patricia@dev:~/../data$ total 64 -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia ls -lh mykeyspace/mycf staff staff staff staff staff staff staff 43B 79K 16B 36B 4.3K 80B 79B Feb Feb Feb Feb Feb Feb Feb 2 2 2 2 2 2 2 15:31 15:31 15:31 15:31 15:31 15:31 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Examining sstable output
  25. 25. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  26. 26. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  27. 27. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  28. 28. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
  29. 29. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server
  30. 30. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command
  31. 31. $ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command • Parallelise processes
  32. 32. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  33. 33. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  34. 34. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  35. 35. // list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()
  36. 36. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  37. 37. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  38. 38. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  39. 39. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
  40. 40. $ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a
  41. 41. ! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion
  42. 42. cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; ! cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ; CQL: Keep schema consistent
  43. 43. CQL3 Considerations • Uses CompositeType comparator
  44. 44. Planet Cassandra 2014 Q&A Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×