Bulk Loading into Cassandra
What are we talking about today?
• Problem statement
• Possible Solutions
– cqlsh COPY FROM
– Custom code using SSTable formatted files
– Java CQL INSERTs
• Test Results
• Unloading considerations
2© 2015. All Rights Reserved.
The problem is simple…
© 2015. All Rights Reserved. 3
Load a pile of files into Cassandra
• Where do the files start out?
– “On my Laptop/Server’s local file system”
• The focus today!
– “In HDFS (or another DFS)
• Consider using Spark – not the topic today
– “In an NFS mount”
• Consider using Spark – not the topic today
© 2015. All Rights Reserved. 4
The Options
• The “Front Door”
– Cqlsh COPY FROM
– Java program loading via INSERT statements and executeAsync()
• Or the language of your choice: C/C++, Python, C#, etc.
• The “Side Door”
– Leverage “streaming” via sstableloader
– Need to create SSTables via Java and CQLSStableLoader
• No other language choice
© 2015. All Rights Reserved. 5
The Front Door: CQL INSERT
© 2015. All Rights Reserved. 6
SSTable
SSTable
Cassandra Write Path
© 2015. All Rights Reserved. 7
Coordinator
Commit
Log
Memtable
SSTable
SSTable
SSTable
PeriodicallySynchronously
Cassandra
Cassandra Clients
• Load Balancing
– Prepared Statements
– Token-aware routing
– Round-robin
• Connections per Cassandra host
– The driver connects to every Cassandra host in the “local” data center
• Synchronous / Asyncrhonous Execution
– How many “in-flight queries”?
• Consistency Level
• Does not require all nodes to be online
– Standard Cassandra rules apply – hinted handoff, etc
© 2015. All Rights Reserved. 8
Cqlsh COPY FROM
• Command-line CQL tool
• Ships with Cassandra
– Can be run from a client machine (versions must match)
• Built in Python
– In 2.1 leverages the Python driver
– No “token aware routing” yet
• Only makes connection to one coordinator
– Does not round-robin
• Executes CQL INSERTs asynchronously
© 2015. All Rights Reserved. 9
Java client – e.g., cassandra-loader
(https://github.com/brianmhess/cassandra-loader)
• Java program leveraging the Java CQL driver
– Java driver (and others) provided by DataStax
• Connects to every node in the cluster
– Potentially multiple times per node
• Variety of driver options
– Load balancing – TokenAwarePolicy, DCAwareRoundRobinPolicy, etc
– Connections per host
– Consistency Level, etc
• Asynchronous execution
– Or Synchronous – e.g., for “DDL” operations
– Aside: cassandra-loader uses asynchronous execution (no DDL)
© 2015. All Rights Reserved. 10
The Side Door: “Streaming”
© 2015. All Rights Reserved. 11
Streaming – the Client
• A connection to each Cassandra node
– Along with token range information
• For each file
– Read records
– Determine which nodes own the token range for this record
– Send the record to those nodes
© 2015. All Rights Reserved. 12
sstableloader
SSTable
File
Streaming – the Cluster
• Receive records from the client
– First write out to SSTable file
– Read file back in to create various Cassandra objects
• The “Primary Index” – in memory index for “shortcuts”
• Any secondary indices defined on this table
• Any materialized views defined on this table (in 3.0)
– Move on to next SSTable file and repeat
© 2015. All Rights Reserved. 13
2i MV
Primary
Index
Streaming
• Streaming requires all nodes to be online
– Because sstableloader will connect to each node
• Can “blacklist” nodes to skip
– sstableloader will not stream to those nodes
– Must know which nodes up front – via nodetool status, say
– SSTables won’t be streamed later to offline nodes when they come online
• No “streaming hints”
• To get data to offline nodes, you must repair
• Streaming also requires SSTables to start with
– Use CQLSSTableWriter Java class to create SSTables
© 2015. All Rights Reserved. 14
The test
• Delimited files
– Different size rows: 100 bytes, 1KB, 10KB, 1MB
– Same “schema”: 12-byte TEXT, 8-byte BIGINT, rest in a TEXT
• 12-byte TEXT is the partition key, BIGINT is unique and the clustering column
– Each file is 1GB – 20 files to load
• Larger rows means fewer rows-per-file
– Parallel execution of commands is allowed – use all the cores
• Hardware/Software
– 8x i2.2xlarge nodes for Cassandra running DSE 4.7.3
– 1x r3.xlarge node as the client
© 2015. All Rights Reserved. 15
The Contenders
1. CQLSSTableWriter + sstableloader
– Wrote a Java program to take delimited files to SSTables
– Use command-line sstableloader to load
– Need to combine the times of both
2. Cqlsh COPY FROM
– Use DSE 4.7.3 (not started) on the client
3. cassandra-loader
– https://github.com/brianmhess/cassandra-loader
– Java CQL driver client
© 2015. All Rights Reserved. 16
Experiment Execution Details
• CQLSSTableWriter +
sstableloader
– Leverage “make -j 8” to run 8 at
a time
– Time CQLSSTableWriter and
then time sstableloader
• Cqlsh COPY FROM
– Leverage “make -j 2” to run 2 at
a time
– Running more than 2 caused
timeouts and errors
• cassandra-loader
– 8 threads (“-numThreads 8”)
– unlogged batches of size 4
(“-batchSize 4”)
– 10000 queries in flight
(“-numFutures 10000”)
– Exception for 100-byte test
• 10 threads
• unlogged batches of size 20
• 50000 queries in flight
© 2015. All Rights Reserved. 17
Results
© 2015. All Rights Reserved. 18
0
1000
2000
3000
4000
5000
6000
7000
8000
100B 1KB 10KB 1MB
Duration (s)
cassandra-
loader
sstablewriter+
sstableloader
copy
0
50000
100000
150000
200000
100B 1KB 10KB 1MB
Rows/s
cassandra-
loader
sstablewriter+
sstableloader
copy
0
20
40
60
80
100
120
100B 1KB 10KB 1MB
Data Rate (MB/s)
cassandra-
loader
sstablewriter+
sstableloader
copy
Observations
• Java executeAsync() was faster in all tests
– Except the 100-byte test, where it was a close second
– cassandra-loader means no custom code
• CQLSSTableWriter+sstableloader works better for smaller
records
– Performance eroded as record size increased
– Custom Java program for each format
• Cqlsh COPY FROM was never the winner
– Second place in the 10KB/row test
– Could not handle the 1MB/row test – ERROR
© 2015. All Rights Reserved. 19
To Batch or Not To Batch
• Varying opinions on unlogged batches
• Batching puts more load on the coordinator
– The coordinator gets the list of INSERTs and executes each one
– Not all INSERTs will be “owned” by the coordinator
– Essentially, the client offloads work to the coordinator
• Batching means fewer queries to the cluster
– Since the INSERTs are bundled into one query
© 2015. All Rights Reserved. 20
Batch test
• 10 delimited files
– 10 BIGINT columns – one is partition key, one is clustering column
• Use cassandra-loader
– Vary the -batchSize argument
• Measure
– Throughput – Rows/sec
– Latency – 95th Percentile
© 2015. All Rights Reserved. 21
Results
• Observation: Increasing batch size
– Increases throughput (to a point)
– Increases latency
• Neither is surprising…
© 2015. All Rights Reserved. 22
0
20000
40000
60000
80000
100000
120000
1 2 4 6 8 12 16 24 32 64 128
Rows/sec
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 6 8 12 16 24 32 64 128
95th Percentile Latency
(ms)
Bulk Unloading
• The Problem
– Get all that data from Table X out to file(s)
• Refinement
– Where’s the data going?
• Local FS
• Distributed FS (e.g., HDFS) – Use Spark (or Hadoop, if you have to)?
© 2015. All Rights Reserved. 23
• The “Front Door”
– CQL SELECT
• There is no “Side Door”
© 2015. All Rights Reserved. 24
Parallel unload
• Split token range into pieces
– Need the set of splits to cover and not overlap
• Cassandra drivers provide that
– Need each split to be completely within one node
• So each extract is able to talk only to one Cassandra node
• Optimization step – not necessary 
– Same approach as Spark and Hadoop (and others)
• Connection / Query per “split”
– Export to a different file
• Optimize paging size
– Reduce overhead for decompression
• Consistency Level
© 2015. All Rights Reserved. 25
Available Tools
• Cqlsh COPY TO
– Built into Cassandra command-line tool cqlsh
– Leverages the Python driver
– Recent improvements: CASSANDRA-9304
• Parallel export, etc
• cassandra-unloader
– Part of the https://github.com/brianmhess/cassandra-loader project
– Delimited file options, just like cassandra-loader
– Parallel export
© 2015. All Rights Reserved. 26
Performance
• From the CASSANDRA-9304 ticket:
“A small benchmark was done on a table of 10M rows inside of a Vagrant box
with 8 cores. The table was created using the following command
`tools/bin/cassandra-stress write n=10M -rate threads=50`.
The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.
Brian Hess's cassandra-unloader takes a little over 2 minutes.”
• Summary:
– Pre-9304 COPY TO: 30 minutes
– Post-9304 COPY TO: 7 minutes
– cassandra-unloader: 2 minutes
© 2015. All Rights Reserved. 27
Summary
• Bulk Loading
– CQL asynchronous INSERTs are your best bet
• Simplicity, performance (almost always), configurability, low/no coding
– CQLSSTableWriter requires a custom Java application
– sstableloader requires all nodes to be online
• Operational consideration
• Batching
– Can improve throughput at the cost of latency
• Bulk Unloading
– Parallel export via splitting token range
– Use CQL, there is no “side door”
© 2015. All Rights Reserved. 28
Thank you

Bulk Loading into Cassandra

  • 1.
  • 2.
    What are wetalking about today? • Problem statement • Possible Solutions – cqlsh COPY FROM – Custom code using SSTable formatted files – Java CQL INSERTs • Test Results • Unloading considerations 2© 2015. All Rights Reserved.
  • 3.
    The problem issimple… © 2015. All Rights Reserved. 3
  • 4.
    Load a pileof files into Cassandra • Where do the files start out? – “On my Laptop/Server’s local file system” • The focus today! – “In HDFS (or another DFS) • Consider using Spark – not the topic today – “In an NFS mount” • Consider using Spark – not the topic today © 2015. All Rights Reserved. 4
  • 5.
    The Options • The“Front Door” – Cqlsh COPY FROM – Java program loading via INSERT statements and executeAsync() • Or the language of your choice: C/C++, Python, C#, etc. • The “Side Door” – Leverage “streaming” via sstableloader – Need to create SSTables via Java and CQLSStableLoader • No other language choice © 2015. All Rights Reserved. 5
  • 6.
    The Front Door:CQL INSERT © 2015. All Rights Reserved. 6
  • 7.
    SSTable SSTable Cassandra Write Path ©2015. All Rights Reserved. 7 Coordinator Commit Log Memtable SSTable SSTable SSTable PeriodicallySynchronously Cassandra
  • 8.
    Cassandra Clients • LoadBalancing – Prepared Statements – Token-aware routing – Round-robin • Connections per Cassandra host – The driver connects to every Cassandra host in the “local” data center • Synchronous / Asyncrhonous Execution – How many “in-flight queries”? • Consistency Level • Does not require all nodes to be online – Standard Cassandra rules apply – hinted handoff, etc © 2015. All Rights Reserved. 8
  • 9.
    Cqlsh COPY FROM •Command-line CQL tool • Ships with Cassandra – Can be run from a client machine (versions must match) • Built in Python – In 2.1 leverages the Python driver – No “token aware routing” yet • Only makes connection to one coordinator – Does not round-robin • Executes CQL INSERTs asynchronously © 2015. All Rights Reserved. 9
  • 10.
    Java client –e.g., cassandra-loader (https://github.com/brianmhess/cassandra-loader) • Java program leveraging the Java CQL driver – Java driver (and others) provided by DataStax • Connects to every node in the cluster – Potentially multiple times per node • Variety of driver options – Load balancing – TokenAwarePolicy, DCAwareRoundRobinPolicy, etc – Connections per host – Consistency Level, etc • Asynchronous execution – Or Synchronous – e.g., for “DDL” operations – Aside: cassandra-loader uses asynchronous execution (no DDL) © 2015. All Rights Reserved. 10
  • 11.
    The Side Door:“Streaming” © 2015. All Rights Reserved. 11
  • 12.
    Streaming – theClient • A connection to each Cassandra node – Along with token range information • For each file – Read records – Determine which nodes own the token range for this record – Send the record to those nodes © 2015. All Rights Reserved. 12 sstableloader
  • 13.
    SSTable File Streaming – theCluster • Receive records from the client – First write out to SSTable file – Read file back in to create various Cassandra objects • The “Primary Index” – in memory index for “shortcuts” • Any secondary indices defined on this table • Any materialized views defined on this table (in 3.0) – Move on to next SSTable file and repeat © 2015. All Rights Reserved. 13 2i MV Primary Index
  • 14.
    Streaming • Streaming requiresall nodes to be online – Because sstableloader will connect to each node • Can “blacklist” nodes to skip – sstableloader will not stream to those nodes – Must know which nodes up front – via nodetool status, say – SSTables won’t be streamed later to offline nodes when they come online • No “streaming hints” • To get data to offline nodes, you must repair • Streaming also requires SSTables to start with – Use CQLSSTableWriter Java class to create SSTables © 2015. All Rights Reserved. 14
  • 15.
    The test • Delimitedfiles – Different size rows: 100 bytes, 1KB, 10KB, 1MB – Same “schema”: 12-byte TEXT, 8-byte BIGINT, rest in a TEXT • 12-byte TEXT is the partition key, BIGINT is unique and the clustering column – Each file is 1GB – 20 files to load • Larger rows means fewer rows-per-file – Parallel execution of commands is allowed – use all the cores • Hardware/Software – 8x i2.2xlarge nodes for Cassandra running DSE 4.7.3 – 1x r3.xlarge node as the client © 2015. All Rights Reserved. 15
  • 16.
    The Contenders 1. CQLSSTableWriter+ sstableloader – Wrote a Java program to take delimited files to SSTables – Use command-line sstableloader to load – Need to combine the times of both 2. Cqlsh COPY FROM – Use DSE 4.7.3 (not started) on the client 3. cassandra-loader – https://github.com/brianmhess/cassandra-loader – Java CQL driver client © 2015. All Rights Reserved. 16
  • 17.
    Experiment Execution Details •CQLSSTableWriter + sstableloader – Leverage “make -j 8” to run 8 at a time – Time CQLSSTableWriter and then time sstableloader • Cqlsh COPY FROM – Leverage “make -j 2” to run 2 at a time – Running more than 2 caused timeouts and errors • cassandra-loader – 8 threads (“-numThreads 8”) – unlogged batches of size 4 (“-batchSize 4”) – 10000 queries in flight (“-numFutures 10000”) – Exception for 100-byte test • 10 threads • unlogged batches of size 20 • 50000 queries in flight © 2015. All Rights Reserved. 17
  • 18.
    Results © 2015. AllRights Reserved. 18 0 1000 2000 3000 4000 5000 6000 7000 8000 100B 1KB 10KB 1MB Duration (s) cassandra- loader sstablewriter+ sstableloader copy 0 50000 100000 150000 200000 100B 1KB 10KB 1MB Rows/s cassandra- loader sstablewriter+ sstableloader copy 0 20 40 60 80 100 120 100B 1KB 10KB 1MB Data Rate (MB/s) cassandra- loader sstablewriter+ sstableloader copy
  • 19.
    Observations • Java executeAsync()was faster in all tests – Except the 100-byte test, where it was a close second – cassandra-loader means no custom code • CQLSSTableWriter+sstableloader works better for smaller records – Performance eroded as record size increased – Custom Java program for each format • Cqlsh COPY FROM was never the winner – Second place in the 10KB/row test – Could not handle the 1MB/row test – ERROR © 2015. All Rights Reserved. 19
  • 20.
    To Batch orNot To Batch • Varying opinions on unlogged batches • Batching puts more load on the coordinator – The coordinator gets the list of INSERTs and executes each one – Not all INSERTs will be “owned” by the coordinator – Essentially, the client offloads work to the coordinator • Batching means fewer queries to the cluster – Since the INSERTs are bundled into one query © 2015. All Rights Reserved. 20
  • 21.
    Batch test • 10delimited files – 10 BIGINT columns – one is partition key, one is clustering column • Use cassandra-loader – Vary the -batchSize argument • Measure – Throughput – Rows/sec – Latency – 95th Percentile © 2015. All Rights Reserved. 21
  • 22.
    Results • Observation: Increasingbatch size – Increases throughput (to a point) – Increases latency • Neither is surprising… © 2015. All Rights Reserved. 22 0 20000 40000 60000 80000 100000 120000 1 2 4 6 8 12 16 24 32 64 128 Rows/sec 0 1000 2000 3000 4000 5000 6000 7000 1 2 4 6 8 12 16 24 32 64 128 95th Percentile Latency (ms)
  • 23.
    Bulk Unloading • TheProblem – Get all that data from Table X out to file(s) • Refinement – Where’s the data going? • Local FS • Distributed FS (e.g., HDFS) – Use Spark (or Hadoop, if you have to)? © 2015. All Rights Reserved. 23
  • 24.
    • The “FrontDoor” – CQL SELECT • There is no “Side Door” © 2015. All Rights Reserved. 24
  • 25.
    Parallel unload • Splittoken range into pieces – Need the set of splits to cover and not overlap • Cassandra drivers provide that – Need each split to be completely within one node • So each extract is able to talk only to one Cassandra node • Optimization step – not necessary  – Same approach as Spark and Hadoop (and others) • Connection / Query per “split” – Export to a different file • Optimize paging size – Reduce overhead for decompression • Consistency Level © 2015. All Rights Reserved. 25
  • 26.
    Available Tools • CqlshCOPY TO – Built into Cassandra command-line tool cqlsh – Leverages the Python driver – Recent improvements: CASSANDRA-9304 • Parallel export, etc • cassandra-unloader – Part of the https://github.com/brianmhess/cassandra-loader project – Delimited file options, just like cassandra-loader – Parallel export © 2015. All Rights Reserved. 26
  • 27.
    Performance • From theCASSANDRA-9304 ticket: “A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. Brian Hess's cassandra-unloader takes a little over 2 minutes.” • Summary: – Pre-9304 COPY TO: 30 minutes – Post-9304 COPY TO: 7 minutes – cassandra-unloader: 2 minutes © 2015. All Rights Reserved. 27
  • 28.
    Summary • Bulk Loading –CQL asynchronous INSERTs are your best bet • Simplicity, performance (almost always), configurability, low/no coding – CQLSSTableWriter requires a custom Java application – sstableloader requires all nodes to be online • Operational consideration • Batching – Can improve throughput at the cost of latency • Bulk Unloading – Parallel export via splitting token range – Use CQL, there is no “side door” © 2015. All Rights Reserved. 28
  • 29.