2. @chbatey
Cassandra London Needs you
• We are always looking for Cassandra speakers to share
their experience and have created a Speakers Program
full of benefits! If you are interested please contact us for
details. Talk to Ale
9. Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
10. Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
Without vnodes
11. Cassandra Data is Distributed By Token Range
0
500
Node 1
Node 2
Node 3
Node 4
With vnodes
12. @chbatey
Goals
• Spark partitions made up of token ranges on the same
node
• Tasks to be executed on workers co-located with that
node
• Same(ish) amount of data in each Spark partition
28. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
29. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
30. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
31. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
32. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
50 CQL Rows
33. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
34. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
35. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
36. 4
spark.cassandra.input.fetch.size_in_rows 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
37. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
38. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
39. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
42. @chbatey
Then we’re into Spark land
• Spark partitions are made up of C* partitions that exist
on the same node
• C* connector tells Spark which workers to use via
information from the C* driver
44. Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
The Spark Cassandra
Connector saveToCassandra
method can be called on
almost all RDDs
rdd.saveToCassandra("Keyspace","Table")