Spark is an execution framework designed to operate on distributed systems like Cassandra. It's a handy tool for many things, including ETL (extract, transform, and load) jobs. In this session, let me share with you some tips and tricks that I have learned through experience. I'm no oracle, but I can guarantee these tips will get you well down the path of pulling your relational data into Cassandra.
About the Speaker
Jim Hatcher Principal Architect, IHS Markit
Jim Hatcher is a software architect with a passion for data. He has spent most of his 20 year career working with relational databases, but he has been working with Big Data technologies such as Cassandra, Solr, and Spark for the last several years. He has supported systems with very large databases at companies like First Data, CyberSource, and Western Union. He is currently working at IHS, supporting an Electronic Parts Database which tracks half a billion electronic parts using Cassandra.
Co-Organizer of Dallas Cassandra Meetup Group
Certified Apache Cassandra Developer
CQL Copy command - This is a pretty quick and dirty way of getting data from a text file into a C* table. The primary limiting factor is that the data in the text file has to match the schema of the table.
Sqoop - this is a tool from the Hadoop ecosystem, but it works for C*, too. It's meant for pulling to/from a RDBMS. It's pretty limited on any kind of transformation you want do.
Write a Java program. It's pretty simple to write a java program that reads from a text file and uses the CQL Driver to write to C*. If you set the write consistency level to Any and use the ExecuteAsync() methods, you can get it to run pretty darn fast.
Write a Spark program. This is a great option if you want to transform the schema of the source before writing to the C* destination. You can get the data from any number of sources (text files, RDBMS, etc.), use a map statement to transform the data into the right format, and then use the Spark Cassandra Connector to write the data to C*.