In Apache Cassandra Lunch #67, we discussed how to move data from Open Source Cassandra to Datastax Astra using dsbulk/scylla migratory.
https://github.com/DataStax-Examples/dsbulk-to-astra/
Accompanying Blog: https://blog.anant.us/apache-cassandra-lunch-67-moving-data-from-cassandra-to-datastax-astra-with-dsbulk
Accompanying Youtube: https://youtu.be/0k7RBf5vi5M
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Apache Cassandra Lunch #67: Moving Data from Cassandra to Datastax Astra
1. Version 1.0
Moving Data from Cassandra to
DataStax Astra using DSBulk
An Anant Corporation Story.
2. Cassandra
● Apache Cassandra is an open-source distributed No-
SQL database designed to handle large volumes of data
across multiple different servers
● Cassandra clusters can be upgraded by either
improving hardware on current nodes (vertical
scalability) or adding more nodes (horizontal
scalability)
○ Horizontal scalability is part of why Cassandra is
so powerful - cheap machines can be added to a
cluster to improve its performance in a
significant manner
● Note: Demo will use Open Source Cassandra
○ Works nearly identically with DSE Cassandra
3. DataStax Astra
● Astra website:
https://www.datastax.com/products/datastax-astra
● DataStax Astra is a fully managed, serverless database
built on Apache Cassandra, and is provided by
DataStax
● Some additional features:
○ Stargate APIs: Makes it easy for developers to use a
Cassandra-based database like Astra to work with data
without deep knowledge of CQL
○ Zero Lock-In: Deploy on AWS, GCP and Azure and still
maintain compatibility with open-source Cassandra
○ Global Scale: Data replication across multiple data
centers, availability zones, and multiple regions.
■ Additionally, allows a user to scale an Astra
database up to multiple petabytes of data without
impacting speed or performance
○ 80 GB of storage and 20 million read/write operations for
free every month
4. DSBulk
● DSBulk: DataStax Bulk Loader for Apache Cassandra is an open source software used to
load/unload CSV or JSON data in and out of supported databases
● Supported databases:
○ DataStax Astra cloud database
○ DataStax Enterprise (DSE) 4.7 and later
○ Open source Apache Cassandra 2.1 and later
● More information about DSBulk, along with an introduction to it and various documentation can
be found linked here: https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkAbout.html
● Github Repository for the DataStax DSBulk project: https://github.com/datastax/dsbulk
5. DSBulk cont...
● Commands that will be used in today’s presentation/demo:
○ dsbulk load
■ This command is used to load data into a cassandra/astra database without a configuration file. Note
that necessary parameters will have to be passed in (listed below)
○ dsbulk unload
■ This command is used to unload data from a cassandra/astra database without a configuration file,
into a CSV or JSON file. Note that necessary parameters will have to be passed in as well.
○ dsbulk count
■ This command is used to return information about loaded data in a cassandra/astra database.
● Some necessary parameters/flags that must be used if using these commands without a configuration file:
○ -k: keyspace
○ -t: table
○ -b: path to secure connect bundle (only necessary if connecting to astra)
○ -u: username, -p: password (to the database)
■ Since recent Astra update earlier this year, need to use ClientID/ClientSecret instead of
username/password.
■ Can be left empty if cassandra database user/password is left as default (cassandra/cassandra)
○ -url: url from where to pull .CSV or .JSON file from, or a local directory for where to unload data into
6. Demo Project Slide
● Link to Github Repo: https://github.com/DataStax-Examples/dsbulk-to-astra/
○ Demo is based on sample data from this github repository
● Will be going through four main processes using dsbulk:
○ Loading a .csv hosted online into local cassandra
○ Loading a .csv hosted online into astra
○ Unloading from local cassandra to a .csv file
○ Loading from a .csv file into astra