M|18 Ingesting Data with the New Bulk Data Adapters
ColumnStore Bulk Data
David Thompson, VP Engineering
Jens Rowekamp, Engineer
Streamline and simplify
the process of data ingestion
Organizations need to make data available for
analysis as soon as it arrives
Enable Machine learning results to be published and
accessible by business users through SQL Based
Ease of integration whether custom or ETL tools.
Bulk data adapters
Applications can use bulk data
adapters SDK to collect and
write data - on-demand data
No need to copy CSV to
ColumnStore node -
Bypass SQL interface,
parser and optimizer -
ColumnStore PM ColumnStore PMColumnStore PM
Write API Write API Write API
Bulk Data Adapter
1. For each row
a. For each column
* Buffer 100,000 rows by default
● The API is C++ 11 based.
● Currently available on modern Linux distributions:
○ May port to Windows and Mac in future release.
● Other language bindings are implemented using SWIG which generates
efficient almost identical native implementations on top of the C++ library:
○ Java 8 (also providing Scala support).
○ Python 2 & 3
○ Other language bindings can be implemented in the future.
● The adapter assumes the existence of a ColumnStore.xml file in the system in
order to determine the system topology, hosts, and ports for the PM nodes.
● If you are running on a ColumnStore node, the adapter will work
● For a remote host, you will need to copy the ColumnStore.xml from a server
● The adapter will need to be able to connect with the ProcMon (8800),
WriteEngine (8630), and DBRMController (8616) ports.
The following classes provide the core interface:
● ColumnStoreDriver : Entry point / connection management
● ColumnStoreBulkInsert: Per table interface for writing a transaction
● ColumnStoreSystemCatalog: Table metadata retrieval
● C++ - mcsapi::
● Java - com.mariadb.columnstore.api
● Python - pymcsapi
Core Classes - ColumnStoreDriver
● Entry point and factory class for creating:
○ ColumnStoreBulkInsert objects to allow bulk write of a single transaction for a single table
○ ColumnStoreSystemCatalog object to allow retrieval of table and column metadata
● Default constructor will look for ColumnStore.xml in:
○ $COLUMNSTORE_INSTALL_DIR/etc/ColumnStore.xml (for non root installs).
● Alternatively pass path to ColumnStore.xml as constructor argument to
specify non standard location.
ColumnStoreDriver d1, d2;
d1 = new ColumnStoreDriver();
d2 = new ColumnStoreDriver
d1 = pymcsapi.ColumnStoreDriver()
d2 = pymcsapi.ColumnStoreDriver
mcsapi::ColumnStoreDriver* d1, d2;
d1 = new mcsapi::ColumnStoreDriver();
d2 = new mcsapi::ColumnStoreDriver
Core Classes - ColumnStoreBulkInsert
● Encapsulates bulk insert operations. Constructed for a single table and
● Multiple instances can be created for multiple drivers but you can only have
one active per table per ColumnStore instance.
● Error handling is important, if you fail to commit or rollback a
ColumnStore table lock will be left which must be released manually with the
○ resetRow can be used to clear the current row if an error occurs and you want to commit the
● After completion getSummary returns some summary details.
The Bulk Data adapters are designed to be used to more easily enable
integrations and streaming use cases such as:
- Kafka or Messaging integration
- Exposing data import via an API
- ETL Tool adapters.
- Custom ETL logic.
MariaDB has introduced a few specific streaming adapters (MaxScale CDC and
Kafka) and we plan to build more in the future. For further details please attend
tomorrow's session "Real-time Analytics With The New Streaming Data Adapters" at
● Enables publishing of machine learning results from Spark DataFrames to
● Enable best of breed approach:
○ In memory machine learning algorithms in Spark
○ Publish results to ColumnStore for ease of consumption with SQL tools such as Tableau.
● Supports both Scala and Python notebooks.
● To pull data from ColumnStore into Spark use the JDBC connector and Spark
SQL to read data.
○ In the future we plan to add a bulk read api.
● Requires configuring additional jar files to Spark runtime configuration.
● Available as a Docker image for reference / easy evaluation.
Spark Connector - Getting Started with Docker
git clone https://github.com/mariadb-corporation/mariadb-columnstore-docker.git
docker-compose up -d
In your browser open http://localhost:8888 and enter 'mariadb' as the password to login to the jupyter notebook