Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Ingesting Data with the New Bulk Data Adapters


Published on

M|18 ColumnStore Bulk Data Adapters

Published in: Data & Analytics
  • Make 16,000 Projects With Step By Step Plans, ...even if you don't have a large workshop or expensive tools! ✄✄✄
    Are you sure you want to  Yes  No
    Your message goes here

M|18 Ingesting Data with the New Bulk Data Adapters

  1. 1. ColumnStore Bulk Data Adapters David Thompson, VP Engineering Jens Rowekamp, Engineer
  2. 2. Streamline and simplify the process of data ingestion
  3. 3. Motivation Organizations need to make data available for analysis as soon as it arrives Enable Machine learning results to be published and accessible by business users through SQL Based tools. Ease of integration whether custom or ETL tools.
  4. 4. Bulk data adapters Applications can use bulk data adapters SDK to collect and write data - on-demand data loading No need to copy CSV to ColumnStore node - simpler Bypass SQL interface, parser and optimizer - faster writes MariaDB Server ColumnStore UM Application ColumnStore PM ColumnStore PMColumnStore PM Write API Write API Write API MariaDB Server ColumnStore UM Bulk Data Adapter 1. For each row a. For each column bulkInsert->setColumn b. bulkInsert->writeRow 2. bulkInsert->commit * Buffer 100,000 rows by default
  5. 5. Language Bindings ● The API is C++ 11 based. ● Currently available on modern Linux distributions: ○ May port to Windows and Mac in future release. ● Other language bindings are implemented using SWIG which generates efficient almost identical native implementations on top of the C++ library: ○ Java 8 (also providing Scala support). ○ Python 2 & 3 ○ Other language bindings can be implemented in the future.
  6. 6. System Configuration ● The adapter assumes the existence of a ColumnStore.xml file in the system in order to determine the system topology, hosts, and ports for the PM nodes. ● If you are running on a ColumnStore node, the adapter will work immediately. ● For a remote host, you will need to copy the ColumnStore.xml from a server node. ● The adapter will need to be able to connect with the ProcMon (8800), WriteEngine (8630), and DBRMController (8616) ports.
  7. 7. Core Classes The following classes provide the core interface: ● ColumnStoreDriver : Entry point / connection management ● ColumnStoreBulkInsert: Per table interface for writing a transaction ● ColumnStoreSystemCatalog: Table metadata retrieval Language namespaces: ● C++ - mcsapi:: ● Java - com.mariadb.columnstore.api ● Python - pymcsapi
  8. 8. Core Classes - ColumnStoreDriver ● Entry point and factory class for creating: ○ ColumnStoreBulkInsert objects to allow bulk write of a single transaction for a single table ○ ColumnStoreSystemCatalog object to allow retrieval of table and column metadata ● Default constructor will look for ColumnStore.xml in: ○ $COLUMNSTORE_INSTALL_DIR/etc/ColumnStore.xml (for non root installs). ○ /usr/local/mariadb/columnstore/etc/ColumnStore.xml ● Alternatively pass path to ColumnStore.xml as constructor argument to specify non standard location.
  9. 9. ColumnStoreDriver Examples Java import com.mariadb.columnstore.api.*; .. ColumnStoreDriver d1, d2; d1 = new ColumnStoreDriver(); d2 = new ColumnStoreDriver ("/etc/cs2.xml"); Python import pymcsapi d1 = pymcsapi.ColumnStoreDriver() d2 = pymcsapi.ColumnStoreDriver ("/etc/cs2.xml"); C++ mcsapi::ColumnStoreDriver* d1, d2; d1 = new mcsapi::ColumnStoreDriver(); d2 = new mcsapi::ColumnStoreDriver ("/etc/cs2.xml");
  10. 10. Core Classes - ColumnStoreBulkInsert ● Encapsulates bulk insert operations. Constructed for a single table and transaction. ● Multiple instances can be created for multiple drivers but you can only have one active per table per ColumnStore instance. ● Error handling is important, if you fail to commit or rollback a ColumnStore table lock will be left which must be released manually with the cleartablelock command. ○ resetRow can be used to clear the current row if an error occurs and you want to commit the prior rows. ● After completion getSummary returns some summary details.
  11. 11. ColumnStoreBulkInsert Examples Java import com.mariadb.columnstore.api.*; .. ColumnStoreDriver d; ColumnStoreBulkInsert b; d = new ColumnStoreDriver(); try { b = d.createBulkInsert("test", "t1", (short)0, 0); b.setColumn(0, 1); b.setColumn(1, "ABC"); b.writeRow(); b.setColumn(0,2); b.setColumn(1, "DEF"); b.writeRow(); b.commit(); } catch (ColumnStoreException e) { b.rollback(); .. } Python import pymcsapi d = pymcsapi.ColumnStoreDriver() try: b = d.createBulkInsert("test", "t1", 0, 0); b.setColumn(0, 1); b.setColumn(1, "ABC"); b.writeRow(); b.setColumn(0,2); b.setColumn(1, "DEF"); b.writeRow(); b.commit(); except RuntimeError as err: b.rollback() C++ mcsapi::ColumnStoreDriver* d; mcsapi::ColumnStoreBulkInsert* b; d = new mcsapi::ColumnStoreDriver(); try { b = d->createBulkInsert("test", "t1", 0, 0); b->setColumn(0, (uint32_t)1); b->setColumn(1, "ABC"); b->writeRow(); b->setColumn(0, (uint32_t)2); b->setColumn(1, "DEF"); b->writeRow(); b->commit(); } catch (mcsapi::ColumnStoreError &e) { b->rollback(); .. }
  12. 12. Core Classes - ColumnStoreSystemCatalog ● Allow retrieval of ColumnStore table and column metadata to allow for generic implementations.
  13. 13. ColumnStoreSystemCatalog Examples Java import com.mariadb.columnstore.api.*; .. ColumnStoreDriver d; ColumnStoreSystemCatalog c; ColumnStoreSystemCatalogTable t; ColumnStoreSystemCatalogColumn c1,c2; d = new ColumnStoreDriver(); c = d.getSystemCatalog(); t = c.getTable("test", "t1"); int t1_cols = c.getColumnCount(); c1 = t.getColumn(0); String c1_name = c1.getColumnName(); c2 = t.getColumn("area_code"); Python import pymcsapi d = pymcsapi.ColumnStoreDriver() c = d.getSystemCatalog() t = c.getTable("test", "t1"); t1_cols = t.getColumnCount(); c1 = t.getColumn(0); c1_name = c1.getColumnName(); c2 = t.getColumn("area_code"); C++ mcsapi::ColumnStoreDriver* d; mcsapi::ColumnStoreSystemCatalog c; mcsapi::ColumnStoreSystemCatalogTable t; mcsapi::ColumnStoreSystemCatalogColumn c1,c2; d = new mcsapi::ColumnStoreDriver(); c = d->getSystemCatalog(); t = c.getTable("test", "t1"); uint16_t t1_cols = c.getColumnCount(); c1 = t.getColumn(0); std:string c1_name = c1.getColumnName(); c2 = t.getColumn("area_code");
  14. 14. Core Classes - Bulk Insert ColumnStoreDriver char* getVersion() ColumnStoreBulkInsert* createBulkInsert(..) ColumnStoreSystemCatalog& getSystemCatalog() ColumnStoreBulkInsert uint16_t getColumnCount() ColumnStoreBulkInsert* writeRow() ColumnStoreBulkInsert* resetRow() void commit() void rollback() ColumnStoreSummary& getSummary() void setTruncateIsError(bool) void setBatchSize(uint32_t) bool isActive() ColumnStoreBulkInsert* setColumn(uint16_t, const std::string& value,..) ColumnStoreBulkInsert* setColumn(uint16_t, uint64_t,..) .. ColumnStoreSummary double getExecutionTime() uint64_t getRowsInsertedCount() uint64_t getTruncationCount() uint64_t getSaturatedCount() uint64_t getInvalidCount() 1 1 0..* 1 ColumnStoreDateTime ColumnStoreDateTime(..) bool set(..) ColumnStoreDecimal ColumnStoreDecimal(..) bool set(..)
  15. 15. Core Classes - System Catalog ColumnStoreSystemCatalogColumn uint32_t getOID() const std::string& getColumnName() uint32_t getDictionaryOID() columnstore_data_types_t getType() uint32_t getWidth() uint32_t getPosition() const std::string& getDefaultValue() bool isAutoincrement() uint32_t getPrecision() uint32_t getScale() bool isNullable() uint8_t compressionType() ColumnStoreSystemCatalogTable const std::string& getSchemaName() const std::string& getTableName() uint32_t getOID() uint16_t getColumnCount() ColumnStoreSystemCatalogColumn& getColumn(const std::string&) ColumnStoreSystemCatalogColumn& getColumn(uint16_t) ColumnStoreDriver char* getVersion() ColumnStoreBulkInsert* createBulkInsert(..) ColumnStoreSystemCatalog& getSystemCatalog() ColumnStoreSystemCatalog ColumnStoreSystemCatalogTable& getTable(const std::string& schemaName, const std::string& tableName) 1 1 1 1 0..* 1..*
  16. 16. Use Cases The Bulk Data adapters are designed to be used to more easily enable integrations and streaming use cases such as: - Kafka or Messaging integration - Exposing data import via an API - ETL Tool adapters. - Custom ETL logic. MariaDB has introduced a few specific streaming adapters (MaxScale CDC and Kafka) and we plan to build more in the future. For further details please attend tomorrow's session "Real-time Analytics With The New Streaming Data Adapters" at 8.40am.
  17. 17. Spark Connector ● Enables publishing of machine learning results from Spark DataFrames to ColumnStore. ● Enable best of breed approach: ○ In memory machine learning algorithms in Spark ○ Publish results to ColumnStore for ease of consumption with SQL tools such as Tableau. ● Supports both Scala and Python notebooks. ● To pull data from ColumnStore into Spark use the JDBC connector and Spark SQL to read data. ○ In the future we plan to add a bulk read api. ● Requires configuring additional jar files to Spark runtime configuration. ● Available as a Docker image for reference / easy evaluation.
  18. 18. Spark Connector Demo / Example
  19. 19. Spark Connector - Getting Started with Docker git clone cd mariadb-columnstore-docker/columnstore_jupyter docker-compose up -d In your browser open http://localhost:8888 and enter 'mariadb' as the password to login to the jupyter notebook application.
  20. 20. Thank you!