Streaming Operational Data with MariaDB MaxScale

MariaDB Maxscale
Streaming Changes to Kafka
in Real Time
Markus Mäkelä
Massimiliano Pinto

What Is Real-Time Analytics?
How Real-Time Analytics Differs From Batch Analytics

Batch
Real-Time
Data oriented process
Scope is static
Data is complete
Output reflects input
Time oriented process
Scope is dynamic
Data is incremental
Output reflects changes in input

Change Data Capture
The MariaDB MaxScale CDC System

What Is Change Data Capture in MaxScale?
● Captures changes in committed data
○ MariaDB replication protocol awareness
● Stored as Apache Avro
○ Compact and efficient serialization format
● Simple data streaming service
○ Provides continuous data streams

What Does the CDC System Consist Of?
● Binlog replication relay (a.k.a Binlog Server)
● Data conversion service
● CDC protocol
● Kafka producer

Replication Proxy Layer
The Binlogrouter Module

Binlog Events
● The master database sends events from its binlog files
● Events sent are a binary representation of the binlog file
contents with a header prepended
● Once all events have been sent the master pauses until new
events are ready to be sent

Binlog Event details
378 | Gtid | 10122 | 420 | BEGIN GTID 0-11-10045
420 | Table_map | 10122 | 465 | table_id: 18 (test.t4)
465 | Write_rows_v1 | 10122 | 503 | table_id: 18 flags: STMT_END_F
503 | Xid | 10122 | 534 | COMMIT /* xid=823 */
Transaction -- TRX1
BEGIN;
INSERT INTO test.t4 VALUES (101);
COMMIT;

Binlog Events Receiving
mysql-bin.01045
Replication
Protocol
● MariaDB Replication Slave registration allows
MaxScale to receive binlog events from master
● Binlog events are stored in binlog files, same
way as master server does
Row based replication with full row image required on Master
set global binlog_format='ROW';
set global binlog_row_image='full';
MariaDB Master Server
MaxScale
Binlog Server

Binlog to Avro Conversion
The Avrorouter Module

Apache Avro™
● A data serialization format
○ Consists of a file header and one or more data blocks
● Specifies an Object Container file format
● Efficient storage of high volume data
○ Schema always stored with data
○ Compact integer representation
○ Supports compression
● Easy to process in parallel due to how the data blocks are stored
● Tooling for Avro is readily available
○ Easy to extract and load into other systems
Source: http://avro.apache.org/

Avro file conversion
mysql-bin.01045 AVRO_file_001
AVRO_file_002
AVRO converter
● Binlog files are converted to Avro file containers
○ one per database table
● On schema changes a new file sequence is created
● Tunable flows of events
#4
#2
#3
#1
Data Warehouse Platforms

Avro Schema
{
"type": "record",
"namespace": "MaxScaleChangeDataSchema.avro",
"name": "ChangeRecord",
"fields": ...
}
• Defines how the data is stored
• Contains some static fields
• MaxScale records always named as ChangeRecord in
MaxScaleChangeDataSchema.avro namespace

Avro Schema - Fields
"fields": [
{ "name": "domain", "type": "int" }, { "name": "server_id", "type": "int" },
{ "name": "sequence", "type": "int" }, { "name": "event_number", "type": "int" },
{ "name": "timestamp", "type": "int" },
{ "name": "event_type", "type": { "type": "enum", "name": "EVENT_TYPES",
"symbols": [ "insert", "update_before", "update_after", "delete" ] } },
… More fields …
]
• MaxScale adds six default fields
⬠ Three GTID components
⬠ Event index inside transaction
⬠ Event timestamp
⬠ Type of captured event
• A list of field information
• Constructed from standard Avro
data types

Avro Schema - Fields
"fields": [
{ "name": "domain", "type": "int" }, { "name": "server_id", "type": "int" },
{ "name": "sequence", "type": "int" }, { "name": "event_number", "type": "int" },
{ "name": "timestamp", "type": "int" },
{ "name": "event_type", "type": {
"type": "enum",
"name": "EVENT_TYPES",
"symbols": [ "insert", "update_before", "update_after", "delete" ]
}
}
{ "name": "id", "type": "int", "real_type": "int", "length": -1},
{ "name": "data", "type": "string", "real_type": "varchar", "length": 255},
]
CREATE TABLE t1 (id INT AUTO_INCREMENT PRIMARY KEY, data VARCHAR(255));
Avro schema file db1.tbl1.000001.avsc

Data Streaming
The CDC Protocol

Data Streaming in MaxScale
• Provide real time transactional data to data lake for analytics
• Capture changed data from the binary log events
• From MariaDB to CDC clients in real-time

CDC Protocol
● Register as change data client
● Receive change data records
● Query last GTID
● Query change data record statistics
● One client receives an events stream for one table
CDC clients
Change Data
Listener Protocol

CDC Client
● Simple Python 3 command line client for the CDC protocol
● Continuous stream consumer
○ A building block for more complex systems
○ Outputs newline delimited JSON or raw Avro data
● Shipped as a part of MaxScale 2.0

CDC Client - Example Output
[alex@localhost ~]$ cdc.py --user umaxuser --password maxpwd db1.tbl1
{"namespace": "MaxScaleChangeDataSchema.avro", "type": "record", "name": "ChangeRecord",
"fields": [{"name": "domain", "type": "int"}, {"name": "server_id", "type": "int"},
{"name": "sequence", "type": "int"}, {"name": "event_number", "type": "int"}, {"name":
"timestamp", "type": "int"}, {"name": "event_type", "type": {"type": "enum", "name":
"EVENT_TYPES", "symbols": ["insert", "update_before", "update_after", "delete"]}},
{"name": "id", "type": "int", "real_type": "int", "length": -1},
{"name": "data", "type": "string", "real_type": "varchar", "length": 255}]}
• Schema is sent first
• Events come after the schema
• New schema sent if the schema changes

{"sequence": 2, "server_id": 3000, "data": "Hello", "event_type": "insert", "id": 1, "domain": 0, "timestamp": 1490878875,
"event_number": 1}
{"sequence": 3, "server_id": 3000, "data": "world!", "event_type": "insert", "id": 2, "domain": 0, "timestamp": 1490878880,
"event_number": 1}
{"sequence": 4, "server_id": 3000, "data": "Hello", "event_type": "update_before", "id": 1, "domain": 0, "timestamp": 1490878914,
"event_number": 1}
{"sequence": 4, "server_id": 3000, "data": "Greetings", "event_type": "update_after", "id": 1, "domain": 0, "timestamp":
1490878914, "event_number": 2}
{"sequence": 5, "server_id": 3000, "data": "world!", "event_type": "delete", "id": 2, "domain": 0, "timestamp": 1490878929,
"event_number": 1}
INSERT INTO t1 (data) VALUES ("Hello"); -- TRX1
INSERT INTO t1 (data) VALUES ("world!"); -- TRX2
UPDATE t1 SET data = "Greetings" WHERE id = 1; -- TRX3
DELETE FROM t1 WHERE id = 2; -- TRX4

{"namespace": "MaxScaleChangeDataSchema.avro", "type": "record", "name": "ChangeRecord",
"fields": [{"name": "domain", "type": "int"}, {"name": "server_id", "type": "int"}, {"name":
"sequence", "type": "int"}, {"name": "event_number", "type": "int"}, {"name": "timestamp",
"type": "int"}, {"name": "event_type", "type": {"type": "enum", "name": "EVENT_TYPES",
"symbols": ["insert", "update_before", "update_after", "delete"]}}, {"name": "id", "type": "int",
"real_type": "int", "length": -1}, {"name": "data", "type": "string", "real_type": "varchar", "length":
255}, {"name": "account_balance", "type": "float", "real_type": "float", "length": -1}]}
{"domain": 0, "server_id": 3000, "sequence": 7, "event_number": 1, "timestamp": 1496682140,
"event_type": "insert", "id": 3, "data": "New Schema", "account_balance": 25.0}
ALTER TABLE t1 ADD COLUMN account_balance FLOAT;
INSERT INTO t1 (data, account_balance) VALUES ("New Schema", 25.0);

Kafka Producer
The CDC Kafka Producer

Why Kafka?
[vagrant@maxscale ~]$ ./bin/kafka-console-consumer.sh --zookeeper 127.0.0.1:2181 --topic MyTopic
{"namespace": "MaxScaleChangeDataSchema.avro", "type": "record", "fields": [{"type": "int", "name": "domain"}, {"type": "int", "name":
"server_id"}, {"type": "int", "name": "sequence"}, {"type": "int", "name": "event_number"},
{"type": "int", "name": "timestamp"},
{"type": {"symbols": ["insert", "update_before", "update_after", "delete"], "type": "enum", "name": "EVENT_TYPES"}, "name": "event_type"},
{"type": "int", "name": "id", "real_type": "int", "length": -1}], "name": "ChangeRecord"}
{"domain": 0, "event_number": 1, "event_type": "insert", "server_id": 1, "sequence": 58, "timestamp": 1470670824, "id": 1}
{"domain": 0, "event_number": 2, "event_type": "insert", "server_id": 1, "sequence": 58, "timestamp": 1470670824, "id": 2}
● Isolation of producers and consumers
○ Data can be produced and consumed at any
time
● Good for intermediate storage of streams
○ Data is stored until it is processed
○ Distributed storage makes data persistent
● Widely supported for real time analytics
○ Druid
○ Apache Storm
● Tooling for Kafka already exists

CDC Kafka Producer
● A Proof-of-Concept Kafka Producer
● Reads JSON generated by the MaxScale CDC Client
● Publishes JSON records to a Kafka cluster
● Simple usage
cdc.py -u maxuser -pmaxpwd -h 127.0.0.1 -P 4001 test.t1 |
cdc_kafka_producer.py --kafka-broker=127.0.0.1:9092 --kafka-topic=MyTopic

Change Data
Listener Protocol
From MaxScale to Kafka
Kafka Producer
CDC Consumer/Kafka Producer
CDC Client

Binlog Server
Everything Together
mysql-bin.01045
AVRO_file_001
AVRO_file_002
AVRO converter
CDC clients
Change Data
Capture Listener
AVRO streaming
MariaDB
Master

MaxScale for Streaming Changes
MaxScale solution provides:
● Easy replication setup from MariaDB database
● Integrated and configurable Avro file conversion
● Easy data streaming to compatible solutions
● Ready to use Python scripts

Streaming Operational Data with MariaDB MaxScale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Streaming Operational Data with MariaDB MaxScale

Similar to Streaming Operational Data with MariaDB MaxScale (20)

More from MariaDB plc

More from MariaDB plc (20)

Recently uploaded

Recently uploaded (20)

Streaming Operational Data with MariaDB MaxScale