MariaDB experts explain how to stream data using MariaDB MaxScale, a database proxy that can vastly improve your server's transactional data processing without sacrificing scalability, security or speed. In this webinar, learn how to use MaxScale to convert data to JSON documents or AVRO objects, and watch as MariaDB's senior software engineers do a live demo of how to use the Kafka producer.
Watch the webinar here: https://mariadb.com/resources/webinars/streaming-operational-data-mariadb-maxscale
2. What Is Real-Time Analytics?
How Real-Time Analytics Differs From Batch Analytics
3. Batch
Real-Time
Data oriented process
Scope is static
Data is complete
Output reflects input
Time oriented process
Scope is dynamic
Data is incremental
Output reflects changes in input
5. What Is Change Data Capture in MaxScale?
● Captures changes in committed data
○ MariaDB replication protocol awareness
● Stored as Apache Avro
○ Compact and efficient serialization format
● Simple data streaming service
○ Provides continuous data streams
6. What Does the CDC System Consist Of?
● Binlog replication relay (a.k.a Binlog Server)
● Data conversion service
● CDC protocol
● Kafka producer
8. Binlog Events
● The master database sends events from its binlog files
● Events sent are a binary representation of the binlog file
contents with a header prepended
● Once all events have been sent the master pauses until new
events are ready to be sent
10. Binlog Events Receiving
mysql-bin.01045
Replication
Protocol
● MariaDB Replication Slave registration allows
MaxScale to receive binlog events from master
● Binlog events are stored in binlog files, same
way as master server does
Row based replication with full row image required on Master
set global binlog_format='ROW';
set global binlog_row_image='full';
MariaDB Master Server
MaxScale
Binlog Server
12. Apache Avro™
● A data serialization format
○ Consists of a file header and one or more data blocks
● Specifies an Object Container file format
● Efficient storage of high volume data
○ Schema always stored with data
○ Compact integer representation
○ Supports compression
● Easy to process in parallel due to how the data blocks are stored
● Tooling for Avro is readily available
○ Easy to extract and load into other systems
Source: http://avro.apache.org/
13. Avro file conversion
mysql-bin.01045 AVRO_file_001
AVRO_file_002
AVRO converter
● Binlog files are converted to Avro file containers
○ one per database table
● On schema changes a new file sequence is created
● Tunable flows of events
#4
#2
#3
#1
Data Warehouse Platforms
14. Avro Schema
{
"type": "record",
"namespace": "MaxScaleChangeDataSchema.avro",
"name": "ChangeRecord",
"fields": ...
}
• Defines how the data is stored
• Contains some static fields
• MaxScale records always named as ChangeRecord in
MaxScaleChangeDataSchema.avro namespace
15. Avro Schema - Fields
"fields": [
{ "name": "domain", "type": "int" }, { "name": "server_id", "type": "int" },
{ "name": "sequence", "type": "int" }, { "name": "event_number", "type": "int" },
{ "name": "timestamp", "type": "int" },
{ "name": "event_type", "type": { "type": "enum", "name": "EVENT_TYPES",
"symbols": [ "insert", "update_before", "update_after", "delete" ] } },
… More fields …
]
• MaxScale adds six default fields
⬠ Three GTID components
⬠ Event index inside transaction
⬠ Event timestamp
⬠ Type of captured event
• A list of field information
• Constructed from standard Avro
data types
18. Data Streaming in MaxScale
• Provide real time transactional data to data lake for analytics
• Capture changed data from the binary log events
• From MariaDB to CDC clients in real-time
19. CDC Protocol
● Register as change data client
● Receive change data records
● Query last GTID
● Query change data record statistics
● One client receives an events stream for one table
CDC clients
Change Data
Listener Protocol
20. CDC Client
● Simple Python 3 command line client for the CDC protocol
● Continuous stream consumer
○ A building block for more complex systems
○ Outputs newline delimited JSON or raw Avro data
● Shipped as a part of MaxScale 2.0
21. CDC Client - Example Output
[alex@localhost ~]$ cdc.py --user umaxuser --password maxpwd db1.tbl1
{"namespace": "MaxScaleChangeDataSchema.avro", "type": "record", "name": "ChangeRecord",
"fields": [{"name": "domain", "type": "int"}, {"name": "server_id", "type": "int"},
{"name": "sequence", "type": "int"}, {"name": "event_number", "type": "int"}, {"name":
"timestamp", "type": "int"}, {"name": "event_type", "type": {"type": "enum", "name":
"EVENT_TYPES", "symbols": ["insert", "update_before", "update_after", "delete"]}},
{"name": "id", "type": "int", "real_type": "int", "length": -1},
{"name": "data", "type": "string", "real_type": "varchar", "length": 255}]}
• Schema is sent first
• Events come after the schema
• New schema sent if the schema changes
29. MaxScale for Streaming Changes
MaxScale solution provides:
● Easy replication setup from MariaDB database
● Integrated and configurable Avro file conversion
● Easy data streaming to compatible solutions
● Ready to use Python scripts