MySQL Cluster (NDB) Best Practices

Copyright 2012 Severalnines AB
Die Hard VII
Percona Live April 2017
Johan Andersson, CTO/Founder Severalnines
Presenter
johan@severalnines.com
MySQL Cluster (NDB) - Best Practices

Copyright 2012 Severalnines ABCopyright 2012 Severalnines AB
● Architecture and Design
● OS Tuning
● Stability Tuning
● Application design
● Identifying bottlenecks
● Tuning tricks
Agenda

● NDB Storage Engine was developed by Ericsson starting in 1996 as a PHD
thesis.
○ Focus on High Availability for Telecom requirements.
● Offered as a product by Alzato (company 100% owned by Ericsson) in
2001
● Acquired by MySQL in September 2003
○ New product called MySQL Cluster
○ SQL support (MySQL)
History

● High Availability
○ MySQL Cluster (NDB Cluster) is a STORAGE ENGINE offering 99.999%
availability.
● Scalability
○ Throughput up to 200M queries per second
○ Designed for MANY parallel and SHORT transactions
● High Performance
○ PK based queries, insertions, short index scans.
● Self-healing (node recovery and cluster recovery)
● Transactional
Core Architecture

● NDB Cluster consists of
○ 2 or more data nodes
■ Data Store / management
● Checkpointing / redo logging
■ Transactional management
■ Query execution
○ 2 management servers
■ Stores and manages the configuration
(config.ini) describing the cluster.
■ Not involved in query processing
■ Arbitration / Network partitioning
○ X number of API nodes
■ MySQL servers (SQL access)
■ Cluster/J (Java based access)
■ Node.js (JavaScript access)
■ NDBAPI (C++ access)
Core Architecture

● Sharded by design
● Data Nodes operates in pairs / shards
○ Partitions (Px)
■ Primary Partition
■ Secondary Partition
● Data is automatically partitioned on the
partitions(but can be user defined)
○ Hash of Primary Key determines which
node group.
○ Requests automatically routed.
● Linear Two-phase commit ensures that data
between two data nodes are synchronously
replicated.
● In case of two partitions
○ 50 % of data / partition
Core Architecture

● Node failure handling is transparent and handled
automatically.
○ An aborted transaction in the prepare state
must be retried.
● Failed node will automatically catch up missing
data
● If all node in a partition fails, then cluster will
shutdown
○ CAP – Theorem:
■ NDB Cluster prioritize data Consistency
over Availability when network
Partitioning happens.
● However, you can replicate to another cluster
asynchronously with conflict detection and
resolution.
Core Architecture

● Internally the Data Node has a number of “blocks” (modules) handling
different functionality. Here are the most important ones:
● TC - Transaction Coordinator
○ Transaction logic (Linear 2 Phase Commit Protocol)
● LDM – Local Data Manager
○ Data storage (reading writing checkpointing..)
○ One or more per data node
● Send
○ One or more threads sending data to other nodes
● Recv
○ One of more threads receiving data from other nodes.
Core Architecture

Core Architecture

● Usually: 1 TC for every 2 LDMs
● Read Heavy: 1 TC for every 4 LDM
● Write Heavy: 1 TC for every 1 LDM
● Fine level tuning / Die Hard VII:
○ ThreadConfig=ldm={count=32,cpubind=1,2…},main={cpubind=3},
tc={count=20,cpubind=…},send={count=8,cpubind=…},
recv={count=10,cpubind=…}
● ”Lazy” level tuning
○ MaxNoOfExecutionThreads=72
Core Architecture

● Key objective
○ Describe what nodes and where that are part of
the cluster
○ Resouce allocation (memory, buffers etc)
● Never place the management servers on the same
hosts as the data nodes
○ Availability may be jeopardized.
● Hardware:
○ Data nodes : 8 cores ore more and SSD
○ Management servers: no requirement
○ MySQL/API nodes: Fast CPUs, many cores, disk
not interesting (unless you binlog changes).
○ Fast network interconnect (alteast 1Gig-E, but
10Gig-E is preferable)
Simple config.ini

● Normal Operation
○ MinDiskWriteSpeed (10m) <—> MaxDiskWriteSpeed (20M)
● Node and System Restart
○ MaxDiskWriteSpeedOwnRestart (200M)
● When another Node doing restart
○ MaxDiskWriteSpeedOtherNodeRestart (50M)
Control Disk Writes

● Don’t be shy!
● LongMessageBufferMemory=128M
● RedoBuffer=64M (each LDM will use this)
● MaxNoOfConcurrentScans=500
Buffers

Node Restart Perf

LCP/REDO Performance

Per Fragment Reporting

Per Fragment Reporting (cont)

● Transactional
○ Read-committed isolation level
● By default data is stored in-memory (backed to disk with checkpoints and
redo logs)
○ Disk data tables (non-indexed data stored on disk) are supported and
can be used for logging tables e.g
● Foreign Key Support
○ Avoid if you can (performance impact)
● GIS / FULLTEXT search is not supported
Other

● Each MySQL Server stores locally (must be created once on each MySQL Server):
○ Views
○ Triggers
○ Routines
○ Events
○ GRANTS (can be replicated but it is painful)
● Write to any API/MySQL server concurrently
○ E.g Galera has weaknesses due to optimistic locking in high concurrency, multi
node writes.
● Add data nodes ONLINE to grow storage (max 48 data nodes)
● Add API nodes ONLINE to increase query performance capacity
● Cluster <-> Cluster asynchronous replication
Other

● Set TimeBetweenGlobalCheckpoints=1000
○ Lose 1 second of data if entire cluster fails
○ Similar as innodb_flush_log_at_trx_commit=2
● A single transaction will never run as fast as on InnoDb
○ Network latency is slower than local memory access latency
○ JOINs will never be as fast as on Innodb, despite being executed on the Data
nodes
● Aggregate functions (SUM, GROUP BY etc)
○ Executed on the API node / MySQL Server (if you use SQL)
● Painful if you have huge sorting / grouping / distinct queries
● Supports ALTER ONLINE TABLE to add columns / Indexes
○ Don’t drop columns or change data type à copying alter table . Heavy.
Compared to InnoDb

● High volume OLTP
● Real time analytics
● Ecommerce and financial trading with fraud detection
● Mobile and micro-payments
● Session management & caching
● Feed streaming, analysis and recommendations
● Content management and delivery
● Massively Multiplayer Online Games
● Communications and presence services
● Subscriber/user profile management and entitlements
Use Cases

● Assumptions:
○ Migrating existing MySQL database
● Create and apply the mysqldump of the schema
○ Create one file for schema only:
● --no-data
● Don’t dump the ‘mysql’ database
● Change ENGINE=InnoDb to ENGINE=NDB
● Load in the schema
● Create and apply the mysqldump for routines, views..
○ --no-data --routines --trigggers –events
○ Load in this dump on ALL mysql servers connected to the cluster
Migrating Data Into NDB Cluster

● Create and apply the mysqldump of the data
○ Create one file each database and table (or if you have many
databases, like 10-20, one file per database)
○ --extended-insert=false --complete-insert=true
● Load in the data dump files in parallel preferably from >1 MySQL Server
connected to the data nodes.
○ Will save you a lot of time
● Why --extended-insert=false ?
○ Same problem as in Galera really
○ MaxNoOfConcurrentOperations (in galera wsrep_max_ws_rows/size)
sets an upper bound how many operations that can be executed at
once.
Migrating Data Into NDB Cluster

● OS Tuning
● Stability Tuning
● Application Tuning
● Bottlenecks
Tuning

● Two options:
○ MaxNoOfExecutionThreads = N [ 0 < N <= 72 ]
○ http://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-ndbd-definition.html#n
dbparam-ndbmtd-maxnoofexecutionthreads
○ Simple and fast way to get started.
● ThreadConfig=ldm={count=1,cpubind=1,2},main={cpubind=3}
○ http://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-ndbd-definition.html#n
dbparam-ndbmtd-threadconfig
○ Allows more complex ‘rules’ specifying exactly on what core a NDB thread is
allowed to run.
OS Tuning – multi core

● Disable NUMA in /etc/grub.conf
○ numa=off
● echo ‘1’ > /proc/sys/vm/swappiness
○ echo ‘vm.swappiness=1’ >> /etc/sysctl.conf
● Bind data node threads to CPUs/cores
○ cat /proc/interrupts | grep eth
OS Tuning
cpu0 cpu1 cpu2 cpu3
44: 31 49432584 0 0 xen-dyn-event eth0
45: 1633715292 0 0 0 xen-dyn-event eth1
OK
AVOID!

● Fine tune transaction processing components of the data nodes
○ How many transaction coordinators per data node?
○ How many local data managers per data node?
● Local Data Managers (LDM
○ Non-primary key access requires more Local Data Managers
● Transaction Coordinators (TC)
○ Primary key requests requires more Transaction Coordinators
● Good rule: 1 TC for each four of LDMs
● Depends on the number of CPU cores are well:
○ 8 cores : 4 LDM, 1 TC
○ 32 cores: 16 LDMs, 4 TCs
Stability Tuning

Stability Tuning (cont)

● Run representative load
○ Does any NDB Thread run hot (close to or 100% cpu util on a single thread)?
● Increase with threads of that type (check ndb_out.log which type of thread it is )
○ Enable HyperThreading
● Can give 40% more
● Enable:
○ RealTimeScheduler=1
● Don’t forget:
○ NoOfFragmentLogParts=<no of LDMs>

● Tuning the REDO log is key
○ FragmentLogFileSize=256M
○ NoOfFragmentLogFiles=<4-6> X DataMemory in MB / 4 x
FragmentLogFileSize
● RedoBuffer=64M for a write busy system
● Disk based data:
○ SharedGlobalMemory=4096M
○ In the LOGFILE GROUP: undo_buffer_size=128M
● Or higher (max is 600M)

● Make sure you don’t have more “execution threads” than cores
● You want to have
○ Major page faults low
○ Involuntary context switches low
mysql> SELECT node_id, thr_no,thr_nm , os_ru_majflt, os_ru_nivcsw FROM
threadstat;
+---------+--------+--------+--------------+--------------+
| node_id | thr_no | thr_nm | os_ru_majflt | os_ru_nivcsw |
+---------+--------+--------+--------------+--------------+
| 3 | 0 | main | 1 | 541719 |
| 4 | 0 | main | 0 | 561769 |
+---------+--------+--------+--------------+--------------+
2 rows in set (0.01 sec)

● Define the most typical Use Cases
○ List all my friends, session management etc etc.
○ Optimize everything for the typical use case
● Engineer schema to cater for the Use Cases
● Keep it simple
○ Complex access patterns does not scale
○ Simple access patterns do ( Primay key and Partitioned Index Scans )
● Note! There is no parameter in config.ini that affects performance – only availability.
○ Everything is about the Schema and the Queries.
○ Tune the mysql servers (sort buffers etc) as you would for innodb.
Application Design

● PRIMARY KEY lookups are HASH lookup O(1)
● INDEX searches a T-tree and takes O(log n) time.
● JOINs are okay but are typically not as fast as on
INNODB
○ Data Locality is important
○ Network access may be expensive
Simple Access

● A lot of CPU is used on the data nodes
○ Probably a lot of large index scans and full table scans are used.
○ Check Slow query log or a query monitor
● A lot of CPU is used on the mysql servers
○ Probably a lot of GROUP BY/DISTINCT or aggregate functions.
● Hardly no CPU is used on either mysql or data nodes
○ Probably low load
○ Time is spent on network (a lot of “ping pong” to satisfy a request).
● System is running slow in general
○ Disks (io util), queries, swap (must never happen), network
Identifying Bottlenecks

● Slow query log
○ set global slow_query_log=1;
○ set global long_query_time=0.5;
○ set global log_queries_not_using_indexes=1;
● General log (if you don’t get enough info in the Slow Query Log)
○ Activate for a very short period of time (30-60seconds) – intrusive
○ Can fill up disk very fast – make sure you turn it off.
○ set global general_log=1;
● Performance Schema
● Use Severalnines ClusterControl
○ Includes a Cluster-wide Query Monitor.
○ Query frequency, EXPLAINs, lock time etc.
○ Performance Monitor and Manager.
Enable Logging

Setup

● By default, all index scans hit all data nodes
○ good if result set is big – you want as many CPUs as possible to help you.
● For smaller result sets (~a couple of hundred records) Partition Pruning is key for scalability.
● User-defined partitioning can help to improve equality index scans on part of a primary key.
Sharding
● All data belonging to a particular uid will be on the same partition.
○ Great locality!
● select * from user where uid=1;
○ Only one data node will be scanned (no matter how many nodes you have)
CREATE TABLE t1 (uid,
fid,
somedata,
PRIMARY KEY(uid, fid))
PARTITION BY KEY(uid);

Sharding

mysql> explain partitions select * from userdata u where u.uid=1G
id: 1
select_type: SIMPLE
table: u
partitions: p0,p1
type: ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 2699
Extra: NULL
With PARTITION BY KEY (UID)
mysql> explain partitions select * from userdata u where u.uid=1G
id: 1
select_type: SIMPLE
table: u
partitions: p0
type: ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 2699
Extra: NULL
Sharding – EXPLAIN PARTITIONS

● BLOB/TEXT columns are stored in an external hidden table.
○ First 256B are stored inline in main table
○ Reading a BLOB/TEXT requires two reads
● One for reading the Main table + reading from hidden table
● Change to VARBINARY/VARCHAR if:
○ Your BLOB/TEXTs can fit within an 14000 Bytes record
● (record size is currently 14000 Bytes)
○ Reading/writing VARCHAR/VARBINARY is less expensive
Note 1: BLOB/TEXT are also more expensive in Innodb as BLOB/TEXT data is not inlined
with the table. Thus, two disk seeks are needed to read a BLOB.
Note 2: Store images, movies etc outside the database on the filesystem.
Data Types
BLOBs/TEXTs vs VARBINARY/VARCHAR

● Ndb_cluster_connection_pool (in my.cnf) creates more connections from one
mysqld to the data nodes
○ Threads load balance on the connections gives less contention on mutex which
in turn gives increased scalabilty
○ Less MySQL Servers needed to drive load!
○ www.severalnines.com/cluster-configurator allows you to specify the
connection pool.
● Set ndb_cluster_connection_pool between 1 and 4
Ndb_cluster_connection_pool

NDB compared to Galera

NDB
● MGM, SQL, Data Nodes
○ NDB Storage Engine (“distributed
hash table”)
○ Migration needed
○ Primarily used with in-memory tables
● Automatic Sharding & User defined
partitioning
○ Great Read & Write scalability
○ Synchronous Replication within the
node group (2-phase commit)
● Pessimistic locking
Galera
● MySQL Server w/ “wsrep plugin”
○ InnoDB Engine
● Virtually Synchronous Multi-master Replication
○ “Dataset is the Master”
○ Global Transaction ID
○ Replication Performance depending on the
slowest node (RTT)
○ Limited Write scalability
● Cluster wide optimistic locking
○ First to commit wins, conflciting writesets
will be aborted.
○ Writeset Certification
○ “deadlocks” for “hotspots” tables

NDB
● Short transactions
● “Push-down joins” in data nodes
○ Aggregates done on the SQL node!
● Geographical Replication
○ Complex
○ Asynchronous Replication
○ No automatic replication channel
failover
○ Conflict detection & resolution
functions
Galera
● Moderate sized transactions
○ A writeset is processed as a single
memory-resident buffer
● Tolerate non-sequential auto-increment
values
● Geographical replication
○ Easy
○ No distinction between a local or
remote node
○ Set segment ID to group nodes
○ Adjust some network timeouts

NDB
● Network Partitioning
○ 1. One node from each node group?
○ 2. No other “groups” with with one data
node from each node group?
○ 3. Ask the arbitrator
● MGM (default) or any API node can be designated
as an arbitrator
○ MGM important
○ Use 2 in production
● Scaling - X number of steps
○ Config changes, Rolling Restart
○ Redistribute dataset, Reclaim free space
● Online Schema changes
○ Add Column
○ Add index
Galera
● Network Partitioning
○ “Primary Component”
○ Majority rules >50%
○ Use odd # of nodes
● garbd arbitrator
○ Even # of nodes
○ Replication Relay
● Scaling
○ Start new node w/ address of existing
node(s)
○ State Transfer Snapshot
● Config changes
● Two Schema Changes Methods
○ Total Order Isolation (default) blocking
○ Rolling Schema Upgrade non-blocking (take
the node out of the cluster first)

● MySQL Cluster 7.4.9
● Percona XtraDb Cluster 5.6.28
● Sysbench 0.5 (32 threads)
● Ubuntu 14.04
● Amazon AWS EC2
○ M3.2xlarge (8 vCPU, 30GB of RAM)
○ 1440 provisioned IOPS (disk should not be a factor)
Small Benchmark

Each box is a m3.2xlarge Amazon AWS Ubuntu 14.04 instance
Galera Setup

Galera 3 nodes - 1, 2, 3 sysbench

Each box is a m3.2xlarge Amazon AWS Ubuntu 14.04 instance
NDB Setup

NDB 2 data nodes , 2 mysqld, 1 and 2
sysbench

● MySQL Cluster 7.4.9
● Percona XtraDb Cluster 5.6.28
● Sysbench 0.5 (32 threads)
● Ubuntu 14.04
● Amazon AWS EC2
○ M3.2xlarge (8 vCPU, 30GB of RAM)
○ 1440 provisioned IOPS (disk should not be a factor)
Reconfigure

NDB 2 data nodes , 1-4 mysqld, 1-4 sysbench

Top (data node + mysqld node)an>

● NDB Cluster requires a strong network for maximum performance.
○ AWS EC2 network performance “high” for m3.2xlarge was not enough.
Network is limit here
Network Flattens Out

● NDB Cluster is a very sophisticated storage engine
○ Not as easy to use as Innodb
○ Predictable performance in mixed read/write environments with
contention
○ Variety of connectors (sql, connector/j, ndbapi (c++) etc )
● Network speed and latency are important aspects
● View the MySQL Serves as ”protocol” converters ( lightweight services) and
co-locate with application servers
Summary

Questions?

● MySQL Cluster Configurator
○ www.severalnines.com/getting-started
● MySQL Cluster Management + Monitoring
○ http://www.severalnines.com/product/clustercontrol
● MySQL Cluster Training Slides
○ http://www.severalnines.com/mysql%C2%AE-cluster-training-demand
● My Blog
○ johanandersson.blogspot.com
Resources

● Facebook
○ www.facebook.com/severalnines
● Twitter
○ @severalnines
● Linked in:
○ www.linkedin.com/company/severalnines
Keep in touch…

Logos to Copy and Paste
OURS TECH PARTNERS

MySQL Cluster (NDB) Best Practices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MySQL Cluster (NDB) Best Practices

Similar to MySQL Cluster (NDB) Best Practices (20)

More from Severalnines

More from Severalnines (20)

Recently uploaded

Recently uploaded (20)

MySQL Cluster (NDB) Best Practices