Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Architectural Overview: MariaDB MaxScale


Published on

M|18 Architectural Overview: MariaDB MaxScale

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

M|18 Architectural Overview: MariaDB MaxScale

  1. 1. MaxScale Architecture Evolution Johan Wikman Lead Developer
  2. 2. Overview ● What is MaxScale ● Architecture ● Performance ● Summary
  3. 3. What is MaxScale
  4. 4. What is MaxScale ● Cluster Abstraction ○ Hides the complexity. ○ Load Balancer ○ High Availability ○ Easier Maintenance ● And more ○ Firewall ○ Data masking ○ Logging ○ Cache ○ ... Client MaxScale Master Slave Slave
  5. 5. Read Write Splitting ● Analyze statements ○ Send where appropriate Client MaxScale Master Slave Slave
  6. 6. Read Write Splitting ● Analyze statements ○ Send where appropriate ● Write statements to master Client MaxScale Master Slave Slave > INSERT INTO ...
  7. 7. Read Write Splitting ● Analyze statements ○ Send where appropriate ● Write statements to master ● Read statements to some slave Client MaxScale Master Slave Slave > SELECT * ...
  8. 8. Read Write Splitting ● Analyze statements ○ Send where appropriate ● Write statements to master ● Read statements to some slave ● Session statements to all servers Client MaxScale Master Slave Slave > SET autocommit
  9. 9. Architecture
  10. 10. ......... Static Architecture Protocol Authenticator Filter Router Query Classifier Monitor MariaDBClient MySQLAuth ... DBFwfilter ... ReadWriteSplit qc_sqlite ... MariaDBMon Core ● Threading ● Logging ● Plugin loading ● Lifetime management ● REST-API ● Admin Functionality ● etc. APIs MaxScale
  11. 11. Data FlowClient Protocol Filter Filter Router Protocol Monitor Query Classifier Servers Server State monitors updates uses MaxScale
  12. 12. Code MaxScale: 147 kloc Core: 51 kloc Authenticators: 5 kloc Filters: 27 kloc Routers: 43 kloc Monitors: 12 kloc Protocols: 9 kloc Modules: 96 kloc For comparison: ● MariaDB server: 2500 kloc
  13. 13. Threading Architecture ● MaxScale is essentially a router. ○ It receives SQL packets from numerous clients and dispatches them to one or more servers. ○ Waits for responses from one or more servers, and sends a response to the client. ○ Number of clients may be large. ● Basic alternatives: ○ One thread per client. ○ Asynchronous I/O and fixed number of threads. ● Reason lost in the mists of history, but MaxScale is implemented using the latter approach.
  14. 14. Asynchronous I/O in Principle. ● Basically: ● When there is no activity, the thread is idle. ● When something happens, the thread wakes up and handles the events. ○ May involve initiating asynchronous I/O whose result is later reported as an event. ● Once the event has been handled, the thread returns to waiting for events. setup(); while (true) { io_events events = wait_for_io(); handle_events(events); } ● Create some file descriptors ● Make them non-blocking ● Add them to some waiting mechanism. ● Wait for something to happen to those file descriptors ● Handle whatever happened
  15. 15. So How do You Wait on Events? ● select ○ The original mechanism, been around since the beginning of time. Fixed size limit on the number of descriptors. O(N) ● poll ○ No limit on number of descriptors. O(N). ● epoll ○ More complex to set up. No limit on number of descriptors. All changes via system calls, i.e. thread safe. O(1). Epoll is not a better poll, it’s different.
  16. 16. MaxScale epoll Setup ● At startup, socket creation is triggered by the presence of listeners. [TheListener] type=listener service=TheService ... port=4009 so = socket(...); ... listen(so); ... struct epoll_event ev; = events; = data; epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev); epoll_fd = epoll_create(...);
  17. 17. Client Connection Client MaxScale while (!shutdown) { struct epoll_event events[MAX_EVENTS]; int ndfs = epoll_wait(epoll_fd, events, ...); for (int i = 0; i < ndfs; ++i) { epoll_event* event = &events[i]; handle_event(event); } }
  18. 18. Client Connection, cont’d void handle_event(struct epoll_event* event) { if (event->events & EPOLLIN) { if (descriptor was a listening socket) { handle_accept(event); } else { handle_read(event); } } if (event->events & ...) { ... } }
  19. 19. Handle Accept void handle_accept(struct epoll_event* event) { for (all servers in service) { int so; connect each server; struct epoll_event ev; = events; = data; epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev); } } [TheService] type=service service=readwritesplit servers=server,server2 ...
  20. 20. Handle Read void handle_read(struct epoll_event* event) { char buffer[MAX_SIZE]; read(sd, buffer, sizeof(buffer)); figure out what to do with the data // - wait for more // - authenticate // - send to master, send to slaves, send to all // ... ... } ● When the servers reply, the response will be handled in a similar manner.
  21. 21. Binding Things Together Client Server DCB Session MXS_ROUTER_SESSION RWSplitSession DCB 1..* 1 1 Plugin boundary Representation of a connection/descriptor ● A Session object ties together the client connection and all server connections associated with that client connection.
  22. 22. MaxScale 1.0 - 2.0 epoll_fd = epoll_create(...); Thread 1 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } Thread 2 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } Thread 3 while (!shutdown) { epoll_wait(epoll_fd, ...); ... }
  23. 23. Problematic with one epoll Instance ● There are multiple socket descriptors for each client session. ○ One for the client connection. ○ One for every backend server. ● It is possible that an event on each of those is concurrently handled by as many threads. ○ Client have issued a request that has been sent to all servers. ○ Response arrives from each server at the same time the client closes its connection. ● Session data ends up being manipulated by many threads concurrently.
  24. 24. Implications ● Lots of locks and locking was needed. ○ Primarily spinlocks, intended to be held for brief periods of time. ● Events for a socket may be reported to a thread while another thread was still handling earlier events for that same socket. ○ Event extraction and event handling had to be decoupled => locking. ● Very hard to be sure no deadlocks could occur. ● Very hard to be sure no races were possible. ● Very hard to program, as it was not always obvious what could and what could not occur concurrently. ● The locks started to hurt under high load and lots of clients.
  25. 25. MaxScale 2.1 Thread 1 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } epoll_fd = epoll_create(...); Thread 2 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } epoll_fd = epoll_create(...); Thread 2 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } epoll_fd = epoll_create(...);
  26. 26. MaxScale 2.1 ● Each thread has a epoll instance of its own. ● When a client connects: ○ The thread that handles the client will also handle all communication will all backends on behalf of that client. ○ All descriptors belonging to a particular client session are only added to the epoll instance of the thread in question. ● Listening sockets are still an exception; added to the poll set of all threads. ○ After accepting, the client socket is then moved in a round-robin fashion to some thread. ● Huge impact on the performance.
  27. 27. MaxScale 2.2 ● Remove the last traces of inter-thread communication. ● Basic problem: How to distribute new connections among existing threads? ● New connections should be distributed across different threads in a roughly even manner. ● All ports must be treated in the same way. ○ So a particular port cannot e.g. be permanently assigned to a specific thread.
  28. 28. epoll ● Two ways events can be triggered: ○ Edge-triggered, reported when something has happened. ○ Level-triggered, reported when something is available. Inactive Active Edge triggered Level triggered
  29. 29. Implication of edge/level triggered epoll 1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance. 2. A pipe writer writes 2 kB of data on the write side of the pipe. 3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor. 4. The pipe reader reads 1 kB of data from rfd. 5. A call to epoll_wait(2) is done. ● If rfd was added using EPOLLET (edge-triggered) then the call at 5 will hang. ● EPOLLET requires ○ Non-blocking descriptors. ○ Events can be waited for (epoll_wait) only after read or write return EAGAIN. Example straight from $ man epoll
  30. 30. Two kind of Descriptors ● Listening sockets that all threads should handle. ● Sockets related to a client session that only a particular thread should handle. ● What’s the problem with the listening sockets being in the epoll instance of each thread (as in MaxScale 2.1)? ○ Also the listening socket must be non-blocking and added using EPOLLET. ○ A thread that returns from epoll_wait must call accept on the listening socket until it returns EWOULDBLOCK. ○ So, either we must accept that a thread suddenly may have to deal with a large number of clients (if there is a sudden surge) or a thread must be able to offload an accepted client socket to another thread.
  31. 31. What we Want ● Each thread does not need to accept more than one client at a time. ○ That is, EPOLLET cannot be used. ● We don’t have to manipulate the epoll instance of a thread, from outside the thread. ○ Listening sockets are a global resource while sockets related to a client session are thread local resources. ○ Not having to do that also means that making it possible to increase and decrease the number of threads at runtime becomes easier.
  32. 32. But epoll instances can also be waited for. ● If an epoll file descriptor has events waiting, then it will indicate that as being readable. ● So, ○ if a file descriptor is added to an epoll instance, and ○ the descriptor of that epoll instance is added to another second epoll instance, then ○ when something happens to the file descriptor, a thread blocked in an epoll_wait call on the second epoll instance will return. ● If the thread now calls epoll_wait on the first epoll instance, it will return with actual file descriptor on which some change has occurred.
  33. 33. MaxScale 2.2 Thread N l_fd = epoll_create(...); struct epoll_event ev; = EPOLLIN; // NOT EPOLLET epoll_ctl(l_fd, EPOLL_CTL_ADD, g_fd, &ev); while (!shutdown) { epoll_wait(l_fd, ...); ... } g_fd = epoll_create(...); void add_listening_socket(int sd) { struct epoll_event ev; = EPOLLIN; // NOT EPOLLET epoll_ctl(g_fd, EPOLL_CTL_ADD, g_fd, &ev); } void add_client_socket(int l_fd, int sd) { struct epoll_event ev; = .. | EPOLLET; epoll_ctl(l_fd, EPOLL_CTL_ADD, sd, &ev); }
  34. 34. Client Connecting typedef void (*handler_t)(epoll_event*); while (!shutdown) { struct epoll_event events[MAX_EVENTS]; int ndfs = epoll_wait(epoll_fd, events, ...); for (int i = 0; i < ndfs; ++i) { epoll_event* event = &events[i]; handler_t handler = get_handler(event); handler(event); } } void handle_epoll_event(epoll_event*) { struct epoll_event events[1]; int fd = get_fd(event); // fd == g_fd epoll_wait(fd, events, 1, 0); // 0 timeout. epoll_event* event = &events[0]; handler_t handler = get_handler(event); handler(event); } void handle_accept_event(epoll_event* event) { int sd = get_fd(event); while ((cd = accept(sd)) != NULL) { ... add_client_socket(cd, ...); } }
  35. 35. get_hander(event) and get_fd(event) ? typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); struct epoll_event { uint32_t events; epoll_data_t data; }; ● When adding a descriptor to an epoll instance you can associate a value. ● When something occurs you get that value back. ○ If you do not store the fd, you do not know what fd the event relates to. ○ If you store the fd, you cannot store anything else.
  36. 36. Storing More Context With an Event typedef uint32_t (*mxs_poll_handler_t)(struct mxs_poll_data* data, int wid, uint32_t events); typedef struct mxs_poll_data { mxs_poll_handler_t handler; /*< Handler for this particular kind of mxs_poll_data. */ } MXS_POLL_DATA; typedef struct dcb { MXS_POLL_DATA poll; int fd; ... } DCB; static uint32_t dcb_poll_handler(MXS_POLL_DATA *data, ...) { DCB *dcb = (DCB*)data; ... }; DCB* create_dcb(...) { DCB* dcb = alloc_dcb(...); dcb.poll.handler = dcb_poll_handler; return dcb; } class Worker : private MXS_POLL_DATA { public: Worker() { MXS_POLL_DATA::handler = &Worker::epoll_handler; ... }; static uint32_t epoll_handler(MXS_POLL_DATA* data, ...) { return ((Worker*)data)->handler(...); } int fd; };
  37. 37. Adding and Extracting Events void poll_add_fd(int fd, uint32_t events, MXS_POLL_DATA* pData) { struct epoll_event ev; = events; = pData; epoll_ctl(m_epoll_fd, EPOLL_CTL_ADD, fd, &ev); } DCB* dcb = ...; poll_add_events(dcb->fd, ..., &dcb->poll); Worker* pWorker = ...; poll_add_events(pWorker->fd, ..., pWorker); while (!should_shutdown) { struct epoll_event events[MAX_EVENTS]; int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1); for (int i = 0; i < n; ++i) { MXS_POLL_DATA* data = (MXS_POLL_DATA)events[i].data.ptr; data->handler(data, ..., events[i].events); } } Each worker thread sits in this loop.
  38. 38. Performance
  39. 39. MaxScale 2.0.5 (up to) Hardware: Two physical servers, 16 cores / 32 hyperthreads, 128GB RAM and an SSD drive, connected using GBE LAN. One runs MaxScale and sysbench, the other 4 MariaDB servers setup as Master and 3 Slaves. Workload: OLTP read-only, 100 simple selects per iteration, no transaction boundaries. ● direct: Sysbench uses all servers directly in round-robin fashion. ● rcr: MaxScale readonnroute router. ● rws: MaxScale readwritesplit router.
  40. 40. MaxScale 2.1.0 ● The architectural change that allowed the removal of a large number of locks provided a dramatic improvement for readconnroute. ● No change for readwritesplit. ● With small number of clients the introduced cache improved the performance, with large number no impact.
  41. 41. Query Classification ● When ReadWriteSplitting, MaxScale must parse the statement. ○ Does it need to be sent to the master, to some slave or to all servers? ● The classification is done using a significantly modified parser from sqlite. ● In each thread, the parsing is done using a thread specific in-memory database. Thread 1 Sqlite Thread 2 sqlite ● No shared data, should be no contention. ● Sqlite was not built using the right flags, but there was serialization going on.
  42. 42. Data Collection ● While parsing a statement, a fair amount of information was collected. ○ What tables and columns are accessed. What functions are called. Etc. ● Allocating memory for that information did not come without a cost. ○ Basically only the firewall filter uses that information. ● Now no information is collected by default, but a filter that is interested in that information must express it explicitly. qc_parse_result_t parse_result = qc_parse(stmt, QC_COLLECT_ALL);
  43. 43. Custom Parser ● Many routers and filters need to know whether a transaction is ongoing. ● Up until MaxScale 2.1.1 that implied that the statements had to be parsed using the query classifier. ● For MaxScale 2.1.2 we introduced a custom parser that only detects statements affecting the autocommit mode & transaction state. ○ Much faster than full parsing. ● In MaxScale 2.3 we will rely upon the server telling the autocommit mode & transaction state. ○ Implies that changes performed via prepared statements or functions will also be detected.
  44. 44. MaxScale 2.1.3 versus 2.0.5
  45. 45. Cache ● The cache was introduced in 2.1.0 but the performance was less than satisfactory. ● Problem was caused by parsing. ○ The cache parsed all statements to detect non-cacheable statements. ○ E.g. SELECT CURRENT_DATE(); ● Added possibility to declare that all SELECT statements are cacheable. [TheCache] type=filter module=cache ... selects=assume_cacheable ● Huge impact on the performance.
  46. 46. MaxScale ReadWriteSplit ● In the best case the performance of MaxScale 2.1.3 is three times, eight if caching is used, than the performance of MaxScale 2.0.5.
  47. 47. The Importance of Early Customer Feedback ● MaxScale caches user information so that it can authenticate users. ● In MaxScale 2.2.0 the user database was shared between threads. ● Worked fine when connection attempts were relatively rare and sessions were relatively long Thread 1 Users Thread 2 ● If a user is not found, any thread may refresh the user data from the server and update the database. ● All access must use locks.
  48. 48. User Report ● With MaxScale 2.2.0 Beta a user reported that he got only 6000 qps. function event(thread_id) db_connect() rs = db_query("select 1;") db_disconnect() ● Reason turned out to be thread contention in relation to the user database. Thread 1 Users Thread 2 Users ● We split it, so that each thread has its own user database. “I just tested the same case, got 361437 queries/second, I think it works for us”
  49. 49. What about MaxScale 2.2.2 ● No real difference, which is good, because 2.2 does more than 2.1. ○ E.g. must catch “SET SESSION SQL_MODE=ORACLE”
  50. 50. Summary
  51. 51. Summary ● MaxScale 0.1 -> 2.0 ○ One epoll instance that all worker threads wait on. ○ Any thread can handle anything. ○ Lots of locking needed, and lots of potential for hard-to-resolve races. ○ Performance problems. ● MaxScale 2.1 ○ One epoll instance per worker thread. ○ Any thread can accept, but must distribute the client socket to a particular thread. ○ All activity related to a particular session handled by one thread. ○ Significantly reduced need for locking and race risk effectively eliminated. ○ Good performance. ● MaxScale 2.2 ○ One epoll instance for “shared” descriptors (listening sockets). ○ One epoll instance per worker thread. ○ All activity related to a particular session handled by one thread. ○ Even less locking needed. ○ Good performance.
  52. 52. Where do We Go From Here? ● The architectural evolution of MaxScale can be summarized as: ○ Decrease the explicit coupling between the worker threads. ■ If that leads to duplicate work or increased memory usage, fine. ● We are likely to continue moving in that direction still, so that we conceptually will end up running N “mini”-MaxScales in parallel, completely oblivious of each other. ● That would also make it easy to allow the starting and stopping of threads, while MaxScale is running.