Auto Europe's ongoing journey with MariaDB and open source

MariaDB plc
MariaDB plcMariaDB plc
“How Do You Solve A Problem Like Maria?”
AutoEurope’s Ongoing
Journey with Open Source
Thomas J. Girsch
Lead System Architect, AutoEurope
What We’ll Cover
1. About AutoEurope (AE), About Me
2. The AE Environment Before MariaDB
3. Our Original Use Case for MariaDB: Content Management
4. Additional Use Cases: Microservices, Session Tracking
5. Wins, Challenges, Solutions and Open Issues
6. Our MariaDB Wish List
About AutoEurope
●
We are the global leader in international car rental services and have been helping
travelers find the perfect rental vehicle for their trips around the world for over 60
years
●
We have roughly 600 employees worldwide, including approximately 250 call
center employees, working in offices and call centers spread across three
continents
●
Through our car rental partners, we offer car rentals in more than 20,000 locations
in over 180 countries worldwide
●
Visit https://www.autoeurope.com/about-us/ to learn more
About Me
●
Nearly three decades of IT experience
●
That hurt to type; I’m getting old!
●
Certified in IBM Informix and MongoDB
●
Both are a bit dated and need updating
●
Proficient in Unix/Linux administration
●
Specialize in high availability, DR and OLTP performance
●
NOT A PROGRAMMER!
●
Outside of work, volunteer as Vice Chair of the Rail Passengers Association
AutoEurope AM (Ante Maria)
●
IBM Informix for most critical systems
●
Quoting and reservations
●
Web session tracking
●
Legacy EDS system
●
Ancient SAP
●
Microsoft SQL Server for many ancillary systems
●
Content management (FarCry)
●
Additional web session tracking
●
Supplier interfaces
●
MongoDB for application logging
●
Also InfluxDB, REDIS and a little bit of very old MySQL
SQL Server at AE
●
One single VM (NO REDUNDANCY!)
●
If SQL Server goes down, our websites don’t respond
●
Thankfully, this rarely happens
●
If SQL Server gets bogged down, websites respond poorly
●
This was happening a lot
●
Why not add redundancy? Insufficient licensing
●
Too expensive to add more
●
Lacked in-house expertise
●
Desire to reduce Microsoft / Windows footprint
IBM Informix at AE
●
Heavily invested, with lots of licenses
●
Production: dedicated physical servers in two data centers
●
We’ll cover in more detail (with pictures!) in a bit
●
Dev/QA/Test: small VMs
●
Small “Shopping Engine” Vms
●
Why not do everything here?
●
Very expensive to add licenses
●
Most third-party tools don’t support it
●
Overkill for many applications
●
Use the right tool for the job
Web Content Management at AE
●
Manages content for nearly 70 websites
– Text of various pages
– Locations of images and other graphic/support files
– Translations to multiple other languages
●
Resiliency is key! (No web = no bookings = no money)
– Must be redundant within the data center
– Must be redundant across data centers
●
Workload distribution is a nice-to-have
Enter MariaDB
●
Web team migrating to new content management platform, MuraCMS
●
Initially started the migration on SQL Server, but we want to eliminate MS
●
Migrating to IBM Informix not an option
●
Mura supports two DB platforms
– Microsoft SQL Server
– MySQL
●
MariaDB = MySQL – Oracle: MariaDB!
●
But is MariaDB resilient enough?
MariaDB Professional Services
●
AE had no in-house MySQL or MariaDB knowledge
●
Needed to get up and running quickly
●
Solution: MariaDB Contractor
– Recommend topology to fit our needs
– Do initial configuration of MariaDB
– Cross-train AE DBAs on MariaDB setup and basic operation
– Be available for future questions
Informix Replication
●
“High-Availability” Data Replication, or HDR
– One primary (required), called PRI
– One “full” secondary (near-sync), called HDR
– “Shared Disk” secondaries, called SDS
– “Remote Standby Secondaries,” called RSS
●
At AE we have 1xPRI, 1xHDR and 2xRSS (see next slide)
AE’s Informix Setup
Portland (Main DC) Brunswick (DR DC)
P1 - Primary
P2 -Secondary
B1 - DR
B2 - DR (delay)
← Connection Managers
MariaDB Replication
●
Traditional (“Master-Slave” or “Primary Target”)
– Simple to set up and administer
– One primary, one target to many targets
– Failover is manual (auto-failover technically supported, but not yet recommended)
●
Galera Cluster (Multi-Master*)
– Load-balancing and write-anywhere*
– Typically three MariaDB servers participate
– No “failover” per se, because every server is the master
– Requires very fast network connections among all participating nodes (usually 3)
Galera at AE
Portland (Main DC) Brunswick (DR DC)
“P1” Server
(NOT USED)
“P2” Server
“B1” Server
“B2” Server
← MaxScale VMs
MaxScale at AE
●
Three MaxScale VMs
– Load balancing
– Redundancy
●
Used ONLY to route connections
– Not filtering or using firewall features
– Slows things down a bit, because our apps are very “talky”
– A redirect mode would be nice
●
Initially configured in “ReadConnRoute” mode
Migrating from SQL Server to MariaDB
●
MariaDB consultant set up SymmetricDS
– Steep learning curve
– Allowed us to run in parallel, but lots of holes
– Still better than doing a lengthy outage w/manual migration
– Encoding issues were a big problem
Trouble in Paradise, Volume 1
●
Some queries were very slow
– Started by enabling query cache, which helped, but there are big
downsides
– Also enabled slow query log: VERY useful
– Used ANALYZE FORMAT=JSON to see query paths and fix issues
– Once the worst offenders were fixed, disabled the query cache
●
Spoiler: MariaDB seems to have trouble recognizing (and ignoring) BAD
indexes, and Mura indexes nearly every column
Trouble in Paradise, Volume 2
●
MariaDB servers would occasionally lock up (Very Bad)
●
No web content = no web bookings = no money made
●
MuraCMS is usually read-heavy but a “publish” operation is very write-heavy
– Problem was caused by deadlocking in Galera Cluster
– ReadConnRoute is bad with Galera, in less you can guarantee that
multiple sessions won’t update the same table at the same time
– Solution: Use “ReadWriteSplit” mode
Trouble in Paradise, Volume 3
●
MaxScale was not playing nicely with JDBC connection pools
●
Mystery disconnects: “Could not find master”
– The master hadn’t gone anywhere!
– The errors being reported were misleading
●
Could work around by increasing max_sescmd_history; but why?
– Bug: MaxScale was not resetting when user context switched
– Fixed in 2.3.4
How’s it Working Now?
●
All in all, extremely well
●
During peak times, over 600 concurrent user sessions
●
Millions of queries per day
●
Very few performance complaints
– And when we get them, the problem is usually NOT the DB
– long_query_time is set to 1 second (very aggressive), and we only see
about a dozen slow queries per day
●
No unscheduled outages in months
Use Case #2: Text Search
●
Complex searches of location information
●
Currently being migrated from SQL Server
●
Uses FULLTEXT indexes
●
Very aggressive: token_size = 2 (default is 4)
●
Because of the very small word size, need to set up a “stopwords” table
●
Coexists with Mura on our Galera cluster
Use Case #3: Session Booking Information
●
Stores huge JSON documents
– Web session history
– Information needed to turn a quote into a booking
●
Extremely write-heavy
●
Use PAGE_COMPRESSED and compressed communication
●
Highly transient data (12 hour maximum useful life)
●
For this, we use traditional replication on separate VMs (not the Galera nodes)
●
The JSON can contain sensitive data, so we use at-rest encryption and SSL
●
Important to disable dedup and compression on the storage side
Use Case #4: MicroServices
●
Small databases that can be highly distributed
– Supplier information
– Locations and hours
– Reference data
– Holy grail: Rates and availability, to serve quotes
●
Containerized via Kubernetes
●
Ideally, “master” would live on Galera Cluster and use traditional replication to fan out to
many “slaves” around the world and in the cloud.
Big Wins
●
Performance for most apps meets or exceeds what we achieved with SQL
Server
●
Using Galera with MaxScale allows rolling upgrades of MariaDB with minimal or
no user-facing downtime
●
Intra- and Inter-Data Center redundancy
●
Fully-supported, still costs a fraction of what same configuration would cost with
IBM Informix or Microsoft SQL Server
●
Virtually unlimited growth potential, thanks to virtualization and the cloud
Remaining Challenges
●
MaxScale adds an expensive layer
– “Talky” apps + required additional network hop = higher latency
– Redirect mode would be nice, but would bypass most MaxScale features
●
User security is a challenge
– By default, MaxScale obscures real host of origin for client; from MariaDB, client
appears to just be coming in from the MaxScale VM
– MaxScale cannot currently handle multiple authentication methods on the same
IP/port (so PAM requires a separate listener from traditional MariaDB users or UNIX
users)
●
I/O performance is a potential issue (no easy way to intelligently split I/O loads)
Wish List
●
MaxScale “redirect” mode (Didn’t I just say that? Yep, I did.)
●
Easy way to require SSL for ALL users, rather than on a user-by-user basis
●
Get session ID for other sessions
●
Heterogeneous Replication!
●
Non-blocking mariabackup with Galera
●
Single-step restore
●
More granularity on binary log rotation
This Could Be YOU!
THANK YOU!
Tom Girsch
tgirsch@autoeurope.com
1 of 28

More Related Content

Similar to Auto Europe's ongoing journey with MariaDB and open source(20)

MariaDB High AvailabilityMariaDB High Availability
MariaDB High Availability
MariaDB plc2.6K views
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community385 views
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster Recovery
MarkTaylorIBM8.7K views
MySQL and MariaDBMySQL and MariaDB
MySQL and MariaDB
Amazon Web Services182 views
MySQL and MariaDBMySQL and MariaDB
MySQL and MariaDB
Amazon Web Services532 views
L20 ScalabilityL20 Scalability
L20 Scalability
Ólafur Andri Ragnarsson651 views

Recently uploaded(20)

Advanced API Mocking TechniquesAdvanced API Mocking Techniques
Advanced API Mocking Techniques
Dimpy Adhikary17 views
SAP FOR TYRE INDUSTRY.pdfSAP FOR TYRE INDUSTRY.pdf
SAP FOR TYRE INDUSTRY.pdf
Virendra Rai, PMP14 views
HarshithAkkapelli_Presentation.pdfHarshithAkkapelli_Presentation.pdf
HarshithAkkapelli_Presentation.pdf
harshithakkapelli10 views
Unleash The MonkeysUnleash The Monkeys
Unleash The Monkeys
Jacob Duijzer7 views
SAP FOR CONTRACT MANUFACTURING.pdfSAP FOR CONTRACT MANUFACTURING.pdf
SAP FOR CONTRACT MANUFACTURING.pdf
Virendra Rai, PMP11 views

Auto Europe's ongoing journey with MariaDB and open source

  • 1. “How Do You Solve A Problem Like Maria?” AutoEurope’s Ongoing Journey with Open Source Thomas J. Girsch Lead System Architect, AutoEurope
  • 2. What We’ll Cover 1. About AutoEurope (AE), About Me 2. The AE Environment Before MariaDB 3. Our Original Use Case for MariaDB: Content Management 4. Additional Use Cases: Microservices, Session Tracking 5. Wins, Challenges, Solutions and Open Issues 6. Our MariaDB Wish List
  • 3. About AutoEurope ● We are the global leader in international car rental services and have been helping travelers find the perfect rental vehicle for their trips around the world for over 60 years ● We have roughly 600 employees worldwide, including approximately 250 call center employees, working in offices and call centers spread across three continents ● Through our car rental partners, we offer car rentals in more than 20,000 locations in over 180 countries worldwide ● Visit https://www.autoeurope.com/about-us/ to learn more
  • 4. About Me ● Nearly three decades of IT experience ● That hurt to type; I’m getting old! ● Certified in IBM Informix and MongoDB ● Both are a bit dated and need updating ● Proficient in Unix/Linux administration ● Specialize in high availability, DR and OLTP performance ● NOT A PROGRAMMER! ● Outside of work, volunteer as Vice Chair of the Rail Passengers Association
  • 5. AutoEurope AM (Ante Maria) ● IBM Informix for most critical systems ● Quoting and reservations ● Web session tracking ● Legacy EDS system ● Ancient SAP ● Microsoft SQL Server for many ancillary systems ● Content management (FarCry) ● Additional web session tracking ● Supplier interfaces ● MongoDB for application logging ● Also InfluxDB, REDIS and a little bit of very old MySQL
  • 6. SQL Server at AE ● One single VM (NO REDUNDANCY!) ● If SQL Server goes down, our websites don’t respond ● Thankfully, this rarely happens ● If SQL Server gets bogged down, websites respond poorly ● This was happening a lot ● Why not add redundancy? Insufficient licensing ● Too expensive to add more ● Lacked in-house expertise ● Desire to reduce Microsoft / Windows footprint
  • 7. IBM Informix at AE ● Heavily invested, with lots of licenses ● Production: dedicated physical servers in two data centers ● We’ll cover in more detail (with pictures!) in a bit ● Dev/QA/Test: small VMs ● Small “Shopping Engine” Vms ● Why not do everything here? ● Very expensive to add licenses ● Most third-party tools don’t support it ● Overkill for many applications ● Use the right tool for the job
  • 8. Web Content Management at AE ● Manages content for nearly 70 websites – Text of various pages – Locations of images and other graphic/support files – Translations to multiple other languages ● Resiliency is key! (No web = no bookings = no money) – Must be redundant within the data center – Must be redundant across data centers ● Workload distribution is a nice-to-have
  • 9. Enter MariaDB ● Web team migrating to new content management platform, MuraCMS ● Initially started the migration on SQL Server, but we want to eliminate MS ● Migrating to IBM Informix not an option ● Mura supports two DB platforms – Microsoft SQL Server – MySQL ● MariaDB = MySQL – Oracle: MariaDB! ● But is MariaDB resilient enough?
  • 10. MariaDB Professional Services ● AE had no in-house MySQL or MariaDB knowledge ● Needed to get up and running quickly ● Solution: MariaDB Contractor – Recommend topology to fit our needs – Do initial configuration of MariaDB – Cross-train AE DBAs on MariaDB setup and basic operation – Be available for future questions
  • 11. Informix Replication ● “High-Availability” Data Replication, or HDR – One primary (required), called PRI – One “full” secondary (near-sync), called HDR – “Shared Disk” secondaries, called SDS – “Remote Standby Secondaries,” called RSS ● At AE we have 1xPRI, 1xHDR and 2xRSS (see next slide)
  • 12. AE’s Informix Setup Portland (Main DC) Brunswick (DR DC) P1 - Primary P2 -Secondary B1 - DR B2 - DR (delay) ← Connection Managers
  • 13. MariaDB Replication ● Traditional (“Master-Slave” or “Primary Target”) – Simple to set up and administer – One primary, one target to many targets – Failover is manual (auto-failover technically supported, but not yet recommended) ● Galera Cluster (Multi-Master*) – Load-balancing and write-anywhere* – Typically three MariaDB servers participate – No “failover” per se, because every server is the master – Requires very fast network connections among all participating nodes (usually 3)
  • 14. Galera at AE Portland (Main DC) Brunswick (DR DC) “P1” Server (NOT USED) “P2” Server “B1” Server “B2” Server ← MaxScale VMs
  • 15. MaxScale at AE ● Three MaxScale VMs – Load balancing – Redundancy ● Used ONLY to route connections – Not filtering or using firewall features – Slows things down a bit, because our apps are very “talky” – A redirect mode would be nice ● Initially configured in “ReadConnRoute” mode
  • 16. Migrating from SQL Server to MariaDB ● MariaDB consultant set up SymmetricDS – Steep learning curve – Allowed us to run in parallel, but lots of holes – Still better than doing a lengthy outage w/manual migration – Encoding issues were a big problem
  • 17. Trouble in Paradise, Volume 1 ● Some queries were very slow – Started by enabling query cache, which helped, but there are big downsides – Also enabled slow query log: VERY useful – Used ANALYZE FORMAT=JSON to see query paths and fix issues – Once the worst offenders were fixed, disabled the query cache ● Spoiler: MariaDB seems to have trouble recognizing (and ignoring) BAD indexes, and Mura indexes nearly every column
  • 18. Trouble in Paradise, Volume 2 ● MariaDB servers would occasionally lock up (Very Bad) ● No web content = no web bookings = no money made ● MuraCMS is usually read-heavy but a “publish” operation is very write-heavy – Problem was caused by deadlocking in Galera Cluster – ReadConnRoute is bad with Galera, in less you can guarantee that multiple sessions won’t update the same table at the same time – Solution: Use “ReadWriteSplit” mode
  • 19. Trouble in Paradise, Volume 3 ● MaxScale was not playing nicely with JDBC connection pools ● Mystery disconnects: “Could not find master” – The master hadn’t gone anywhere! – The errors being reported were misleading ● Could work around by increasing max_sescmd_history; but why? – Bug: MaxScale was not resetting when user context switched – Fixed in 2.3.4
  • 20. How’s it Working Now? ● All in all, extremely well ● During peak times, over 600 concurrent user sessions ● Millions of queries per day ● Very few performance complaints – And when we get them, the problem is usually NOT the DB – long_query_time is set to 1 second (very aggressive), and we only see about a dozen slow queries per day ● No unscheduled outages in months
  • 21. Use Case #2: Text Search ● Complex searches of location information ● Currently being migrated from SQL Server ● Uses FULLTEXT indexes ● Very aggressive: token_size = 2 (default is 4) ● Because of the very small word size, need to set up a “stopwords” table ● Coexists with Mura on our Galera cluster
  • 22. Use Case #3: Session Booking Information ● Stores huge JSON documents – Web session history – Information needed to turn a quote into a booking ● Extremely write-heavy ● Use PAGE_COMPRESSED and compressed communication ● Highly transient data (12 hour maximum useful life) ● For this, we use traditional replication on separate VMs (not the Galera nodes) ● The JSON can contain sensitive data, so we use at-rest encryption and SSL ● Important to disable dedup and compression on the storage side
  • 23. Use Case #4: MicroServices ● Small databases that can be highly distributed – Supplier information – Locations and hours – Reference data – Holy grail: Rates and availability, to serve quotes ● Containerized via Kubernetes ● Ideally, “master” would live on Galera Cluster and use traditional replication to fan out to many “slaves” around the world and in the cloud.
  • 24. Big Wins ● Performance for most apps meets or exceeds what we achieved with SQL Server ● Using Galera with MaxScale allows rolling upgrades of MariaDB with minimal or no user-facing downtime ● Intra- and Inter-Data Center redundancy ● Fully-supported, still costs a fraction of what same configuration would cost with IBM Informix or Microsoft SQL Server ● Virtually unlimited growth potential, thanks to virtualization and the cloud
  • 25. Remaining Challenges ● MaxScale adds an expensive layer – “Talky” apps + required additional network hop = higher latency – Redirect mode would be nice, but would bypass most MaxScale features ● User security is a challenge – By default, MaxScale obscures real host of origin for client; from MariaDB, client appears to just be coming in from the MaxScale VM – MaxScale cannot currently handle multiple authentication methods on the same IP/port (so PAM requires a separate listener from traditional MariaDB users or UNIX users) ● I/O performance is a potential issue (no easy way to intelligently split I/O loads)
  • 26. Wish List ● MaxScale “redirect” mode (Didn’t I just say that? Yep, I did.) ● Easy way to require SSL for ALL users, rather than on a user-by-user basis ● Get session ID for other sessions ● Heterogeneous Replication! ● Non-blocking mariabackup with Galera ● Single-step restore ● More granularity on binary log rotation