2. 2
Speaker
● My name is Jose Cores Finotto I am
an Engineering manager in
Infrastructure at Gitlab.
● I have been a part of the team since
September 2018.
● Background in larger organizations
with deep experience in
Infrastructure, primarily in relational
databases.
3. 3
Agenda
● Gitlab - Features and Numbers
● Scenario
● Goal
● Architecture REPMGR
● Architecture Patroni
● The Project
● Migration
● Conclusion
4. 4
Gitlab by the numbers
Company
- Incorporated in 2014
- 500+ employees
- 50+ countries
Broad adoption
- Millions of users
- 100,000+ organizations
- Over 550,000 paid users
- Open source model
- 2,200+ code contributors
- 10,000+ total contributors
Strong business
- ARR (Dec ‘18): $44M
- ARR Growth Rate: 177%
- Capital Raised: $158M
- Capital Spent: $26M
5. 5
Gitlab Values
1 2 3 4 5 6
Work asynchronously
with fully remote
workforce (org)
Use GitLab to build
GitLab, there’s an Issue
and/or Merge Request
for everything
Collaboration Results
Track outcomes,
not hours
Boring solutions win.
Complexity slows
cycle time.
Efficiency
Remote-only tends toward
global diversity, but we still
have a ways to go.
Hire those who add to culture,
not those who fit with culture.
We want cultural diversity
instead of cultural conformity.
Diversity
Minimum Viable
Change (MVC) if the
change is better than
the existing solution,
ship it.
Iteration
Everything at GitLab
is public by default:
Strategy, Roadmap,
Quarterly Goals,
Handbook, and Issue
Trackers
Transparency
8. 8
GitLab.com
We have a hosted version of Gitlab:
● Over 25 million daily git pull operations.
● Upwards of 3K requests/second.
● More than 4k git requests per second.
● 650.000 git pushes a day.
9. 9
Scenario
● We used REPMGR as our High Availability
Postgresql solution.
● During the first half of 2018 we faced some
failures.
● We faced issues during the execution of some
failovers, and some possible split brain.
● Some customers on self-managed reported
some problems with REPMGR.
● We decided we need to change the HA solution.
10. 10
Goal
What is the appropriate solution in the
market for our HA needs ?
The GitLab methodology to choose a technical solution has
the following steps :
● Blueprints
● Design docs
● Project
11. 11
Choosing a solution
Between the possible options we had there we
considered :
● Stolon
● Upgrade REPMGR
● Patroni
Ongres helped us with us during the whole project,
from the blueprint to the rollout to production.
Our choice was to implement Patroni.
12. 12
Architecture REPMGR
REPMGR is described by its developers as “The Most Popular Replication Manager
for PostgreSQL”.
REPMGR wraps common operations :
● provisioning a new replica.
● offers commands and a daemon for:
○ manually switchovers.
○ automatic failover.
REPMGR writes its state and monitoring information to a PostgreSQL database.
13. 13
PostgreSQL Cluster
Consul Cluster
Architecture REPMGR
PGBouncer
Primary DB
Secondary
DB
Secondary
DB
Consul
Consul
Consul
Gitlab
Application
Database Node
PostgreSQL
Repmgrd
Consul
Pgbouncer Node
Pgbouncer
Consul
14. 14
Architecture Patroni
Patroni is a template for you to create your own customized, high-availability
solution using Python.
Based on a DCS (Distributed Consensus Storage, in our case Consul):
● The DCS provides very strong correctness guarantees.
● All the nodes see the same information, all the time.
15. 15
How Patroni Operates
How Patroni operates:
● Nodes periodically write to the DCS its position in the replication stream (lag).
● Master node holds a lock that needs to renew every number of seconds.
● An election would be triggered if the master does not hold the lock.
● We have a load balancer for read only queries.
17. 17
Why Patroni
Main reasons to choose Patroni :
● Solution based on a DCS, being more reliable.
● Used widely by the Postgres community.
● Ongres has hands-on experience with Patroni, bringing experience to the project.
● Respects the software principles of Gitlab.
19. 19
The Project
Create stg env
w/ Patroni
Understand the
failover process
Automate the
failover process
+ QA testing
Migration
20. 20
Migration
● We executed a dry-run to verify all the pre-steps were ok.
● Since the dry-run was successful, we decided
to finish the migration.
● Our maintenance window was 1 hour long.
● We had onboard on the migration meeting :
○ SRE / DBRE teams
○ ONGRES
○ QA
○ Management
21. 21
Points to optimize during the migration
Points to improve:
● We missed some file descriptors config on pgbouncers (too many files open).
● One pre-check script did not work as expected in production.
● We exceeded the maintenance window by 45 minutes.
● We had to restart some agents for them to work properly.
● Some monitoring needed some adjustments.
● We should have involved more SREs on call.
22. 22
Conclusions
Some post-project observations:
● We are satisfied with Patroni’s behavior since migration.
● Several manual and automated failovers executed successfully.
● We have run into a few small issues since the deployment, but they were
solved successfully and never presented a severe threat to the database.
● We are planning an upgrade to Patroni to address said issues and remove
workarounds.