Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud TiDB deep dive


Published on

Cloud TiDB deep dive

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Cloud TiDB deep dive

  1. 1. Cloud TiDB Deep Dive Ed Huang
  2. 2. Hi there! ● Ed Huang ● Co-founder & CTO @PingCAP ● Distributed system engineer, Open source advocate ● Living in Beijing / San Fransisco Bay Area @dxhuang
  3. 3. Agenda ● Goal: understand Cloud TiDB architecture ● Time Taken: ~45 minutes ● Outline: ○ Cloud TiDB Architecture ○ Key Components ■ TidbCluster Custom Resource Definitions (CRDs) ■ tidb-controller-manager ■ tidb-scheduler ○ Implementions ■ Gracefully Upgrade ■ Auto Failover ■ HA Scheduling
  4. 4. Part I - Cloud TiDB Architecture
  5. 5. TiDB Cluster Controller TiDB Controller Manager TiDB Controller PD Controller TiKV Controller TiDB Operator Scheduler API Server Controller Manager TiDB Pods Scheduler Extender TiDB Scheduler Kube Scheduler TidbCluster Custom Resources Overview
  6. 6. Initializer Monitoring TidbCluster Rendered API Objects NodeANodeBNodeC TiDB Operator TiKV PD PD TiKV PD TiKV TiDB TiDB Kuberenetes Controller & Scheduler Job Cron Job Overview
  7. 7. Details ● Kubernetes as the orchestration platform ● TiDB Operator injects TiDB’s domain-specific orchestration logic into Kubernetes: ○ TidbCluster: the custom resource to declare user’s intention ○ tidb-controller-manager: a set of custom controllers that implements the user’ intention declared in TidbCluster ○ tidb-scheduler: custom scheduling policy, e.g. PD and TiKV HA(High Available) scheduling
  8. 8. Part II - Key Components
  9. 9. Kubernetes Recap: Declarative API ● CRUD(Create, Retrieve, Update, Delete) of API Objects ● API Object is “Record of Intent” ● spec: desired state ● status: current state ● Long-running API client called “Controller” reconciles the states Desired replicas: 3 Actual replicas: 2 loop { if actual != desired { reconcile() } }
  10. 10. API Resources for various Intents Container Pod I need co-scheduling Deployment ReplicaSet I have many pod replicas I want to manage deploy Service I want to expose my service Ingress I want to proxy my pods CronJob Job I only run once... ...but periodically StatefulSet I need state DaemonSet I run as daemon ConfigMap Secret I need configs... ...and they are sensitive
  11. 11. Q: How to tell Kubernetes “I want a TiDB cluster that has N replicas ...”? 🤔 A: An API object for “TiDB cluster”
  12. 12. Custom Resource Definition (CRD) ● Analogy: Class Definition and Class Instance (Object) in OOP. Definition Instance
  13. 13. ...however, Kubernetes cannot accomplish this intent 😅 Enter tidb-controller-manager
  14. 14. Control Loop Desired State Actual State tidb-controller-manager continuously diffs the desired state and actual state. If not match, trigger actions to converge to the desired state. This process is also known as reconcile.
  15. 15. What is triggered after all? ● Create some Pods? Yes, of course we can create Pods to make the TiDB cluster live! ● However, managing Pod is hard, can we find something better? ● StatefulSet ● So, (mainly) we CRUD StatefulSet objects to accomplish the intent that defined in TidbCluster object.
  16. 16. Accomplish the intent apiVersion: kind: TidbCluster metadata: name: demo spec: pd: image: pingcap/pd:v2.1.3 replicas: 3 ... tikv: image: pingcap/tikv:v2.1.3 replicas: 5 ... tidb: image: pingcap/tidb:v2.1.3 replicas: 2 ... PD StatefulSet TiKV StatefulSet TiDB StatefulSet
  17. 17. TiDB Scheduler ● Moreover, PD / TiKV / TiDB Pods need extra scheduling policy ○ Raft Protocol: majority alive (>= n/2 + 1) ● Extended Scheduler can intercept the scheduling process. However, sometimes we cannot change the configuration of kube-scheduler, e.g. in managed Kubernetes ● The solution is to deploy our own scheduler, which reuse the kube-scheduler image.
  18. 18. kube-scheduler kube-scheduler tidb-scheduler other Pod other Pod other Pod PD Pod TiKV Pod TiDB Pod apiVersion: apps/v1 kind: StatefulSet … spec: template: spec: schedulerName: tidb-scheduler containers: ... TiDB Scheduler
  19. 19. TiDB Node PD StatefulSet TiKV StatefulSet TiDB StatefulSet TidbCluster Pod PVC PV Pod PVC PV Pod PVC PV Scheduling TiDB Scheduler TidbCluster Controller
  20. 20. Part III - Implementations
  21. 21. Gracefully Rolling Update ● Rolling update: ○ Cluster version ○ Cluster configuration ● StatefulSet can perform rolling update out-of-box ● Problem: ○ TiKV: evit region leader before stop ○ PD: transfer Raft leader before stop ○ TiDB: swich DDL Owner before stop ● tidb-controller-manager controls the update process by write the `partition` field of StatefulSet
  22. 22. TiKV Revision 2 TiKV Revision 1 TiKV-4 TiKV-3 TiKV-2 TiKV-1 TiKV-0 if store.LeaderCount > 0 { evictLeader(store.ID); return } // increment partition by 1 to process next Pod setUpgradePartition(partition--) Gracefully Rolling Update
  23. 23. Auto Failover TiKV-0 TiKV-1 TiKV-2
  24. 24. Auto Failover TiKV-0 TiKV-1 TiKV-2
  25. 25. Auto Failover ● How can we determine wether a TiKV instance is down? ○ According to Pod status “unknown”? ■ what if it is just a temporary network-partition? ● We have to combine the Kubernetes’ view and PD’s view ○ Kubernetes think the Pod is not running, e.g. “unknown” status when node is down ○ PD think the store is down: the corresponding TiKV store status is “Down” ● Now, the actual state is not only stats from Kubernetes cluster, but also stats from TiDB cluster (PD)!
  26. 26. Auto Failover status: pd: failureMembers: ... leader: ... members: ... tikv: failureStores: ... stores: ... tombstoneStores: ... PD Operator collects status from Kubernetes and PD
  27. 27. Auto Failover TiKV-0 TiKV-1 TiKV-2 The desired replica number of StatefulSet is the desired replica number of TiKV plus failed instance number localpv-a localpv-b localpv-c pvc-0 pvc-2pvc-1 TiKV-3 pvc-3 localpv-d
  28. 28. HA Scheduling PD-0 PD-1 PD-2 Non HA PD-0 PD-1 PD-2 HA Scheduling
  29. 29. HA Scheduling ● Why inter-pod (anti-)affinity is in-sufficient? PD-3PD-0 PD-1 PD-2PD-4 ● If we use ‘required’ (hard) affinity, this topology is not allowed ● If we use ‘preferred’ (soft) affinity, and say we have 2 nodes, then we will get an non HA topology we can tolerant one node failure in this case
  30. 30. Extended Scheduler kube-scheduler tidb-scheduler PD Pod TiKV Pod TiDB Pod apiVersion: apps/v1 kind: StatefulSet … spec: template: spec: schedulerName: tidb-scheduler containers: ... Filter out nodes that will run more than N / 2 PD instances with the new PD Pod assigned to
  31. 31. Extended Scheduler PD-0 PD-1 PD-2 PD-2 is waiting for scheduling Custom schedule policy: ● Node-0 will have 2 pods > (3 / 2), filer it out ● So does Node-1 ● PD-2 stick in Pending state due to scheduling failure. ● This policy avoid potential PD cluster failure at the very begining.
  32. 32. Extending TiDB Operator ● Put it together: ○ New/more user intents: ■ define a new CRD or modify TidbCluster CRD ○ Accomplish new intents or refine current intents ■ modify tidb-controller-manager ○ Custom scheduling requirements ■ modify tidb-scheduler
  33. 33. Thank You !