SlideShare a Scribd company logo
1 of 34
Download to read offline
ClickHouse复制简单说明
新浪-⾼高鹏-2018年年05⽉月
做⼀一个最简单的测试
1个Shard
2个副本
这个副本关系到底是怎么实现的?
仅仅对分布式表写⼊入,
并且在internal_replication = false的情况下,
会写⼊入分布式表对应的⼦子表
Using non-replicated MergeTree tables and duplicate data through Distributed table - we
call it "poor man's replication".
When you use it, you have to do many work by yourself - recovery, control of consistency,
etc.
I recommend to use real replication (Replicated) tables for almost all cases. But there are
notable exceptions:
- if you really hate ZooKeeper or maybe you afraid to have any piece of Java code in your
infrastructure;
- if you already have some data processing pipeline with other databases, that are already
replicated "by hand" and you want to just integrate ClickHouse in the same way;
- if you want your replicas to be as much independent as possible;
- if you want solution that is conceptually as simple as possible.
Replicated tables are not slower at insertion than plain MergeTree, if you measure
throughput with enough batch size. ZooKeeper synchronization only contributes to latency.
But for INSERTs, additional latency up to hundreds of ms usually doesn't matter.
Also Replicated tables are more heavy: for plain MergeTree tables ClickHouse will tolerate
many thousands of tables per server, and for ReplicatedMergeTree it's difficult (but usually
you should have few big tables).
⽼老老⼤大说了了,⽤用复制表就对了了
“江湖上⼀一个流⾏行行的配置⽂文件,
到底啥意思”
解读⼀一下这个配置⽂文件
配置了了3个shard,
即3组机器器同时查询,
实际查询只能是3台同时服务,并不不是6台
11、12可以依赖分布式表进⾏行行复制
同时可以做内部的fail over
难点在于搞清楚每个shard⾥里里的复制关系,
⽤用什什么复制⽅方式,不不同的⽅方式有啥优缺点
11
12
13
14
15
16
Shard1 Shard2 Shard3
Shard1/Shard2/Shard3是分⽚片关系,
为的是把数据分散到不不同节点,加速查询,
机器器越多查的越快,依赖分布式表
11/12是副本关系,13/14、15/16同样,
为的是增强数据安全
Cluster 1
Cluster 2
11、12相互备份,有2种⽅方案:
⽅方案1:
11/12使⽤用复制表,相同ZK路路径的表会⾃自动复制,
这是⼀一种⽐比较⾼高级的⽅方案,⽐比较依赖ZK,要做好ZK的准备⼯工作。
相互复制会进⾏行行数据校验,⾃自动确保数据⼀一致性。
这⾥里里建议3个节点做复制,可以设置⾄至少2个节点收到数据后写⼊入才算成功,增强数据的⼀一致性。
11
12
13
14
15
16
Shard1 Shard2 Shard3
Cluster 1
Cluster 2
11、12相互备份,有2种⽅方案:
2. 11/12使⽤用分布式表写⼊入,需要设置internal_replication: false,写全部的分⽚片。
这种⽅方式被官⽅方的⼈人成为‘poor man's replication’,需要⾃自⾏行行处理理存量量数据、数据迁移等⼯工作。并不不建议使⽤用。
11
12
13
14
15
16
Shard1 Shard2 Shard3
Cluster 1
Cluster 2
1. internal_replication设置为true,
即只写⼀一个shard⾥里里的⼀一个副本
(写分布式表的情况下)
2. 同时开启表级别的复制,这样⽆无论上述哪个
副本被写⼊入,数据都会被同步到其他副本(节点)
3. 同时,11、12作为⼀一个副本关系,如果出现宕机,
请求会⾃自动转移到正常节点,不不会影响写⼊入和读取
(这个操作有点复杂,请务必搞清楚再⽤用)
推荐的使⽤用⽅方式:
复制引擎的参数到底什什么意思?
关于复制引擎
ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/hits', '{replica}', EventDate, intHash32(UserID), (CounterID,
EventDate, intHash32(UserID), EventTime), 8192)
ZK路路径,如果要相互复制,这⾥里里必须⼀一样
副本名称,必须不不⼀一样
剩下的就是分区和主键,以及索引粒度
ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/hits', '{replica}', EventDate, intHash32(UserID), (CounterID,
EventDate, intHash32(UserID), EventTime), 8192)
ZK路路径,如果要相互复制,这⾥里里必须⼀一样
副本名称,必须不不⼀一样
剩下的就是分区和主键,以及索引粒度
ClickHouse官⽅方给的{layer}其实指代的就是第⼏几层,同⼀一层的数据⼀一样
{shard}指的是⼀一个集群⾥里里的第⼏几个分⽚片
同⼀一层的同⼀一分⽚片,数据相互复制
ClickHouse不不建议搞特别⼤大的集群,建议⼀一个业务就跑⼀一个集群,具体多少分⽚片,⾃自⼰己衡量量
关于复制引擎
举个例例⼦子
CREATE TABLE apm.access_msg ( date Date, clientip String, serverip String, domain String)
ENGINE = ReplicatedMergeTree('/clickhouse/cluster1/apm/access_msg/1', '1', date, (date, domain), 8192)
CREATE TABLE apm.access_msg ( date Date, clientip String, serverip String, domain String)
ENGINE = ReplicatedMergeTree('/clickhouse/cluster1/apm/access_msg/1', '2', date, (date, domain), 8192)
CREATE TABLE apm.access_msg ( date Date, clientip String, serverip String, domain String)
ENGINE = ReplicatedMergeTree('/clickhouse/cluster1/apm/access_msg/1', '3', date, (date, domain), 8192)
以上这3个表,⽆无论在哪⾥里里,因为共享了了同⼀一个zk的路路径,数据会相互复制
这3个表可以在⼀一个机器器上,也可以分布在3个机器器上
显然,分布在3个机器器上更更加合理理,这就是3个副本的概念
那如何跟分⽚片关联起来呢?答案就在ZK的路路径设置上
注意这⾥里里为啥有个1
我们线上的例例⼦子
⼀一个理理想的架构是什什么样的?
11
12
13
14
21
22
23
24
cluster replica layer shard
1x 11 1 1
1x 12 2 2
1x 13 3 3
1x 14 4 4
cluster replica layer shard
2x 21 1 1
2x 22 2 1
2x 23 3 1
2x 24 4 1
Cluster 1 Cluster 2
2个集群,各⾃自4个节点,是如何复制的
机器器名or节点名 这俩可以⼀一样
11
12
13
14
21
22
23
24
Cluster 1 Cluster 2
31
32
33
34
Cluster 3
集群
第⼀一层
⼀一个节点
11
12
13
14
21
22
23
24
Cluster 1 Cluster 2
31
32
33
34
Cluster 3
合理理的使⽤用:3个集群,各⾃自N个节点(N决定查询性能)
不不理理解?⼀一步⼀一步拆分看看
一个集群,有4个分片,
即:每次查询可以有4个机器共同承担(横向扩展)
写入要通过LB设备(DNS、Haproxy、Nginx等)平均分散
到4个节点去,或者通过分布式表直接写入(这种并不推荐)
此时你会发现,查询性能受限于机器的数量,机器越多,查询越快
但是你发现,每1/4的数据,只留了一份,不够安全啊,于是你想要
副本,于是有了后面的架构:
11
12
13
14
Cluster 1
LB
相比先前的架构,这回每1/4的数据,都有了一个‘孪生兄弟’,
即副本,数据丢失的风险在于:如果恰好某个分片的2个兄弟全跪了
你觉得还不够安全,于是增加了副本的数量:
11
12
13
14
21
22
23
24
Cluster 1 Cluster 2
3个副本,数据丢失的风险已经非常低了
同时,你还可以设置一个类似于多数派写的参数,确保3个里面,
至少有3/2+1=2个副本收到数据,才返回客户端数据已经写入成功
11
12
13
14
21
22
23
24
Cluster 1 Cluster 2
31
32
33
34
Cluster 3
你们有没有这么搞?
答案是:没有
1. 没机器器
2. 所处业务场景允许丢数据
3. ZK压⼒力力不不⼩小
(这个不不是主要原因,主要是穷)
有没有穷逼的活法?
如果没有机(穷)器,就这么搞吧~
4台机器,交叉备份
1/2/3/4是4台设备
A1&A1'是通过ZK复制的同一份数据
2 A2 A1'
1 A1 A4'
4 A4 A3'
3 A3 A2'
如果没有机(穷)器,就这么搞吧~
4台机器,交叉备份
1/2/3/4是4台设备
A1&A1'是通过ZK复制的同一份数据
2 A2 A1'
1 A1 A4'
4 A4 A3'
3 A3 A2'
如果机器2宕机,此时,还有1、3、4机器在
数据有A1A2'A3A3'A4A4',是维持了一份完整的数据集的
只是,当前的状态无法提供查询,只能把3上面的A2'拷贝到新的机器,破坏了可用性
神⻢马?嫌麻烦?
11
12
13
14
⼀一台超⼤大服务器器
只做备份⽤用
不不查询
Cluster 1
由于ClickHouse的复制表没有物理上的限制
搞一个超大机器,通通复制过来就好了
为了了⼏几台设备,搞这么麻烦,值得么?
也许,在⾮非技术层⾯面,值得
数据统计,
哪⾥里里需要那么严格的数据⼀一致性??
丢了了就丢了了吧

More Related Content

Featured

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Featured (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

ClickHouse Data Replication in 34 pages