SlideShare a Scribd company logo
1 of 17
Download to read offline
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Scrub and Repair in Jewel
Kefu Cai
Red Hat
October 18, 2015
The Scrub and Repair in Jewel
Kefu Cai
Red Hat
October 18, 2015
2015-10-18
The Scrub and Repair in Jewel
谢谢⼤家现在还醒着
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Scrub and Repair in Jewel
Kefu Chai
Red Hat
October 18, 2015
The Scrub and Repair in Jewel
Kefu Chai
Red Hat
October 18, 2015
2015-10-18
The Scrub and Repair in Jewel
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
scrub 101
pain in scrub
scrub in jewel
Q & A
Outline
scrub 101
pain in scrub
scrub in jewel
Q & A
2015-10-18
The Scrub and Repair in Jewel
Outline
1. 接下来想和诸位聊⼀下 ceph ⾥⾯的 scrub 和 repair
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
$ sudo ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 17.1c1 is active+clean+inconsistent, acting [21,25,30]
2 scrub errors
$ sudo ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 17.1c1 is active+clean+inconsistent, acting [21,25,30]
2 scrub errors
2015-10-18
The Scrub and Repair in Jewel
有的时候需要检查 osd 的 log,才能知道具体是哪个 object
有了问题,进⽽⼿动解决。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
$ sudo ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 17.1c1 is active+clean+inconsistent, acting [21,25,30]
2 scrub errors
$ ceph pg repair 17.1c1
$ sudo ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 17.1c1 is active+clean+inconsistent, acting [21,25,30]
2 scrub errors
$ ceph pg repair 17.1c1
2015-10-18
The Scrub and Repair in Jewel
有的时候需要检查 osd 的 log,才能知道具体是哪个 object
有了问题,进⽽⼿动解决。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
scrub
why
▶ fix the data lost/corruption when we can
scrub
why
▶ fix the data lost/corruption when we can
2015-10-18
The Scrub and Repair in Jewel
scrub 101
scrub
由于硬盘出错导致的 EIO,使得数据不可读。
或者更可怕的是,数据出错,同时客户端没有觉察,
把错误的数据当作正确的数据处理, 从⽽造成灾难性的损失。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
scrub
when
▶ least impact to production
scrub
when
▶ least impact to production
2015-10-18
The Scrub and Repair in Jewel
scrub 101
scrub
1. 限定在 begin hour 和 end hour 之间
2. 在系统负载超过阈值的时候也不会贸然启动
3. 以 osd 为单位,同时进⾏的 scrub 数量是可以配置的
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
scrub
when
▶ least impact to production
▶ scheduled periodically
scrub
when
▶ least impact to production
▶ scheduled periodically
2015-10-18
The Scrub and Repair in Jewel
scrub 101
scrub
1. min/max interval 可配置
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
scrub
when
▶ least impact to production
▶ scheduled periodically
▶ distributed randomly in configured time range
scrub
when
▶ least impact to production
▶ scheduled periodically
▶ distributed randomly in configured time range
2015-10-18
The Scrub and Repair in Jewel
scrub 101
scrub
1. 试图缓解因为 max interval 导致的⼤量 scrub 任务堆积
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
scrub
how
▶ pg based and initialized by the primary
▶ scan in batch
scrub
how
▶ pg based and initialized by the primary
▶ scan in batch
2015-10-18
The Scrub and Repair in Jewel
scrub 101
scrub
1. 每次以 pg 为单位,由 primary 发起,要求所有的 peered 的
pg 参与。 分批次处理 pg ⾥⾯的所有 object,以 object 的
hash 为界。这就是所谓的 chunky scrub。把收集到的 object
信息,组织成⼀个 scrub map,同时也向 replica 请求相应的
scrub map。很明显,scrub map 会包含 object, xattr, omap
的信息, 以供交叉⽐较。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
scrub
how
▶ pg based and initialized by the primary
▶ scan in batch
▶ select authoritive copy
scrub
how
▶ pg based and initialized by the primary
▶ scan in batch
▶ select authoritive copy
2015-10-18
The Scrub and Repair in Jewel
scrub 101
scrub
1. 和⺴购⼀样,⼀个 object 也有好评差评之分。⽐如, object
对应 hash 读不出来了; omap 读不出, object
读出来的⼤⼩或者 digest 不⼀致; 干脆读取的时候出现 EIO。
都会导致⼀个 replica 失去成为权威拷⻉的机会。
只有通过这些考验的副本才能成为权威拷⻉。⽽出问题的情况,
不外乎读取失败,数据缺失,和数据不⼀致⼏种。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
scrub
how
▶ pg based and initialized by the primary
▶ scan in batch
▶ select authoritive copy
▶ write down the results
▶ fix it (later)
scrub
how
▶ pg based and initialized by the primary
▶ scan in batch
▶ select authoritive copy
▶ write down the results
▶ fix it (later)
2015-10-18
The Scrub and Repair in Jewel
scrub 101
scrub
1. 把 scrub 的结果通过 cluster log 发给 monitor 记录下来。
这些结果包括⼀些 metadata 本⾝不⼀致的情况,⽐如 head
和 snapdir 共存,有 clone, 但是 snapset ⾥⾯的 seq
是空的, 也有 snapset 本⾝出现不⼀致,⽐如说⾃⼰觉得有 3 个
snap,但是发现只有 2 个。
2. 对于 digest 缺失的情况,就直接⽤已知的 digest 重写所有的
replica。 如果数据本⾝不⼀致, 或者缺失,那么就把它放到
missing 列表⾥⾯去。 如果在以后处理 io
请求的时候,正好碰到了缺少的 object,就⽴即恢复它。
即随机选择⼀个 replica,发送 pull 请求, ⽤收到的 push
数据重写本地有错误,或者缺少的 object。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
pain in scrub
▶ ceph pg repair does not always work.
▶ scrub/repair is not programmable/scriptable.
pain in scrub
▶ ceph pg repair does not always work.
▶ scrub/repair is not programmable/scriptable.
2015-10-18
The Scrub and Repair in Jewel
pain in scrub
pain in scrub
现有的机制还⾮常简单,⽆法应对绝⼤多数情形。
有时候需要观察⽇志,删除出错的 replica。
没有实现更智能、更复杂的策略。
⽐如说少数服从多数的算法,虽然直觉,但是⺫前没有这个设计。
如果 librados 能为 scrub 提供更丰富的接⼝,
那么客户端就能灵活地制定策略,选择有效的副本,指定正确的数据,
甚⾄恢复 snapset/clone,当然这需要对 snapset 有⼀些理解。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
new APIs
query
▶ get_inconsistent_pools() => pool[]
▶ get_inconsistent_pgs() => pg[]
▶ get_inconsistent(pg) => epoch, inconsistent[]
new APIs
query
▶ get_inconsistent_pools() => pool[]
▶ get_inconsistent_pgs() => pg[]
▶ get_inconsistent(pg) => epoch, inconsistent[]
2015-10-18
The Scrub and Repair in Jewel
scrub in jewel
new APIs
在 scrub 过程中,osd
会把缺失或者不⼀致的信息收集起来。保存在临时 object ⾥⾯,供
scrub API 查询。在同⼀个 pg interval 的期间,可以根据这次 scrub
的结果 进⾏查询,读取,进⾏恢复的操作。 但是如果 interval
过了,scrub API 会返回 EAGAIN 。 inconsistent
会是⼀个带有版本号的 json,⽅便⽇后扩展。 是 osd 到 shard_info_t
的 map, shard_info_t 包含 replica 是否存在, omap_sha1,
data_sha1, size 等 metadata。
基于这些数据,⽤户可以实现各种策略,⽐如说投票。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
new APIs
read
▶ operate_on_shard(epoch, osd, shard, op)
new APIs
read
▶ operate_on_shard(epoch, osd, shard, op)
2015-10-18
The Scrub and Repair in Jewel
scrub in jewel
new APIs
其中,op 可能是 read 操作,这些操作允许⽤户检查数据。
从⽽选择有效的数据副本,同时跳过 OSD ⼀致性的检查。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
new APIs
choose
allows user to
▶ specify the good/bad shard/xattr/omap replica
▶ specify the metadata (snapset in particular)
new APIs
choose
allows user to
▶ specify the good/bad shard/xattr/omap replica
▶ specify the metadata (snapset in particular)
2015-10-18
The Scrub and Repair in Jewel
scrub in jewel
new APIs
前者本⾝就是数据,没有和其它对象有互相引⽤的关系。 但是
snapset 和 clone
有其⾃⾝的⼀致性问题。所以具体的接⼝还在讨论。
如果操作不当,可能会产⽣更多的不⼀致,有些可能是预料之外的。
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Q & A
Q & A
2015-10-18
The Scrub and Repair in Jewel
Q & A
Q & A
段经理在早上,说到 call for
action。在这⾥也希望得到⼤家的宝贵意⻅。

More Related Content

Viewers also liked

Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Community
 
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering Ceph Community
 
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...Ceph Community
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Community
 
Transforming the Ceph Integration Tests with OpenStack
Transforming the Ceph Integration Tests with OpenStack Transforming the Ceph Integration Tests with OpenStack
Transforming the Ceph Integration Tests with OpenStack Ceph Community
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Ceph Community
 
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Community
 
Ceph Day 2015 - Erasure Coding
Ceph Day 2015 - Erasure Coding Ceph Day 2015 - Erasure Coding
Ceph Day 2015 - Erasure Coding Ceph Community
 
Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update Ceph Community
 
Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureCeph Community
 
Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS
Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFSCeph Day New York 2014: Distributed OLAP queries in seconds using CephFS
Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFSCeph Community
 
iSCSI Target Support for Ceph
iSCSI Target Support for Ceph iSCSI Target Support for Ceph
iSCSI Target Support for Ceph Ceph Community
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Community
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
 
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Community
 
Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph Ceph Community
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Community
 
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture Ceph Community
 
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client Ceph Community
 

Viewers also liked (20)

Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance Archiecture
 
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering Ceph Day Beijing: Optimizations on Ceph Cache Tiering
Ceph Day Beijing: Optimizations on Ceph Cache Tiering
 
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
Ceph Day Shanghai - VSM (Virtual Storage Manager) - Simplify Ceph Management ...
 
Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph Ceph Day Shanghai - On the Productization Practice of Ceph
Ceph Day Shanghai - On the Productization Practice of Ceph
 
Transforming the Ceph Integration Tests with OpenStack
Transforming the Ceph Integration Tests with OpenStack Transforming the Ceph Integration Tests with OpenStack
Transforming the Ceph Integration Tests with OpenStack
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools Ceph Day Shanghai - Ceph Performance Tools
Ceph Day Shanghai - Ceph Performance Tools
 
Ceph Day 2015 - Erasure Coding
Ceph Day 2015 - Erasure Coding Ceph Day 2015 - Erasure Coding
Ceph Day 2015 - Erasure Coding
 
Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update Ceph Day LA: Ceph Ecosystem Update
Ceph Day LA: Ceph Ecosystem Update
 
Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive Ceph Day LA - RBD: A deep dive
Ceph Day LA - RBD: A deep dive
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS
Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFSCeph Day New York 2014: Distributed OLAP queries in seconds using CephFS
Ceph Day New York 2014: Distributed OLAP queries in seconds using CephFS
 
iSCSI Target Support for Ceph
iSCSI Target Support for Ceph iSCSI Target Support for Ceph
iSCSI Target Support for Ceph
 
Ceph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic CloudCeph Day Berlin: Scaling an Academic Cloud
Ceph Day Berlin: Scaling an Academic Cloud
 
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Day Beijing: Big Data Analytics on Ceph Object Store
Ceph Day Beijing: Big Data Analytics on Ceph Object Store
 
Ceph Day NYC: Developing With Librados
Ceph Day NYC: Developing With LibradosCeph Day NYC: Developing With Librados
Ceph Day NYC: Developing With Librados
 
Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph Ceph Day Beijing: Containers and Ceph
Ceph Day Beijing: Containers and Ceph
 
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
Ceph Day Beijing: Experience Sharing and OpenStack and Ceph Integration
 
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
Ceph Day New York 2014: Ceph and the Open Ethernet Drive Architecture
 
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
Ceph Day Beijing: Ceph-Dokan: A Native Windows Ceph Client
 

Similar to Ceph Day Shanghai - The Scrub and Repair in Jewel

My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册yiditushe
 
My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册yiditushe
 
尽管去做
尽管去做尽管去做
尽管去做hievan
 
Cite space中文手册
Cite space中文手册Cite space中文手册
Cite space中文手册cueb
 
51 cto下载 2010-ccna实验手册
51 cto下载 2010-ccna实验手册51 cto下载 2010-ccna实验手册
51 cto下载 2010-ccna实验手册poker mr
 
神经网络与深度学习
神经网络与深度学习神经网络与深度学习
神经网络与深度学习Xiaohu ZHU
 
尽管去做——无压工作的艺术
尽管去做——无压工作的艺术尽管去做——无压工作的艺术
尽管去做——无压工作的艺术Jim Liu
 
Ext 中文手册
Ext 中文手册Ext 中文手册
Ext 中文手册donotbeevil
 

Similar to Ceph Day Shanghai - The Scrub and Repair in Jewel (13)

My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册
 
My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册My Eclipse 6 Java Ee开发中文手册
My Eclipse 6 Java Ee开发中文手册
 
尽管去做
尽管去做尽管去做
尽管去做
 
Direct show
Direct showDirect show
Direct show
 
Ucd火花集
Ucd火花集Ucd火花集
Ucd火花集
 
Ucd火花集
Ucd火花集Ucd火花集
Ucd火花集
 
Cite space中文手册
Cite space中文手册Cite space中文手册
Cite space中文手册
 
51 cto下载 2010-ccna实验手册
51 cto下载 2010-ccna实验手册51 cto下载 2010-ccna实验手册
51 cto下载 2010-ccna实验手册
 
神经网络与深度学习
神经网络与深度学习神经网络与深度学习
神经网络与深度学习
 
尽管去做——无压工作的艺术
尽管去做——无压工作的艺术尽管去做——无压工作的艺术
尽管去做——无压工作的艺术
 
080620-16461915
080620-16461915080620-16461915
080620-16461915
 
Ext 中文手册
Ext 中文手册Ext 中文手册
Ext 中文手册
 
080620-16461915
080620-16461915080620-16461915
080620-16461915
 

Ceph Day Shanghai - The Scrub and Repair in Jewel

  • 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Scrub and Repair in Jewel Kefu Cai Red Hat October 18, 2015 The Scrub and Repair in Jewel Kefu Cai Red Hat October 18, 2015 2015-10-18 The Scrub and Repair in Jewel 谢谢⼤家现在还醒着
  • 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Scrub and Repair in Jewel Kefu Chai Red Hat October 18, 2015 The Scrub and Repair in Jewel Kefu Chai Red Hat October 18, 2015 2015-10-18 The Scrub and Repair in Jewel
  • 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline scrub 101 pain in scrub scrub in jewel Q & A Outline scrub 101 pain in scrub scrub in jewel Q & A 2015-10-18 The Scrub and Repair in Jewel Outline 1. 接下来想和诸位聊⼀下 ceph ⾥⾯的 scrub 和 repair
  • 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ sudo ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 17.1c1 is active+clean+inconsistent, acting [21,25,30] 2 scrub errors $ sudo ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 17.1c1 is active+clean+inconsistent, acting [21,25,30] 2 scrub errors 2015-10-18 The Scrub and Repair in Jewel 有的时候需要检查 osd 的 log,才能知道具体是哪个 object 有了问题,进⽽⼿动解决。
  • 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ sudo ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 17.1c1 is active+clean+inconsistent, acting [21,25,30] 2 scrub errors $ ceph pg repair 17.1c1 $ sudo ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 17.1c1 is active+clean+inconsistent, acting [21,25,30] 2 scrub errors $ ceph pg repair 17.1c1 2015-10-18 The Scrub and Repair in Jewel 有的时候需要检查 osd 的 log,才能知道具体是哪个 object 有了问题,进⽽⼿动解决。
  • 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . scrub why ▶ fix the data lost/corruption when we can scrub why ▶ fix the data lost/corruption when we can 2015-10-18 The Scrub and Repair in Jewel scrub 101 scrub 由于硬盘出错导致的 EIO,使得数据不可读。 或者更可怕的是,数据出错,同时客户端没有觉察, 把错误的数据当作正确的数据处理, 从⽽造成灾难性的损失。
  • 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . scrub when ▶ least impact to production scrub when ▶ least impact to production 2015-10-18 The Scrub and Repair in Jewel scrub 101 scrub 1. 限定在 begin hour 和 end hour 之间 2. 在系统负载超过阈值的时候也不会贸然启动 3. 以 osd 为单位,同时进⾏的 scrub 数量是可以配置的
  • 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . scrub when ▶ least impact to production ▶ scheduled periodically scrub when ▶ least impact to production ▶ scheduled periodically 2015-10-18 The Scrub and Repair in Jewel scrub 101 scrub 1. min/max interval 可配置
  • 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . scrub when ▶ least impact to production ▶ scheduled periodically ▶ distributed randomly in configured time range scrub when ▶ least impact to production ▶ scheduled periodically ▶ distributed randomly in configured time range 2015-10-18 The Scrub and Repair in Jewel scrub 101 scrub 1. 试图缓解因为 max interval 导致的⼤量 scrub 任务堆积
  • 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . scrub how ▶ pg based and initialized by the primary ▶ scan in batch scrub how ▶ pg based and initialized by the primary ▶ scan in batch 2015-10-18 The Scrub and Repair in Jewel scrub 101 scrub 1. 每次以 pg 为单位,由 primary 发起,要求所有的 peered 的 pg 参与。 分批次处理 pg ⾥⾯的所有 object,以 object 的 hash 为界。这就是所谓的 chunky scrub。把收集到的 object 信息,组织成⼀个 scrub map,同时也向 replica 请求相应的 scrub map。很明显,scrub map 会包含 object, xattr, omap 的信息, 以供交叉⽐较。
  • 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . scrub how ▶ pg based and initialized by the primary ▶ scan in batch ▶ select authoritive copy scrub how ▶ pg based and initialized by the primary ▶ scan in batch ▶ select authoritive copy 2015-10-18 The Scrub and Repair in Jewel scrub 101 scrub 1. 和⺴购⼀样,⼀个 object 也有好评差评之分。⽐如, object 对应 hash 读不出来了; omap 读不出, object 读出来的⼤⼩或者 digest 不⼀致; 干脆读取的时候出现 EIO。 都会导致⼀个 replica 失去成为权威拷⻉的机会。 只有通过这些考验的副本才能成为权威拷⻉。⽽出问题的情况, 不外乎读取失败,数据缺失,和数据不⼀致⼏种。
  • 12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . scrub how ▶ pg based and initialized by the primary ▶ scan in batch ▶ select authoritive copy ▶ write down the results ▶ fix it (later) scrub how ▶ pg based and initialized by the primary ▶ scan in batch ▶ select authoritive copy ▶ write down the results ▶ fix it (later) 2015-10-18 The Scrub and Repair in Jewel scrub 101 scrub 1. 把 scrub 的结果通过 cluster log 发给 monitor 记录下来。 这些结果包括⼀些 metadata 本⾝不⼀致的情况,⽐如 head 和 snapdir 共存,有 clone, 但是 snapset ⾥⾯的 seq 是空的, 也有 snapset 本⾝出现不⼀致,⽐如说⾃⼰觉得有 3 个 snap,但是发现只有 2 个。 2. 对于 digest 缺失的情况,就直接⽤已知的 digest 重写所有的 replica。 如果数据本⾝不⼀致, 或者缺失,那么就把它放到 missing 列表⾥⾯去。 如果在以后处理 io 请求的时候,正好碰到了缺少的 object,就⽴即恢复它。 即随机选择⼀个 replica,发送 pull 请求, ⽤收到的 push 数据重写本地有错误,或者缺少的 object。
  • 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . pain in scrub ▶ ceph pg repair does not always work. ▶ scrub/repair is not programmable/scriptable. pain in scrub ▶ ceph pg repair does not always work. ▶ scrub/repair is not programmable/scriptable. 2015-10-18 The Scrub and Repair in Jewel pain in scrub pain in scrub 现有的机制还⾮常简单,⽆法应对绝⼤多数情形。 有时候需要观察⽇志,删除出错的 replica。 没有实现更智能、更复杂的策略。 ⽐如说少数服从多数的算法,虽然直觉,但是⺫前没有这个设计。 如果 librados 能为 scrub 提供更丰富的接⼝, 那么客户端就能灵活地制定策略,选择有效的副本,指定正确的数据, 甚⾄恢复 snapset/clone,当然这需要对 snapset 有⼀些理解。
  • 14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . new APIs query ▶ get_inconsistent_pools() => pool[] ▶ get_inconsistent_pgs() => pg[] ▶ get_inconsistent(pg) => epoch, inconsistent[] new APIs query ▶ get_inconsistent_pools() => pool[] ▶ get_inconsistent_pgs() => pg[] ▶ get_inconsistent(pg) => epoch, inconsistent[] 2015-10-18 The Scrub and Repair in Jewel scrub in jewel new APIs 在 scrub 过程中,osd 会把缺失或者不⼀致的信息收集起来。保存在临时 object ⾥⾯,供 scrub API 查询。在同⼀个 pg interval 的期间,可以根据这次 scrub 的结果 进⾏查询,读取,进⾏恢复的操作。 但是如果 interval 过了,scrub API 会返回 EAGAIN 。 inconsistent 会是⼀个带有版本号的 json,⽅便⽇后扩展。 是 osd 到 shard_info_t 的 map, shard_info_t 包含 replica 是否存在, omap_sha1, data_sha1, size 等 metadata。 基于这些数据,⽤户可以实现各种策略,⽐如说投票。
  • 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . new APIs read ▶ operate_on_shard(epoch, osd, shard, op) new APIs read ▶ operate_on_shard(epoch, osd, shard, op) 2015-10-18 The Scrub and Repair in Jewel scrub in jewel new APIs 其中,op 可能是 read 操作,这些操作允许⽤户检查数据。 从⽽选择有效的数据副本,同时跳过 OSD ⼀致性的检查。
  • 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . new APIs choose allows user to ▶ specify the good/bad shard/xattr/omap replica ▶ specify the metadata (snapset in particular) new APIs choose allows user to ▶ specify the good/bad shard/xattr/omap replica ▶ specify the metadata (snapset in particular) 2015-10-18 The Scrub and Repair in Jewel scrub in jewel new APIs 前者本⾝就是数据,没有和其它对象有互相引⽤的关系。 但是 snapset 和 clone 有其⾃⾝的⼀致性问题。所以具体的接⼝还在讨论。 如果操作不当,可能会产⽣更多的不⼀致,有些可能是预料之外的。
  • 17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Q & A Q & A 2015-10-18 The Scrub and Repair in Jewel Q & A Q & A 段经理在早上,说到 call for action。在这⾥也希望得到⼤家的宝贵意⻅。