We all agree that recurring operational tasks are time-consuming nuisances, which should be eradicated using automation. However, sometimes they require careful coordination, hardware manipulation and worst of all - human interaction.
Recently, we found that our code doesn’t really need to pass a Turing test in order to successfully interact with humans, and convince them to partake in an automated process. In this talk I’ll describe how we automated disk replacement for our HDFS clusters - despite having to communicate with the hosting provider by emails, while preventing the process from failing at scale.
3. Asingleharddiskisgreat
● Will store ALL your porn data
○ 2015: 10TB helium-filled HDD
● Cheap
○ 2015: 0.032$ per-gigabyte
● Will rarely fail
○ Unless it’s really inconvenient
8. Problem
1. Detect
2. Stop usage
3. Request replacement
4. <Physical replacement>
5. Initialize disk
6. Resume usage
No API for actual replacement
:,(
We want:
replace(server_id, disk_id)
is_replaced(server_id, disk_id)
9. is_replaced(server_id,disk_id)
● We have RAID adapters on all the servers
● They can emit the Disk Serial Number
is_replaced(server_id, disk_id):
return STORED_SERIAL_NUMBERS(server_id, disk_id) !=
get_serial_number(server_id, disk_id)
16. Diskreplacementprocess
1. Detect
a. SMART / Vendor utility
b. Monitoring service API
2. Stop usage
a. Application API
b. Operating system (kill processes, umount, delete SCSI device)
3. Request replacement
a. replace(server_id, disk_id)
4. <Physical replacement>
a. while not is_replace(server_id, disk_id): sleep(t)
5. Initialize disk
a. [RAID], Partition Table, Filesystem, Directories, Permissions
6. Resume usage
a. Application API
17. Safetyis#1priority!
● Check Pre-conditions
○ Under-replicated blocks
○ Missing blocks
● Single replacement per-cluster
● Handle special cases
○ Root device
○ User facing services
18. Notifications&Audit
● All emails originate from PE group address
● CC to PE group and Slack
● Audit log sent to ElasticSearch
○ SOON: Kibana dashboard
19. Summary
● One disk GOOD
● Many disks BAD
● Avoid Fail at Scale
● No need to pass a Turing Test to interact with humans
replace(1.2.3.4, 0:1:2)