It is an invited presentation for NCSC2012 (China National Conference on Social Computing) on cloud computing from industry.
It summarized what we learn on developing and operating an Infrastructure as a Service in a highly scalable manner. The service described inside the corporation is kind of dogfood that engineers work with in their daily work.
5. Quick Stats
●
5,800 VMs provisioned in 2 months
●
700+ individual visitors per month
●
50,000+ requests to web services per single
day
– Less than 40% requests are sent by human
6. Design for Failure
●
“Failure is not an option, it's a requirement.”
●
Things will crash
– Linux kernel panic
– Defunct process
– File system becomes read only suddenly
●
HW just doesn't work in every week
– Broken disk
– Flaws in CPU
– Network adapter varies among 10/100/1000 Mbps
8. Flakiness
Nov 14 00:39:27 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Nov 14 00:39:35 r007x072 kernel: e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Nov 14 00:39:35 r007x072 kernel: e1000e 0000:1a:00.1: eth3: 10/100 speed: disabling TSO
Nov 14 00:39:35 r007x072 kernel: bonding: bond1: link status definitely up for interface eth3.
Nov 14 00:39:36 r007x072 kernel: e1000e: eth3 NIC Link is Down
Nov 14 00:39:36 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Analysis
Analysis
Unqualified Network Cables
Unqualified Network Cables
9. [root@r007x072 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:98:2a:4c
Slave queue ID: 0
Slave Interface: eth3
MII Status: up
Link Failure Count: 1627
Permanent HW addr: 00:1b:21:98:2a:4d
Slave queue ID: 0
10. Keep Simple and Robust
●
“I have 4 letters for you: KISS (Keep it simple
and stupid)”
●
Complex system === hazardous system
●
Just enough fault-tolerance
– Reboot machine if it goes wrong
– Logout iSCSI session and login again
– Mini toolkit to fix broken DM (device mapper)
table
11. Example: Stateless OS
●
Mount root partition in RAM
– Think about how you install Ubuntu or Fedora
●
Fix problem by reboot only
[root@r009x090 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/live-rw 7,9G 1,5G 6,4G 19% /
tmpfs 71G 4,0K 71G 1% /dev/shm
/dev/sda2 7,9G 1,4G 6,2G 18% /var/log
/dev/sda4 1,6T 183G 1,4T 12% /iaas/local-storage
12. P2P based Socialized Communication
●
Bots “talk” to each other
●
Anyone can be re-run in seconds when things
go wrong
13. Robust Application
●
A number of roles in distributed system do
there own jobs
– Bot, manager, watch dog, zookeeper, agent,
hbase, hadoop, etc
(http://zookeeper.apache.org/images/zookeeper_small.gif)
HBase
Region Region Region
Server Server Server
zookeeper
Data Data Data
node node node
HDFS
Regular bot Watch dog Manager bot
14. Dedicated
Network-accessible Services
●
NTP (controversial in VM but good enough)
●
ZooKeeper
– Node presence
– Configuration data
– Leader election
●
HBase: store schema-less data
●
Rsyslog: centralize logs
●
Web Service: accept HTTP requests only
15. Scale-out Architecture For Growth
●
Single namespace for global infrastructure
– v525400ffffff.region-a.cloud.xx.ibm.com/service-foo
●
Multi-region for Geo-distribution
●
Use cache when possible
●
Share nothing by autonomy
●
Leader election (elect new manager if former
dies)
●
Collect metrics
16. Requirement grows/decreases
faster than purchasing HW
●
“I need 200 large VMs this afternoon and will
terminate all of them tomorrow.”
17. Storage is Always Not Enough
●
Walk-around: recycle unused files
– Move low hit virtual images out of hot zone
– Setup SLA to limit availability (provide
redundancy only when necessary)
18. Metrics Collection is Critical
●
“Gathering, storing, and displaying metrics
should be considered a mission-critical part of
your infrastructure.”*
●
Measurement for performance boost (or
downgrade)
(* comes from chapter 3 of the book “web operations”)
19. Example #1:
Fix Side Effect of the Leap Second
●
The latest leap second occurred on the end of
June 2012
/var/log/messages grows too much
It take 10 times long in job distribution
between bots
tgtd: work_timer_evt_handler(89) failed to read from timerfd,
Resource temporarily unavailable
# service ntpd stop; date -s “`date`”; service ntpd start
20. Example #2:
Recycle Unused Resources
@zhukecdl Our analysis of your VM instance(s) shows that
CPU utilization and network traffic in the past 48 hours
have dropped below 2% and 10 MB.
Instance ID CPU Time (s) CPU Rate (%) TX (MB)
r007.x072.17897.u51393 337.3 0.20 0
We would strongly urge you to consider recycling your
instance(s) so that others can make use of these resources.
If you didn't contact the administrator before 2011-08-16
17:00+8000, the instance r007.x072.17897.u51393 will be recycled
Regards,
21. Automated Operation (and More)
●
Goals
– Daily upgrade all components
– One administrator for 1k systems
– No working overtime
●
Tool
– Ruby chef
– SmartCloud portfolio
●
Process
– Run benchmark to the system every week
– Stay in office until build break is fixed
22. Run Benchmark to the System Often
●
Measurement to your performance tweaks
●
Tools
– Netperf
– Apache JMeter
Benchmarking network infrastructure
# netserver
# netperf -H 10.10.1.97 -l 43200 TCP_CRR &
# netperf -H 9.123.127.227 -l 43200 TCP_CRR &
23. Infrastructure as Code
●
Building network accessible services
●
Integration these services
[root@beijing-mn03 ~]# virsh list
Id Name State
----------------------------------
1 hbm1 running
2 bj-jenkins running
3 hjt running
4 webservice-1 running
5 hnn2 running
6 bugzilla running
8 hslave07 running
9 hslave08 running
10 hslave09 running
11 hslave10 running
12 hslave11 running
13 ScannerSlackware running
24. Real time Feedback by Tracing Logs
●
Manager X: “I need daily success rate report
on deploy VM from department Y today.”