Practice and challenges from building IaaS

Practice and Challenges
from
Building Infrastracture-as-a-Service
朱可
zhukecdl@cn.ibm.com

Disclaimer
●
Representing personal opinion only

IaaS in Our Development Lab
●
Virtual machine
●
Block storage
●
Virtual machine template
●
VLAN
●
Static ip address
●
Virtual Desktop

$ ./iaas-deploy-vms -i centos63 -n 100
$ ./iaas-deploy-vms -i centos63 -n 100

The Machinery

Node: 16 Cores 192GB RAM 1,6TB

Rack: 20+ nodes, 2 rack switches

Quick Stats
●
5,800 VMs provisioned in 2 months
●
700+ individual visitors per month
●
50,000+ requests to web services per single
day
– Less than 40% requests are sent by human

Design for Failure
●
“Failure is not an option, it's a requirement.”
●
Things will crash
– Linux kernel panic
– Defunct process
– File system becomes read only suddenly
●
HW just doesn't work in every week
– Broken disk
– Flaws in CPU
– Network adapter varies among 10/100/1000 Mbps

Flakiness
Nov 14 00:39:27 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Nov 14 00:39:35 r007x072 kernel: e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Nov 14 00:39:35 r007x072 kernel: e1000e 0000:1a:00.1: eth3: 10/100 speed: disabling TSO
Nov 14 00:39:35 r007x072 kernel: bonding: bond1: link status definitely up for interface eth3.
Nov 14 00:39:36 r007x072 kernel: e1000e: eth3 NIC Link is Down
Nov 14 00:39:36 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it

Analysis
Analysis

Unqualified Network Cables
Unqualified Network Cables

[root@r007x072 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:98:2a:4c
Slave queue ID: 0

Slave Interface: eth3
MII Status: up
Link Failure Count: 1627
Permanent HW addr: 00:1b:21:98:2a:4d
Slave queue ID: 0

Keep Simple and Robust
●
“I have 4 letters for you: KISS (Keep it simple
and stupid)”
●
Complex system === hazardous system
●
Just enough fault-tolerance
– Reboot machine if it goes wrong
– Logout iSCSI session and login again
– Mini toolkit to fix broken DM (device mapper)
table

Example: Stateless OS
●
Mount root partition in RAM
– Think about how you install Ubuntu or Fedora
●
Fix problem by reboot only

[root@r009x090 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/live-rw 7,9G 1,5G 6,4G 19% /
tmpfs 71G 4,0K 71G 1% /dev/shm
/dev/sda2 7,9G 1,4G 6,2G 18% /var/log
/dev/sda4 1,6T 183G 1,4T 12% /iaas/local-storage

P2P based Socialized Communication
●
Bots “talk” to each other
●
Anyone can be re-run in seconds when things
go wrong

Robust Application
●
A number of roles in distributed system do
there own jobs
– Bot, manager, watch dog, zookeeper, agent,
hbase, hadoop, etc
(http://zookeeper.apache.org/images/zookeeper_small.gif)

HBase

Region Region Region
Server Server Server
zookeeper
Data Data Data
node node node

HDFS
Regular bot Watch dog Manager bot

Dedicated
Network-accessible Services
●
NTP (controversial in VM but good enough)
●
ZooKeeper
– Node presence
– Configuration data
– Leader election
●
HBase: store schema-less data
●
Rsyslog: centralize logs
●
Web Service: accept HTTP requests only

Scale-out Architecture For Growth
●
Single namespace for global infrastructure
– v525400ffffff.region-a.cloud.xx.ibm.com/service-foo
●
Multi-region for Geo-distribution
●
Use cache when possible
●
Share nothing by autonomy
●
Leader election (elect new manager if former
dies)
●
Collect metrics

Requirement grows/decreases
faster than purchasing HW
●
“I need 200 large VMs this afternoon and will
terminate all of them tomorrow.”

Storage is Always Not Enough
●
Walk-around: recycle unused files
– Move low hit virtual images out of hot zone
– Setup SLA to limit availability (provide
redundancy only when necessary)

Metrics Collection is Critical
●
“Gathering, storing, and displaying metrics
should be considered a mission-critical part of
your infrastructure.”*
●
Measurement for performance boost (or
downgrade)

(* comes from chapter 3 of the book “web operations”)

Example #1:
Fix Side Effect of the Leap Second
●
The latest leap second occurred on the end of
June 2012
/var/log/messages grows too much
It take 10 times long in job distribution
between bots

tgtd: work_timer_evt_handler(89) failed to read from timerfd,
Resource temporarily unavailable

# service ntpd stop; date -s “`date`”; service ntpd start

Example #2:
Recycle Unused Resources
@zhukecdl Our analysis of your VM instance(s) shows that
CPU utilization and network traffic in the past 48 hours
have dropped below 2% and 10 MB.

Instance ID CPU Time (s) CPU Rate (%) TX (MB)
r007.x072.17897.u51393 337.3 0.20 0

We would strongly urge you to consider recycling your
instance(s) so that others can make use of these resources.

If you didn't contact the administrator before 2011-08-16
17:00+8000, the instance r007.x072.17897.u51393 will be recycled

Regards,

Automated Operation (and More)
●
Goals
– Daily upgrade all components
– One administrator for 1k systems
– No working overtime
●
Tool
– Ruby chef
– SmartCloud portfolio
●
Process
– Run benchmark to the system every week
– Stay in office until build break is fixed

Run Benchmark to the System Often
●
Measurement to your performance tweaks
●
Tools
– Netperf
– Apache JMeter

Benchmarking network infrastructure
# netserver
# netperf -H 10.10.1.97 -l 43200 TCP_CRR &
# netperf -H 9.123.127.227 -l 43200 TCP_CRR &

Infrastructure as Code
●
Building network accessible services
●
Integration these services
[root@beijing-mn03 ~]# virsh list
Id Name State
----------------------------------
1 hbm1 running
2 bj-jenkins running
3 hjt running
4 webservice-1 running
5 hnn2 running
6 bugzilla running
8 hslave07 running
9 hslave08 running
10 hslave09 running
11 hslave10 running
12 hslave11 running
13 ScannerSlackware running

Real time Feedback by Tracing Logs
●
Manager X: “I need daily success rate report
on deploy VM from department Y today.”

Summary
●
Keep it simple and robust
●
Scale-out architecture
●
Automated operation

Practice and challenges from building IaaS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Practice and challenges from building IaaS

Similar to Practice and challenges from building IaaS (20)

Recently uploaded

Recently uploaded (20)

Practice and challenges from building IaaS