Practice and Challenges
               from
Building Infrastracture-as-a-Service
                 朱可
         zhukecdl@cn.ibm.com
Disclaimer
●
    Representing personal opinion only
IaaS in Our Development Lab
●
    Virtual machine
●
    Block storage
●
    Virtual machine template
●
    VLAN
●
    Static ip address
●
    Virtual Desktop


    $ ./iaas-deploy-vms -i centos63 -n 100
    $ ./iaas-deploy-vms -i centos63 -n 100
The Machinery


                Node: 16 Cores 192GB RAM 1,6TB




Rack: 20+ nodes, 2 rack switches
Quick Stats
●
    5,800 VMs provisioned in 2 months
●
    700+ individual visitors per month
●
    50,000+ requests to web services per single
    day
    –   Less than 40% requests are sent by human
Design for Failure
●
    “Failure is not an option, it's a requirement.”
●
    Things will crash
    –   Linux kernel panic
    –   Defunct process
    –   File system becomes read only suddenly
●
    HW just doesn't work in every week
    –   Broken disk
    –   Flaws in CPU
    –   Network adapter varies among 10/100/1000 Mbps
Event In Red: Failure
Flakiness
Nov 14 00:39:27 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it
Nov 14 00:39:35 r007x072 kernel: e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Nov 14 00:39:35 r007x072 kernel: e1000e 0000:1a:00.1: eth3: 10/100 speed: disabling TSO
Nov 14 00:39:35 r007x072 kernel: bonding: bond1: link status definitely up for interface eth3.
Nov 14 00:39:36 r007x072 kernel: e1000e: eth3 NIC Link is Down
Nov 14 00:39:36 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it




                                            Analysis
                                            Analysis




                        Unqualified Network Cables
                        Unqualified Network Cables
[root@r007x072 ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1b:21:98:2a:4c
Slave queue ID: 0

Slave Interface: eth3
MII Status: up
Link Failure Count: 1627
Permanent HW addr: 00:1b:21:98:2a:4d
Slave queue ID: 0
Keep Simple and Robust
●
    “I have 4 letters for you: KISS (Keep it simple
    and stupid)”
●
    Complex system === hazardous system
●
    Just enough fault-tolerance
    –   Reboot machine if it goes wrong
    –   Logout iSCSI session and login again
    –   Mini toolkit to fix broken DM (device mapper)
        table
Example: Stateless OS
  ●
      Mount root partition in RAM
      –   Think about how you install Ubuntu or Fedora
  ●
      Fix problem by reboot only

[root@r009x090 ~]# df   -h
Filesystem              Size   Used Avail Use% Mounted on
/dev/mapper/live-rw     7,9G   1,5G 6,4G 19% /
tmpfs                    71G   4,0K   71G   1% /dev/shm
/dev/sda2               7,9G   1,4G 6,2G 18% /var/log
/dev/sda4               1,6T   183G 1,4T 12% /iaas/local-storage
P2P based Socialized Communication
●
    Bots “talk” to each other
●
    Anyone can be re-run in seconds when things
    go wrong
Robust Application
   ●
        A number of roles in distributed system do
        there own jobs
          –   Bot, manager, watch dog, zookeeper, agent,
              hbase, hadoop, etc
       (http://zookeeper.apache.org/images/zookeeper_small.gif)



                                                                           HBase

                                                                  Region   Region   Region
                                                                  Server   Server   Server
                        zookeeper
                                                                  Data     Data     Data
                                                                  node     node     node

                                                                           HDFS
Regular bot Watch dog                     Manager bot
Dedicated
        Network-accessible Services
●
    NTP (controversial in VM but good enough)
●
    ZooKeeper
    –   Node presence
    –   Configuration data
    –   Leader election
●
    HBase: store schema-less data
●
    Rsyslog: centralize logs
●
    Web Service: accept HTTP requests only
Scale-out Architecture For Growth
●
    Single namespace for global infrastructure
    –   v525400ffffff.region-a.cloud.xx.ibm.com/service-foo
●
    Multi-region for Geo-distribution
●
    Use cache when possible
●
    Share nothing by autonomy
●
    Leader election (elect new manager if former
    dies)
●
    Collect metrics
Requirement grows/decreases
      faster than purchasing HW
●
    “I need 200 large VMs this afternoon and will
    terminate all of them tomorrow.”
Storage is Always Not Enough
●
    Walk-around: recycle unused files
    –   Move low hit virtual images out of hot zone
    –   Setup SLA to limit availability (provide
        redundancy only when necessary)
Metrics Collection is Critical
●
    “Gathering, storing, and displaying metrics
    should be considered a mission-critical part of
    your infrastructure.”*
●
    Measurement for performance boost (or
    downgrade)




                      (* comes from chapter 3 of the book “web operations”)
Example #1:
   Fix Side Effect of the Leap Second
  ●
      The latest leap second occurred on the end of
      June 2012
/var/log/messages grows too much
                       It take 10 times long in job distribution
                       between bots



tgtd: work_timer_evt_handler(89) failed to read from timerfd,
Resource temporarily unavailable




  # service ntpd stop; date -s “`date`”; service ntpd start
Example #2:
           Recycle Unused Resources
@zhukecdl Our analysis of your VM instance(s) shows that
CPU utilization and network traffic in the past 48 hours
have dropped below 2% and 10 MB.

Instance ID            CPU Time (s)     CPU Rate (%)   TX (MB)
r007.x072.17897.u51393       337.3      0.20     0

We would strongly urge you to consider recycling your
instance(s) so that others can make use of these resources.

If you didn't contact the administrator before 2011-08-16
17:00+8000, the instance r007.x072.17897.u51393 will be recycled

Regards,
Automated Operation (and More)
●
    Goals
    –   Daily upgrade all components
    –   One administrator for 1k systems
    –   No working overtime
●
    Tool
    –   Ruby chef
    –   SmartCloud portfolio
●
    Process
    –   Run benchmark to the system every week
    –   Stay in office until build break is fixed
Run Benchmark to the System Often
●
    Measurement to your performance tweaks
●
    Tools
    –   Netperf
    –   Apache JMeter



               Benchmarking network infrastructure
     # netserver
     # netperf -H 10.10.1.97 -l 43200 TCP_CRR &
     # netperf -H 9.123.127.227 -l 43200 TCP_CRR &
Infrastructure as Code
●
    Building network accessible services
●
    Integration these services
          [root@beijing-mn03 ~]# virsh list
           Id Name                 State
          ----------------------------------
            1 hbm1                 running
            2 bj-jenkins           running
            3 hjt                  running
            4 webservice-1         running
            5 hnn2                 running
            6 bugzilla             running
            8 hslave07             running
            9 hslave08             running
           10 hslave09             running
           11 hslave10             running
           12 hslave11             running
           13 ScannerSlackware     running
Real time Feedback by Tracing Logs
●
    Manager X: “I need daily success rate report
    on deploy VM from department Y today.”
Visualize Traces Via Timeline
Summary
●
    Keep it simple and robust
●
    Scale-out architecture
●
    Automated operation

Practice and challenges from building IaaS

  • 1.
    Practice and Challenges from Building Infrastracture-as-a-Service 朱可 zhukecdl@cn.ibm.com
  • 2.
    Disclaimer ● Representing personal opinion only
  • 3.
    IaaS in OurDevelopment Lab ● Virtual machine ● Block storage ● Virtual machine template ● VLAN ● Static ip address ● Virtual Desktop $ ./iaas-deploy-vms -i centos63 -n 100 $ ./iaas-deploy-vms -i centos63 -n 100
  • 4.
    The Machinery Node: 16 Cores 192GB RAM 1,6TB Rack: 20+ nodes, 2 rack switches
  • 5.
    Quick Stats ● 5,800 VMs provisioned in 2 months ● 700+ individual visitors per month ● 50,000+ requests to web services per single day – Less than 40% requests are sent by human
  • 6.
    Design for Failure ● “Failure is not an option, it's a requirement.” ● Things will crash – Linux kernel panic – Defunct process – File system becomes read only suddenly ● HW just doesn't work in every week – Broken disk – Flaws in CPU – Network adapter varies among 10/100/1000 Mbps
  • 7.
  • 8.
    Flakiness Nov 14 00:39:27r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Nov 14 00:39:35 r007x072 kernel: e1000e: eth3 NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX Nov 14 00:39:35 r007x072 kernel: e1000e 0000:1a:00.1: eth3: 10/100 speed: disabling TSO Nov 14 00:39:35 r007x072 kernel: bonding: bond1: link status definitely up for interface eth3. Nov 14 00:39:36 r007x072 kernel: e1000e: eth3 NIC Link is Down Nov 14 00:39:36 r007x072 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Analysis Analysis Unqualified Network Cables Unqualified Network Cables
  • 9.
    [root@r007x072 ~]# cat/proc/net/bonding/bond1 Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009) Bonding Mode: adaptive load balancing Primary Slave: None Currently Active Slave: eth2 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth2 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:1b:21:98:2a:4c Slave queue ID: 0 Slave Interface: eth3 MII Status: up Link Failure Count: 1627 Permanent HW addr: 00:1b:21:98:2a:4d Slave queue ID: 0
  • 10.
    Keep Simple andRobust ● “I have 4 letters for you: KISS (Keep it simple and stupid)” ● Complex system === hazardous system ● Just enough fault-tolerance – Reboot machine if it goes wrong – Logout iSCSI session and login again – Mini toolkit to fix broken DM (device mapper) table
  • 11.
    Example: Stateless OS ● Mount root partition in RAM – Think about how you install Ubuntu or Fedora ● Fix problem by reboot only [root@r009x090 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/live-rw 7,9G 1,5G 6,4G 19% / tmpfs 71G 4,0K 71G 1% /dev/shm /dev/sda2 7,9G 1,4G 6,2G 18% /var/log /dev/sda4 1,6T 183G 1,4T 12% /iaas/local-storage
  • 12.
    P2P based SocializedCommunication ● Bots “talk” to each other ● Anyone can be re-run in seconds when things go wrong
  • 13.
    Robust Application ● A number of roles in distributed system do there own jobs – Bot, manager, watch dog, zookeeper, agent, hbase, hadoop, etc (http://zookeeper.apache.org/images/zookeeper_small.gif) HBase Region Region Region Server Server Server zookeeper Data Data Data node node node HDFS Regular bot Watch dog Manager bot
  • 14.
    Dedicated Network-accessible Services ● NTP (controversial in VM but good enough) ● ZooKeeper – Node presence – Configuration data – Leader election ● HBase: store schema-less data ● Rsyslog: centralize logs ● Web Service: accept HTTP requests only
  • 15.
    Scale-out Architecture ForGrowth ● Single namespace for global infrastructure – v525400ffffff.region-a.cloud.xx.ibm.com/service-foo ● Multi-region for Geo-distribution ● Use cache when possible ● Share nothing by autonomy ● Leader election (elect new manager if former dies) ● Collect metrics
  • 16.
    Requirement grows/decreases faster than purchasing HW ● “I need 200 large VMs this afternoon and will terminate all of them tomorrow.”
  • 17.
    Storage is AlwaysNot Enough ● Walk-around: recycle unused files – Move low hit virtual images out of hot zone – Setup SLA to limit availability (provide redundancy only when necessary)
  • 18.
    Metrics Collection isCritical ● “Gathering, storing, and displaying metrics should be considered a mission-critical part of your infrastructure.”* ● Measurement for performance boost (or downgrade) (* comes from chapter 3 of the book “web operations”)
  • 19.
    Example #1: Fix Side Effect of the Leap Second ● The latest leap second occurred on the end of June 2012 /var/log/messages grows too much It take 10 times long in job distribution between bots tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable # service ntpd stop; date -s “`date`”; service ntpd start
  • 20.
    Example #2: Recycle Unused Resources @zhukecdl Our analysis of your VM instance(s) shows that CPU utilization and network traffic in the past 48 hours have dropped below 2% and 10 MB. Instance ID CPU Time (s) CPU Rate (%) TX (MB) r007.x072.17897.u51393 337.3 0.20 0 We would strongly urge you to consider recycling your instance(s) so that others can make use of these resources. If you didn't contact the administrator before 2011-08-16 17:00+8000, the instance r007.x072.17897.u51393 will be recycled Regards,
  • 21.
    Automated Operation (andMore) ● Goals – Daily upgrade all components – One administrator for 1k systems – No working overtime ● Tool – Ruby chef – SmartCloud portfolio ● Process – Run benchmark to the system every week – Stay in office until build break is fixed
  • 22.
    Run Benchmark tothe System Often ● Measurement to your performance tweaks ● Tools – Netperf – Apache JMeter Benchmarking network infrastructure # netserver # netperf -H 10.10.1.97 -l 43200 TCP_CRR & # netperf -H 9.123.127.227 -l 43200 TCP_CRR &
  • 23.
    Infrastructure as Code ● Building network accessible services ● Integration these services [root@beijing-mn03 ~]# virsh list Id Name State ---------------------------------- 1 hbm1 running 2 bj-jenkins running 3 hjt running 4 webservice-1 running 5 hnn2 running 6 bugzilla running 8 hslave07 running 9 hslave08 running 10 hslave09 running 11 hslave10 running 12 hslave11 running 13 ScannerSlackware running
  • 24.
    Real time Feedbackby Tracing Logs ● Manager X: “I need daily success rate report on deploy VM from department Y today.”
  • 25.
  • 26.
    Summary ● Keep it simple and robust ● Scale-out architecture ● Automated operation