5+ years experience designing, setting up, testing & running production web systems in varied deployment environments Experience setting up colocation IDCs with Active-Active DR sites for India’s No. 1 OTA Experience working on public cloud platforms like AWS and setting up private cloud infrastructure …Generation G : Gamification /engineer/ Tags: techie, open source enthusiast, engineer, geek, DevOps, web ops, security , Tripper(MMYT),Ex-Nextag-ian :)
Avoid unnecessary change by selecting a long-term supported distribution on which to base your platform. ◦ RHEL / CentOS ◦ Ubuntu LTS (Long Term Support) ◦ Debian Stable My preference:- RHEL / CentOS (Red Hat Stability & yum wins)
Use your capacity model to drive a decision on how you build infrastructure : Check SLAs & Cost constraints ◦ 100% dedicated hardware (Self Managed / Outsourced) ◦ 100% cloud (May consider AWS /or Rackspace) ◦ Hybrid Cloud success relies on “automating” key service management processes to optimize the run-time operation of /dynamic workloads/ in a shared-resource environment.
Split each service(/layer) out across its own set of servers for easier scale-out and management. ◦ Traffic Management / (both Global Traffic & Local traffic management) ◦ Application Servers ◦ Data Store Servers ◦ Email Services ◦ + Minimize Distribution of State:- Keep services that require storage to a minimum, for ease of backups and management - like Data Services (backups)
Use redundant pairs(on devices/appliances) , /HA/ & clustering or failover to ensure availability of service(s). ◦ Minimum down-time. ◦ Application & services redundancy + Load Balanced cluster on one site & DR too ◦ DB HA+ Data Store(MySQL) Backup and Recovery ◦ Choose and implement best suited Failover strategy ◦ Redundant Network on each node (+ on Server: Linux NIC bond)
◦ Dev , QA and staging platforms (both application & N/W platform) to prove application and configuration changes before they go live into production.◦ Most of the Live site issues are due to lack of similar configuration environment / platform for Dev / QA / Staging Testing.◦ LAB Env:- Performance/Stress LAB Experimentation LAB (A/B or Multivariate experiment) support with Live traffic
Virtualization is key here :) ...actually this is changing world ...not the cloud !! + Selecting the Right Virtualization Technology Use network boot and installer tools; or templated provisioning to build servers identically ◦ PXE Boot + Kickstart ◦ VMWare ESXi Template /Citrix Xenserver ◦ Amazon AMI (EC2) ◦ OpenNebula
Package Management - YUM repositories (Distribution + Own) Create you own Repository servers for packages + Code both Use configuration management tools to deploy configuration automatically from a central location. ◦ Puppet / Facter ◦ Chef ◦ CFEngine (Nova) ◦ RANCID (N/w Devices)
Use a central service for identity and password management ◦ OpenLDAP ◦ Active Directory ◦ TACACS+ (N/w devices) Have proper accounting/audit Logging Inventory Management : ◦ Use facter facts + CMDB based Inventory Management
◦ Version Control:- SVN / GIT◦ Use continuous integration and deployment tools to test and release software Jenkins (Hudson) / Go Capistrano / Fabric◦ ....Deploy more frequently ...so as to build confidence in the whole system for change management
Starting from Site Availability Checks & External Dependencies Checks to much more detailed data to Capture as much data as possible. Store time-series data for trend analysis, and alert when thresholds are breached. ◦ CPU / RAM / IO / Network usage per server ◦ Application metrics ◦ Disc space usage ◦ Network bandwidth ◦ MySQL numbers ◦ ...etc
So, source could be anything starting from DB, logs, SNMP, http etc + have Real time reporting over it (Dashboards) + Real time data extraction Tools to consider: ◦ Ganglia / Centreon / Nagios ◦ OpManager for URL monitoring ◦ Selenium RC based checks (Functional tests) etc Alerting on both Minimum/Maximum Thresholds (OK, WARN, CRITICAL)!
Continue to plan your resource requirements based on growth expectations, new features and performance targets Use data from: ◦ Your monitoring system! ◦ Business requirements Continuously Improve: ◦ Profile applications and reduce resource usage (Dtrace) ◦ Review performance against capacity model ◦ Feed a “Top 10” hitlist back to developers may be slow queries etc
Varnish cache ◦ Reverse proxy, flexible configuration with inline C support Nginx ◦ Event based / Lightweight ◦ Runs more than 8% of the web PHP-FPM ◦ Best FastCGI implementation available for PHP MySQL Server tuning / optimization Caching:- In memory data store - Memcached / Redis
As a first exercise - do have a IT Infrastructure & Application Threat Modeling done along with Risk Assessment then…..consider having ◦ HIDS (OSSEC) /IPTABLES ◦ WAF (Web Application Firewall) ◦ IPS (Intrusion prevention system) ◦ Linux Hardening ◦ DLP (Data Leakage Prevention) ◦ Data Encryption considerations wrt Data Classification Security Monitoring & Attack Detection Key thing is to "Enable continuous compliance" ...maybe PCI-DSS for an e-comm.
Diagnosing / Troubleshooting and Fixing production issues Change Management and Delivery Automate as much as possible with centralized management of Scripting etc Backup/restore : Always do test drills for them Don’t re-invent the wheel & try to Go with proven and solid technologies when you can Last :) Keep-on Re-architecting the infrastructure (may be small things) to optimize efficiency (every 6 months) ...learn from mistakes (yours/ others too :))
Questions if Any !! Ping Me on:-IRC /freenode/ : PiyushK ##infra-talkGtalk: piykumarTwitter @piykumar