Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Umeng Operations Infrastructure & Practice
Wang Yuxi, Umeng
w@umeng.com
About Me
● Before 2014, the only ops at Umeng
● Now, core member of ops team
● Technical generalist, responsible for the o...
Agenda
● About Umeng
● IDC
● Network
● Server
● Product
● On Giant’s Shoulders
● OS
● User Management
● Critical Infrastru...
About Umeng
● Founded on April 2010
● Incubated by Innovation
Works
● $10 Million raised from
Matrix China
● Acquired by A...
IDC
● IDC
o 3 + 1
● Rack
o ~90
● Server
o 800+
● Network device
o 100+
Network
● Bandwidth
o 4Gbps+
o BGP cost
● Internal Network
o 10G interconnection
o Third Network arch
 Upgrade on Q2, 201...
Network(cont.)
Server
● Before 2014
o Dell(11G, 12G)
● Now
o Dell, HP, Huawei, Inspur
● 10G NIC, enterprise SSD
● Power supply, hot-plug,...
Product
● Real time analytics(thunder)
o 150k req/s
o ~ 5B log/d
o 100+ shards
● Batch processing system(iceberg)
o ~ 300 ...
Product(cont.)
On Giant’s Shoulders
● OSS
o Nginx(Tengine)
o Finagle, Thrift
o Redis
o Kafka
o Storm
o MongoDB
o Hadoop & ecosystem
● Ent...
OS
● Before 2013
o Ubuntu 10.04/12.04
● Now
o RedHat 6.2, 2.6.32-279(80%)
o professional technical support
● BIOS, RAID
o ...
OS(cont.)
● OS template
o ks & preseed(great pain)
o partition(ext3/ext4, mount options)
o unnecessary service(irqbalance,...
User Management
● OpenVPN(multi path)
o Incredibly stable for 3 years, ZERO outage
o TCP vs UDP
● Public key
o OK for star...
Critical Infrastructure
● DNS
o use IP, not hostname in your code
o retry, timeout
● NTP
● Netfilter
o disabled by default...
Package Management
● Internal repo
o sync periodically
o GFWed issue :-(
● Really need compile?
● package manager
o yum/ap...
Code deployment
● Capistrano
o Written in Ruby
o Deploy any language
o Easy to use
● Configuration management
o dev use
o ...
Configuration Management
● 2011
o tens of servers
o free to use, mainly shell
● 2012 ~ 2013
o just ME
o Puppet is ok, lear...
Monitoring
● Metrics, Metrics, Metrics!!!
● “All monitoring software evolves towards becoming an
implementation of Nagios”...
Monitoring(cont.)
● From top to bottom
o customer perspective
o business level(dau, etc.), critical sensitive
o applicatio...
Monitoring(cont.)
● Ideal
o near-real time
o flexible, 5s, 60s, 300s,
1800s
o comparable by date/time
o active/passive or ...
Tuning
● From app level to system level
● App level, not covered here
● System level, take away for common use
● Don’t for...
Tuning(cont.)
● Don’t modify kernel
parameter unless 100%
sure
o timestamp issue
o ecn issue
● Tcp related
● Ring buffer, ...
Documentation
● Routine
o regular deploy & setup, weekly report
o online standard, 100+ slides for engineer
o ops share ev...
Outage & Diagnose
● This year(2014)
o SLA 99% ~ 99.9%
o issues every week, mostly invisible to customers
● When site is do...
Security
● IP issue, long long
history
o public & private ip
o port restricted, listen()
o oob
● test IDC
● UDP amplificat...
With Dev
● Tradeoff
o less dev’s work usually means more reliable system
o there will always be conflicts between ops & de...
What We Are Doing Now
● New IDCs, New beginning, Great challenge
o active - backup
o active - active
● Transfer data from ...
What We Are Doing Now(cont.)
● Private Cloud
o capex & opex
o resource(hardware,
software)
o workforce
End
Q & A
Upcoming SlideShare
Loading in …5
×

Umeng Operations Infrastructure & Practice

2,530 views

Published on

Brief introduction to Umeng.com Operations Infrastructure & Practice.

---
updates: 03/05/2015
Thanks to @TerryWang(http://www.slideshare.net/terrywang) who help to correct some grammar errors.
Below is the original copy, feel free to comment:
https://docs.google.com/presentation/d/1d1MAR8SClZDf8gjCNPuOeu63Fd83T-mzzqnqcTboAoY/edit?usp=sharing

Published in: Internet

Umeng Operations Infrastructure & Practice

  1. 1. Umeng Operations Infrastructure & Practice Wang Yuxi, Umeng w@umeng.com
  2. 2. About Me ● Before 2014, the only ops at Umeng ● Now, core member of ops team ● Technical generalist, responsible for the overall reliability and performance of Umeng ● ArchLinux user @Jasey_Wang | http://JaseyWang.Me
  3. 3. Agenda ● About Umeng ● IDC ● Network ● Server ● Product ● On Giant’s Shoulders ● OS ● User Management ● Critical Infrastructure ● Package Management ● Code Deployment ● Configuration Management ● Monitoring ● Tuning ● Documation ● Outage & Diagnose ● Security ● With Dev ● What We Are Doing Now
  4. 4. About Umeng ● Founded on April 2010 ● Incubated by Innovation Works ● $10 Million raised from Matrix China ● Acquired by Alibaba ● Largest Mobile app analytical platform in China ● 400K+ Apps ● ~1B mobile device
  5. 5. IDC ● IDC o 3 + 1 ● Rack o ~90 ● Server o 800+ ● Network device o 100+
  6. 6. Network ● Bandwidth o 4Gbps+ o BGP cost ● Internal Network o 10G interconnection o Third Network arch  Upgrade on Q2, 2014  Nexus 752  Bonding o OOB issue
  7. 7. Network(cont.)
  8. 8. Server ● Before 2014 o Dell(11G, 12G) ● Now o Dell, HP, Huawei, Inspur ● 10G NIC, enterprise SSD ● Power supply, hot-plug, redundant ● Hard drive, hot-plug
  9. 9. Product ● Real time analytics(thunder) o 150k req/s o ~ 5B log/d o 100+ shards ● Batch processing system(iceberg) o ~ 300 2U node, 2T/3T, 7200 SAS o ~ 3T/d daily incremental data o 4P/5P usage ● Push, Social
  10. 10. Product(cont.)
  11. 11. On Giant’s Shoulders ● OSS o Nginx(Tengine) o Finagle, Thrift o Redis o Kafka o Storm o MongoDB o Hadoop & ecosystem ● Enterprise o Google apps o Github enterprise o Redhat o NewRelic o CDN
  12. 12. OS ● Before 2013 o Ubuntu 10.04/12.04 ● Now o RedHat 6.2, 2.6.32-279(80%) o professional technical support ● BIOS, RAID o automatic tools o done before delivery http://goo.gl/TyDEVR
  13. 13. OS(cont.) ● OS template o ks & preseed(great pain) o partition(ext3/ext4, mount options) o unnecessary service(irqbalance, cpuspeed, netfilter, etc.) o sshd, monitoring agent o handy tools(nmap, tcpdump, htop, iftop, screen, etc.) o lang(Java/Scala, Python, Ruby) ● Custom init setup via Cobbler ● Added automatically by Zabbix
  14. 14. User Management ● OpenVPN(multi path) o Incredibly stable for 3 years, ZERO outage o TCP vs UDP ● Public key o OK for startup, quick & dirty ● IPA(identity, policy, audit(snoopy)) o preferred ● Headache for us, history reason o engineers enjoy the “free style” o so, the sooner the better
  15. 15. Critical Infrastructure ● DNS o use IP, not hostname in your code o retry, timeout ● NTP ● Netfilter o disabled by default o conntrack o NAT server
  16. 16. Package Management ● Internal repo o sync periodically o GFWed issue :-( ● Really need compile? ● package manager o yum/apt o rpm/dpkg o how we use them ● One package principle o rpm o tgz
  17. 17. Code deployment ● Capistrano o Written in Ruby o Deploy any language o Easy to use ● Configuration management o dev use o ops
  18. 18. Configuration Management ● 2011 o tens of servers o free to use, mainly shell ● 2012 ~ 2013 o just ME o Puppet is ok, learn some Ruby o tens of modules written by me ● Now o prerequisite  team skill tree  learning curve o Puppet  obsolete in new IDC  complex syntax, slow o Saltstack  easy to pick up  flexible & plain  ansible as backup o Python/ruby scripts, product level
  19. 19. Monitoring ● Metrics, Metrics, Metrics!!! ● “All monitoring software evolves towards becoming an implementation of Nagios” http://goo.gl/PvBYky
  20. 20. Monitoring(cont.) ● From top to bottom o customer perspective o business level(dau, etc.), critical sensitive o application level(qps, latency, return code, exception) o system level(load, nic, cpu, memory)  fork  swap in/out  nic speed/drops/errors  tcp queue, retransmit o hardware level
  21. 21. Monitoring(cont.) ● Ideal o near-real time o flexible, 5s, 60s, 300s, 1800s o comparable by date/time o active/passive or just feed ● Dashboard(core metric) ● Before o Nagios/Munin(out of box) ● Now o Zabbix/Graphite o networkbench, alibench(user end) o New relic ● log o rsyslog o ELK o scripts
  22. 22. Tuning ● From app level to system level ● App level, not covered here ● System level, take away for common use ● Don’t forget hardware(BIOS, RAID) ● Baseline comes first ● One modification one time ● Never over-optimized o “it works”, then “it runs happily” o business driven
  23. 23. Tuning(cont.) ● Don’t modify kernel parameter unless 100% sure o timestamp issue o ecn issue ● Tcp related ● Ring buffer, interrupts, open files, etc. ● DB, watch out
  24. 24. Documentation ● Routine o regular deploy & setup, weekly report o online standard, 100+ slides for engineer o ops share every Thu ● Post-Mortem o blameless o timeline & deadline ● Github Wiki & Google Docs
  25. 25. Outage & Diagnose ● This year(2014) o SLA 99% ~ 99.9% o issues every week, mostly invisible to customers ● When site is down o from bottom to top, vice-versa o good bug can reproduce o tools are key power  system http://goo.gl/wrNLi7  app o inform support & bd o technical background share(http://blog.umeng.com/?cat=4) ● Network is a unreliable, and it can breakdown
  26. 26. Security ● IP issue, long long history o public & private ip o port restricted, listen() o oob ● test IDC ● UDP amplification ● Bash, SSL vulnerability ● DDoS ● whitehat(WooYun, etc.) http://goo.gl/Q1SkXV
  27. 27. With Dev ● Tradeoff o less dev’s work usually means more reliable system o there will always be conflicts between ops & dev  unless one of them gives in  aggressive or mild, choose one ● Understand business logic o code talks o data talks http://goo.gl/Qwh6Ze
  28. 28. What We Are Doing Now ● New IDCs, New beginning, Great challenge o active - backup o active - active ● Transfer data from BJ to SH ● Env setup, stress test, benchmark ● Finally, switchover http://goo.gl/TMDnnS
  29. 29. What We Are Doing Now(cont.) ● Private Cloud o capex & opex o resource(hardware, software) o workforce
  30. 30. End Q & A

×