A lot of modules to keep track of25 main logic modules span across 6 application servers (75 logic instances)30 database instances span across 9 database server45 servers in totalDifferent application platform & networkMobile operating system iOS, Android, S40, Window PhoneMobile operators Viettel, Vina, Mobifone, foreign mobile operatorsNetwork typeWifi, 3G, EDGE, HSPDA, LTE, …
2 versions1st version Scribe as log shipperFast logHadoop as data storage Data replicationHive for ad hoc querying Acceptable query-time on large dataset Distributed calculationYou don’t have the real-time stats (30 minutes later or more)You cannot insert logic mid-streamYou don’t have access to the actual log files (Our System Operators have to setup sync scripts to sync logs to our servers)
How could we improve ?Keep your enemy closeStart with what you have and slowly improve itWrite to localhost Scribe proxy worker won’t get stuck if remote server goes down.Write aggregator server using scribe interfaceKeep a local copy of logAggregate data, write to rrd databasePerform trends analysis Chaining log from our log aggregator to HadoopAggregate data don’t have to be very preciseApproximate data often workData written into RRD will be normalize anywayLocal copy of log don’t have to persist instantly Cache data and flush to disk in chunksEasy to extend once write exceed local capacitySetup a small HDFS ring and write directly to itRRD advantages Storage efficient & very easy to queryLots of plug-in to display RRD
Nagios for alertingCacti for graphing chartsCustom build toolsProvide real-time aggregation information on multiple dimension both client & server sideTotal request of each actionFailure rateAverage execution time
Things happen when you least expectedDetect anomaly data base on trends & anomaly detectionKeep improving your algorithm
Reduce the number of alert can help greatly
Server log, monitoring and qo s platform of a messaging app
Log, monitoring and
QoS platform of a large
Phan Huy Hoàng – Lead Engineer
Web Mobile – Zing
• 1/ Quality of Service Platform: Why we need pay
attention to QoS
• 2/ Logging: Handling lots of data
• 3/ Monitoring
• 4/ Questions & Answers
Quality of Service Platform
Why we need to pay attention to QoS ?
Why we need QoS?
• How your app actually works in real-life ?
• Users are using which functions ?
Chat, social, search nearby… ?
• Do those functions even work at all ?
• If it does work, how good, in which environment ?
A few numbers
• Lots of data
• 3M users
• 28M messages chat/day
• 600M million events/day