Server log, monitoring and qo s platform of a messaging app


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • A lot of modules to keep track of25 main logic modules span across 6 application servers (75 logic instances)30 database instances span across 9 database server45 servers in totalDifferent application platform & networkMobile operating system iOS, Android, S40, Window PhoneMobile operators Viettel, Vina, Mobifone, foreign mobile operatorsNetwork typeWifi, 3G, EDGE, HSPDA, LTE, …
  • 2 versions1st version Scribe as log shipperFast logHadoop as data storage Data replicationHive for ad hoc querying Acceptable query-time on large dataset Distributed calculationYou don’t have the real-time stats (30 minutes later or more)You cannot insert logic mid-streamYou don’t have access to the actual log files (Our System Operators have to setup sync scripts to sync logs to our servers)
  • How could we improve ?Keep your enemy closeStart with what you have and slowly improve itWrite to localhost Scribe proxy worker won’t get stuck if remote server goes down.Write aggregator server using scribe interfaceKeep a local copy of logAggregate data, write to rrd databasePerform trends analysis Chaining log from our log aggregator to HadoopAggregate data don’t have to be very preciseApproximate data often workData written into RRD will be normalize anywayLocal copy of log don’t have to persist instantly Cache data and flush to disk in chunksEasy to extend once write exceed local capacitySetup a small HDFS ring and write directly to itRRD advantages Storage efficient & very easy to queryLots of plug-in to display RRD
  • Nagios for alertingCacti for graphing chartsCustom build toolsProvide real-time aggregation information on multiple dimension both client & server sideTotal request of each actionFailure rateAverage execution time
  • Things happen when you least expectedDetect anomaly data base on trends & anomaly detectionKeep improving your algorithm
  • Reduce the number of alert can help greatly
  • Server log, monitoring and qo s platform of a messaging app

    1. 1. Log, monitoring and QoS platform of a large scale service Phan Huy Hoàng – Lead Engineer Web Mobile – Zing
    2. 2. Agenda • 1/ Quality of Service Platform: Why we need pay attention to QoS • 2/ Logging: Handling lots of data • 3/ Monitoring • 4/ Questions & Answers
    3. 3. Quality of Service Platform Why we need to pay attention to QoS ?
    4. 4. Why we need QoS? • How your app actually works in real-life ? • Users are using which functions ? Chat, social, search nearby… ? • Do those functions even work at all ? • If it does work, how good, in which environment ?
    5. 5. A few numbers • Lots of data • 3M users • 28M messages chat/day • 600M million events/day
    6. 6. Too much data
    7. 7. Logging Handling lots of data
    8. 8. Logging Flow v1
    9. 9. Logging Flow v2
    10. 10. Monitoring
    11. 11. Charting • Live data • Trends data
    12. 12. Dashboard • Simplified your life • Concentrate on drastic change
    13. 13. Anomaly data points detection • How to deal with stuff like this ? • Too many data point deviated from normal deviation should trigger an alert • You will get a lot of false positive
    14. 14. What’s next ? • Sending alert • By Zalo, SMS, email • Happy life
    15. 15. Questions?