Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GOCON Autumn (Story of our own Monitoring Agent in golang)

3,332 views

Published on

Story of our own Monitoring Agent in golang

Published in: Software
  • Be the first to comment

GOCON Autumn (Story of our own Monitoring Agent in golang)

  1. 1. Story of our own Monitoring Agent in golang @dxhuy LINE corp
  2. 2. Introduction • @dxhuy • Vietnamese • Building monitoring stack at LINE
  3. 3. My goal today • Join GoConference without lottery
  4. 4. My goal today • Show that this is not 100% true
  5. 5. Today takeaway →Anatomy of monitoring agent →How to design one →Challenges and learn
  6. 6. Monitoring Agent !?
  7. 7. • Small application run on host machine • Collect host machine metrics • Request latency? • MySQL load? • Redis hit/miss rate? • ..... • Aggregate metrics (sum/avg/histogram..) • Send to collector server → alert / chart ... • statsd / collectd / telegraf...
  8. 8. Not a generic log transfer
  9. 9. Why not reuse existing technology? • Scale problem • We need to write our own stack • Various environment problem • Management problem • Development velocity problem
  10. 10. Let's start write our own
  11. 11. Language
  12. 12. Features
  13. 13. • Modularity (for user) • Buffer (prevent data loss) • Management friendly (for admin)
  14. 14. Modularity • What is modularity? • Easily to add new metrics from user view • Pluggable
  15. 15. Modularity • How? • Input : get metric • Codec : understand metric • Output : send metric
  16. 16. // Metric is central model for imonD type Metric struct { ProtocolVersion ProtocolVer Name string Val Value TimeStamp time.Time Fingerprint Fingerprint Type MetricType Labels map[string]string }
  17. 17. Input Plugin design
  18. 18. Input Plugin design • Three important things: • Process model • Plugin model • Collecting model (push vs pull)
  19. 19. Process model Single process vs Multiple process
  20. 20. Process model - Adv : easy management / maintainance - DisAdv : one bad plugin could affect the whole
  21. 21. Same language vs Embedded language Plugin model
  22. 22. Plugin model - Adv: Simple model, better maintainance - DisAdv: each time add new plugin, need to restart the whole agent
  23. 23. // InputPlugin represent an input plugin interface type InputPlugin interface { Interval() config.Duration GracefulStop() error Name() string Type() InputType } type InputByte interface { Decoder() codec.Decoder ReadBytesWithContext(ctx context.Context) ([]byte, error) } type InputMetrics interface { ReadMetricsWithContext(ctx context.Context) (model.Metrics, error) } All plugins share same interface
  24. 24. Push vs Pull Collecting model
  25. 25. Collecting model - Adv: less affect to middleware, simple model - DisAdv: Application need to expose some thing to "pull" (http endpoint / file / ..)
  26. 26. func (i *MemcachedInput) ReadMetricsWithContext(ctx context.Context) (model.Metrics, error) { .............. conn, err := net.DialTimeout("tcp", i.endpoint, i.timeout.Duration) if err != nil { return nil, err } defer conn.Close() _, err = conn.Write([]byte("statsn")) if err != nil { return nil, err } .................. scanner := bufio.NewScanner(conn) for scanner.Scan() { text := scanner.Text() if text == "END" { break } // Split entries which look like: STAT time 1488291730 entries := strings.Split(text, " ") if len(entries) == 3 { v, err := strconv.ParseInt(entries[2], 10, 64) if err != nil { log.Debug("invalid value %s", entries[2]) continue } ms = append(ms, *model.NewMetric( entries[1], model.Value(float64(v)), time.Now(), model.GaugeType, )) } } .......... return ms, nil } Pull sample directly contact server
  27. 27. Codec Plugin /
 Output Plugin
  28. 28. type Encoder interface { //Name() string Encode(metrics model.Metrics) ([]byte, error) Name() string } type Decoder interface { //Name() string Decode(input []byte) (model.Metrics, error) Name() string } Codec interface
  29. 29. // OutputPlugin represent an output plugin interface type OutputPlugin interface { WriteWithContext(ctx context.Context, metrics model.Metrics) error // for Cancellable write Encoder() codec.Encoder Interval() config.Duration GracefulStop() error WalReader() wal.LogReader Name() string } Output interface
  30. 30. Buffer design
  31. 31. each Output maintain its own offset i
 
 offset will be update 
 when output success Buffer design
  32. 32. Buffer design • Advantages • When output failed, just rollback index • Chunks will be organized by segments (each segments ~ 1GB) • To clean up, just delete old segments which already consumed by all output
  33. 33. Buffer design • Other concerns • Serialization • It's not hard to write your own serialization method (link) • mmap vs file read • not much different in our case • mmap index management is cubersome to write because it has to manipulate at 2^n address • Concurrent write vs Synchronized write • Synchronized write for data safety https://www.slideshare.net/dxhuy88/story-writing-byte-serializer-in-golang
  34. 34. Buffer design type LogReader interface { Read() (model.Metrics, error) Read1() (model.Metrics, error) CurrentOffset() int64 SetOffset(int64) error Destroy() error } type LogWriter interface { Write(*model.Metrics) error LastOffset() int64 }
  35. 35. Management friendly • Monitoring agents is f**king hard • Deploy agents in large scale is painful
  36. 36. Potential risk • Die without noticing • Over resource consume • Overflow buffer • Dirty data • Resend storm
  37. 37. Resend storm is aweful
  38. 38. How we solve those problems • Expose agent state as http endpoint • and monitoring them all using prometheus • Monitoring everything • Aliveness / CPU / Memory / Output Lag • Using circuitbreaker / jitter resend to prevent resend storm
  39. 39. func (b *AutoOpenBreaker) Close() { log.Info("close breaker for %v", b.autoOpenTime) b.state = CLOSE b.closeTime = time.Now() go b.autoOpen() } func (b *AutoOpenBreaker) open() { b.state = OPEN } func (b *AutoOpenBreaker) IsOpen() bool { return b.state == OPEN } func (b *AutoOpenBreaker) autoOpen() { tick := time.Tick(b.autoOpenTime) select { case <-tick: log.Info("auto open breaker after %v", b.autoOpenTime) b.open() } } Circuit breaker
  40. 40. func (i *Output) retry(left int, cancelCtx context.Context, f func() error) error { select { case <-cancelCtx.Done(): return fmt.Errorf("got cancelled") default: // no-op } // jitter retry m := math.Min(capacity, float64(base*math.Pow(2.0, float64(maxRetry- left)))) s := rand.Intn(int(m)) log.Debug("retry sleep %d second", s) time.Sleep(time.Duration(s) * time.Second) // do some work .... } jitter
  41. 41. Agent monitoring using prometheus / grafana
  42. 42. Export agent own metrics at http://host:port/agent_metrics
  43. 43. Admin page
  44. 44. Finally • Golang is awesome • Quick prototype, works everywhere • Never, ever write your own agent • ... unless you have to • But it's fun because there're a lot of problems
  45. 45. We're hiring

×