Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mirko Damiani - An Embedded soft real time distributed system in Go

51 views

Published on

An embedded system usually involves low level languages like C and highly customized hardware. In this talk we will see a use case of a soft real time system which was developed taking a very different approach, written in Go. We will see what are the advantages of this choice, along with its limits.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Mirko Damiani - An Embedded soft real time distributed system in Go

  1. 1. An Embedded Soft Real Time System in Go Mirko Damiani Software Engineer @ Develer LinuxLab 3-4 Dec 2018
  2. 2. Overview ● Soft real time systems ● Our hardware/software solution ● Go advantages for embedded ● Go optimizations
  3. 3. Soft Real Time System
  4. 4. ● Industrial machines ● Quality features ○ Color ○ Weight ○ Defects ○ Shape ○ ... ● Classification ○ Grouping items together, according to quality features Quality Classifier Photo by Kate Buckley on flickr
  5. 5. Industrial Applications Photo by SortingExpert on WikipediaPhoto from prozesstechnik
  6. 6. Specs & Outline ● 100 lanes ● 20 items/sec per lane ● 2000 items/sec ● 10 exits per lane ● Industrial scale Photo by Chris Chadd from Pexels Feeder Sensor Exit Lane Rotary Encoder Ejector Classify
  7. 7. Need of Precision ● Items are eventually ejected ○ Precise timing of ejection ○ Precision of 250 us ○ Multiple exits ● Usually real time OS are used ○ Higher determinism
  8. 8. Hardware Architecture
  9. 9. Our Machine Layout IO IO exit 1 exit 2 ejectorssensors BL data in data out checkpoint ● Boards ○ BL: Business Logic ○ IO: Input/Output ● Business Logic ○ Acquires data from sensors ○ Manages every lane ● Network traffic is heavy ○ Up to 250 sensors ○ Up to 2000 items per second ● Checkpoint ○ Trigger for classification
  10. 10. The Challenge Canonical Way ● RTOS kernel ● Custom hardware & boards ● CANBUS communication ● Single board ● Firmware C bare metal
  11. 11. The Challenge: Linux and Go Canonical Way ● RTOS kernel ● Custom hardware & boards ● CANBUS communication ● Single board ● Firmware C bare metal Our Solution ● GNU/Linux standard kernel ● Hardware standard components ● Ethernet based communication ● Distributed system ● Go language
  12. 12. Why Linux? GNU/Linux ● Real time processes ● Microprocessor boards ● No Safety Certifications ● Plenty of Drivers ● Separation of competences ● Debug on desktop with tools ● Many Languages & Libraries RTOS ● Tasks with priorities ● Microcontrollers (no MMU) ● Safety and Certifications ● Limited number of Drivers ● Single big application ● Debug on hardware boards ● Few Languages and Libraries
  13. 13. Network connections ● “BL” Single Business Logic board ○ Freescale i.MX 6, Quad Core ARM Cortex A9 @ 1.2 GHz ○ Performs the items classification for every lane ● “IO” Multiple Input/Output boards ○ Develboard Atmel, Single Core ARM @ 600 MHz ○ Digital inputs and outputs ● Multiple sensor sensors ● Ethernet bus with standard switches/routers BL IO IO IO Ethernet Switch Star topology
  14. 14. Different topology with Linux Sw bridge BL IO IO IO BL IO IO IO Star topology Serial topology ● Simplified cabling ● Sw bridge: 15% CPU
  15. 15. Latency for Soft Real Time ● Kernel driver with a precision of 250 us ○ DMA + double buffering ○ Buffer has a duration of 100 ms ○ Actual precision of 66 us ● Queue of scheduled activations ○ User space software writes activations to the kernel driver ● Soft real time latency ○ 100 ms + queue management ~= 150 ms ○ System can’t react faster than 150 ms (e.g. change of speed)
  16. 16. Rotary encoder ● Lanes are physically bound ○ Multiple encoders in case of big machines ● Encoder steps ○ Square waves ○ 2000 steps/round ● Kernel driver ○ Parameters exported in sysfs A B Z
  17. 17. Linear Interpolation ● We cannot synchronize thousands times per second with Ethernet ● Synchronization every 100 ms ● Linear interpolation ○ Encoder accelerations are “slow” because it’s bound to a mechanical transport ● Workaround of a real time protocol t #step ● Real curve ● Interpolated curve
  18. 18. BL/IO Clock Synchronization ● Activation messages are marked with a specific timestamp ○ We need to synchronize clocks ● Usually, NTP is used ○ Precision of ~milliseconds => Not enough for us ○ We need a precision of (at least) 250 us ● Precision Time Protocol IEEE 1585 (PTP) ● Two PTP timestamping models: Hardware or Software ○ Software timestamping: Kernel interrupt => Precision of ~microseconds ○ Hardware timestamping: Ethernet interface => Precision of ~nanoseconds ○ Develboard supports IEEE 1858, but software timestamping is enough
  19. 19. Go for Embedded
  20. 20. Basic Advantages ● Simple language, clients got used to it very quickly ● Simple documentation and maintenance ● Static binaries ● Large ecosystem of libraries ● Concurrent programming ● Easy cross compilation ○ Embedded (ARM) ○ Windows ○ Linux
  21. 21. Embedded: Go vs C++ ● Stack trace and data race analysis ○ Valgrind slows down performance ● Debug tools ○ Remote debugging and system analysis (gdb vs pprof) ● Linter and code analysis ○ Easier to integrate static analysis tools (e.g. golint, go vet) ● Tags (go build -tags …) ○ Useful for embedded apps and stubs ○ Cleaner approach compared to #ifdef
  22. 22. Fine tuning: Disassembly ● go tool objdump -s main.join -S <binary_name> func join(strings []string) string { 0x8bae0 e59a1008 MOVW 0x8(R10), R1 0x8bae4 e15d0001 CMP R1, R13 0x8bae8 9a00001a B.LS 0x8bb58 e 0x8baec e52de024 MOVW.W R14, -0x24(R13) 0x8baf0 e3a00000 MOVW $0, R0 0x8baf4 e3a01000 MOVW $0, R1 0x8baf8 e3a02000 MOVW $0, R2 for _, str := range strings { 0x8bafc ea00000f B 0x8bb40 0x8bb00 e58d0020 MOVW R0, 0x20(R13) 0x8bb04 e59d3028 MOVW 0x28(R13), R3 0x8bb08 e7934180 MOVW (R3)(R0<<3), R4 0x8bb0c e0835180 ADD R0<<$3, R3, R5 0x8bb10 e5955004 MOVW 0x4(R5), R5 package main import ( "fmt" "os" ) func join(strings []string) string { var ret string for _, str := range strings { ret += str } return ret } func main() { fmt.Println(join(os.Args[1:])) }
  23. 23. How do we perform tests on Embedded? 1. Unit tests 2. Full integration tests ○ Integration framework ○ Mocking board/instruments as Goroutines ○ Easier than C++ ○ Fast prototyping for tests 3. Continuous integration ○ The real embedded system was simulated on CircleCI
  24. 24. ● Monitoring of performance ○ Metrics ○ Profiling ● Google pprof upstream version: ○ go get -u github.com/google/pprof ● Small CPU profile file => 10 minutes execution => just 185 KiB ○ Stand alone, no binary ○ Can read from both local file or over HTTP ○ pprof -http :8081 http://localhost:8080/debug/pprof/profile?seconds=30 Avoid Performance Regression
  25. 25. Hardware in the Loop ● Automatic performance monitoring ● We have a real hardware test bench ● We want to deploy our system directly to the test bench ● Results from the test bench are retrieved by CircleCI Repo CI Metrics Hardware
  26. 26. Remote Introspection via Browser ● Uncommon in embedded apps ● Expvar ○ Standard interface for public variables ○ Exports figures about the program ○ JSON format // At global scope var requestCount = expvar.NewInt("RequestCount") ... func myHandler(w http.ResponseWriter, r *http.Request) { requestCount.Add(1) ... }
  27. 27. Go Optimizations
  28. 28. Metrics ● Performance analysis ○ We don’t want performance regressions ○ Refactoring ○ Test suites don’t help ● “Tachymeter” library to monitor metrics ○ Low impact, samples are added to a circular buffer ○ Average, Standard Deviation, Percentiles, Min, Max, … ● Multiple outputs ○ Formatted string, JSON string, Histogram text and html ○ HTTP endpoint for remote analysis
  29. 29. Checkpoint Margin ● Average ○ Avg 2.301660948s ○ StdDev 176.75148ms ● Percentiles ○ P75 2.222552667s ○ P95 1.921699001s ○ P99 1.721095s ○ P999 1.575430001s ● Limits ○ Max 2.916016667s ○ Min 1.464427001s checkpointsensors margin activation 2 minutes run
  30. 30. How is Checkpoint Margin affected? ● I/O bound ● Reading packets from connections ● We need to read fairly from 250 tcp sockets eth/tcpBL S S S
  31. 31. Standard Network Loop ● One Goroutine per connection ○ 1. Read data from network ○ 2. Decode packets ○ 3. Send to main loop via channel ● chan packet ○ Sending one packet at time to the main loop ● Can we do better? main loop TCP gorunTCP gorunTCP gorunTCP Read chan packet Concurrent Goroutines
  32. 32. Batched Channel ● chan packet vs chan []packet ○ Sending one packet at time is too slow ● Use a single channel write operation to send all packets received from a single TCP read ○ Minimizing channel writes is good main loop TCP gorunTCP gorunTCP gorunTCP Read chan [ ]packet Concurrent Goroutines
  33. 33. Number of Channel Writes ● Channel ○ Buffered ○ Slice of packets ● Writes per second ○ 2000 → 25000 w/s ● Total GC STW time ○ 2.28 → 11.50 s Channel Writes per Second [w/s] ● Checkpoint Margin [s] ● GC Time [s] 2 minutes run
  34. 34. Failed Test: Using a Mutex ● Goroutines will block on a mutex ○ High contention ○ Go scheduler is cooperative ● Deadline missed ○ Checkpoint event is delayed ● Conn.Read(): Channel Mutex ○ Min 13 us 13 us ○ Max 773 us 1.15 s ○ P99 64 us 510 ms ● Activation margin: Channel Mutex ○ P99 466 ms -1.13 s main loop TCP gorunTCP gorunTCP gorunTCP Read mutex checkpoint margin activation delay
  35. 35. Alternative: Using EPOLL ● EPOLL syscall allows to use a single Goroutine ● MultiReader Go interface ○ Reading from multiple connections ○ Monitoring of multiple file descriptors ● Drawbacks ○ It can’t be used on Windows ○ Cannot use net.Socket ○ Maintenance main loop TCP Multi Read type MultiPacketReader interface { // TCP connection with framing Register(conn PacketConn) // Reads from one of the // registered connections ReadPackets(packets [][]byte) (n int, conn PacketConn, err error) } chan [ ]packet
  36. 36. CPU Usage: EPOLL VS Go ● 4 CPUs in total ○ Graph shows just one CPU (for simplicity) ● Go impl ○ CPU usage is higher... ○ … but more “uniform” ● EPOLL impl ○ CPU cores are switched more frequently ● EPOLL ● Go Time (2 minutes)
  37. 37. Conclusions Thanks mirko@develer.com ● Standard Linux OS and hardware ○ Faster development ○ Distributed system ● Testing and monitoring ○ Fast prototyping for tests ○ Profiling and metrics ○ Performance tests on real hardware ● Optimizations ○ Goroutines management ○ Packets reception ● Drawbacks ○ GC impact must be reduced ○ Mutex contention can be a problem ○ Network APIs are not flexible enough ● Go can be used for embedded apps!

×