This document discusses monitoring microservices running in containers at scale. It introduces Sysdig, an open source tool that captures system events and provides visibility into containers, processes, files, and more through a curses UI. Sysdig addresses the challenges of getting monitoring data out of orchestrated applications running in containers across nodes and making sense of that data. It collects kernel-level data without adding dependencies to containers. Sysdig Cloud provides correlated monitoring data at scale. The document also introduces Falco, an anomaly detection system built on Sysdig that monitors for security threats through behavior rules.
3. Introducing Sysdig
• Capture system events, filter them, run useful scripts
• Lua scripting
• Open Source
• Nice curses UI
lsof
netstat
tcpdump
htop
ps
strace
4. and more
• track user activity
• top files/processes/connections by
• cpu
• bytes
• …
• logs
• containers
• tracers
• you name it, we track it
5. Design Goals
• Production-ready
• Simple
• lightweight
• Rich data
• Natural workflow
• Native support for containers
• Native support for
and more…
7. Containers are Great…
• Simple
• Scalable
• Isolated
• Service-oriented
• Elastic
• Flexible
• Separation of concerns
8. But Some Things Are Becoming More
Complex
Cache
Webserver
Database
Legacy Monolitic App
9. But Some Things Are Becoming More
Complex
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
10. But Some Things Are Becoming More
Complex
Computing Node
Computing Node
Computing Node
Computing Node
Computing Node
Computing Node
Container-based App
Service1
Service2
Service3
11. But Things Are Becoming More Complex
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
Two Problems
12. Problem #1:
How Do We Get Data Out of These Guys?
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
• System
• Network
• Process
• JVM
• Response Time
• Requests
• Errors
13. Problem #2:
How Do We Get Make Sense of the Data?
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
14. Complexity Calls for Great Monitoring
•Isolated
•Automated
•Orchestration-aware
•Simple
•Scalable
16. Complexity Also Calls for Great Troubleshooting
What‘s the network
activity of my
Marathon group?
What’s using the
CPU the Wordpress
task?
How the hell does
my Mesos task
work?!
Where’s the
bottleneck?
What’s the response
time of my login
service?
What transactions is my
Redis service serving?
22. OS
Monitoring Containers, Option 1
Container1 Container3Container2
Monitoring Agent
• Not scalable
• Not composable
• Adds dependencies/size
• Kills the concept of one process per container
28. Sky cloud is the limit
• Correlate data
• Scale with your
infrastructure
• Alerts, notifications,
visualization tools
• Continuous data
collection and retention
from production systems
29. Sysdig Cloud
• Sysdig evolution for the
cloud
• Preserve the premises
• production ready
• natural workflow
• ease of use
• 0 to low config needed
32. How About Security?
Did someone log into
one of our containers?
Has something
been installed in
one of the
containers?
Have we been
hacked?
Were configuration files
changed?
33. How About Security?
Did someone log into
one of our containers?
Have we been
hacked?
Were configuration files
changed?
Has something
been installed in
one of the
containers?
36. Rules Examples
rule: shell_in_container
desc: a shell running in a container
condition: container.id != host and proc.name = bash
output: “Shell running in container (user=%user.name
container_id=%container.id container_name=%container.name
shell=%proc.name parent=%proc.pname)”
priority: WARNING
37. Rules Examples
rule: mysqld_spawn_process
desc: mysqld spawning a new process after startup.
condition: spawn_process and proc.name = mysqld and not
proc_is_new
output: “mysqld spawned new process after startup
(user=%user.name command=%proc.cmdline file=%fd.name)”
priority: WARNING
38. Rules Examples
macro: open_connection
condition: syscall.type=connect and evt.dir=< and fd.sockfamily =ip
rule: system_binaries_network_activity
desc: any network connection initiated by system binaries that are not
expected to send or receive any network traffic
condition: open_connection and proc.name in (ls, ps, mkdir, … )
output: Known system binary made network connection (user=%user.name
command=%proc.cmdline connection=%fd.name)
priority: WARNING"