Monitoring microservices:
Docker, Mesos and
Kubernetes visibility at scale
Me
Alessandro Gallotta
Software Engineer @sysdig
@alex_gallotta
@sysdig
Introducing Sysdig
• Capture system events, filter them, run useful scripts
• Lua scripting
• Open Source
• Nice curses UI
lsof
netstat
tcpdump
htop
ps
strace
and more
• track user activity
• top files/processes/connections by
• cpu
• bytes
• …
• logs
• containers
• tracers
• you name it, we track it
Design Goals
• Production-ready
• Simple
• lightweight
• Rich data
• Natural workflow
• Native support for containers
• Native support for
and more…
Demo time
Containers are Great…
• Simple
• Scalable
• Isolated
• Service-oriented
• Elastic
• Flexible
• Separation of concerns
But Some Things Are Becoming More
Complex
Cache
Webserver
Database
Legacy Monolitic App
But Some Things Are Becoming More
Complex
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
But Some Things Are Becoming More
Complex
Computing Node
Computing Node
Computing Node
Computing Node
Computing Node
Computing Node
Container-based App
Service1
Service2
Service3
But Things Are Becoming More Complex
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
Two Problems
Problem #1: 

How Do We Get Data Out of These Guys?
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
• System
• Network
• Process
• JVM
• Response Time
• Requests
• Errors
Problem #2: 

How Do We Get Make Sense of the Data?
Computing Node
Computing Node
Computing Node
Service1
Service2
Service3
Computing Node
Computing Node
Computing Node
Container-based App
Complexity Calls for Great Monitoring
•Isolated
•Automated
•Orchestration-aware
•Simple
•Scalable
The Orchestrated Version of This
Complexity Also Calls for Great Troubleshooting
What‘s the network
activity of my
Marathon group?
What’s using the
CPU the Wordpress
task?
How the hell does
my Mesos task
work?!
Where’s the
bottleneck?
What’s the response
time of my login
service?
What transactions is my
Redis service serving?
Hypervisor
How Do I Get Data Out of These Things: VMs
VM1 VM3VM2
Hypervisor
Monitoring VMs, Option 1
VM1 VM3VM2
Hypervisor-level instrumentation,
Amazon CloudWatch
Hypervisor
Monitoring VMs, Option 2
VM1 VM3VM2
Monitoring Agent
OS
Monitoring Containers
Container1 Container3Container2
OS
Monitoring Containers, Option 1
Container1 Container3Container2
Monitoring Agent
OS
Monitoring Containers, Option 1
Container1 Container3Container2
Monitoring Agent
• Not scalable
• Not composable
• Adds dependencies/size
• Kills the concept of one process per container
OS
Monitoring Containers, Option 2
Container1 Container3Container2
Container runtime – level monitoring
Kernel-level instrumentation
OS
Monitoring Containers, Option 3
Container1 Monitoring
Container
Container2
Sysdig Data Collection
Kernel
Container1
Docker
Container2
Docker
Container3
LXCAppApp
Sysdig Data Collection
Kernel
Container1
Docker
Container2
Docker
Container3
LXCAppApp
Instrumentation
through kernel
module
Sysdig Data Collection
Kernel
Container1
Docker
Container2
Docker
Container3
LXCAppApp
sysdig
Docker
Capture and
analysis
Sky cloud is the limit
• Correlate data
• Scale with your
infrastructure
• Alerts, notifications,
visualization tools
• Continuous data
collection and retention
from production systems
Sysdig Cloud
• Sysdig evolution for the
cloud
• Preserve the premises
• production ready
• natural workflow
• ease of use
• 0 to low config needed
Out of the box support
Demo time 2
How About Security?
Did someone log into
one of our containers?
Has something
been installed in
one of the
containers?
Have we been
hacked?
Were configuration files
changed?
How About Security?
Did someone log into
one of our containers?
Have we been
hacked?
Were configuration files
changed?
Has something
been installed in
one of the
containers?
An anomaly detection system built on top of the
sysdig engine
Falco Architecture
Kernel
Container1
Docker
Container2
rkt
Container3
LXCAppApp
Rule system
Docker
• File activity
• Network Activity
• User Activity
• Process execution
• IPC
• …
Rules Examples
rule: shell_in_container
desc: a shell running in a container
condition: container.id != host and proc.name = bash
output: “Shell running in container (user=%user.name
container_id=%container.id container_name=%container.name
shell=%proc.name parent=%proc.pname)”
priority: WARNING
Rules Examples
rule: mysqld_spawn_process
desc: mysqld spawning a new process after startup.
condition: spawn_process and proc.name = mysqld and not
proc_is_new
output: “mysqld spawned new process after startup
(user=%user.name command=%proc.cmdline file=%fd.name)”
priority: WARNING
Rules Examples
macro: open_connection
condition: syscall.type=connect and evt.dir=< and fd.sockfamily =ip
rule: system_binaries_network_activity
desc: any network connection initiated by system binaries that are not
expected to send or receive any network traffic
condition: open_connection and proc.name in (ls, ps, mkdir, … )
output: Known system binary made network connection (user=%user.name
command=%proc.cmdline connection=%fd.name)
priority: WARNING"
Thank You!
www.sysdig.org
www.sysdig.org/falco
@alex_gallotta
@sysdig
github.com/draios
www.sysdig.com

Monitoring microservices: Docker, Mesos and Kubernetes visibility at scale