1. Data Science for Infrastructure:
Observe, Understand, Automate
Zain Asgar & Natalie Serrino
2. https://px.dev
Zain Asgar Natalie Serrino
@nserrino
Principal Engineer - TLM @ New Relic
Prior: Eng @ Observe, Eng @ Trifacta,
Eng @ Intel
@zainasgar
GM @ New Relic
Adjunct Professor of CS @ Stanford
Prior: Co-founder/CEO - Pixie Labs
Eng @ Google, Trifacta, NVIDIA
3. https://px.dev
We see observability as a data problem
- It’s easy for machines to generate GBs of data per second
- It’s hard to get complete coverage applications, especially in distributed
environments
- It’s hard to make sure this data is relevant
- It’s hard to distill the data into something usable
4. https://px.dev
What we learned in the data space
- Collecting the right data is half the battle
- Simple models on relevant data usually outperform complex models on a
skewed/incomplete dataset
- Important to be able to audit and inspect your data pipelines
5. https://px.dev
How to do data-driven automation?
Transform data
into signal!
Do something
based on signal!
Gather
raw data!
⏰ Most time is spent here
Need variety and depth in
input data
👀 Disproportionate
emphasis
Can be a simple rule set or a
statistical/ML model
🤞 Ideally with limits + alerts
Huge possibilities here with the
Kubernetes API
6. https://px.dev
How to do data-driven automation?
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
- Logs
- Application metrics
- Raw requests
- Aggregates
- Anomaly detection
- Regex
- Machine learning models
- Ping Slack/JIRA
- Scale deployment up/down
- Allocate more resources
7. https://px.dev
How to do data-driven automation?
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
- Logs
- Infrastructure utilization
- Application metrics
- Raw requests
- Application profiles
- Network connections
- Kubernetes state
- Mostly data wrangling...
- Aggregates
- Anomaly detection
- Thresholds
- Regex/pattern-matching
- Linear regression
- Machine learning models
- Ping Slack/JIRA
- Scale deployment up/down
- Restart pod/service
- Page someone
- Allocate more resources
- Roll back
- Disable/enable feature
8. https://px.dev
We built Pixie to solve these problems
Auto-telemetry using eBPF
100% scriptable & API-driven
Kubernetes native
15. ● Valid
● Valid
● Built for data analysis and ML
import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL is an embedded DSL
16. import px
def http_data():
df = px.DataFrame(table='http_events', start_time='-30s')
df.pod = df.ctx['pod']
return df[['pod', 'http_req_path', 'http_resp_latency_ns']]
px.display(http_data())
PxL specifies logical
flow of data
(declarative)
Pixie plans &
optimizes the
execution
Operator
Data
PxL is an dataflow language
20. https://px.dev
PxL provides an interface to work with data
It allows us to construct powerful, composabe workflows.
These following demos demonstrate this capability:
1. Slack alert on SQL injection attacks
2. Auto-scale deployment by HTTP request throughput
25. https://px.dev
What is a SQL injection?
“SQL injection is a code injection technique used to attack
applications, in which malicious SQL statements are inserted into an
entry field for execution.“
26. https://px.dev
Example SQL injection
User accesses
http://foobar.com?user_id=123
Application executes
SELECT * from users where user_id=123
Malicious actor accesses
http://foobar.com?user_id=123 or 1=1
Application executes
SELECT * from users where user_id=123 or 1=1
�� ��
27. https://px.dev
How can we detect SQL injections?
💥 Rules 💥
- Parse query to detect prohibited syntax (e.g. unions)
- Regexes to detect prohibited syntax
💭 Complication: What if your app has a legitimate use of union?
💥 Machine learning 💥
- Train model on real world examples
- Can theoretically learn that certain usage of syntax are okay
💭 Complication: Where to get the dataset?
30. https://px.dev
Slack Alert for SQL Injection Attacks
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
Generate alert about
SQL injections
Diagnose SQL
injection events
Collect raw
SQL events
32. https://px.dev
Autoscaling
💭 How do you know how many pods your deployment should
have?
💭 How do you know the amount of resources to provision for
those pods?
33. https://px.dev
Possible autoscaling metrics
- CPU, memory of pod
- Avg / p90 / p99 request latency
- Latency of downstream dependencies
- # of outbound connections
- Application-specific metrics
- ….. Many more …...
34. https://px.dev
K8s Autoscalers
- Both “Horizontal” and “Vertical” scaling
- Some built-in autoscaling metrics:
- Pod CPU
- Pod Memory
- Custom metrics API allows to scale on
custom metrics! 😎
https://github.com/kubernetes/metrics
Credit: kubernetes.io
36. https://px.dev
Other tools supporting this demo
Custom metrics server adapted from this project:
github.com/kubernetes-sigs/custom-metrics-apiserver
👆 Check it out to build your own K8s metrics server!
HTTP load testing via Hey
https://github.com/rakyll/hey
38. https://px.dev
Autoscale deployment by HTTP request throughput
Transform
data into signal!
Do something
based on signal!
Gather
raw data!
Autoscale # of pods
by HTTP req/s
Calculate HTTP
req/s by pod
Collect raw HTTP
requests
39. https://px.dev
We’d love to get your feedback
In these demos we showed some simple data workflows on Pixie.
- More details about SQL injection here: blog.px.dev/sql-injection
- More details about autoscaling: blog.px.dev/autoscaling-custom-k8s-metric
What’s next:
- We are working on XSS detection.
- We want to learn about more use cases. Find us on GitHub (pixie-io/pixie) or
Slack (slackin.px.dev).