SlideShare a Scribd company logo
1 of 29
Download to read offline
Seattle 2018
@danielhochman / Engineer / Lyft
Instrumenting and Scaling
Cloud-Native Databases with Envoy
Seattle 2018
Database outage
1. Disk I/O wait spikes briefly
2. Client opens more connections
3. Slowdown due to auth overhead of new
connections
4. Client opens more connections
5. Hit max connection limit
Seattle 2018
Databases in the cloud
Instantly provision resilient, high-throughput infrastructure
No access to underlying VM and/or shared hardware
Limited access to telemetry
Limited access to configuration
Closed source or no ability to run custom binary
Seattle 2018
Cloud Native
Cloud native technologies empower organizations to build and run
scalable applications in modern, dynamic environments such as
public, private, and hybrid clouds.
Seattle 2018
Service Mesh topology
Service mesh
Edge
DiscoveryEnvoy Proxy is deployed at every hop
Seattle 2018
Instance topology
Application communicates over locally to Envoy
which will proxy all traffic
localhost:6001
localhost:6101
localhost:7000
…
(internal services)
(third-party services)
(cloud services)
and more!
Seattle 2018
Layer 3 / 4: Proxying TCP
- DNS aware
- Load balancing: round robin, least request, ring hash, random, etc
- Impose an idle timeout
- Healthchecking
- Access logging
localhost:7000
Stats
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
Other benefits
iot.us-east-1.amazonaws.com
174.217.14.202
174.217.14.234
Seattle 2018
Layer 5 / 6: Offloading SSL
Stats
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
Other benefits
- Efficient
- Up-to-date and secure (TLS 1.3)
- SNI, cert pinning, session resumption, etc.
- Easier to upgrade
localhost:7000 172.217.14.202:443
Seattle 2018
Layer 7: Managing HTTP
Stats
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
Other benefits
- Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed)
- Manage request retries and timeouts
- Access logging
- Offload GZIP decompression
HTTP/1
HTTP/2
Seattle 2018
Statistics
TCP (L3/L4) SSL (L5/L6) HTTP (L7)
cx_active
cx_connect_fail
cx_idle_timeout
cx_total
cx_tx_bytes_total
cx_rx_bytes_total
cx_length_ms (hist)
handshakes
tls_session_reused
fail_verify_no_cert
fail_verify_ca_error
fail_verify_san
cipher.<cipher>
days_until_cert_expires
cx_http1_total
cx_http2_total
cx_protocol_error
rq_2xx
rq_4xx
rq_5xx
rq_retry
rq_time_ms (hist)
rq_timeout
and more!
Seattle 2018
Dashboards
Live templating
or {% macro envoy_stats(origin, destination) %}
Seattle 2018
Observability
Homogenous telemetry data makes it easier
to observe and correlate behavior in large
systems.
Seattle 2018
Observability
Libraries are heterogenous!
SSL ciphers? Status code metrics? Retry?
import pynamodb
use AwsDynamoDbDynamoDbClient;
import "github.com/aws/dynamodb"
&aws.Config{
Endpoint:aws.String("http://localhost:8000")
}
e.g.
Envoy provides standard access logs, stats,
alarms, retry, etc
Seattle 2018
Layer 7: Beyond HTTP
Envoy supports three other database-specific L7 protocols today
Seattle 2018
DynamoDB
- Protocol: JSON over HTTP
- Cloudwatch telemetry
- min, avg, max latency
- per-table capacity unit throughput
- per-minute
- Benefits of Envoy:
- Histogram of latency (percentiles)
- Custom windowing of metrics
- Per-host, per-zone, and per-cluster statistics
Seattle 2018
DynamoDB with codec
Seattle 2018
POST / HTTP/1.1
X-Amz-Target: DynamoDB_20120810.GetItem
{
"TableName": "pets",
"Key": {
"Name": {"S": "Patty"}
}
}
DynamoDB with codec
dynamodb.table.pets.GetItem.upstream_rq_time
Seattle 2018
DynamoDB
What was the per-30s p99 for write requests from the
users-streamlistener canary to the pets table?
ts(
envoy.dynamodb.pets.PutItem.upstream_rq_time.p99,
window=30,
group=users-streamlistener,
canary=true,
)
Seattle 2018
MongoDB
- Protocol: Binary JSON (BSON)
- Benefits of Envoy in TCP mode:
- Per-host, per-cluster, per-zone network I/O
- Benefits of Envoy with Mongo codec:
- Per-operation latency
- Count size and number of documents
- Count scattered gets in sharded cluster
How did the number of documents returned by queries
change in us-east-1a after the 3pm deploy of my service?
Seattle 2018
MongoDB at scale
Help! My Mongo database is experiencing outages:
- Disk I/O wait spikes briefly
- Client opens more connections
- Slowdown due to auth overhead of new connections
- Open more connections
- Hit max connection limit
Envoy will rate limit new connections to apply backpressure so that query
times can recover.
Seattle 2018
MongoDB at scale
Help! I deleted an index. I read the code but it was in a 3,000 line class.
The index was still in use and everything fell over until we could
recreate it.
Envoy will efficiently log all Mongo queries in JSON format so that a week
of logs can be audited for usage of the index's fields.
Have you tried the built-in query profiler?
Yes, it caused a serious outage because it's expensive and results in 3x
CPU usage.
Seattle 2018
MongoDB at scale
Envoy will:
- globally rate limit new connections
- efficiently log all Mongo queries
- track the number of queries with no timeout set
- parse the $comment field of a query so we can time and count queries of
individual application methods, log how many records they returned, etc.
… for applications in 3 different languages across 8 clusters.
… 6 months and several outages later ...
Seattle 2018
/var/log/envoy/mongo/0.log
{
"time": "2018-10-13T21:17:08.483Z",
"upstream_host": "172.18.3.19:27817"
"message": {
"opcode": "OP_QUERY",
"query": {
"findAndModify": "user",
"query": {"_id": 903730},
"update": {"$set": {"stats.rating": 4.9}},
"$comment": "{
"hostname": "users-3ae3r",
"httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7",
"callingFunction": "users.UpdateRating" }"
},
},
}
envoy.mongo.callsite.users.UpdateRating.reply_time_ms
Seattle 2018
Redis partitioning proxy
Consistent hashingRedis protocol
+=
Seattle 2018
Redis at scale
localhost:6379
…
SET msg hello
INCR comm
MGET lyft hello
SET msg hello
GET hello
INCR comm
GET lyft
OK
1
nil
To the application, the proxy looks like a single instance of Redis.
Seattle 2018
Approaches
TCP
HTTP
…
Bump-in-the-wire Fully routing
vs
Seattle 2018
Future codecs
Seattle 2018
Roadmap
- More codecs
- Full L7 capability vs bump-in-the-wire
- Better integration of tracing
- More fault injection coverage
- Role-based access control
Seattle 2018
Thanks!
@danielhochman

More Related Content

What's hot

ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectDatabricks
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Daniel Hochman
 
Redis cluster
Redis clusterRedis cluster
Redis clusteriammutex
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?Mydbops
 
Lightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraLightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraScyllaDB
 
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, AdjustShipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, AdjustAltinity Ltd
 
Vitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL GiantVitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL GiantMatt Lord
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentationIlias Okacha
 
Rate limits and Performance
Rate limits and PerformanceRate limits and Performance
Rate limits and Performancesupergigas
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Databricks
 

What's hot (20)

ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
Geospatial Indexing at Scale: The 15 Million QPS Redis Architecture Powering ...
 
Redis cluster
Redis clusterRedis cluster
Redis cluster
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Lightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache CassandraLightweight Transactions in Scylla versus Apache Cassandra
Lightweight Transactions in Scylla versus Apache Cassandra
 
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, AdjustShipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
Shipping Data from Postgres to Clickhouse, by Murat Kabilov, Adjust
 
Vitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL GiantVitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL Giant
 
Open ebs 101
Open ebs 101Open ebs 101
Open ebs 101
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
 
Rate limits and Performance
Rate limits and PerformanceRate limits and Performance
Rate limits and Performance
 
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
Smartsheet’s Transition to Snowflake and Databricks: The Why and Immediate Im...
 

Similar to Instrumenting and Scaling Databases with Envoy

Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travixRuslan Lutsenko
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsCisco Canada
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.darach
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)jaxLondonConference
 
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Canada
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQlxfontes
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) Surendar S
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationThomasGraf42
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysisDhaval Mehta
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 

Similar to Instrumenting and Scaling Databases with Envoy (20)

Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Migrating the elastic stack to the cloud, or application logging @ travix
 Migrating the elastic stack to the cloud, or application logging @ travix Migrating the elastic stack to the cloud, or application logging @ travix
Migrating the elastic stack to the cloud, or application logging @ travix
 
Model-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data AnalyticsModel-driven Telemetry: The Foundation of Big Data Analytics
Model-driven Telemetry: The Foundation of Big Data Analytics
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
Big Data, Mob Scale.
Big Data, Mob Scale.Big Data, Mob Scale.
Big Data, Mob Scale.
 
Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)Big Events, Mob Scale - Darach Ennis (Push Technology)
Big Events, Mob Scale - Darach Ennis (Push Technology)
 
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data StreamsMongoDB World 2018: MongoDB for High Volume Time Series Data Streams
MongoDB World 2018: MongoDB for High Volume Time Series Data Streams
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Cisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven TelemetryCisco Connect Toronto 2017 - Model-driven Telemetry
Cisco Connect Toronto 2017 - Model-driven Telemetry
 
Cisco project ideas
Cisco   project ideasCisco   project ideas
Cisco project ideas
 
Writing New Relic Plugins: NSQ
Writing New Relic Plugins: NSQWriting New Relic Plugins: NSQ
Writing New Relic Plugins: NSQ
 
Bigdata meetup dwarak_realtime_score_app
Bigdata meetup dwarak_realtime_score_appBigdata meetup dwarak_realtime_score_app
Bigdata meetup dwarak_realtime_score_app
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Addressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh IntegrationAddressing Network Operator Challenges in YANG push Data Mesh Integration
Addressing Network Operator Challenges in YANG push Data Mesh Integration
 
Log aggregation and analysis
Log aggregation and analysisLog aggregation and analysis
Log aggregation and analysis
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 

Recently uploaded

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Recently uploaded (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 

Instrumenting and Scaling Databases with Envoy

  • 1. Seattle 2018 @danielhochman / Engineer / Lyft Instrumenting and Scaling Cloud-Native Databases with Envoy
  • 2. Seattle 2018 Database outage 1. Disk I/O wait spikes briefly 2. Client opens more connections 3. Slowdown due to auth overhead of new connections 4. Client opens more connections 5. Hit max connection limit
  • 3. Seattle 2018 Databases in the cloud Instantly provision resilient, high-throughput infrastructure No access to underlying VM and/or shared hardware Limited access to telemetry Limited access to configuration Closed source or no ability to run custom binary
  • 4. Seattle 2018 Cloud Native Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds.
  • 5. Seattle 2018 Service Mesh topology Service mesh Edge DiscoveryEnvoy Proxy is deployed at every hop
  • 6. Seattle 2018 Instance topology Application communicates over locally to Envoy which will proxy all traffic localhost:6001 localhost:6101 localhost:7000 … (internal services) (third-party services) (cloud services) and more!
  • 7. Seattle 2018 Layer 3 / 4: Proxying TCP - DNS aware - Load balancing: round robin, least request, ring hash, random, etc - Impose an idle timeout - Healthchecking - Access logging localhost:7000 Stats cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total Other benefits iot.us-east-1.amazonaws.com 174.217.14.202 174.217.14.234
  • 8. Seattle 2018 Layer 5 / 6: Offloading SSL Stats handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires Other benefits - Efficient - Up-to-date and secure (TLS 1.3) - SNI, cert pinning, session resumption, etc. - Easier to upgrade localhost:7000 172.217.14.202:443
  • 9. Seattle 2018 Layer 7: Managing HTTP Stats cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout Other benefits - Transparent upgrade from HTTP/1 to HTTP/2 (multiplexed) - Manage request retries and timeouts - Access logging - Offload GZIP decompression HTTP/1 HTTP/2
  • 10. Seattle 2018 Statistics TCP (L3/L4) SSL (L5/L6) HTTP (L7) cx_active cx_connect_fail cx_idle_timeout cx_total cx_tx_bytes_total cx_rx_bytes_total cx_length_ms (hist) handshakes tls_session_reused fail_verify_no_cert fail_verify_ca_error fail_verify_san cipher.<cipher> days_until_cert_expires cx_http1_total cx_http2_total cx_protocol_error rq_2xx rq_4xx rq_5xx rq_retry rq_time_ms (hist) rq_timeout and more!
  • 11. Seattle 2018 Dashboards Live templating or {% macro envoy_stats(origin, destination) %}
  • 12. Seattle 2018 Observability Homogenous telemetry data makes it easier to observe and correlate behavior in large systems.
  • 13. Seattle 2018 Observability Libraries are heterogenous! SSL ciphers? Status code metrics? Retry? import pynamodb use AwsDynamoDbDynamoDbClient; import "github.com/aws/dynamodb" &aws.Config{ Endpoint:aws.String("http://localhost:8000") } e.g. Envoy provides standard access logs, stats, alarms, retry, etc
  • 14. Seattle 2018 Layer 7: Beyond HTTP Envoy supports three other database-specific L7 protocols today
  • 15. Seattle 2018 DynamoDB - Protocol: JSON over HTTP - Cloudwatch telemetry - min, avg, max latency - per-table capacity unit throughput - per-minute - Benefits of Envoy: - Histogram of latency (percentiles) - Custom windowing of metrics - Per-host, per-zone, and per-cluster statistics
  • 17. Seattle 2018 POST / HTTP/1.1 X-Amz-Target: DynamoDB_20120810.GetItem { "TableName": "pets", "Key": { "Name": {"S": "Patty"} } } DynamoDB with codec dynamodb.table.pets.GetItem.upstream_rq_time
  • 18. Seattle 2018 DynamoDB What was the per-30s p99 for write requests from the users-streamlistener canary to the pets table? ts( envoy.dynamodb.pets.PutItem.upstream_rq_time.p99, window=30, group=users-streamlistener, canary=true, )
  • 19. Seattle 2018 MongoDB - Protocol: Binary JSON (BSON) - Benefits of Envoy in TCP mode: - Per-host, per-cluster, per-zone network I/O - Benefits of Envoy with Mongo codec: - Per-operation latency - Count size and number of documents - Count scattered gets in sharded cluster How did the number of documents returned by queries change in us-east-1a after the 3pm deploy of my service?
  • 20. Seattle 2018 MongoDB at scale Help! My Mongo database is experiencing outages: - Disk I/O wait spikes briefly - Client opens more connections - Slowdown due to auth overhead of new connections - Open more connections - Hit max connection limit Envoy will rate limit new connections to apply backpressure so that query times can recover.
  • 21. Seattle 2018 MongoDB at scale Help! I deleted an index. I read the code but it was in a 3,000 line class. The index was still in use and everything fell over until we could recreate it. Envoy will efficiently log all Mongo queries in JSON format so that a week of logs can be audited for usage of the index's fields. Have you tried the built-in query profiler? Yes, it caused a serious outage because it's expensive and results in 3x CPU usage.
  • 22. Seattle 2018 MongoDB at scale Envoy will: - globally rate limit new connections - efficiently log all Mongo queries - track the number of queries with no timeout set - parse the $comment field of a query so we can time and count queries of individual application methods, log how many records they returned, etc. … for applications in 3 different languages across 8 clusters. … 6 months and several outages later ...
  • 23. Seattle 2018 /var/log/envoy/mongo/0.log { "time": "2018-10-13T21:17:08.483Z", "upstream_host": "172.18.3.19:27817" "message": { "opcode": "OP_QUERY", "query": { "findAndModify": "user", "query": {"_id": 903730}, "update": {"$set": {"stats.rating": 4.9}}, "$comment": "{ "hostname": "users-3ae3r", "httpUniqueId": "91aaaaaf-4c3d-9400-bcbf-c4aaaaaaadb7", "callingFunction": "users.UpdateRating" }" }, }, } envoy.mongo.callsite.users.UpdateRating.reply_time_ms
  • 24. Seattle 2018 Redis partitioning proxy Consistent hashingRedis protocol +=
  • 25. Seattle 2018 Redis at scale localhost:6379 … SET msg hello INCR comm MGET lyft hello SET msg hello GET hello INCR comm GET lyft OK 1 nil To the application, the proxy looks like a single instance of Redis.
  • 28. Seattle 2018 Roadmap - More codecs - Full L7 capability vs bump-in-the-wire - Better integration of tracing - More fault injection coverage - Role-based access control