Monitoring Error Logs at Databricks

Monitoring error logs at
Databricks
Josh Rosen
April 12, 2017
1

2
How we process logs
Amazon S3
Kinesis Analysis
(This talk) Alerts
Reports
Dashboards
Raw logs
Raw logs
+
Raw logs
Basic pre-processing
Parquet
+
Parquet
CustomerDatabricks
service
service
Log4J

Goal: monitor services’ logs for errors
• Search service logs for error messages to discover
issues and determine their scope/impact:
– Which customers are impacted by an error?
– How frequently is that error occurring?
– Does the error only affect certain versions of our
software?
3

Challenges
• Data structure: our logs have less structure than our
metrics
• Data volume: we ingest over 10 terabytes of logs per day
• Signal vs. noise: alerting system isn’t very useful if has
frequent false alarms for benign/known errors.
4

Solution: normalize, deduplicate & filter
• Normalize: replace constants in logs (numbers, IP
addresses, customer names) with placeholders.
• Deduplicate: Store (count, version,
set(customers), example) instead of raw logs.
• Filter: Use patterns to (conditionally) ignore known
errors or to surface only new errors (errors that
appeared for the first time).
5

High level overview of pipeline:
6
Raw Logs
Logs with
versions
Service
version info
Fast
normalize
Deduplicate
/ aggregate
Slow
normalize
Final
aggregation
Error
suppression
patterns
Historic data
New /
interesting
errors
Alerts
Reports
Dashboards
Storage
(for historical
analysis)
Non-suppressed
errors

7
CREATE TABLE allDeduplicatedErrors (
normalizedErrorDetail STRING,
rawErrorDetail STRING,
numOccurrences BIGINT,
serviceVersion STRING,
affectedShards ARRAY<STRING>,
className STRING,
date STRING,
service STRING
)
USING parquet
PARTITIONED BY (date, service)
Store counts instead
of individual messages
Partition to support
data skipping at query
time
Store separate counts
for each service
version so we can
compare relative error
frequencies
Error pattern (with
placeholders)
Example of raw error
(before normalization)
Name of Java class
producing log
List of affected
customers

Enriching logs with version info
• Our services don’t record version information in each log message.
• We can use service uptime logs to build a
(serviceInstance, time range) -> version
mapping, the join against this mapping to enrich logs with version info.
8

val serviceVersions = sql(s"""
SELECT
tags.shardName AS customer,
tags.projectName AS service,
instanceId AS instanceId,
cast(from_unixtime(min(timestamp / 1000)) AS timestamp) AS min_ts,
cast(from_unixtime(max(timestamp / 1000)) AS timestamp) AS max_ts,
tags.branchName AS branchName
FROM serviceUptimeUsageLogs
GROUP BY tags.branchName, tags.projectName, instanceId, tags.shardName
""")
serviceVersions.createOrReplaceTempView("serviceVersions")
9

SELECT
cast(1 AS long) AS cnt, serviceErrors.*, branchName
FROM serviceErrors, serviceVersions
WHERE
serviceErrors.shardName = serviceVersions.shardName
AND serviceErrors.service = serviceVersions.service
AND serviceErrors.instanceId = serviceVersions.instanceId
AND cast(concat(getArgument('date'), ' ', serviceErrors.time) AS timestamp)
>= serviceVersions.min_ts
AND cast(concat(getArgument('date'), ' ', serviceErrors.time) AS timestamp)
<= serviceVersions.max_ts
10

Fast normalization
• Log volume is huge, so we want to perform cheap normalization to quickly
cut down the log volume before applying more expensive normalization.
• At this stage, we:
– Truncate huge log messages
– Strip out information which prefixes the log message (timestamps, metadata)
– Drop certain high-frequency errors that we don’t want to analyze using this
pipeline (e.g. OOMs, which are analyzed separately)
• This can be expressed easily in SQL using built-in functions.
11

Expensive normalization
• Use a UDF (user-defined function) which applies a list of regexes, in
increasing order of generality, to replace variable data with placeholders.
12
val regexes = Seq(
"https://[^ ]+" -> "https://<URL>",
"http://[^ ]+" -> "http://<URL>",
"[((root|tenant|op|parent)=[^ ]+ ?){1,5}]" -> "<RPC-TRACING-INFO>",
[...]
"(?<![a-zA-Z])[0-9]+.[0-9]+" -> "<NUM>", // floating point numbers
"(?<![a-zA-Z])[0-9]+" -> "<NUM>"
)

Expensive normalization
• Use a UDF (user-defined function) which applies a list of regexes, in
increasing order of generality, to replace variable data with placeholders.
13
assert(normalizeError("1 1.23 1.2.3.4") == "<NUM> <NUM> <IP-ADDRESS>")

Example: raw log
14
(workerEnvId=default-worker-env)[tenant=0 root=ElasticJobRun-68af12a6e31cd8e7
parent=InstanceManager-3db732f5476993f5 op=InstanceManager-3db732f5476993f6]:
Exception while trying to launch new instance (req =
NewInstanceRequest(r3.2xlarge,worker,branch-2.39-304-9912f549,shard-fooCorp,Pend
ingInstance{attributes=AwsInstanceAttributes(instance_type_id: "r3.2xlarge"
memory_mb: 62464
num_cores: 8
[...]
com.databricks.backend.aws.util.InstanceSetupTimeoutException: Timeout after
1200 seconds while setting up instance i-00cf6d76e44d64ed7: Instance is not
running.
at [..stacktrace..]

Example: variables to normalize
15
(workerEnvId=default-worker-env) [tenant=0 root=ElasticJobRun-68af12a6e31cd8e7
parent=InstanceManager-3db732f5476993f5 op=InstanceManager-3db732f5476993f6] :
Exception while trying to launch new instance (req =
NewInstanceRequest( r3.2xlarge,worker,branch-2.39-304-9912f549 ,shard-fooCorp,Pend
ingInstance{attributes=AwsInstanceAttributes(instance_type_id: "r3.2xlarge"
memory_mb: 62464
num_cores: 8
[...]
1200 seconds while setting up instance i-00cf6d76e44d64ed7 : Instance is not
running.
at [..stacktrace..]

Example: normalized log
16
(workerEnvId=default-worker-env) <RPC-TRACING-INFO> : Exception while trying to
launch new instance (req =
NewInstanceRequest( <INSTANCE-TYPE-ID> ,worker,<BRANCH-NAME>,<SHARD-NAME>,PendingIn
stance{attributes=AwsInstanceAttributes(instance_type_id: " <INSTANCE-TYPE-ID> "
memory_mb: <NUM>
num_cores: <NUM>
[...]
<NUM>
seconds while setting up instance <AWS-INSTANCE-ID> : Instance is not running.
at [..stacktrace..]

18
Signal vs. noise in log monitoring
• Even though we’ve deduplicated our errors, many of the same error
categories occur day-to-day.
• We don’t want to have to wade through lots of errors that we already know
about in order to find new errors.
• Once we’ve fixed a bug which causes an error it would be useful to
suppress that error in log messages from known buggy versions.

19
Filtering known errors
Error suppression patterns:
• Mark an error pattern as known and suppress it.
• If the error is expected to be fixed, supply a fix version.
• If an error reoccurs in a version where we expect the error to be fixed,
(actualVersion >= fixVersion), then we do not suppress the
error and show the new occurrences in reports.

20
case class KnownError(
service: String,
className: String,
errorDetailPattern: String,
fixVersion: String)
KnownError(
"driver",
"TaskSetManager",
"Task <NUM> in stage <NUM> failed <NUM> times; aborting job%",
null)
KnownError(
"driver", "LiveListenerBus", "%ConcurrentModificationException%", "2.0.2")
Error reports will only
include occurrences
from version 2.0.2+
Unconditionally hide
this error

21
SELECT
[...]
FROM allDeduplicatedErrors [...]
WHERE
NOT EXISTS (
SELECT *
FROM knownErrors
WHERE
knownErrors.service = allDeduplicatedErrors.service
AND knownErrors.className = allDeduplicatedErrors.className
AND allDeduplicatedErrors.normalizedErrorDetail LIKE knownErrors.errorDetailPattern
AND (fixVersion IS NULL OR
isHigherVersionThan(fixVersion, allDeduplicatedErrors.serviceVersion))
)
GROUP BY [...]

22
End result:
• High-signal dashboards and reports of only “new” errors.
– Reports have surfaced several rarely-occurring but important errors
– Example: alerted on unexpected failure mode in third-party library
• Normalized and aggregated error logs enable fast analysis / investigation.
• Fast processing pipeline means we can quickly re-process historical raw
logs in case we want to normalize or aggregate by different criteria.

Thank you
joshrosen@databricks.com
@jshrsn
24

Monitoring Error Logs at Databricks

More Related Content

What's hot

Viewers also liked

Similar to Monitoring Error Logs at Databricks

More from Databricks

Recently uploaded

Monitoring Error Logs at Databricks