1. The number of known solutions is very large; roughly five thousand.
How informative is each error code? An error code is informative if its presence in an incoming incident
narrows down the incident’s possible solutions, i.e., reduces our uncertainty of the incident’s solution.
We can quantify the notion of solution “uncertainty” using entropy, a metric from information theory.
Entropy is a quantitative measure of uncertainty and disorder. Solution entropy measures the uncertainty
of an incoming incident’s solution.
𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 ≔ −
𝑖
𝑝𝑖 log 𝑝𝑖 , where 𝑝𝑖 is the proportion of incidents that have solution 𝑖.
Definitions:
• Baseline solution entropy (6.35): Solution entropy of all incidents (see top graph).
• Error type solution entropy: Solution entropy of incidents containing that error type.
• Error number solution entropy: Solution entropy of incidents containing that error type and error
number.
• Nearly all error type solution entropies are significantly lower than the baseline solution entropy,
indicating the presence of these error types significantly reduces solution uncertainty.
• Most error number solution entropies are significantly lower than their respective error type solution
entropies, indicating these error numbers carry additional information.
Alert descriptions tend to be messy and are in a format that does not allow
easy discovery and extraction of error codes:
<EventMessage>Apr 29 16:04:19 cmic CMICCore[9327]: DEGRADED: CMICCore: 11 #Failed over
for CMIC 1, 132, 1, 11 (39.84.32.11). This CMIC has taken over its
duties.</EventMessage><Subsystem>ServerMgmt</Subsystem><TrackingID>EBAY18-
CMICFailover</TrackingID><P
1. Parse out monograms/bigrams preceding numbers from alerts.
2. Manually determine which monograms/bigrams are potential error codes.
Stop words and noise are omitted.
3. Create and test regular expression patterns that extract these error codes.
4. Use the regular expression patterns to extract error codes from alerts.
• Identified 26 summary alert error types and 33 MPP alert error types from
~30,000 incidents from ~2,000 customer sites from July 2017 to July 2019.
Teradata's Global Support Organization (GSO)
strives to improve customer experience by quickly
and effectively resolving customers' technical issues
(i.e. incidents).
SAM is short for Service Analytics Machine, GSO's
initiative to automate certain aspects of customer
support using AI and machine learning. SAM
analyzes telemetry and currently serves as a
recommendation engine that recommends solutions
to known customer issues.
SAM’s Impact on Time to Resolutionof Known Issues
An incident occurs when a customer runs into
something wrong or unexpected with one of
Teradata's products.
For example, a customer may not be able to use a
product because the product crashes, is
unresponsive, or is slow.
AIC (Automatic Incident Creation) proactively
monitors telemetry and automatically creates an
incident when the AIC system detects a significant
cluster of anomalies/errors.
Alerts contain useful telemetry for SAM to analyze,
such as:
• Customer ID
• Timestamps
• Type of incident/alert
• Alert description
• Version numbers
• Backtraces
• Error codes
Feature engineering: Identify, extract, and profile
informative error codes for SAM to use as features
to learn on, hence improving SAM’s predictive
power.
• Identified, extracted, and profiled informative
error codes for SAM to use as features to
learn on, leading to improved predictive
power.
• Increased insight around error codes.
• Potentially increase the number of incidents
fingerprinted if error codes appear in incident
not covered by SAM’s existing models.
Teradata can leverage analytics and machine
learning to identify and extract informative
features, e.g. error codes, from noisy
telemetry.
• Incorporate error code features into SAM.
• Communicate findings with the broader
product support community to share insight
around error codes.
• Fully automate the process of identifying and
extracting informative error codes from
telemetry.
• Python (pandas, scikit-learn, NumPy, SciPy,
matplotlib, seaborn, pytest, re)
• Teradata Vantage
• SQL
• Jupyter Notebooks
• Git/GitHub
• Jenkins
• JIRA
• Getting accustomed to the software-heavy
aspect of building and maintaining a
machine learning model in production.
• Deriving value from messy, cryptic text data.
• This internship verified my passion in
pursuing data science as a career.
• Data science is a team effort; real data
science problems are complex and require
the collaboration of a cross-functional team.
• Manager: Brandon Quach
• Mentor: Chris Smith
• Director: Jenny Wang
• Product Owner: Brian Hutchins
• SAM data scientists: Brandon Quach, Chris
Smith, Jiacong Li, Andrew Washington
• The entire SAM team
Recommended
Solution
SAM
PSR Rules
Is a group of MPP
alerts
problematic?
MPP Alert
Rules
Is the telemetry
anomalous?
AIC Admin
Rules
Is a summary
alert serious
enough to warrant
an incident?
Telemetry
MPP
Alerts
Summary
Alerts
AIC
Incident
MPPalertsbundleintoasummaryalert.Summaryalertscomposeanincident.
Error Type Error Number
Error Code
CMICCore: 11
August 19 2019
Example of an informative error type with uninformative error numbers (the error
numbers tend to be unique, suggestingthe error numbers are arbitraryprocess IDs):
Example of an informative error type with uninformative error numbers (the error
number entropies are scarcely lower than the error type entropy):
Examples of an informativeerror type with informative error numbers:
Noisy Telemetry
Analytics-Driven Feature Extraction
Informative Features
Predictive Power