Thoughts on Technical Issues and Resolution CL24_1656.pdf
1.
Thoughts on
Technical Issues
andResolution
Jesse Austin
Flight System Systems Engineer
04/03/2023
This document has been reviewed and determined not
to contain export controlled technical data.
2.
jpl.nasa.gov
About me (college)
4/7/2024
Thisdocument has been reviewed and determined not to contain export controlled technical
data.
2
• CU Boulder Electrical and
Computer Engineering
• B.S. 2014, M.S. 2016
• Started aerospace
projects through
Colorado Space Grant
Consortium (COSGC)
• RockSat-X, RockOn,
PolarCube, and more!
Rocksat-X Terrier-Improved Malemute at NASA Wallops Flight Facility (WFF
Rocksat-X Integration and Testing at WFF Rocksat-X X-HD Payload Test
Rocksat-X X-HD Flight video still
3.
jpl.nasa.gov
About me (JPLcareer)
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
3
• First job out of college at NASA
Jet Propulsion Laboratory for
Mars 2020!
• M2020 Roles:
• Motor Control Subsystem
Integration and Test
Engineer
• Motor Control Flight
System Systems Engineer
• Mechanisms Operations
Engineer
• Currently the Lead Motor
Control Flight System Systems
Engineer for the Sample
Retrieval Lander
JPL Motor Control Subsystem Actuator Testbed
JPL Vehicle Systems
Testbed
JPL Engineering Operations
Mechanism Chair on M20
landing
Mars Sample Return Program concept image
4.
jpl.nasa.gov
What you willhear today
… maybe learn about
• Overview of technical issues and the JPL issue tracking
process.
• My personal beliefs on issue resolution mindset and common
pitfalls.
• Some example issues that I find interesting.
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
4
Disclaimer!
The views and advice presented here are my personal opinions.
They are influenced by Mars planetary project work and need to be
adapted to your specific use cases.
Nothing here is the rule for success.
Follow guidelines provided to you by your place of work.
5.
jpl.nasa.gov
Example issues
… weall have problems!
• In general, anything unexpected when using the system. (anomaly)
• Can be system design, behavior, or performance related:
• Software reset (assert)
• Software reports an error
• Algorithm computes incorrect value
• System powered off unexpectedly
• Mechanical yield
• Instrument pointing to incorrect location
• Electrical noise on sensor corrupts data
• System attitude exceeds defined limit
• Communication loss
• Can be user or procedure error:
• Tester left test lock in place before attempting to deploy mechanism
• Damaged circuitry while installing into mechanical chassis.
• Incorrect command issued resulting in undesired system state.
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
5
Issues are not limited to flight
hardware/software.
Ground systems, tools, etc.
can also have mission critical
issues.
6.
jpl.nasa.gov
When do wefind these issues?
• What phases of a project are issues typically discovered?
• Design ?
• Assembly and Test (V&V Program)?
• Flight Operations ?
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
6
7.
jpl.nasa.gov
When do wefind these issues?
• What phases of a project are issues typically discovered?
• Assembly and Test
• Flight Operations
• Note that Flight and Test venues may show/hide problems differently
depending on the fidelity of the venue and the environment.
• Example: electromagnetic interference (EMI) may present itself differently in a flight
configuration with flight electrical harnessing vs. a testbed configuration which may
have harnessing that is much longer, separated, or not shielded the same way.
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
7
Finding problems as early as
possible is typically much
better for implementation, cost,
and schedule!
8.
jpl.nasa.gov
Issues in Testvs. Operations
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
8
Test (Ground)
• Typically easier to investigate.
• Unlikely to be time critical.
• More readily observable.
• Can add external non-flight
devices to measure things
(instrumentation) such as an
oscilloscope.
• Likely have higher data
rates and more diagnostic
evidence available.
• Can use your senses
• Did you see, hear, smell,
feel something?
Flight (Operations)
• Can be difficult to investigate.
• May be time critical.
• Only have as much visibility as
designed into the system.
• Existing sensors, software
diagnostics, imagery, etc.
• May need to use a ground
Engineering Model (EM) for
comparison.
• Possible lower data rate or
access.
Tip: Design for diagnostics to be observable during
in flight anomalies! Ask yourself “what data would I
need to have access to understand issues with my
system?”
9.
jpl.nasa.gov
Where are wemost likely to find issues in a
system?
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
9
Main
Computer
Camera
Instrument
Power System
Pointing
Mechanism
Science
Objective
High Rate
Data Bus
Low Power Bus
High Power Bus
Low Rate
Data Bus
Camera
View
Softwar
e
Softwar
e
Firmwar
e
Softwar
e
Firmwar
e
???
10.
jpl.nasa.gov
Importance of earlytesting and finding issues.
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
10
Cost/impact
to fix a
problem
Test Fidelity
New Issues
Found (hopefull
Timeline to operations
1 2 3 4
1 Initial testing, interface
checkouts. The most time to
resolve issues.
2 Testing is high enough fidelity that
many issues can be uncovered.
More complex issues may be
observable. Finding and fixing
issues is the goal here.
3 Testing venue and process
stabilizes, key interface, behavior,
performance issues start to
reduce. Time to operations is
limited, costs/impact to fix an
issue increases.
4 Mission nears flight/operations. Any issues here are very
expensive and possibly risky to resolve depending on
impact.
“We have 7 years
to do a 9-year job”
11.
jpl.nasa.gov
So, you founda problem..
Now what?
• 1a. Human safety
• Priority is human health and safety. Always.
• Hardware can be replaced, people cannot.
• 1b. Hardware safety
• If human safety is not a concern, protect the hardware under test.
• As needed and appropriate for scenario: stop actions, Emergency
Power Off (EPO), ensure no future damage can occur.
• 2. ???
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
11
12.
jpl.nasa.gov
So, you founda problem..
Now what?
• 1a. Human safety
• Priority is human health and safety. Always.
• Hardware can be replaced, people cannot.
• 1b. Hardware safety
• If human safety is not a concern, protect the hardware under test.
• As needed and appropriate for scenario: stop actions, Emergency
Power Off (EPO), ensure no future damage can occur.
• 2. Document what occurred.
• 3. Stop and think about appropriate next steps.
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
12
13.
jpl.nasa.gov
JPL problem reportingprocess
• JPL issue reporting will include documenting the
following
• Description of Incident
• What happened, where, when, data artifacts from the time
of issue.
• Verification and Analysis
• Re-create the problem if possible/safe.
• Investigation into the issue will ideally determine a root
cause for and re-testing with diagnostic instrumentation or
analysis of the design can be used to verify the observed
issue.
• Corrective Action
• Decision on how to resolve the problem will be agreed
upon.
• Fix: implement a fix for the problem (such as a software
update)
• Use As Is (UAI): Do not fix the problem.
• May add additional operational workaround such as a Flight
Rule (FR) or handling constraint or requirement waiver.
• Testing
• Depending on the corrective action, testing will take place
to verify the fix or the workaround as acceptable.
• Closure
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
13
Tracking problems
with user assigned
tickets is a good
way to make sure
work gets done
and is not
missed/forgotten.
For example,
Github tickets
are used to track
flight ops tool
issues.
At JPL we assign an
engineer to work
the problems
(Stuckee).
14.
jpl.nasa.gov
Tips for documentation
•When initially describing the problem, do not place blame on any
specific person or area of the design.
• Just list the facts and the timeline as needed.
• Wrong:
• “Ricky incorrectly sent the POWER_OFF command at the wrong time
because they do not know what they are doing”
• “The calculated angle reported is incorrect probably due to flight software
code implementing the wrong algorithm”
• Right:
• “The test conductor sent the POWER_OFF command erroneously at step
11-7 of Procedure P12345, prior to observing the intended state change
listed in step 11-6, resulting in a warning message ERR_123 Invalid power
state request (code 0xABC3).”
• “The reported angle in telemetry field TLM-0001 was 1.2 Radians, where
operators expected 1.5 Radians. Flight software is expected to calculate
the angle based on additional fields X and Y with equation Z.”
• Always provide relevant pointers to test documentation and raw files
if they are available
• Goal is to have relevant info immediately available for Subject Matter
Experts (SME), leads/managers, or future reviewers to find.
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
14
15.
jpl.nasa.gov
My philosophies forissue investigation
• Always be willing to admit your mistakes.
• Trust but verify (especially your own assumptions).
• “Ice in the veins”
• Finding problems is probably part of your job…
• Fishbone Diagram
• What can cause the observed issue? what other observables would there
be? What data can we look at or generate work down possible problem
areas?
• Have documentation at the ready and do not be afraid of technical
specification or source code.
• Pitfalls
• Action bias
• Asking for help too soon
• Asking for help too late
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
15
Fishbone diagram
16.
jpl.nasa.gov
Mobility First Drive
4/7/2024
Thisdocument has been reviewed and determined not to contain export controlled technical
data.
16
https://www.nasa.gov/feature/jpl/nasas-mars-2020-rover-completes-its-
first-drive
System designed for a different environment than it is tested in…
17.
jpl.nasa.gov
Ingenuity Helicopter Deployin System Thermal Test (STT)
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
17
https://mars.nasa.gov/resources/25792/ingenuitys-complete-
deployment/
Human errors and the importance of having confidence in your expectations
18.
jpl.nasa.gov
Ingenuity Helicopter Deployin System Thermal Test (STT)
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
18
https://mars.nasa.gov/resources/25792/ingenuitys-complete-
deployment/
Human errors and the importance of having confidence in your expectations
19.
jpl.nasa.gov
Motor Controller CurrentSensor Corruption
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
19
Image source: https://www.jpl.nasa.gov/edu/news/2019/8/29/mars-rover-engineer-built-career-from-nasa-jpl-intern
Issues can be very complex and difficult to observe… Fixes can be risky or inopportune
20.
jpl.nasa.gov
Motor Controller CurrentSensor Corruption
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
20
HW current filter is
affected by “bad” PWM
26
Software filtered
current successful at
rejecting corruption
in the HW 500 mA motor
current error
avoided.
PWM = 26
Fixed:
Softwar
e
Filtered
Current.
Hardware
Filtered
Current
Hardware
Raw
Current
Note: mcs cmotor estimate is
filtered
differently than MCFSW
algorithm
Raw
Motor
Voltage
No need to understand this plot!
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber
=9843596
21.
jpl.nasa.gov
What we cando in the design/test phase to help minimize any
major problems later in the mission…
• Keep design refined and simple from the get go. More complexity = more problems.
• Reference past designs.
• Design Principles
• JPL tracks lessons learned as Design Principles that are imposed on future missions. Failure to adhere to these
principles requires an approved waiver with good rationale and agreement amongst technical experts.
• Design with margin that we can use later
• Mechanical design mass and volume margin, torque limits
• Software timing
• Memory resource needs
• Test to the environments we expect to be in (with margin)
• Hot/cold, vacuum, vibration, shock, EMI/EMC, power conditions, etc.
• Test at various levels of build up with various Engineering Model and Flight units.
• Subsystems, partial flight systems, full flight systems
• Accept/reject/resolve Test As You Fly Exceptions (TAYFE)
• Track and document our test plans, identifying holes in the testing that may not provide good coverage to find
system level issues.
• Fault Tree Analysis (FTA), Failure Modes and Effect Analysis (FMEA), and assessing Fault Containment Regions
(FCR)
• Implement fault protections in hardware and software as needed to mitigate or meet acceptable level of risk for
each of these areas.
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
21
22.
jpl.nasa.gov
Last bits ofadvice
• Have an expectation on what you will observe before you take an
action.
• It is okay to not know the answer! We all get better over time,
more experience will naturally make you more adept at finding
root cause for issues.
• Understand that not every issue needs to be fixed.
• Ask, what are we trying to do, how does this problem affect it? Can
we accept the issue?
• Is this resolution a “Make play, or make better?”
• Risk vs. Consequence
• Reflect on how you performed after major anomaly
investigations.
• What did you do right? What do you wish you had done better?
4/7/2024
This document has been reviewed and determined not to contain export controlled technical
data.
22