Using
Formal Techniques with TLA+ for ISO 26262 Functional Safety Verification
— Full Version
A short version of this piece appeared on LinkedIn.
This is the full version, with the actual TLA+ specifications, the ISO
26262 V-cycle integration, and the tooling notes that wouldn’t fit
there.
Finding Design Bugs
Before You Write Code
ISO 26262 Part 6 Table 7 recommends formal methods for ASIL-C/D
software design verification. Most teams interpret that as “run
Polyspace” and move on. That covers half the requirement. The other half
— design verification — is where the harder bugs live.
Static analysis catches coding defects, not design flaws. When your
redundancy manager has an unhandled state, Polyspace won’t find it. When
your mode-transition logic has an unreachable safe state, MISRA
compliance doesn’t help. These are architectural bugs that exist in the
design before a single line of C is written.
I’ve been working with TLA+ to close this gap. TLA+ is a formal
specification language created by Leslie Lamport (LaTeX, the Turing
Award, etc.). It’s used at AWS and Azure to verify distributed systems,
and the same principles apply directly to safety-critical embedded
systems.
Below is a worked example you can copy, run, and adapt.
The Problem:
Dual-Channel Redundancy Arbitration
Consider a dual-channel redundancy controller — the kind found in any
drive-by-wire, brake-by-wire, or steer-by-wire system. Two independent
processing channels run simultaneously and cross-check each other. Each
may process slightly different data, but they must deliver identical
results. If outputs diverge, a fault is detected.
The safety requirement reflects the architecture: both
channels must be ACTIVE and agreeing for the system to assure
correctness. Any divergence indicates a detectable fault.
That sounds straightforward. Let’s specify it formally and see what
happens.
Model 1 — The Naive Design
---------------- MODULE DualChannelV1 ----------------
EXTENDS Naturals, TLC
VARIABLES channel_a, channel_b, outputs_match
States == {"ACTIVE", "FAILED"}
Init ==
/\ channel_a = "ACTIVE"
/\ channel_b = "ACTIVE"
/\ outputs_match = TRUE
ChannelDivergence ==
/\ channel_a = "ACTIVE" /\ channel_b = "ACTIVE"
/\ outputs_match = TRUE
/\ outputs_match' = FALSE
/\ UNCHANGED <<channel_a, channel_b>>
ChannelAFails ==
/\ channel_a = "ACTIVE"
/\ channel_a' = "FAILED"
/\ UNCHANGED <<channel_b, outputs_match>>
ChannelBFails ==
/\ channel_b = "ACTIVE"
/\ channel_b' = "FAILED"
/\ UNCHANGED <<channel_a, outputs_match>>
Next == ChannelDivergence \/ ChannelAFails \/ ChannelBFails
Spec == Init /\ [][Next]_<<channel_a, channel_b, outputs_match>>
\* Safety: both channels must be ACTIVE and agreeing
SafetyInvariant ==
channel_a = "ACTIVE" /\ channel_b = "ACTIVE" => outputs_match
====================================================
A code review would probably approve this. It captures the intuitive
understanding:
- Two channels, each
ACTIVEorFAILED - Cross-check via
outputs_match(TRUE = agree, FALSE =
divergence) - Both start ACTIVE with matching outputs
- Hardware failures can take a channel down independently
Run TLC against the safety invariant.
Bug #1 — Found in Two Steps
Error: Invariant SafetyInvariant is violated.
State 1: channel_a = "ACTIVE", channel_b = "ACTIVE", outputs_match = TRUE
State 2: channel_a = "ACTIVE", channel_b = "ACTIVE", outputs_match = FALSE
The model checker found a critical safety violation in two steps. The
cross-check detected divergence, but the design has no
response. Both channels remain ACTIVE while disagreeing —
neither can be trusted, but the system continues to drive actuators as
if everything were fine.
This is the classic specification gap: the design defined how to
detect faults but not how to handle
them. In a real system, this state produces undefined actuator behaviour
— potentially sending conflicting commands to safety-critical
components.
A code review would not have caught this. Polyspace would not have
caught this. Unit tests wouldn’t have caught this — the bug isn’t in any
single line of code. The system worked exactly as specified. The
specification was wrong.
Model 2 — Handling Divergence
The fix has to handle the divergence. With only two
channels, we have a fundamental problem: when outputs disagree,
we don’t know which channel is faulty. The cross-check tells us they
disagree, not who is correct.
There is no tiebreaker. The honest answer in a dual-channel
architecture is: on divergence, transition both channels to a
SAFE state and demand external intervention.
---------------- MODULE DualChannelV2 ----------------
EXTENDS Naturals, TLC
VARIABLES channel_a, channel_b, outputs_match, system_state
States == {"ACTIVE", "FAILED", "SAFE"}
SystemStates == {"NORMAL", "DEGRADED", "FAIL_SAFE"}
Init ==
/\ channel_a = "ACTIVE"
/\ channel_b = "ACTIVE"
/\ outputs_match = TRUE
/\ system_state = "NORMAL"
\* Divergence triggers system-wide fail-safe
HandleDivergence ==
/\ outputs_match = FALSE
/\ system_state \in {"NORMAL", "DEGRADED"}
/\ system_state' = "FAIL_SAFE"
/\ channel_a' = "SAFE"
/\ channel_b' = "SAFE"
/\ UNCHANGED outputs_match
\* (other transitions unchanged)
SafetyInvariant ==
/\ (channel_a = "ACTIVE" /\ channel_b = "ACTIVE") => outputs_match
/\ (~outputs_match) => (system_state = "FAIL_SAFE" \/ ENABLED HandleDivergence)
====================================================
Run TLC again. Now the divergence path is handled — every divergent
state is followed by a transition to FAIL_SAFE. But Model 2
has its own bug.
Bug #2 — The
Single-Channel-Failure Race
Error: Invariant SafetyInvariant is violated.
State 1: a=ACTIVE, b=ACTIVE, match=TRUE, sys=NORMAL
State 2: a=ACTIVE, b=FAILED, match=TRUE, sys=NORMAL <-- still NORMAL!
State 3: a=ACTIVE, b=FAILED, match=FALSE, sys=NORMAL <-- divergence with one dead channel
When channel B fails as a hardware fault, channel A keeps running but
loses its cross-check partner. The system stays in NORMAL
because outputs still “match” (B is FAILED, contributing nothing). One
physical disturbance later, A diverges from a known-good reference and
the system has no working cross-check.
The fix: when a channel fails for any reason, the system must
transition to DEGRADED and require a graceful shutdown —
not continue in NORMAL pretending everything is fine.
This is the second architectural bug found in two
more model-checker steps. None of these would have been caught by
code-level review alone.
What Goes Into Your Safety
Case
For ISO 26262 Part 6 §9.4.4 (“Verification of software design”),
Table 7 lists formal verification (1f) as highly
recommended at ASIL-C and ASIL-D. The work product expected is
a verification specification that demonstrates the design satisfies the
software safety requirements.
A TLA+ analysis like the one above produces three artefacts that drop
directly into your safety case:
- The specification itself —
DualChannelV2.tlais human-readable mathematics describing
the safety architecture. Reviewers can read it. - The model-checking report — TLC output showing what
state space was explored, what invariants held, and what counterexamples
were found and resolved. This is verification evidence. - The traceability link — each TLA+ invariant maps to
one or more software safety requirements (SSRs). The
SafetyInvariantabove traces to the SSR that says “the
system shall transition to a safe state on detection of channel
divergence within FTTI.”
Your assessor will ask: what evidence shows the design verifies
the SSRs? The TLC report is part of that evidence.
Integrating TLA+ Into the
V-Cycle
TLA+ is a left-side activity in the V-model — it
lives between architecture design and detailed design. Concretely:
| V-cycle stage | TLA+ activity |
|---|---|
| Item definition | Identify safety goals; surface the architecture sketch |
| Functional safety concept | Write top-level TLA+ spec (states, transitions, safety invariants) |
| Technical safety concept | Refine TLA+ spec with channel/component decomposition |
| Software architectural design | TLA+ model checking against SSRs — main effort here |
| Software unit design | TLA+ specs become acceptance tests for unit behaviour |
| Software unit verification | Runtime monitors derived from TLA+ invariants |
| Software integration | Counterexample traces become integration test cases |
Done well, TLA+ pays for itself twice: once during architecture
(catches design bugs cheaply) and again during verification (every
counterexample is a free integration test).
Practical Tooling Notes
A few things that matter when you actually run this in
production:
- Use the TLA+ Toolbox for the first project — the
built-in TLC integration, error trace browser, and PlusCal translator
save days. Once your team is comfortable, the VS Code
extension (tlaplus.vscode-ide-tlaplus) is faster
for iterative work. - Constrain state explosion early. Set
CONSTANTdeclarations for things like channel count, mode
count, fault counts. Without bounds, TLC will happily try to enumerate
infinite states. - Symmetry sets can shrink the state space by orders
of magnitude. If your channels are symmetric, declare them as
\* SYMMETRY Permutations(Channels). - Liveness vs. safety — most ISO 26262 work cares
about safety properties (always, invariants). Liveness
(eventually) matters when you need to prove the system
reaches the safe state within a bound. Use the<>[]
and[]<>operators sparingly; they’re slower to
check. - PlusCal is worth learning if your team’s mental
model is closer to algorithms than to mathematics. It transpiles to
TLA+, runs on TLC, and gives you sequential-program syntax
(while,if, etc.) for free. - Counterexample traces are the deliverable. A TLA+
project that produces no counterexamples is suspect — either
the spec is too loose to fail anything, or you’ve missed a property. Aim
to find at least one bug per model. They’re the evidence that the work
mattered.
What TLA+ Doesn’t Do
To set expectations:
- It doesn’t verify your code. TLA+ verifies the
design. The implementation must still be checked against the spec — by
inspection, by static analysis, by simulation, by test. - It doesn’t verify timing. TLA+ models discrete
state transitions. For hard real-time properties (FTTI compliance,
jitter bounds), you need a different tool (Uppaal, model-based timing
analysis, or measurement on target hardware). - It doesn’t verify hardware faults. TLA+ assumes the
model of hardware faults you give it. If your fault model is wrong
(e.g., you assume independent failures but the hardware has a
common-cause coupling), TLA+ verifies the wrong system.
These limits are why ISO 26262 calls for multiple
complementary methods, not formal verification alone. TLA+ is
one tool — necessary but not sufficient — that addresses the design-flaw
gap that other methods leave open.
Resources to Get Started
Books – Practical TLA+ by Hillel Wayne —
the best introduction – Specifying Systems by Leslie Lamport —
the canonical reference (free PDF online)
Tools – TLA+ Toolbox (official IDE, includes TLC
model checker) – VS Code TLA+ extension for lighter-weight editing –
PlusCal translator (bundled with the Toolbox)
Free – learntla.com — Hillel Wayne’s online
tutorial – Leslie Lamport’s video course on the official TLA+ site
The Bottom Line
ISO 26262 recommends formal methods because design bugs in
safety-critical systems can present hazards that kill people or damage
property.
Static analysis is necessary but not sufficient. It catches coding
defects in the implementation of your design. It cannot catch flaws in
the design itself.
TLA+ is one tool — not the only tool — that addresses this gap. For
safety-critical state machines where unhandled states, missing
transitions, and edge cases matter, it’s worth the investment.
The unhandled divergence in our example would have shipped. It would
have passed code review, static analysis, and probably most testing. It
would have failed in the field, under exactly the conditions it was
designed to detect.
Formal methods found it in two steps.
Need help applying TLA+ to your ISO 26262 program — or want a
second opinion on an existing safety case? Get in touch.
This article is also published on Zenodo (DOI:
10.5281/zenodo.19349165) for citation in formal documentation.
#FunctionalSafety #ISO26262 #FormalMethods #TLAplus #AutomotiveSafety
#ASIL #SafetyCritical #SystemsEngineering #Verification #Automotive