Using Formal Techniques with TLA+ for ISO 26262 Functional Safety Verification

Using
Formal Techniques with TLA+ for ISO 26262 Functional Safety Verification
— Full Version

A short version of this piece appeared on LinkedIn.
This is the full version, with the actual TLA+ specifications, the ISO
26262 V-cycle integration, and the tooling notes that wouldn’t fit
there.


Finding Design Bugs
Before You Write Code

ISO 26262 Part 6 Table 7 recommends formal methods for ASIL-C/D
software design verification. Most teams interpret that as “run
Polyspace” and move on. That covers half the requirement. The other half
— design verification — is where the harder bugs live.

Static analysis catches coding defects, not design flaws. When your
redundancy manager has an unhandled state, Polyspace won’t find it. When
your mode-transition logic has an unreachable safe state, MISRA
compliance doesn’t help. These are architectural bugs that exist in the
design before a single line of C is written.

I’ve been working with TLA+ to close this gap. TLA+ is a formal
specification language created by Leslie Lamport (LaTeX, the Turing
Award, etc.). It’s used at AWS and Azure to verify distributed systems,
and the same principles apply directly to safety-critical embedded
systems.

Below is a worked example you can copy, run, and adapt.


The Problem:
Dual-Channel Redundancy Arbitration

Consider a dual-channel redundancy controller — the kind found in any
drive-by-wire, brake-by-wire, or steer-by-wire system. Two independent
processing channels run simultaneously and cross-check each other. Each
may process slightly different data, but they must deliver identical
results. If outputs diverge, a fault is detected.

The safety requirement reflects the architecture: both
channels must be ACTIVE and agreeing
for the system to assure
correctness. Any divergence indicates a detectable fault.

That sounds straightforward. Let’s specify it formally and see what
happens.


Model 1 — The Naive Design

---------------- MODULE DualChannelV1 ----------------
EXTENDS Naturals, TLC

VARIABLES channel_a, channel_b, outputs_match

States == {"ACTIVE", "FAILED"}

Init ==
  /\ channel_a = "ACTIVE"
  /\ channel_b = "ACTIVE"
  /\ outputs_match = TRUE

ChannelDivergence ==
  /\ channel_a = "ACTIVE" /\ channel_b = "ACTIVE"
  /\ outputs_match = TRUE
  /\ outputs_match' = FALSE
  /\ UNCHANGED <<channel_a, channel_b>>

ChannelAFails ==
  /\ channel_a = "ACTIVE"
  /\ channel_a' = "FAILED"
  /\ UNCHANGED <<channel_b, outputs_match>>

ChannelBFails ==
  /\ channel_b = "ACTIVE"
  /\ channel_b' = "FAILED"
  /\ UNCHANGED <<channel_a, outputs_match>>

Next == ChannelDivergence \/ ChannelAFails \/ ChannelBFails

Spec == Init /\ [][Next]_<<channel_a, channel_b, outputs_match>>

\* Safety: both channels must be ACTIVE and agreeing
SafetyInvariant ==
  channel_a = "ACTIVE" /\ channel_b = "ACTIVE" => outputs_match
====================================================

A code review would probably approve this. It captures the intuitive
understanding:

  • Two channels, each ACTIVE or FAILED
  • Cross-check via outputs_match (TRUE = agree, FALSE =
    divergence)
  • Both start ACTIVE with matching outputs
  • Hardware failures can take a channel down independently

Run TLC against the safety invariant.


Bug #1 — Found in Two Steps

Error: Invariant SafetyInvariant is violated.

State 1:  channel_a = "ACTIVE", channel_b = "ACTIVE", outputs_match = TRUE
State 2:  channel_a = "ACTIVE", channel_b = "ACTIVE", outputs_match = FALSE

The model checker found a critical safety violation in two steps. The
cross-check detected divergence, but the design has no
response
. Both channels remain ACTIVE while disagreeing —
neither can be trusted, but the system continues to drive actuators as
if everything were fine.

This is the classic specification gap: the design defined how to
detect faults but not how to handle
them. In a real system, this state produces undefined actuator behaviour
— potentially sending conflicting commands to safety-critical
components.

A code review would not have caught this. Polyspace would not have
caught this. Unit tests wouldn’t have caught this — the bug isn’t in any
single line of code. The system worked exactly as specified. The
specification was wrong.


Model 2 — Handling Divergence

The fix has to handle the divergence. With only two
channels
, we have a fundamental problem: when outputs disagree,
we don’t know which channel is faulty. The cross-check tells us they
disagree, not who is correct.

There is no tiebreaker. The honest answer in a dual-channel
architecture is: on divergence, transition both channels to a
SAFE state and demand external intervention.

---------------- MODULE DualChannelV2 ----------------
EXTENDS Naturals, TLC

VARIABLES channel_a, channel_b, outputs_match, system_state

States == {"ACTIVE", "FAILED", "SAFE"}
SystemStates == {"NORMAL", "DEGRADED", "FAIL_SAFE"}

Init ==
  /\ channel_a = "ACTIVE"
  /\ channel_b = "ACTIVE"
  /\ outputs_match = TRUE
  /\ system_state = "NORMAL"

\* Divergence triggers system-wide fail-safe
HandleDivergence ==
  /\ outputs_match = FALSE
  /\ system_state \in {"NORMAL", "DEGRADED"}
  /\ system_state' = "FAIL_SAFE"
  /\ channel_a' = "SAFE"
  /\ channel_b' = "SAFE"
  /\ UNCHANGED outputs_match

\* (other transitions unchanged)

SafetyInvariant ==
  /\ (channel_a = "ACTIVE" /\ channel_b = "ACTIVE") => outputs_match
  /\ (~outputs_match) => (system_state = "FAIL_SAFE" \/ ENABLED HandleDivergence)
====================================================

Run TLC again. Now the divergence path is handled — every divergent
state is followed by a transition to FAIL_SAFE. But Model 2
has its own bug.


Bug #2 — The
Single-Channel-Failure Race

Error: Invariant SafetyInvariant is violated.

State 1:  a=ACTIVE, b=ACTIVE, match=TRUE,  sys=NORMAL
State 2:  a=ACTIVE, b=FAILED, match=TRUE,  sys=NORMAL    <-- still NORMAL!
State 3:  a=ACTIVE, b=FAILED, match=FALSE, sys=NORMAL    <-- divergence with one dead channel

When channel B fails as a hardware fault, channel A keeps running but
loses its cross-check partner. The system stays in NORMAL
because outputs still “match” (B is FAILED, contributing nothing). One
physical disturbance later, A diverges from a known-good reference and
the system has no working cross-check.

The fix: when a channel fails for any reason, the system must
transition to DEGRADED and require a graceful shutdown —
not continue in NORMAL pretending everything is fine.

This is the second architectural bug found in two
more model-checker steps. None of these would have been caught by
code-level review alone.


What Goes Into Your Safety
Case

For ISO 26262 Part 6 §9.4.4 (“Verification of software design”),
Table 7 lists formal verification (1f) as highly
recommended
at ASIL-C and ASIL-D. The work product expected is
a verification specification that demonstrates the design satisfies the
software safety requirements.

A TLA+ analysis like the one above produces three artefacts that drop
directly into your safety case:

  1. The specification itself
    DualChannelV2.tla is human-readable mathematics describing
    the safety architecture. Reviewers can read it.
  2. The model-checking report — TLC output showing what
    state space was explored, what invariants held, and what counterexamples
    were found and resolved. This is verification evidence.
  3. The traceability link — each TLA+ invariant maps to
    one or more software safety requirements (SSRs). The
    SafetyInvariant above traces to the SSR that says “the
    system shall transition to a safe state on detection of channel
    divergence within FTTI.”

Your assessor will ask: what evidence shows the design verifies
the SSRs?
The TLC report is part of that evidence.


Integrating TLA+ Into the
V-Cycle

TLA+ is a left-side activity in the V-model — it
lives between architecture design and detailed design. Concretely:

V-cycle stage TLA+ activity
Item definition Identify safety goals; surface the architecture sketch
Functional safety concept Write top-level TLA+ spec (states, transitions, safety
invariants)
Technical safety concept Refine TLA+ spec with channel/component decomposition
Software architectural design TLA+ model checking against SSRs — main effort
here
Software unit design TLA+ specs become acceptance tests for unit behaviour
Software unit verification Runtime monitors derived from TLA+ invariants
Software integration Counterexample traces become integration test cases

Done well, TLA+ pays for itself twice: once during architecture
(catches design bugs cheaply) and again during verification (every
counterexample is a free integration test).


Practical Tooling Notes

A few things that matter when you actually run this in
production:

  • Use the TLA+ Toolbox for the first project — the
    built-in TLC integration, error trace browser, and PlusCal translator
    save days. Once your team is comfortable, the VS Code
    extension
    (tlaplus.vscode-ide-tlaplus) is faster
    for iterative work.
  • Constrain state explosion early. Set
    CONSTANT declarations for things like channel count, mode
    count, fault counts. Without bounds, TLC will happily try to enumerate
    infinite states.
  • Symmetry sets can shrink the state space by orders
    of magnitude. If your channels are symmetric, declare them as
    \* SYMMETRY Permutations(Channels).
  • Liveness vs. safety — most ISO 26262 work cares
    about safety properties (always, invariants). Liveness
    (eventually) matters when you need to prove the system
    reaches the safe state within a bound. Use the <>[]
    and []<> operators sparingly; they’re slower to
    check.
  • PlusCal is worth learning if your team’s mental
    model is closer to algorithms than to mathematics. It transpiles to
    TLA+, runs on TLC, and gives you sequential-program syntax
    (while, if, etc.) for free.
  • Counterexample traces are the deliverable. A TLA+
    project that produces no counterexamples is suspect — either
    the spec is too loose to fail anything, or you’ve missed a property. Aim
    to find at least one bug per model. They’re the evidence that the work
    mattered.

What TLA+ Doesn’t Do

To set expectations:

  • It doesn’t verify your code. TLA+ verifies the
    design. The implementation must still be checked against the spec — by
    inspection, by static analysis, by simulation, by test.
  • It doesn’t verify timing. TLA+ models discrete
    state transitions. For hard real-time properties (FTTI compliance,
    jitter bounds), you need a different tool (Uppaal, model-based timing
    analysis, or measurement on target hardware).
  • It doesn’t verify hardware faults. TLA+ assumes the
    model of hardware faults you give it. If your fault model is wrong
    (e.g., you assume independent failures but the hardware has a
    common-cause coupling), TLA+ verifies the wrong system.

These limits are why ISO 26262 calls for multiple
complementary methods
, not formal verification alone. TLA+ is
one tool — necessary but not sufficient — that addresses the design-flaw
gap that other methods leave open.


Resources to Get Started

BooksPractical TLA+ by Hillel Wayne —
the best introduction – Specifying Systems by Leslie Lamport —
the canonical reference (free PDF online)

Tools – TLA+ Toolbox (official IDE, includes TLC
model checker) – VS Code TLA+ extension for lighter-weight editing –
PlusCal translator (bundled with the Toolbox)

Freelearntla.com — Hillel Wayne’s online
tutorial – Leslie Lamport’s video course on the official TLA+ site


The Bottom Line

ISO 26262 recommends formal methods because design bugs in
safety-critical systems can present hazards that kill people or damage
property.

Static analysis is necessary but not sufficient. It catches coding
defects in the implementation of your design. It cannot catch flaws in
the design itself.

TLA+ is one tool — not the only tool — that addresses this gap. For
safety-critical state machines where unhandled states, missing
transitions, and edge cases matter, it’s worth the investment.

The unhandled divergence in our example would have shipped. It would
have passed code review, static analysis, and probably most testing. It
would have failed in the field, under exactly the conditions it was
designed to detect.

Formal methods found it in two steps.


Need help applying TLA+ to your ISO 26262 program — or want a
second opinion on an existing safety case?
Get in touch.

This article is also published on Zenodo (DOI:
10.5281/zenodo.19349165) for citation in formal documentation.

#FunctionalSafety #ISO26262 #FormalMethods #TLAplus #AutomotiveSafety
#ASIL #SafetyCritical #SystemsEngineering #Verification #Automotive