[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlaplus] Specifying a multimodal AI orchestrator in TLA+ before implementation



Hi everyone,


We wanted to share a recent production use case applying TLA+ to the domain of AI orchestration. We built a multimodal AI orchestrator by first specifying the architecture, and then writing the code to align with the spec before deploying to Google Cloud.


We built a system that generates multimodal content based on user context and private uploaded images, and conducts related searches across internal databases and the web. This requires coordinating multiple external AI services (Vision, LLM, Image/Video Generation, and Search). The underlying AI models are inherently non-deterministic and suffer from unpredictable latency spikes. If a video generation node times out, or a search API returns a malformed payload, it can trigger asynchronous race conditions, retry loops, and partially complete states where a user is billed for an asset that never delivers.


Standard integration testing couldn’t adequately cover the state space created by these asynchronous intersections.


We modeled the orchestration kernel in TLA+. We didn’t model the AI model logic itself; instead, we considered the AI agents as non-deterministic nodes that could return any value from a set of defined outcomes (e.g., Success, Timeout).

This let us verify exactly how our orchestrator would handle the AI’s unpredictable behavior before the implementation phase even began.

We wrote two separate specifications: one for the backend orchestration kernel (~58,000 states explored) and one for the frontend lifecycle (~30,000 states explored) - to make sure the user journey is consistent across environments.


The model checker identified several critical logic flaws that would not have surfaced until production:


After three months deployed on GCP, we’ve had zero logic failures or state-tracking errors. Zero logic failures or state-tracking errors. During one live incident, a misconfigured secondary search service went down — the verified architecture bypassed it and delivered the primary result without system failure.


TLC Checking stats

tlaproj-states-3retries.png


A privacy invariant snippet

tlaproj-privacy-invariant.png



We welcome questions and feedback.


--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+unsubscribe@xxxxxxxxxxxxxxxx.
To view this discussion visit https://groups.google.com/d/msgid/tlaplus/f2ceeb73-d4d5-4480-9ac8-51b5d969be75n%40googlegroups.com.