MAINTENANCE 5.0.
AI-Agentic Orchestration for Heavy Industrial Maintenance
An R&D demonstration project: autonomous AI agents managing the complete maintenance lifecycle of high-capital industrial assets.

The Problem Space
Heavy industrial maintenance — the kind required for high-capital assets like rolling stock, large machinery, or aviation components — is operationally complex. A single asset arriving for service triggers dozens of decisions: which subsystems need intervention, which workshops handle which components, how to allocate scarce technicians and spare parts, when each component is ready to be reassembled, and how to schedule everything against priority and due-date pressure. Most facilities still run these decisions through manual schedulers and paper-driven workflows. The cost is real — delayed asset return-to-service, idle workshop bays, parts shortages discovered too late, and senior staff spending their time on coordination rather than judgment.
We built Maintenance 5.0 to demonstrate what a production-grade alternative looks like: an autonomous AI agent system that handles the full lifecycle — intake, decomposition, routing, scheduling, parts allocation, and reassembly — with no human coordinator in the loop.
What We Built
Maintenance 5.0 is a workflow engine driven by ten specialized AI agents, each owning a specific decision domain:
- A Coordinator agent plans each asset intake, deciding which subsystems require intervention and how to route them.
- Six Workshop agents — one per specialized shop type — receive batched work orders and produce maintenance plans covering operations, parts, technician assignments, and duration estimates.
- A Spare Parts agent allocates inventory across batches of work orders, reasoning about scarcity and priority simultaneously.
- A Pool Allocation agent assigns serviceable replacement components from the rotable pool while originals are under maintenance.
- An Outgoing Coordinator agent decides where to re-mount completed components, finding waiting assets that need them.
The agents coordinate through a PostgreSQL-backed workflow engine that persists state at every suspension point. When a real-world event arrives — parts received, maintenance finished, bay freed — the workflow resumes from the exact step where it was suspended, even if the server restarted in between.
Technical Highlights
Typed agent contracts via the submit-tool pattern. Every agent ends each turn by calling exactly one terminal "submit" tool whose input schema defines the agent's decision contract. The workflow extracts and validates the agent's output through the schema, eliminating the brittleness of prompt parsing. Agents that fail to call the submit tool are retried up to three times with explicit reminders. This is the single biggest architectural choice that makes the system reliable in production rather than fragile in a demo.
Crash-safe workflow persistence. Workflows persist state to PostgreSQL at every suspension point. The run ID is stored against the relevant business entity, so any inbound event can locate and resume the correct run — even after server restarts or hours-to-days-long suspensions waiting for real-world signals.
Batched agent calls with debouncing. Multiple work orders arriving within a short window are batched into a single agent call. This reduces LLM cost and latency, and lets the agent reason across the batch — for example, allocating scarce parts in priority order across all pending work orders at once, rather than greedy first-come-first-served.
Atomic concurrent resource claims. When multiple workflows race for the same scarce resource — a serviceable pool component, a workshop bay — SELECT FOR UPDATE SKIP LOCKED transactions guarantee no double-allocation. Race losers re-suspend cleanly and retry.
Deterministic post-validation of LLM output. The Coordinator agent's structured output is reconciled deterministically against the input intervention list before persistence — guarding against hallucinated, omitted, or duplicated entries even when the LLM responds unexpectedly.
Cost control via context caching. Agent system instructions and tool schemas are cached with the Gemini context-caching API per agent. Only the per-call prompt varies. Repeat-call latency and cost drop significantly across long-running operations.
Live operations dashboard. A React + Hono dashboard renders the live state of the facility — assets, workshops, bay occupancy, pool inventory — and animates component flows in real time as the system orchestrates them. The dashboard is also where operators trigger manual events for shop-floor integration.
End-to-end simulation harness. The system ships with a deterministic simulation driver that exercises the full lifecycle without requiring shop-floor input. Simulations run in compressed or instant time, fire intakes per scenario specification, and produce structured run reports including agent call counts, LLM cost estimates, peak workshop occupancy, and automated stall detection.
Stack
TypeScript (strict mode) on Node.js. Mastra framework for agent orchestration. Google Gemini 2.5 Flash via @ai-sdk/google. PostgreSQL with Drizzle ORM. Hono for the dashboard HTTP layer. React for the frontend. Zod for runtime validation. Single-line model configuration so the LLM provider can be swapped without touching code elsewhere.
Status & What It Demonstrates
Maintenance 5.0 is an internal R&D demonstration of our agentic systems capabilities for industrial maintenance and adjacent use cases. It runs end-to-end in simulation, handling multi-asset concurrent intakes, parts shortages, bay contention, and stall detection. The architectural patterns developed here — typed agent contracts, crash-safe workflow persistence, batched agent reasoning, atomic concurrent claims — transfer directly to client engagements wherever reliability, observability, and cost control matter more than demo polish.
If you're building an agentic system that needs to survive contact with production — not just look good in a screen recording — we should talk.
BUILDING
SOMETHING LIKE THIS?
Production reliability. Multi-actor workflows. AI integration. Complex state. If your project shares these characteristics, we should talk.