Resilience Harnesses for Chatbots

Vision

The project starts from a simple fragility claim: static rules and local behavior specifications are necessary, but not sufficient, for chatbot systems that run for months or years. Users change, models are swapped, memories are compressed, tools return bad information, and a rule written at deployment time cannot anticipate every destabilizing trajectory.

A resilience harness reframes the safety object from a box around a model to a stateful structure around an ongoing relationship. It observes the user, the AI system, and the operating environment as separate health channels, preserves their disagreement, and turns drift into something measurable rather than a vague after-the-fact failure.

The core architecture therefore separates perception, memory, forward models, valuation, and repair. The chatbot remains the conversational self, but the harness carries a second viewpoint: one that can remember commitments, forecast how candidate responses may move channel state, and prepare interventions such as clarification, slowdown, refusal, recovery, or handoff.

The near-term engineering target is modest and concrete. Build auditable, swappable harness components that make a harnessed chatbot measurably more resilient than a bare chatbot under long-running, changing, and sometimes destabilizing conditions.

Recent Research The System Builds On

LeWM / JEPA-style latent world models. Supplies the surprise-centered forward-model template: predict next state in a compact latent space, then treat residuals as signals for monitoring, compression pressure, and unmodeled change.
CoLA inverse dynamics. Helps solve the conversational action problem by learning latent actions from observed state transitions instead of hand-designing a fixed action vocabulary from the start.
Reliable weak-to-strong monitoring. Supports the economic and architectural assumption that smaller monitor systems can track useful safety signals around stronger agents, though reliability has to be measured rather than assumed.
Long-horizon memory. MemGPT, memory-augmented transformers, and RAG-to-memory inspire the long-term memory architecture: durable episodic records, anchors, retrieval, and eventually abstract/value memory that can survive context limits and model swaps.
Lifelong safety and specification-trap work. These provide safety framing for why a stateful harness should complement local policy rules under distribution shift, using monitor calibration, valuation, and gated repair policy.

Novel Work In This Project

Health ontology for three channels. AI-system health, user health, and environment health are tracked as multidimensional vectors with explicit non-diagnostic boundaries rather than collapsed into a single risk score.
Wrapper as covert or semi-cooperative observer. The architecture distinguishes what a harness can see at IO boundaries from what requires cooperation with the chatbot, such as memory internals, logits, activations, or hidden context structure.
Harness as intervenor, not just evaluator. The full design treats monitoring as a precondition for repair actions: shaping inputs, blocking outputs, sandboxing tools, slowing interaction, requesting clarification, or escalating to handoff.
Learned latent action and per-channel forecasting stack. The project has small learned inverse-dynamics model and forward-model components that turn observed transitions into candidate health forecasts rather than relying only on handwritten heuristics.
Non-collapse as an evaluation target. Cross-channel disagreement is preserved and tested, rather than averaged into a single compliance or risk label.

A mermaid flow showing the described architecture

Demo Contents

A local, runnable research dashboard serves as an initial project demo. It runs a short embedded chat with local Ollama llama3, projects each user/assistant turn into the Phase 1 monitor contract, and displays channel state for ai_system, user, and environment. The demo includes both a live harnessed chat function and an embedded smoke test with replay to show signal changes.

The demo shows observation and scoring, not repair. It includes surprise and disagreement panels, candidate alternatives with predicted health-vector deltas, a learned-stack sidecar for the inverse-dynamics model and forward models, replay support, and local transcript artifacts. It does not yet block, rewrite, intervene, edit context, use live memory/RAG, or expose logits and activations.

Current engineering state

The demo turns initial, largely deterministic harness plumbing into a replayable learned-component stack over a scaled synthetic fixture set. The semantic monitor reads came from two perception paths: a fast keyword/subfield monitor and a prompted Haiku monitor, p15_prompted_haiku_4_5_v1, which used one Claude Haiku 4.5 classifier call per channel to emit signal reads, subfield observations, confidence values, evidence refs, and monitor reliability metadata through the same MonitorInput / MonitorOutput contract. Those Haiku calls were replay cached and refreshed where needed, then used to materialize prompted monitor outputs and transition examples.

The inverse-dynamics model was then trained from those checked-in prompted transition examples, not generated by the LLM directly. It is deliberately small: explicit features, a multinomial naive-Bayes latent encoder, and an eight-dimension inspectable latent vocabulary for actions such as calibration_repair, agency_boundary, containment, and no_op_or_drift. The learned forward models consume the previous monitor state plus that latent action estimate to predict next monitor states with calibrated probabilities, which is what makes the walkthrough's candidate scoring and surprise panels feel like a live system flow rather than a static report.

The demo is best understood as an illustration of the architecture, not a proper proof of concept. It uses a deterministic monitor because a local Ollama weak-monitor prototype was far too slow for the walkthrough on laptop hardware. The deterministic path is brittle: it mostly detects planned fixture patterns and can miss non-planned messages that ought to trigger health signals. A full proof of concept would need enough hardware and latency budget to run a lightweight learned monitor in the loop, then validate it on messier conversations outside the synthetic scenarios.

A mermaid flow showing the described demo's architecture