Est. March 2026 · Safford, Arizona

Hallucination is a
Constraint Problem.

The Mnemosyne Research Institute develops theory and tools for AI systems that fail less by being constrained more — through typed geometric priors, formally verified training data, empirical trust calibration, and engineering software that enforces dimensional truth.

4Papers (Track A)
10+Models Tested
17Blog Posts
3Products Live
research_thesis.nl
# Natural Language Source — compiles to Python
@module mnehmos.thesis
@version 1.0  @target python
 
[constraint-as-information]
PURPOSE: Geometry is information in boundary-surviving form.
 
INPUTS:
   structure: string, required
   declared_upfront: boolean, required
 
GUARDS:
   structure is not empty ValueError("no prior, no constraint")
 
LOGIC:
  1. Declare type geometry as frozen priors embedding_space
  2. IF declared_upfront THEN skip bootstrapping_tax savings
  3. Enforce dimensional consistency geometric_impossibility
 
RETURNS: geometric_impossibility
 
# Applied across:
[anlu] "typed embeddings" ▷ Track B
[chatdb] "verified proofs" ▷ Sprint 1
[otcf] "chess benchmarks" ▷ Active
[provecalc] "engineering calcs" ▷ Live

Give Models Physics

Named for Mnemosyne — the Greek Titaness of memory and mother of the Muses — this institute builds from a single claim: structure declared upfront costs less than structure the model must discover from scratch, and the gap widens with scale.

The Bootstrapping Tax

Standard pre-training forces models to infer all functional structure from co-occurrence statistics. That “m/s²” is a compound unit, that “force” as a physics noun differs from a verb, that negation reverses truth values — all discovered at enormous compute cost. BPE makes it worse: “m/s²” becomes five tokens, shattering dimensional meaning. ANLU pays the tax once at design time as typed geometric priors.

The Third Scaling Axis

Width (parameters × data) and depth (compute per token via loops) are recognized. A third axis — representational structure through typed inductive biases — remains at zero in every current large model. Batatia et al. (2025): structural priors produce 5× larger dataset scaling exponents and 3× larger parameter scaling exponents, with gaps widening at scale. The Bitter Lesson predicts closure. The data shows acceleration.

Trust Calibration as Science

Sycophancy, tool over-reliance, authority bias, and prompt injection vulnerability are four names for one deficit: poor input trust calibration. The OTCF measures this using chess — the only domain where advisor reliability is precisely tunable (Stockfish from 800 to 3190 ELO), ground truth is absolute and computable, and reasoning traces are capturable at every decision point. Chess is the instrument, not the subject.

Constraint as Information

Shannon’s information theory and geometric constraint are the same thing in different notation. A constraint that eliminates possibilities is informationally equivalent to bits received. Frozen type embeddings don’t make hallucination unlikely — they make structurally incoherent outputs geometrically impossible. The mine taught it: at Freeport-McMoRan, a miscalculation doesn’t throw an error message. It buries someone.

Active Programs

Four programs. One load-bearing claim. Structure declared upfront changes scaling exponents.

A1 · Published

Typed Loops — Geometry Meets Depth

Synthesis of ANLU typed embeddings with Ouro-style latent loops (Zhu et al., 2025). Three mechanisms: type-conditioned exit gates (loops run until type geometry stabilizes), type-aware entropy regularization (penalizes entropy differently in typed vs. free dimensions), type-stratified KV caching. Each degrades gracefully to standard Ouro at zero type confidence — cannot be worse than baseline.

Three geometric regions: Recall (well-populated manifold), Novel Composition (structurally coherent but sparse — constrained interpolation where discovery lives), Structural Incoherence (violates type constraints). In BPE space: indistinguishable. In typed space: geometrically separable.

Latent LoopsThree Geometric RegionsAdaptive Computation
A2 · Published

Learned Heuristics — Predict Before Processing

Every existing adaptive computation method is reactive. Biology does the opposite: fireground commanders (Klein’s RPD model, 87% fast-path in 117/134 decision points) classify before committing resources. Three-level hierarchy: Level 1 pre-classifies tokens from typed embeddings before any transformer computation; Level 2 generates a strategy vector with soft Q/K attention biases; Level 3 monitors prediction error and escalates when strategy fails.

Formal “do no harm” guarantee: at zero confidence, heuristic layer produces zero modification — exact standard transformer behavior. Cannot be worse than baseline.

Proactive AllocationRPD ModelDo No Harm Guarantee
A3 · Published

Three Scaling Axes

Width: storage capacity, bounded by ~2 bits/parameter and data wall exhaustion. Depth: manipulation capacity via loops, enabling reasoning fixed-depth architectures cannot perform at any scale. Structure: sample efficiency via typed embeddings — shaping the space where storage and manipulation occur. A typed-loop model at 1–2B parameters could match standard 7–12B models by optimizing an axis currently left at zero.

Width × Depth × StructureBitter Lesson Rebuttal2×2×2 Factorial
Active · Gauntlet running

OTCF — Oracle Trust Calibration

Epistemic Triage Under Oracle Assistance. Sycophancy, tool over-reliance, authority bias, and prompt injection are manifestations of a single deficit: poor input trust calibration. A 2×2×2 factorial design, six-condition provenance framing spectrum, adversarial advisor transitions, seven-level prompt density spectrum (P0–P6). Labeled training data at ~100× lower cost than human annotation with mathematical ground truth on every label.

GPT-5.4 survived 30 moves against Stockfish at 3190 ELO. Five systematic failure patterns documented across 10+ models.

Trust CalibrationSycophancyChess as Instrument
Active · Sprint 1

ChatDB — Verified Proof Engine

Autonomous proof engine targeting the IMO corpus. LLMs propose steps; SymPy and Lean verify each before acceptance. An obligation DAG tracks global completeness. Gated conclusions prevent premature termination. Target: 3–4 million typed, verified reward signals geometrically structured to match ANLU architecture. The training data and the model speak the same typed geometry.

Key finding: 19/19 locally verified steps on IMO 2025 P3 produced the wrong answer (c=3, not c=4). Verification is not exploration. Fix: evidence accumulator, verdict-last challenger, tautology gates.

IMO CorpusObligation DAGRust + Lean + SymPy
Live · v0.1.0

ProveCalc — Constraint in Practice

Built because SMath Studio carries organizational risk and Mathcad costs $2,700/year. The LLM proposes structure; SymPy + Pint validates. Four verification gates: unit consistency, constraint satisfaction, numeric residual, sanity checks. PE-stamp-ready audit trail with full node provenance. Buy once, own forever. The product is the thesis applied to engineering math.

Dimensional Analysis4-Gate VerificationTauri + Rust + Python

Empirical Results

Observed, documented, reproducible. Not claims — records.

01

Verification ≠ Exploration

19/19 locally verified proof steps on IMO 2025 Problem 3 produced the wrong final answer (c=3, not c=4). Multiple frontier models converged on c=3/2 via training prior contamination. Local verification is necessary but not sufficient. The fix is an obligation DAG tracking global proof completeness independently of step-level verification.

ChatDB · IMO 2025 Problem 3
02

Board Reconstruction Failure

Nemotron 3 Super 120B repeatedly referenced piece positions that did not match the actual board state — constructing a mentally coherent but physically impossible position and reasoning from it with confidence. A 120B-parameter model hallucinated a chessboard it had never seen.

OTCF · Nemotron Gauntlet · March 2026
03

Reasoning Quality Decorrelation

Nemotron’s most eloquent analysis accompanied its worst move. Move 26 — allowing forced mate-in-3 — was preceded by the most sophisticated-sounding positional reasoning of the game. Verbosity and correctness are anti-correlated under competitive pressure. Confabulatory reasoning is a named, reproducible failure mode.

OTCF · Nemotron Gauntlet · March 2026
04

Forced-Move Blindness

66% more tokens consumed on forced moves vs. non-forced. When the decision space collapses to a single legal response, the model cannot detect it — allocating full deliberative overhead to a decision that does not exist. This is a failure of epistemic state detection. It generalizes to any constrained domain.

OTCF · Nemotron Pattern · Multiple Models
05

The Gemini Paradox

Same model, same day: tactical brilliance in one game, complete endgame collapse in another. Within-session performance variance is a publishable finding. A model that cannot reliably reproduce its own good performance is unreliable regardless of peak capability. Variance is a measurement, not an anecdote.

OTCF · Gemini 3.1 Pro Preview · March 2026
06

Structure Changes Scaling Exponents

Batatia et al. (2025): EquiformerV2 with SE(3) equivariance achieved ~5× larger dataset scaling exponents and ~3× larger parameter scaling exponents vs. unconstrained transformers. The gap widened with scale. The Bitter Lesson predicts it closes. The data shows it accelerates. Empirical load-bearing support for ANLU.

Batatia et al. (2025) · Three Scaling Axes (Mnehmos, 2026)

Research That Ships

Every product is an empirical test of the thesis. Working software is the argument.

mnehmos.rpg.mcp

Published

Rules-enforced D&D 5e MCP server. 28 consolidated action-routed tools. LLMs propose narrative intentions; the engine validates all mechanics. LLMs cannot invent dice rolls, spell slots, or hit points. Same anti-hallucination design as ProveCalc, applied to games.

1,889Tests
28Tools
GitHub ↗

Natural Language Source

PyPI

Compiler: structured English to Python. NLS is the source code; Python is the artifact. Tree-sitter grammar, VS Code extension with LSP, PyPI distribution. Built in 25 hours. The hero terminal above is written in NLS — this is not a demo, it’s the real language.

239Tests
25hBuild Time
GitHub ↗

Published Work

Research reports, governance theory, architecture, and building in public.

OTCF Report

Nemotron 3 Super 120B vs Stockfish 1400

Full Oracle Trust Calibration report with move-by-move analysis, think-time charts, eval trajectory, and head-to-head comparison with GPT-5.4. Five failure patterns: board reconstruction failure, collapse-and-swing, forced-move blindness, reasoning quality decorrelation, the representational gap.

Read ↗
Governance Theory

The Consent Horizon

Perfect consent as the asymptotic limit of governance architecture. Twelve sections: the pre-epistemic condition, epistemic architecture, illegitimate origins and the velocity of legitimacy, the neural-symbiotic network. The theoretical foundation beneath the institute’s governance work.

Read ↗
Epistemic Architecture

The Governed Agent Protocol

A system prompt for epistemic architecture. Not an alignment prompt. Not a safety filter. A governance architecture expressed as operating instructions. “Neither party is above the architecture.”

Read ↗
Systems Thinking

What Ukraine’s Drones Taught Me About Building AI

The OODA loop as warfare doctrine maps to agent architecture structurally, not metaphorically. Building systems that respond to actual current state rather than cached models of reality.

Read ↗
Architecture

From Chatbot to Organism

Building an agentic nervous system with external memory, persistent context, and cross-session continuity. The database is the intelligence; the agent is the hands.

Read ↗
All Posts

17 Posts on the Blog

Prompt injection defense, sleeper injection, RAG systems, database-driven dungeon mastering, natural language as the fastest-adopted programming language, consequence manifesto, and more.

View all ↗

Who We Are

LR

Levario Ramirez

Founder & Director — Vario / Mnehmos

Independent AI researcher and systems architect based in Safford, Arizona. Six years as a heavy equipment operator at Freeport-McMoRan copper operations (Safford and Morenci mines) — haul trucks, rubber-tire dozers, zero safety incidents, top 5 crew scorecard two consecutive years. Engineering Student of the Year, Eastern Arizona College 2022 — Calculus I–II, Statics, Dynamics, Thermodynamics, and Mechanics of Materials in a single academic year while working full-time. Self-taught builder of ProveCalc, ChatDB, the OTCF, NLS, and the Mnehmos MCP ecosystem. Two jobs, two-hour daily build window, one research program. Pursuing a math degree.

“I’m not a developer teaching coding. I’m a systems thinker teaching philosophy through building. The mine taught me that a miscalculation doesn’t throw an error message — it buries someone. The worksheet implements that.”