The Mnemosyne Research Institute develops theory and tools for AI systems that fail less by being constrained more — through typed geometric priors, formally verified training data, empirical trust calibration, and engineering software that enforces dimensional truth.
Named for Mnemosyne — the Greek Titaness of memory and mother of the Muses — this institute builds from a single claim: structure declared upfront costs less than structure the model must discover from scratch, and the gap widens with scale.
Standard pre-training forces models to infer all functional structure from co-occurrence statistics. That “m/s²” is a compound unit, that “force” as a physics noun differs from a verb, that negation reverses truth values — all discovered at enormous compute cost. BPE makes it worse: “m/s²” becomes five tokens, shattering dimensional meaning. ANLU pays the tax once at design time as typed geometric priors.
Width (parameters × data) and depth (compute per token via loops) are recognized. A third axis — representational structure through typed inductive biases — remains at zero in every current large model. Batatia et al. (2025): structural priors produce 5× larger dataset scaling exponents and 3× larger parameter scaling exponents, with gaps widening at scale. The Bitter Lesson predicts closure. The data shows acceleration.
Sycophancy, tool over-reliance, authority bias, and prompt injection vulnerability are four names for one deficit: poor input trust calibration. The OTCF measures this using chess — the only domain where advisor reliability is precisely tunable (Stockfish from 800 to 3190 ELO), ground truth is absolute and computable, and reasoning traces are capturable at every decision point. Chess is the instrument, not the subject.
Shannon’s information theory and geometric constraint are the same thing in different notation. A constraint that eliminates possibilities is informationally equivalent to bits received. Frozen type embeddings don’t make hallucination unlikely — they make structurally incoherent outputs geometrically impossible. The mine taught it: at Freeport-McMoRan, a miscalculation doesn’t throw an error message. It buries someone.
Four programs. One load-bearing claim. Structure declared upfront changes scaling exponents.
Two coordinated changes to pre-training. First: replace BPE tokenization with whole-word, unit-preserving segmentation — compound units like m/s² are single tokens; polysemy is preserved as distinct context-dependent embeddings. Second: add functional role type vectors to each token embedding, organized in a six-layer hierarchy: closed class (L1) → open class (L2) → rhetorical structure (L3) → quantitative STEM substrate (L4) → domain-specific semantics (L5) → cognitive/agent types (L6). Higher layers override lower. “force” as a physics noun is QUANTITY, not NOUN.
Synthesis of ANLU typed embeddings with Ouro-style latent loops (Zhu et al., 2025). Three mechanisms: type-conditioned exit gates (loops run until type geometry stabilizes), type-aware entropy regularization (penalizes entropy differently in typed vs. free dimensions), type-stratified KV caching. Each degrades gracefully to standard Ouro at zero type confidence — cannot be worse than baseline.
Every existing adaptive computation method is reactive. Biology does the opposite: fireground commanders (Klein’s RPD model, 87% fast-path in 117/134 decision points) classify before committing resources. Three-level hierarchy: Level 1 pre-classifies tokens from typed embeddings before any transformer computation; Level 2 generates a strategy vector with soft Q/K attention biases; Level 3 monitors prediction error and escalates when strategy fails.
Width: storage capacity, bounded by ~2 bits/parameter and data wall exhaustion. Depth: manipulation capacity via loops, enabling reasoning fixed-depth architectures cannot perform at any scale. Structure: sample efficiency via typed embeddings — shaping the space where storage and manipulation occur. A typed-loop model at 1–2B parameters could match standard 7–12B models by optimizing an axis currently left at zero.
Epistemic Triage Under Oracle Assistance. Sycophancy, tool over-reliance, authority bias, and prompt injection are manifestations of a single deficit: poor input trust calibration. A 2×2×2 factorial design, six-condition provenance framing spectrum, adversarial advisor transitions, seven-level prompt density spectrum (P0–P6). Labeled training data at ~100× lower cost than human annotation with mathematical ground truth on every label.
Autonomous proof engine targeting the IMO corpus. LLMs propose steps; SymPy and Lean verify each before acceptance. An obligation DAG tracks global completeness. Gated conclusions prevent premature termination. Target: 3–4 million typed, verified reward signals geometrically structured to match ANLU architecture. The training data and the model speak the same typed geometry.
Built because SMath Studio carries organizational risk and Mathcad costs $2,700/year. The LLM proposes structure; SymPy + Pint validates. Four verification gates: unit consistency, constraint satisfaction, numeric residual, sanity checks. PE-stamp-ready audit trail with full node provenance. Buy once, own forever. The product is the thesis applied to engineering math.
Observed, documented, reproducible. Not claims — records.
19/19 locally verified proof steps on IMO 2025 Problem 3 produced the wrong final answer (c=3, not c=4). Multiple frontier models converged on c=3/2 via training prior contamination. Local verification is necessary but not sufficient. The fix is an obligation DAG tracking global proof completeness independently of step-level verification.
ChatDB · IMO 2025 Problem 3Nemotron 3 Super 120B repeatedly referenced piece positions that did not match the actual board state — constructing a mentally coherent but physically impossible position and reasoning from it with confidence. A 120B-parameter model hallucinated a chessboard it had never seen.
OTCF · Nemotron Gauntlet · March 2026Nemotron’s most eloquent analysis accompanied its worst move. Move 26 — allowing forced mate-in-3 — was preceded by the most sophisticated-sounding positional reasoning of the game. Verbosity and correctness are anti-correlated under competitive pressure. Confabulatory reasoning is a named, reproducible failure mode.
OTCF · Nemotron Gauntlet · March 202666% more tokens consumed on forced moves vs. non-forced. When the decision space collapses to a single legal response, the model cannot detect it — allocating full deliberative overhead to a decision that does not exist. This is a failure of epistemic state detection. It generalizes to any constrained domain.
OTCF · Nemotron Pattern · Multiple ModelsSame model, same day: tactical brilliance in one game, complete endgame collapse in another. Within-session performance variance is a publishable finding. A model that cannot reliably reproduce its own good performance is unreliable regardless of peak capability. Variance is a measurement, not an anecdote.
OTCF · Gemini 3.1 Pro Preview · March 2026Batatia et al. (2025): EquiformerV2 with SE(3) equivariance achieved ~5× larger dataset scaling exponents and ~3× larger parameter scaling exponents vs. unconstrained transformers. The gap widened with scale. The Bitter Lesson predicts it closes. The data shows it accelerates. Empirical load-bearing support for ANLU.
Batatia et al. (2025) · Three Scaling Axes (Mnehmos, 2026)Every product is an empirical test of the thesis. Working software is the argument.
Engineering calculations you can prove. Built in 36 days because SMath carries organizational risk and Mathcad costs $2,700/year. The LLM proposes; SymPy + Pint validates every claim. Four verification gates. PE-stamp-ready audit trail. Perpetual desktop license.
“Five-year math: Mathcad costs $13,500. ProveCalc is a one-time payment. Buy once, own forever. No rent economy.”
Rules-enforced D&D 5e MCP server. 28 consolidated action-routed tools. LLMs propose narrative intentions; the engine validates all mechanics. LLMs cannot invent dice rolls, spell slots, or hit points. Same anti-hallucination design as ProveCalc, applied to games.
Compiler: structured English to Python. NLS is the source code; Python is the artifact. Tree-sitter grammar, VS Code extension with LSP, PyPI distribution. Built in 25 hours. The hero terminal above is written in NLS — this is not a demo, it’s the real language.
Research reports, governance theory, architecture, and building in public.
Full Oracle Trust Calibration report with move-by-move analysis, think-time charts, eval trajectory, and head-to-head comparison with GPT-5.4. Five failure patterns: board reconstruction failure, collapse-and-swing, forced-move blindness, reasoning quality decorrelation, the representational gap.
Read ↗ Governance TheoryPerfect consent as the asymptotic limit of governance architecture. Twelve sections: the pre-epistemic condition, epistemic architecture, illegitimate origins and the velocity of legitimacy, the neural-symbiotic network. The theoretical foundation beneath the institute’s governance work.
Read ↗ Epistemic ArchitectureA system prompt for epistemic architecture. Not an alignment prompt. Not a safety filter. A governance architecture expressed as operating instructions. “Neither party is above the architecture.”
Read ↗ Systems ThinkingThe OODA loop as warfare doctrine maps to agent architecture structurally, not metaphorically. Building systems that respond to actual current state rather than cached models of reality.
Read ↗ ArchitectureBuilding an agentic nervous system with external memory, persistent context, and cross-session continuity. The database is the intelligence; the agent is the hands.
Read ↗ All PostsPrompt injection defense, sleeper injection, RAG systems, database-driven dungeon mastering, natural language as the fastest-adopted programming language, consequence manifesto, and more.
View all ↗Founder & Director — Vario / Mnehmos
Independent AI researcher and systems architect based in Safford, Arizona. Six years as a heavy equipment operator at Freeport-McMoRan copper operations (Safford and Morenci mines) — haul trucks, rubber-tire dozers, zero safety incidents, top 5 crew scorecard two consecutive years. Engineering Student of the Year, Eastern Arizona College 2022 — Calculus I–II, Statics, Dynamics, Thermodynamics, and Mechanics of Materials in a single academic year while working full-time. Self-taught builder of ProveCalc, ChatDB, the OTCF, NLS, and the Mnehmos MCP ecosystem. Two jobs, two-hour daily build window, one research program. Pursuing a math degree.
“I’m not a developer teaching coding. I’m a systems thinker teaching philosophy through building. The mine taught me that a miscalculation doesn’t throw an error message — it buries someone. The worksheet implements that.”
No university. No VC. No grant overhead. Research stays independent and open. Overhead is three AI subscriptions and a DoorDash shift. Your support covers compute, infrastructure, and the hours between the cafeteria and midnight.
Research partnerships or institutional collaboration? research@themnemosyneresearchinstitute.com