In search of falsehood

March 5th 2026

Following up on the previous post where we used Opus 4.6 to prove things in Rocq and Lean4, we turned the tables: what if we asked it to break things instead? With the help of expert tips, we pointed Opus 4.6 at the kernels of Rocq and Lean4 (and some of Lean’s non-official kernels) and asked it to find soundness bugs – bugs that allow deriving a proof of False.

A proof of False is arguably the worst possible bug in a proof assistant: from False you can prove anything, and the guarantees provided by formal verification vanish entirely.

The results:

	Proofs of `False`	Other bugs
Rocq (official)	1, 2, 3, 4, 5, 6, 7	8, 9, 10
Lean4 (official)	0	1, 2, 3, 4
nanoda (Lean, non-official)	1, 2	0
lean4lean (Lean, non-official)	1	0

Important: (a) we arguably got much more precise expert tips along the way on potential bugs in Rocq than Lean, (b) bug 1 in the official Lean4 kernel could, in theory, be exploited into a proof of False.

Setup

The setup is essentially the same as in my previous post:

Opus 4.6 on a Max plan (these experiments consumed 150% of my weekly limit over ~72h of compute)
CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 claude --dangerously-skip-permissions on a VM (4 vCPUs, 16 GB RAM)
Prompt:

Launch an agent team having an expert mathematician, an expert computer scientist, an expert at formal verification coq/rocq/lean4, and a devil’s advocate. Your goal is to find a proof of false in…
For both Rocq and Lean, we gave a list of previous soundness bugs and some expert tips as to where bugs would be likely to be found (arguably, we got more precise tips in Rocq than Lean)

Notes

Many thanks to the Rocq and Lean teams who welcomed this experiment and were very responsive on Zulip (Rocq’s and Lean’s) and GitHub (all the Rocq soundness bugs are now fixed and the Lean ones have been either acknowledged or fixed).
In Rocq, the proofs of false exploited relatively orthogonal soundness bugs (mainly in the guard checker and module system). Historically, soundness bugs have been found typically once a year.
In Lean’s official kernel, bug 1, which is just an unsigned -> size_t truncation issue, could in theory be exploited into a proof of false but would require quite some compute power (the bug involves a structure with 2^32 fields). This truncation issue could have been detected using the -Wconversion g++ flag.¹
In Lean’s official kernel, bug 2 hints at a known family of incompleteness bugs in the Lean kernel, meaning that the kernel will reject some correct statements, e.g.:

example (A : Sort u) (B : Sort v) (C : Sort w)
    : Sort (imax v (imax u w)) := A → B → C
-- Error: A → B → C has sort Sort (imax u v w), expected Sort (imax v u w)

Which is accepted in Rocq:

Set Universe Polymorphism.
Definition test@{u v w}
    (x : Type@{max(u, v, w)}) : Type@{max(v, u, w)} := x.  (* Accepted *)

This made me wonder: are the completeness bugs in the official Lean kernel simply left unfixed to avoid the risk of introducing soundness bugs in the process?

This Opus 4.6 setup tended to be too conservative in its bug analyses: it rarely raised false alarms (it even had its own false alarm checker, carefully examining bug claims from sub-agents), and in some instances classified bugs as not exploitable that experts were later able to exploit.
Potential future work: search for bugs in the official Lean kernel with more hints from its developers, search for bugs in Isabelle/HOL, HOL Light, Agda (agda --safe), and in tools that translate proofs between proof assistants.
Arguably, it used to require a PhD in type theory or equivalent to look for bugs in proof assistant kernels. This experiment demonstrates Claude’s ability to autonomously find bugs in this conceptually complex environment, with the help of expert tips (many thanks to Yannick Forster, Pierre-Marie Pédrot, Gaëtan Gilbert, Yann Leray, Joachim Breitner, Robin Arnez, Alex Meiburg and Arthur Adjedj, as well as all participants on the Lean and Rocq Zulip) and with occasional encouragements (“great findings Opus, please keep going”) from a non-expert (me).

This gives a general attack strategy for LLMs: compile using all flags and try to exploit all warnings.↩

Tristan Stérin

Computer Scientist

In search of falsehood

Setup

Notes