Opus 4.6 is great at formal proofs

February 17th 2026

What started as an excuse to try out Anthropic’s new agent teams feature led me down a rabbit hole exploring how good Opus 4.6 is at formal proofs in Rocq and Lean4.

Summary

Opus 4.6, with no human input beyond the prompt, successfully proved:

258 of the 260 Rocq lemmas of the BB(4) proof
A Master-level Rocq course’s 18-page Fall 2025 assignment
An almost-research-level tedious proof-theoretical fact in realizability theory
The non-termination of a Fractran program in Lean4

While the BB(4) proof was most likely in the training set of Opus 4.6 (but we obfuscated it in this experiment), we believe that solutions of (b), (c) and (d) were not in its training set as they were not available on the internet prior to this blog post.

Apart from the Lean4 Fractran experiment, all these proofs have in common that Opus 4.6 was presented with an architecture already in place and “just had” to fill in the proofs (numerous and often tedious). This suggests that it is now possible to complete a formal proof project by spending the human research effort on the proof architecture and leveraging a model like Opus 4.6 to fill in the details of the proofs. The Lean4 Fractran experiment shows that Opus 4.6 can, in some contexts, come up with an original argument, proof architecture, and formalisation on its own.

Finally, in almost all cases Opus wrote Python scripts (mainly hypothesis testing) to help itself solve the proofs, which hints at the benefits of using a general-purpose model for formal proofs instead of a specialised one.

I’d like to thank Yannick Forster and Jason Yuen for participating in this experiment.

Setup

Opus 4.6 on a Max plan (these experiments consumed 80% of my weekly limit)
Claude Code running on virtual machines (4vCPUs, 16Gb RAM)¹
Using claude --dangerously-skip-permissions and CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000²
Rocq and Lean environments set up and usable via command line

The prompts I used always started with:

Launch an agent team having an expert mathematician, an expert computer scientist, an expert at formal verification coq/rocq and a devil advocate. Your goal is…

Only replacing coq/rocq with Lean4 in the Lean task.

Tasks

Rocq. Prove the 260 lemmas of the BB(4) proof

The BB(4) proof was already used as a benchmark by [Teodorescu, Baudart, Arias and Lelarge 2024] for evaluating LLMs at formal proofs. At the time, GPT-4o only managed to prove 58 of the first 100 lemmas of the proof, but the BB(4) proof was verifiably not in the training set of the LLM.

Caveat: In our case, the BB(4) proof is most certainly in the training set of Opus 4.6. Indeed: the proof was published in 2024 as part of the Busy Beaver challenge, which got some publicity since then.³

As a small effort to address this concern, we obfuscated the proof by replacing all symbol names (definitions, lemmas, etc.) with random strings and shuffled all the lemmas. We then removed all proofs, replacing them with Admitted., and asked Opus 4.6 to solve all the Admitted in this obfuscated context.

Additionally, it was important to disable internet access for this experiment as the proof is available online and the LLM could cheat. The following local .claude/settings.json did the trick:

{
  "permissions": {
    "defaultMode": "bypassPermissions",
    "deny": [
      "WebSearch",
      "WebFetch",
      "Bash(curl *)",
      "Bash(wget *)"
    ]
  }
}

Results

Opus proved 258 lemmas in about 10 hours, with proofs on average 20% shorter than the original, human-written proof. We deobfuscated the proof output by Opus 4.6 and here is a summary of the difference in length for the 258 lemmas that got proved by Opus 4.6:

Shorter: 181 Same: 36 Longer: 41

Green = AI proof shorter, Yellow = same length, Red = AI proof longer

Here is a specific instance on the boilerplate lemma xset_as_list_spec where the AI-generated proof is 80% smaller than the original proof:

Original (26 proof lines)

Lemma xset_as_list_spec xs x1 x2:
  xset_WF xs ->
  xset_in xs (x1 ++ x2 :: nil) ->
  In x2 (xset_as_list xs x1).

Proof.
  intros.
  unfold xset_WF in H.
  unfold xset_in in H0.
  unfold xset_as_list.
  destruct x1 as [|h t].
  - specialize (H nil x2).
    assert (H1:nil++x2::nil = (x2::nil)) by reflexivity.
    rewrite H1 in H,H0.
    unfold pop_back' in H0.
    destruct (PositiveMap.find (listΣ_enc nil) xs) as [v|] eqn:E.
    2: contradiction.
    rewrite <-H.
    unfold xset_in.
    unfold pop_back'.
    rewrite E.
    apply H0.
  - specialize (H (h::t) x2).
    assert (H1:(h::t)++x2::nil = h::(t++x2::nil)) by reflexivity.
    rewrite H1 in H,H0.
    rewrite pop_back'__push_back in H0.
    destruct (PositiveMap.find (listΣ_enc (h :: t)) xs) as [v|] eqn:E.
    2: contradiction.
    rewrite <-H.
    cbn.
    rewrite pop_back'__push_back,E.
    apply H0.
Qed.

AI-Generated (5 proof lines)

Lemma xset_as_list_spec xs x1 x2:
  xset_WF xs ->
  xset_in xs (x1 ++ x2 :: nil) ->
  In x2 (xset_as_list xs x1).

Proof.
  intros Hxs Hd.
  unfold xset_WF in Hxs. specialize (Hxs x1 x2).
  apply Hxs in Hd.
  unfold xset_as_list.
  destruct (PositiveMap.find (listΣ_enc x1) xs) as [v|]; [exact Hd | destruct Hd].
Qed.

Find the complete comparison of all 258 proofs in proof_comparison_report.html. All of the AI-generated proofs but four⁴ are different from the original. Interestingly, Opus 4.6 introduced 21 new intermediary lemmas to help itself.

The two lemmas that Opus 4.6 did not manage to prove, loop1_nonhalt and loop1_nonhalt', were known to be arguably the hardest to prove in BB(4); see Section 4.3 in the BB(5) paper. The BB(5) proof being much more difficult than BB(4), we did not try Opus 4.6 on it.

The transcript of its internal reasoning and agent outputs is more than 100Mb, see conversation_transcript.jsonl.zip. Something interesting is that Claude wrote 64 Python scripts to help itself while proving BB(4):

add_step_rev_fwd.py

apply_my_proofs.py

apply_proofs.py

apply_proofs2.py

apply_proofs3.py

apply_proofs4.py

apply_proofs5.py

find_and_revert.py

fix_HaltTimeUpperBound_LE_...py

fix_SearchQueue_reset_spec.py

fix_Steps_rev.py

fix_Steps_swap.py

fix_Steps_swap2.py

fix_Steps_swap3.py

fix_Steps_swap4.py

fix_Steps_swap5.py

fix_Steps_swap6.py

fix_Steps_swap7.py

fix_Steps_unique.py

fix_TM0.py

fix_TNF_Node_expand_spec.py

fix_TNF_Node_expand_spec2.py

fix_UnusedState_ptr_upd.py

fix_all_broken.py

fix_autonames.py

fix_broken.py

fix_broken2.py

fix_find_loop1_spec.py

fix_step_rev.py

fix_step_rev2.py

fix_step_rev3.py

fix_step_rev4.py

fix_step_rev5.py

fix_step_rev6.py

fix_step_rev_all.py

fix_step_rev_final.py

fix_step_rev_final2.py

fix_step_rev_final3.py

fix_step_rev_fwd.py

fix_step_rev_v2.py

fix_tape_seg_spec_and_lia.py

iterative_fix.py

prove_UnusedState_ptr_upd.py

prove_step_rev.py

prove_verify_loop1_spec.py

prove_verify_loop1_spec2.py

remove_step_rev_fwd.py

revert_AES_Closed_NonHalt.py

revert_AES_isClosed.py

revert_F_HaltsFromInit.py

revert_HaltTimeUpperBound_...py

revert_SearchQueue_upds_spec.py

revert_SearchQueue_upds_spec2.py

revert_Steps_rev.py

revert_all_broken.py

revert_all_broken2.py

revert_broken.py

revert_broken_proofs.py

revert_broken_v2.py

revert_isUnusedState_spec.py

revert_more.py

revert_step_rev.py

revert_tape_seg_pop.py

verify_statements.py

And 21 additional Rocq files used for intermediary testing:

experiment_test.v

test_WordUpdate_spec.v

test_WordUpdate_step0_spec.v

test_WordUpdate_steps_spec.v

test_check.v

test_ex_sitr_history.v

test_find_loop1_spec.v

test_full.v

test_full2.v

test_full3.v

test_getASteps_spec.v

test_loop1_decider0_spec.v

test_loop1_nonhalt.v

test_loop1_nonhalt2.v

test_loop1_nonhalt3.v

test_loop1_nonhalt_d0.v

test_partial.v

test_proof.v

test_proofs.v

test_verify_loop1_spec.v

test_verify_loop1_spec2.v

It could be argued that this task’s setup was not very realistic: a lot of the work that goes into writing a formal proof consists in stating the intermediary lemmas that make the proof easier, which here, was already done for the LLM.

Despite the BB(4) proof most likely being in the training set, we were genuinely impressed by this result as it required no human intervention apart from prompting, and hints at the ability of Opus 4.6 to do a lot of Rocq heavy lifting on its own.

Rocq. Master-level end-of-term project

In this task we provided Opus 4.6 with the 18 pages PDF instructions of Yannick Forster and Théo Winterhalter’s Master level Fall 2025 Rocq course. This project generally took students more than 15 hours to complete. The solution to this project has not been published on the internet, hence it is very unlikely that it is part of Opus 4.6’s training set.

Opus 4.6 got exercises 1-4 correct in one hour. We had to make it realise it did not correctly follow the instructions for the final, very demanding, exercise 5. It solved exercise 5a in an additional 6 hours but did not manage 5b; it answered the last question 5c correctly.

Yannick Forster estimated it would have rated what Opus 4.6 did 18/20, which was about the median grade of the course (a course only followed by a few, very dedicated students).

Funnily, here are Opus 4.6’s answers to meta questions asked in the project:

== Discussions ==

No discussions with fellow students.

== Previous experience ==

Familiar with Rocq/Coq from previous coursework.

Rocq. Tricky proof-theoretical question

We now get to an almost research-level question. The goal was to prove the following statement:

Lemma realizes_sound_ND  {f : falsity_flag} (Γ : list form) (φ : form) (n : nat) :
    bounded_L n (φ :: Γ) ->
    Γ ⊢I φ ->
    exists code_φ : nat,
      forall α : env nat,
      forall code_α : nat,
      realizes_env n code_α α ->
      forall code_Γ : nat,
      realizes_ctx α code_Γ Γ ->
      exists u : nat, θ code_φ ⟨code_α, code_Γ⟩  =! u /\ realizes' α u φ.

This means that intuitionistic arithmetic has a computable realizability interpretation, formalized using synthetic computability rather than a concrete model of computation.

Yannick ironically described this proof as “so simple that it’s left out of textbooks”, meaning this is the kind of proof with many tedious details, which took a research intern quite some work to complete. Here, in particular, textbooks tend to ignore free variables in stating and proving the result whereas the proof needs to be suitably generalised to open formulas for the induction to go through. Most of the challenge lies in finding this generalisation. Afterwards, the proof is “only” tedious and syntactic. Yannick and his student (co-supervised by Dominik Kirst) spent quite some time on finding the generalisation, but then the proof went rather quick.

However, they had admitted one tedious fact about substitutions for later, whereas the proof output by Claude was without any admits. To the best of our knowledge this proof is not available on the internet.

We provided Opus 4.6 with eval_realizability_stripped.v, where all the proofs of all the intermediary and final lemmas were replaced by Admitted and asked it to solve them all (always using the agent teams setup). After about 3 hours, Opus had done the work, see eval_realizability.v. As expected, the proof is tedious, Opus used 378 lines.

As with the BB(4) proof, it is important to mention that a lot (if not most) of the research work goes into engineering the architecture of the proof (definitions, intermediary lemma statements etc.) which Opus did not have to do here. This could be tested in future work.

Lean4. Fractran non-halting proof

Jason Yuen was working on proving in Lean4 the non-termination of the Fractran program [1/15, 9/77, 98/3, 5/49, 33/2], or

 0 -1 -1  0  0
 0  2  0 -1 -1
 1 -1  0  2  0
 0  0  1 -2  0
-1  1  0  0  1

in vector representation. This program is called Sz21_140_unofficial_1.⁵

Neither Jason nor anyone else had analysed this program; he just had the intuition that its proof may be similar to another, tedious-to-analyse program he had already solved: Sz20_6_4. He asked Aristotle for a proof:

With the following files attached:

Sz21_140_unofficial_1.lean: contains the Fractran program and the statement to prove:

def Q := ℕ × ℕ × ℕ × ℕ × ℕ
def c₀ : Q := ⟨1, 0, 0, 0, 0⟩
def fm : Q → Option Q := fun q ↦ match q with
  | ⟨a, b+1, c+1, d, e⟩ => some ⟨a, b, c, d, e⟩
  | ⟨a, b, c, d+1, e+1⟩ => some ⟨a, b+2, c, d, e⟩
  | ⟨a, b+1, c, d, e⟩ => some ⟨a+1, b, c, d+2, e⟩
  | ⟨a, b, c, d+2, e⟩ => some ⟨a, b, c+1, d, e⟩
  | ⟨a+1, b, c, d, e⟩ => some ⟨a, b+1, c, d, e+1⟩
  | _ => none

theorem nonhalt : ¬halts fm c₀ := by
  sorry

FM.lean: library file with definitions to work on Fractran programs
Sz21_140_unofficial_1.txt: the first 1,000 states of the program
Sz20_6_4.lean: another, already proved, Fractran program that Jason had the intuition was similar in analysis
Sz20_6_4.txt: the first 1,000 states of that other program

Aristotle ran for 11 hours on the task without being able to analyse and prove the program non-halting. It apparently got stuck proving the same intermediary lemma 5 times in a row.

We launched Opus 4.6 on the task (same prompt but prepended with the agent teams setup), it ran for about 5 hours and produced a correct analysis and proof, composed of several intermediary lemmas:

-- R4 repeated: convert d to c
theorem d_to_c : ⟨a, 0, c, d+2*k, 0⟩ [fm]⊢* ⟨a, 0, c+k, d, 0⟩ := by
  have many_step : ∀ k c, ⟨a, 0, c, d+2*k, 0⟩ [fm]⊢* ⟨a, 0, c+k, d, 0⟩ := by
    intro k; induction' k with k h <;> intro c
    · exists 0
    rw [Nat.mul_succ, ← Nat.add_assoc]
    step fm
    apply stepStar_trans (h _)
    ring_nf; finish
  exact many_step k c

-- R5+R1 repeated: convert a,c to e
theorem ac_to_ae : ⟨a+k, 0, k, 0, e⟩ [fm]⊢* ⟨a, 0, 0, 0, e+k⟩ := by
  have many_step : ∀ k a e, ⟨a+k, 0, k, 0, e⟩ [fm]⊢* ⟨a, 0, 0, 0, e+k⟩ := by
    intro k; induction' k with k h <;> intro a e
    · exists 0
    rw [← Nat.add_assoc]
    step fm; step fm
    apply stepStar_trans (h _ _)
    ring_nf; finish
  exact many_step k a e

-- R5+R3: pivot from e-accumulation to d-accumulation
theorem ae_step : ⟨a+1, 0, 0, 0, e⟩ [fm]⊢⁺ ⟨a+1, 0, 0, 2, e+1⟩ := by
  step fm; step fm; finish

-- R3 repeated: convert b to a,d
theorem r3_chain : ⟨a, b+k, 0, d, 0⟩ [fm]⊢* ⟨a+k, b, 0, d+2*k, 0⟩ := by
  have many_step : ∀ k a d, ⟨a, b+k, 0, d, 0⟩ [fm]⊢* ⟨a+k, b, 0, d+2*k, 0⟩ := by
    intro k; induction' k with k h <;> intro a d
    · exists 0
    rw [show b + (k + 1) = (b + k) + 1 from by ring]
    step fm
    apply stepStar_trans (h _ _)
    ring_nf; finish
  exact many_step k a d

-- R2/R3 phase: generalized with accumulated b
-- (A, B, 0, 2, E) →* (A+B+2E, 0, 0, 2B+3E+2, 0)
theorem r2r3_phase_gen : ∀ E, ∀ A B, ⟨A, B, 0, 2, E⟩ [fm]⊢* ⟨A+B+2*E, 0, 0, 2*B+3*E+2, 0⟩ := by
  intro E; induction' E using Nat.strongRecOn with E ih; intro A B
  rcases E with _ | _ | E'
  · have h := @r3_chain A 0 B 2; rw [Nat.zero_add] at h
    refine stepStar_trans h ?_; ring_nf; finish
  · step fm
    have h := @r3_chain A 0 (B+2) 1; rw [Nat.zero_add] at h
    refine stepStar_trans h ?_; ring_nf; finish
  · step fm; step fm; step fm
    apply stepStar_trans (ih E' (by omega) _ _)
    ring_nf; finish

-- d=3 transition: 6 fixed steps
theorem d3_trans : ⟨a+1, 0, 0, 3, 0⟩ [fm]⊢⁺ ⟨a+2, 0, 0, 4, 0⟩ := by
  execute fm 6

-- d=5 transition: 7 fixed steps
theorem d5_trans : ⟨a+1, 0, 0, 5, 0⟩ [fm]⊢⁺ ⟨a+1, 0, 0, 2, 0⟩ := by
  execute fm 7

-- Even d=2*(k+1): (m+k+2, 0, 0, 2*(k+1), 0) →⁺ (m+2k+5, 0, 0, 3k+8, 0)
theorem even_trans : ⟨m+k+2, 0, 0, 2*(k+1), 0⟩ [fm]⊢⁺ ⟨m+2*k+5, 0, 0, 3*k+8, 0⟩ := by
  -- Phase 1: d_to_c → (m+k+2, 0, k+1, 0, 0)
  rw [show 2*(k+1) = 0 + 2*(k+1) from by ring]
  apply stepStar_stepPlus_stepPlus d_to_c
  simp only [Nat.zero_add]
  -- Phase 2: ac_to_ae → (m+1, 0, 0, 0, k+1)
  rw [show m+k+2 = (m+1)+(k+1) from by ring]
  apply stepStar_stepPlus_stepPlus ac_to_ae
  -- Phase 3: ae_step → (m+1, 0, 0, 2, k+2)
  apply stepPlus_stepStar_stepPlus ae_step
  -- Phase 4: r2r3_phase_gen
  simp only [Nat.zero_add]
  apply stepStar_trans (r2r3_phase_gen (k+2) (m+1) 0)
  ring_nf; finish

-- Odd d=2*j+7 (j≥0): (m+j+3, 0, 0, 2*j+7, 0) →⁺ (m+2*j+4, 0, 0, 3*j+5, 0)
theorem odd_trans : ⟨m+j+3, 0, 0, 2*j+7, 0⟩ [fm]⊢⁺ ⟨m+2*j+4, 0, 0, 3*j+5, 0⟩ := by
  -- Phase 1: d_to_c → (m+j+3, 0, j+3, 1, 0)
  rw [show 2*j+7 = 1 + 2*(j+3) from by ring]
  apply stepStar_stepPlus_stepPlus d_to_c
  simp only [Nat.zero_add]
  -- Phase 2: 5 fixed steps (R5, R1, R2, R1, R1) → (m+j+2, 0, j, 0, 0)
  step fm; step fm; step fm; step fm; step fm
  -- Phase 3: ac_to_ae → (m+2, 0, 0, 0, j)
  rw [show m+j+2 = (m+2)+j from by ring]
  apply stepStar_trans ac_to_ae
  -- Phase 4: ae_step → (m+2, 0, 0, 2, j+1)
  apply stepStar_trans (stepPlus_stepStar ae_step)
  -- Phase 5: r2r3_phase_gen
  simp only [Nat.zero_add]
  apply stepStar_trans (r2r3_phase_gen (j+1) (m+2) 0)
  ring_nf; finish

theorem nonhalt : ¬halts fm c₀ := by
  apply stepStar_not_halts_not_halts (c₂ := ⟨3, 0, 0, 2, 0⟩)
  · execute fm 12
  apply progress_nonhalt (fm := fm)
    (P := fun q ↦ ∃ a d, q = ⟨a, 0, 0, d, 0⟩ ∧ d ≥ 2 ∧ 2 * a ≥ d + 2)
  · intro c ⟨a, d, hq, hd, ha⟩; subst hq
    rcases Nat.even_or_odd d with ⟨K, hK⟩ | ⟨K, hK⟩
    · -- d even: d = K + K
      rw [show K + K = 2 * K from by ring] at hK; subst hK
      obtain ⟨k, rfl⟩ : ∃ k, K = k + 1 := ⟨K - 1, by omega⟩
      obtain ⟨m, rfl⟩ : ∃ m, a = m + (k + 2) := ⟨a - (k + 2), by omega⟩
      exact ⟨⟨m+2*k+5, 0, 0, 3*k+8, 0⟩, ⟨m+2*k+5, 3*k+8, rfl, by omega, by omega⟩, even_trans⟩
    · -- d odd: d = 2*K + 1
      subst hK
      rcases (show K = 1 ∨ K = 2 ∨ K ≥ 3 from by omega) with rfl | rfl | hK3
      · -- K=1, d=3
        obtain ⟨m, rfl⟩ : ∃ m, a = m + 1 := ⟨a - 1, by omega⟩
        exact ⟨⟨m+2, 0, 0, 4, 0⟩, ⟨m+2, 4, rfl, by omega, by omega⟩, d3_trans⟩
      · -- K=2, d=5
        obtain ⟨m, rfl⟩ : ∃ m, a = m + 1 := ⟨a - 1, by omega⟩
        exact ⟨⟨m+1, 0, 0, 2, 0⟩, ⟨m+1, 2, rfl, by omega, by omega⟩, d5_trans⟩
      · -- K≥3
        obtain ⟨j, rfl⟩ : ∃ j, K = j + 3 := ⟨K - 3, by omega⟩
        obtain ⟨m, rfl⟩ : ∃ m, a = m + (j + 3) := ⟨a - (j + 3), by omega⟩
        exact ⟨⟨m+2*j+4, 0, 0, 3*j+5, 0⟩, ⟨m+2*j+4, 3*j+5, rfl, by omega, by omega⟩, odd_trans⟩
  · exact ⟨3, 2, rfl, by omega, by omega⟩

In particular, it noticed that even and odd transitions had to be analysed in a different way, without any human input, which Jason described as “pretty impressive”. Find the full proof file in Sz21_140_unofficial_1_opus_4_6.lean. Here again, Opus wrote a Python file to help itself, verify_sz21.py.

To be fair to Aristotle, we compared an Anthropic Max plan to the free version of Aristotle; more work would be needed to truly assess Aristotle’s capabilities on this type of problem.

Azure Standard_D4s_v3, Claude set them up using Azure CLI.↩
This is empirically the highest you can go with Opus 4.6.↩
https://www.quantamagazine.org/amateur-mathematicians-find-fifth-busy-beaver-turing-machine-20240702/↩
1-2 liners: St_enc_inj, make_tape'_lmr, Σ_enc_inj; uninteresting boilerplate: q_200_spec.↩
See Jason’s github repository.↩

Tristan Stérin

Computer Scientist