IMProofBench

An AI Benchmark for Mathematical Reasoning

IMProofBench is a benchmark designed to evaluate AI systems' ability to generate rigorous mathematical proofs at research level. Unlike benchmarks that focus on questions with unique (e.g. numerical) answers, we test whether AI can produce complete, logically sound arguments that would satisfy the standards of peer-reviewed mathematics journals. The benchmark maintains a private problem set to ensure unbiased evaluation and prevent overfitting.

Team: Johannes Schmitt, Tim Gehrunger, Jeremy Feusi (ETH Zurich)
Whitepaper: link

Pilot Project Goals (Summer 2025)

What Makes a Good Problem?

Required Characteristics

What to Avoid

Quick Problem Primer

Take 5 minutes to brainstorm! Think about:

Example thought process:
"In my last paper, I needed to show that a certain intersection number was zero. The calculation looked routine but required understanding how different strata interact in the compactification. This could make a good problem..."

Problem Templates

To give you some ideas, here is a list of templates of questions in algebraic geometry that could be suitable for the benchmark:

Intersection Theory

Main: Let X be [variety]. Compute the class of [specific cycle] in the Chow ring A*(X).
Subquestions: What is the degree of this class? Does it vanish in A²(X)?
Main: For the moduli space M of [objects], compute a closed formula for the intersection number ∫_M α₁ ∪ α₂ ∪ ... ∪ αₙ.
Subquestions: What is this number for specific values of parameters?

Enumerative Geometry

Main: How many [geometric objects] in [ambient space] satisfy [conditions]?
Subquestions: What is this count for low-degree cases? Is the answer always positive?
Main: Compute the Gromov-Witten invariant ⟨τ_{a₁}(γ₁),...,τ_{aₙ}(γₙ)⟩_{g,β} for [specific data].
Subquestions: What is the value when all aᵢ = 0? When g = 0?

Classification Problems

Main: Classify all [objects] with [property]. Give explicit representatives for each isomorphism class.
Subquestions: How many classes are there? Which have additional property P?

Cohomology Calculations

Main: What is the rank of the cohomology group Hᵏ(M) for [variety/moduli space M]?
Subquestions: What is dim H⁰(M)? Is Hⁿ(M) = 0 for n > d?
Main: Compute H^i(X, L) where X = [variety] and L = [line bundle].
Subquestions: What is h⁰(X, L)? Does Serre duality hold?

Full Example Problems

Example 1: Stable Graphs

Main question: Find a closed formula for the number N(g) of stable graphs of genus g with no legs and precisely 3 edges, for all g ≥ 2.

Subquestions:

Example 2: Permutation Representations

Main question: Let G be a finite group. Is the functor Perm: G-sets → Rep_ℂ(G) sending X to its permutation representation fully faithful? Prove or provide a counterexample.

Subquestions:

How to Submit

During the brainstorming session:

  1. Join our collaborative Overleaf project: link
  2. Create your problem file using the provided template
  3. When ready for AI testing, notify in chat so that organizers can run it on several frontier AI models
  4. Review AI responses and refine your problem as needed

For submission outside of an in-person session:

Join the Session

Question Brainstorming Session

When: Tuesday, 3rd March, 15:00 - 16:00
Where: Les Diablerets, Hotel Les Sources, Salle Glacier
What to bring: Laptop

Resources: