IMProofBench is a benchmark designed to evaluate AI systems' ability to generate rigorous mathematical proofs at research level.
Unlike benchmarks that focus on questions with unique (e.g. numerical) answers, we test whether AI can produce complete, logically sound arguments that would satisfy the standards of peer-reviewed mathematics journals.
The benchmark maintains a private problem set to ensure unbiased evaluation and prevent overfitting.
Take 5 minutes to brainstorm! Think about:
- A tricky calculation from your recent work that required a clever insight
- An "obvious" statement that actually needs a non-trivial proof
- A self-contained lemma that came up in a research project
- An oral exam question for an advanced course that you would ask an excellent student
- A classification problem with a finite but non-obvious answer
Example thought process:
"In my last paper, I needed to show that a certain intersection number was zero. The calculation looked routine but required understanding how different strata interact in the compactification. This could make a good problem..."
To give you some ideas, here is a list of templates of questions in algebraic geometry that could be suitable for the benchmark:
Intersection Theory
Main: Let X be [variety]. Compute the class of [specific cycle] in the Chow ring A*(X).
Subquestions: What is the degree of this class? Does it vanish in A²(X)?
Main: For the moduli space M of [objects], compute a closed formula for the intersection number ∫_M α₁ ∪ α₂ ∪ ... ∪ αₙ.
Subquestions: What is this number for specific values of parameters?
Enumerative Geometry
Main: How many [geometric objects] in [ambient space] satisfy [conditions]?
Subquestions: What is this count for low-degree cases? Is the answer always positive?
Main: Compute the Gromov-Witten invariant ⟨τ_{a₁}(γ₁),...,τ_{aₙ}(γₙ)⟩_{g,β} for [specific data].
Subquestions: What is the value when all aᵢ = 0? When g = 0?
Classification Problems
Main: Classify all [objects] with [property]. Give explicit representatives for each isomorphism class.
Subquestions: How many classes are there? Which have additional property P?
Cohomology Calculations
Main: What is the rank of the cohomology group Hᵏ(M) for [variety/moduli space M]?
Subquestions: What is dim H⁰(M)? Is Hⁿ(M) = 0 for n > d?
Main: Compute H^i(X, L) where X = [variety] and L = [line bundle].
Subquestions: What is h⁰(X, L)? Does Serre duality hold?
Example 1: Stable Graphs
Main question: Find a closed formula for the number N(g) of stable graphs of genus g with no legs and precisely 3 edges, for all g ≥ 2.
Subquestions:
- What is N(3)?
- What is N(8)?
- What is N(1000)?
Example 2: Permutation Representations
Main question: Let G be a finite group. Is the functor Perm: G-sets → Rep_ℂ(G) sending X to its permutation representation fully faithful? Prove or provide a counterexample.
Subquestions:
- Is the statement true for all finite groups?
- Is the statement true for all finite cyclic groups?
- Is the statement true for all finite abelian groups?
Online Submission Process
- Create an account on our submission website: improofbench.math.ethz.ch
- Create a new question using our online interface
- Write your problem following the guidelines and templates provided above
- Test against AI model - Once your text is ready, click the test button to run your problem against the latest frontier AI model (free of charge)
- Review and refine based on AI responses to ensure appropriate difficulty
Brainstorming Sessions
Join our collaborative brainstorming sessions to develop problems with peer feedback:
- Format: Short primer on effective question generation, followed by work on new questions alone or in small groups, with live feedback and discussions
- What to bring: Laptop to access the submission website
- What we'll do: Think of interesting questions following the guidelines above, get real-time feedback from peers, and refine problems collaboratively