IMProofBench is a benchmark designed to evaluate AI systems' ability to generate rigorous mathematical proofs at research level.
Unlike benchmarks that focus on questions with unique (e.g. numerical) answers, we test whether AI can produce complete, logically sound arguments that would satisfy the standards of peer-reviewed mathematics journals.
The benchmark maintains a private problem set to ensure unbiased evaluation and prevent overfitting.
Take 5 minutes to brainstorm! Think about:
- A tricky calculation from your recent work that required a clever insight
- An "obvious" statement that actually needs a non-trivial proof
- A self-contained lemma that came up in a research project
- An oral exam question for an advanced course that you would ask an excellent student
- A classification problem with a finite but non-obvious answer
Example thought process:
"In my last paper, I needed to show that a certain intersection number was zero. The calculation looked routine but required understanding how different strata interact in the compactification. This could make a good problem..."
To give you some ideas, here is a list of templates of questions in algebraic geometry that could be suitable for the benchmark:
Intersection Theory
Main: Let X be [variety]. Compute the class of [specific cycle] in the Chow ring A*(X).
Subquestions: What is the degree of this class? Does it vanish in A²(X)?
Main: For the moduli space M of [objects], compute a closed formula for the intersection number ∫_M α₁ ∪ α₂ ∪ ... ∪ αₙ.
Subquestions: What is this number for specific values of parameters?
Enumerative Geometry
Main: How many [geometric objects] in [ambient space] satisfy [conditions]?
Subquestions: What is this count for low-degree cases? Is the answer always positive?
Main: Compute the Gromov-Witten invariant ⟨τ_{a₁}(γ₁),...,τ_{aₙ}(γₙ)⟩_{g,β} for [specific data].
Subquestions: What is the value when all aᵢ = 0? When g = 0?
Classification Problems
Main: Classify all [objects] with [property]. Give explicit representatives for each isomorphism class.
Subquestions: How many classes are there? Which have additional property P?
Cohomology Calculations
Main: What is the rank of the cohomology group Hᵏ(M) for [variety/moduli space M]?
Subquestions: What is dim H⁰(M)? Is Hⁿ(M) = 0 for n > d?
Main: Compute H^i(X, L) where X = [variety] and L = [line bundle].
Subquestions: What is h⁰(X, L)? Does Serre duality hold?
Example 1: Stable Graphs
Main question: Find a closed formula for the number N(g) of stable graphs of genus g with no legs and precisely 3 edges, for all g ≥ 2.
Subquestions:
- What is N(3)?
- What is N(8)?
- What is N(1000)?
Example 2: Permutation Representations
Main question: Let G be a finite group. Is the functor Perm: G-sets → Rep_ℂ(G) sending X to its permutation representation fully faithful? Prove or provide a counterexample.
Subquestions:
- Is the statement true for all finite groups?
- Is the statement true for all finite cyclic groups?
- Is the statement true for all finite abelian groups?
During the brainstorming session:
- Join our collaborative Overleaf project: link
- Create your problem file using the provided template
- When ready for AI testing, notify in chat so that organizers can run it on several frontier AI models
- Review AI responses and refine your problem as needed
For submission outside of an in-person session:
- Fill out the submission form: [link to be provided]
- Include complete problem statement, solution sketch, and grading rubric
- You'll receive confirmation and can track your contribution
Question Brainstorming Session
When: Tuesday, 3rd March, 15:00 - 16:00
Where: Les Diablerets, Hotel Les Sources, Salle Glacier
What to bring: Laptop
Resources:
- Overleaf project: link
- LaTeX submission template: link
- Submission form: [link]