IMProofBench - AI Benchmark for Mathematical Reasoning

IMProofBench is a benchmark designed to evaluate AI systems' ability to generate rigorous mathematical proofs at research level. Unlike benchmarks that focus on questions with unique (e.g. numerical) answers, we test whether AI can produce complete, logically sound arguments that would satisfy the standards of peer-reviewed mathematics journals. The benchmark maintains a private problem set to ensure unbiased evaluation and prevent overfitting.

Pilot Project Goals (Summer 2025)

Collect 25-50 high-quality problems by end of July
Test against frontier AI models (ChatGPT o3, Claude Opus 4, Gemini 2.5 Pro, etc.)
Submit proof-of-concept paper to ICLR 2026; all contributors have the option to become co-authors
Focus initially on algebraic geometry and related fields

What Makes a Good Problem?

Required Characteristics

PhD-level or above difficulty: Suitable for qualifying exams, research papers, or advanced seminars
Requires genuine mathematical insight: Not solvable by routine application of known algorithms
Clear proof-based main question: The answer should be a complete mathematical argument, not just a number
Generates 2-3 unique-answer subquestions: For automated evaluation (e.g., "Is the statement true for n=5?", "What is the rank of this group?")

What to Avoid

Problems solvable by pattern matching or lucky guessing
Standard textbook exercises (even from graduate texts)
Purely computational problems that Mathematica/SageMath can solve directly

Quick Problem Primer

Take 5 minutes to brainstorm! Think about:

A tricky calculation from your recent work that required a clever insight
An "obvious" statement that actually needs a non-trivial proof
A self-contained lemma that came up in a research project
An oral exam question for an advanced course that you would ask an excellent student
A classification problem with a finite but non-obvious answer

Example thought process:
"In my last paper, I needed to show that a certain intersection number was zero. The calculation looked routine but required understanding how different strata interact in the compactification. This could make a good problem..."

Problem Templates

To give you some ideas, here is a list of templates of questions in algebraic geometry that could be suitable for the benchmark:

Intersection Theory

Main: Let X be [variety]. Compute the class of [specific cycle] in the Chow ring A*(X).
Subquestions: What is the degree of this class? Does it vanish in A²(X)?

Main: For the moduli space M of [objects], compute a closed formula for the intersection number ∫_M α₁ ∪ α₂ ∪ ... ∪ αₙ.
Subquestions: What is this number for specific values of parameters?

Enumerative Geometry

Main: How many [geometric objects] in [ambient space] satisfy [conditions]?
Subquestions: What is this count for low-degree cases? Is the answer always positive?

Main: Compute the Gromov-Witten invariant ⟨τ_{a₁}(γ₁),...,τ_{aₙ}(γₙ)⟩_{g,β} for [specific data].
Subquestions: What is the value when all aᵢ = 0? When g = 0?

Classification Problems

Main: Classify all [objects] with [property]. Give explicit representatives for each isomorphism class.
Subquestions: How many classes are there? Which have additional property P?

Cohomology Calculations

Main: What is the rank of the cohomology group Hᵏ(M) for [variety/moduli space M]?
Subquestions: What is dim H⁰(M)? Is Hⁿ(M) = 0 for n > d?

Main: Compute H^i(X, L) where X = [variety] and L = [line bundle].
Subquestions: What is h⁰(X, L)? Does Serre duality hold?

Full Example Problems

Example 1: Stable Graphs

Main question: Find a closed formula for the number N(g) of stable graphs of genus g with no legs and precisely 3 edges, for all g ≥ 2.

Subquestions:

What is N(3)?
What is N(8)?
What is N(1000)?

Example 2: Permutation Representations

Main question: Let G be a finite group. Is the functor Perm: G-sets → Rep_ℂ(G) sending X to its permutation representation fully faithful? Prove or provide a counterexample.

Subquestions:

Is the statement true for all finite groups?
Is the statement true for all finite cyclic groups?
Is the statement true for all finite abelian groups?

How to Submit

Online Submission Process

Create an account on our submission website: improofbench.math.ethz.ch
Create a new question using our online interface
Write your problem following the guidelines and templates provided above
Test against AI model - Once your text is ready, click the test button to run your problem against the latest frontier AI model (free of charge)
Review and refine based on AI responses to ensure appropriate difficulty

Brainstorming Sessions

Join our collaborative brainstorming sessions to develop problems with peer feedback:

Format: Short primer on effective question generation, followed by work on new questions alone or in small groups, with live feedback and discussions
What to bring: Laptop to access the submission website
What we'll do: Think of interesting questions following the guidelines above, get real-time feedback from peers, and refine problems collaboratively