Can an AI Solve Math Problems That Stumped Humans for Decades?
Axiom Math Just Did — Here's How
A 24-year-old Stanford dropout built an AI that achieved a perfect Putnam score and solved four unsolved math problems. But the real story is what this means for trusting AI-generated code.
- The 24-Year-Old Who Dropped Out to Build an AI Mathematician
- What Is Axiom Math and Why Should Engineers Care?
- The Lean Verification Engine: How It Actually Works
- From Math Proofs to Verified Code: The Bigger Picture
- Pros & Cons
- Real User Pulse: What Reddit and Hacker News Say
- Head-to-Head: Axiom vs Traditional Proof Assistants
- Who Should Actually Use Axiom Math?
- Expert Editorial Opinion
- Final Verdict
- FAQ
Imagine being 24 years old, holding a perfect GPA from MIT in both math and physics, accepted into Stanford's elite JD/PhD program — and then walking away from all of it. Not because you failed. Because you looked at the future of artificial intelligence and realized something terrifying: our most powerful AI systems can't do basic math without making mistakes.
That's Carina Hong's story. Born in 2001 in Guangzhou to parents who never attended college, she taught herself calculus at 12, published 9 research papers as an MIT undergraduate, won the Morgan Prize (the highest honor for North American undergraduate mathematicians), and then dropped out of Stanford in March 2025 to start Axiom Math. Eighteen months later, her company is valued at $1.6 billion, backed by $264 million in funding, and her AI just achieved something that only five humans have done in 98 years: a perfect score on the Putnam Competition, the world's hardest undergraduate math exam.
But here's what makes this story matter to you — even if you haven't thought about math since high school. Axiom Math isn't just solving math problems. It's building the infrastructure that could make AI-generated code trustworthy. And in a world where AI is writing increasingly critical software, that might be the most important problem in technology right now.
The 24-Year-Old Who Dropped Out to Build an AI Mathematician
Carina Hong's path to founding Axiom Math reads like a movie script. She grew up in a working-class family in Guangzhou, China, with no academic pedigree or connections. By age 12, she was self-studying calculus. At 16, she was competing in the Chinese National Mathematical Olympiad. At 18, she entered MIT and completed both math and physics degrees in just three years — while publishing 9 research papers.
The Morgan Prize, awarded to the most outstanding undergraduate mathematician in North America, came in 2023. Then a Rhodes Scholarship to Oxford for neuroscience. Then Stanford's joint JD/PhD program. And then — dropout. In March 2025, she walked away from what most would consider a perfect academic trajectory because she saw a fundamental flaw in how AI systems reason.
The problem, as Hong identified it, is that large language models like GPT-4 and Claude are statistical pattern matchers. They predict what sounds right, not what is logically correct. In mathematics, this creates "hallucinations" — plausible-looking proofs that contain subtle errors. In software, it creates security vulnerabilities, race conditions, and logic bugs that cost companies billions annually.
Hong's solution was radical: instead of training AI to predict likely outputs, train it to produce outputs that can be mechanically verified as correct. Every step of reasoning, every line of code, every mathematical proof — checked by a deterministic verifier that cannot be fooled. This is formal verification, and Axiom Math is bringing it to AI at scale.
imageimage_search:23#0
What Is Axiom Math and Why Should Engineers Care?
Axiom Math is building what they call "Verified AI" — artificial intelligence systems that generate not just outputs, but outputs accompanied by formal proofs of correctness. Founded in March 2025, the company has grown to approximately 30 people and raised $264 million in total funding, including a $200 million Series A led by Menlo Ventures at a $1.6 billion valuation.
The core product is Axiom Prover, an ensemble AI system that uses the Lean programming language — a functional programming language that doubles as a formal proof verification system — to ground all generated mathematical proofs and code in mechanically checkable logic. This isn't testing. This isn't "it passed our test suite." This is mathematical certainty: if the proof checks out, the result is correct. Period.
Why does this matter for software engineering? Because the same verification infrastructure that proves mathematical theorems can prove that code satisfies its specifications. A function that sorts a list can be proven to always produce a sorted output. A cryptographic protocol can be proven to never leak keys. A financial transaction system can be proven to never lose money. This transforms AI code generation from a probabilistic gamble into something enterprises can actually trust.
The Lean Verification Engine: How It Actually Works
Axiom Prover's architecture is a masterclass in hybrid AI design. It starts with open-source foundation models (preferring those with strong coding and natural language capabilities), then subjects them to continued pre-training and reinforcement learning specifically for formal mathematics in Lean.
The key innovation is the reward signal. While most AI systems learn from human feedback (RLHF) or simple completion metrics, Axiom Prover learns from Lean's compiler. Every proof attempt is either accepted or rejected by the Lean verifier — a deterministic, unforgiving judge that cares only about logical validity. This creates a training signal far stronger than any human evaluator could provide: the system literally knows when it's wrong, and why.
Recursive Proof Decomposition
Complex proofs are broken into sub-goals, which are broken into smaller sub-goals, creating proof trees that scale from 40 nodes to 4,000+ nodes. The system learns to backtrack when paths fail.
AXL Verification Toolkit
14 open-source proof validation and manipulation tools, including Verify Proof (100x faster than existing comparators) and Repair Tools that transform broken Lean code into valid proofs.
MathLib Integration
Leverages the community-built Lean mathematics library containing formalized undergraduate mathematics. Performance depends on existing definitional infrastructure in each domain.
Recursive Self-Improvement
Axiom Prover's daily mathematical work feeds back into model improvement. Verified proofs become training data, creating a compounding intelligence flywheel that standard LLMs cannot replicate.
imageimage_search:23#1
From Math Proofs to Verified Code: The Bigger Picture
Axiom Math's vision extends far beyond mathematics competitions. The company sees formal verification as foundational infrastructure for all AI-generated code — what they call "right of first refusal on all AI-generated code." The business model is an API that developers and other AI systems can call for verification, similar to how companies use Perplexity for search or Jina AI for reading.
The transfer from math to code is already showing promising results. On the CodeMurina benchmark — which evaluates code generation with formal proof of correctness — Axiom achieved 99% accuracy (187 of 189 problems solved with both code and proof). For context, OpenAI's o3 achieved 4.9% on the same benchmark. The system had no training on software engineering concepts — only formal logic and mathematics — yet transferred those reasoning capabilities to code verification.
Menlo Ventures, which led the $200M Series A, frames the investment thesis starkly: "We are approaching a world where AI writes effectively all software. But none of it is verified. Axiom is the company that solves this." The vision is a future where every AI-generated function comes with a proof it satisfies its specification — not because regulation requires it, but because the verification infrastructure makes it possible and economically advantageous.
imageimage_search:23#5
Pros & Cons
✓ Comprehensive Advantages
- ✅ Perfect 120/120 Putnam score — first AI system to exceed human and informal LLM performance on a major math benchmark.
- ✅ Solved four previously unsolved mathematical problems, including 20-year-old number theory conjectures.
- ✅ 99% accuracy on CodeMurina benchmark for verified code generation — orders of magnitude ahead of competitors.
- ✅ Deterministic verification eliminates AI hallucinations — every output is mechanically proven correct.
- ✅ Open-source AXL toolkit enables community contribution and ecosystem growth.
- ✅ Recursive self-improvement creates compounding intelligence flywheel structurally impossible for standard LLMs.
- ✅ World-class team including Ken Ono (number theorist who advised 10 Morgan Prize winners) and Shubho Sengupta (ex-Meta AI Research Director).
✗ Foundational Constraints
- ❌ Domain coverage limited — performs poorly on differential topology, differential geometry, and other domains lacking MathLib infrastructure.
- ❌ Proof overhead is massive — approximately 20 lines of Lean proof per line of program code.
- ❌ Auto-formalization remains unsolved — converting informal problem statements to formal specifications is still harder than proving.
- ❌ Specification problem — even perfect verification can't prove what isn't specified; humans must define requirements completely.
- ❌ Theoretical limits exist — Rice's theorem proves non-trivial program properties cannot be verified for all programs universally.
- ❌ Early stage — only 18 months old, unproven at massive enterprise scale, limited public track record beyond benchmarks.
- ❌ Requires specialized expertise — Lean meta-programming is extremely rare, creating talent bottlenecks.
💡 Real User Pulse: What Reddit and Hacker News Say
Head-to-Head: Axiom vs Traditional Proof Assistants
| Evaluated Criteria | Axiom Math | Coq/Isabelle | Standard LLMs (GPT-4, Claude) |
|---|---|---|---|
| Verification Method | Lean + RL | Manual proof | None (statistical) |
| Automation Level | High (AI-generated) | Low (human-driven) | High (no verification) |
| Math Benchmarks | 120/120 Putnam | N/A (tool, not solver) | 103/120 (DeepSeek) |
| Code Verification | 99% CodeMurina | Possible, manual | 4.9% (OpenAI o3) |
| Accessibility | API (early) | Expert-only | Universal |
| Practical Deployment | API integration | Research/academic | Production-ready |
Who Should Actually Use Axiom Math?
Optimized Target Profiles: Cryptocurrency and blockchain projects where mathematical certainty about smart contract correctness is worth millions. Aerospace and defense contractors requiring formally verified systems for safety-critical applications. Financial institutions building high-frequency trading or settlement systems where bugs cost real money. Academic mathematicians exploring conjectures in number theory, algebra, and geometry where MathLib has strong coverage. AI companies building autonomous agents that need verifiable behavior guarantees.
Alternative Directions: For general software development, Cursor or GitHub Copilot at $20/month remain vastly more practical. For traditional formal verification without AI assistance, Coq or Isabelle offer mature ecosystems but require specialized expertise. For code that "just needs to work" without mathematical proof, standard LLMs are sufficient and far more accessible. Axiom Math is for domains where correctness is worth the overhead.
Expert Editorial Opinion
I've spent weeks analyzing Axiom Math from every angle — academic papers, benchmark results, team backgrounds, funding dynamics, and the broader formal verification landscape. My conclusion is nuanced: this is genuinely groundbreaking research packaged as a startup, with all the opportunities and risks that entails.
The Putnam perfect score is not a stunt. The 120/120 result, achieved in December 2025, represents the first time a formal verification system exceeded both human and informal LLM performance on a major competitive mathematics benchmark. Only five humans have achieved this in 98 years. The median human score is zero. This is not marketing fluff — it's a legitimate scientific achievement.
But I'm also concerned about the hype cycle. The "four unsolved problems" claim has already faced scrutiny — one turned out to be solved in 1936, discovered only through a Stack Overflow post. This highlights a genuine challenge: mathematical search and retrieval is notoriously difficult, and even AI systems struggle with literature provenance. Axiom's transparency about this "Erdős debacle" is commendable, but it also shows how early-stage the technology remains.
The 20:1 proof-to-code ratio is the practical bottleneck that will determine Axiom's commercial success. For aerospace or crypto, this overhead is acceptable. For a SaaS startup building a CRUD app, it's absurd. The company needs to abstract away Lean complexity significantly before mainstream adoption. The API is a promising start, but the developer experience needs years of refinement.
My final assessment: Axiom Math is building critical infrastructure for the AI era, but it's infrastructure for 2030, not 2026. The $1.6B valuation reflects investor belief in the long-term vision, not current revenue. If they can bridge the gap between research breakthroughs and developer-friendly products, they'll be foundational. If they can't, they'll be a brilliant academic project that never found product-market fit.
Final Verdict
Axiom Math is the most technically impressive AI company I've evaluated. The combination of reinforcement learning, formal verification, and recursive self-improvement represents a genuine paradigm shift in how we build trustworthy AI systems. The Putnam perfect score, the four solved research problems (even with the caveat), and the 99% CodeMurina benchmark are not incremental improvements — they're qualitative leaps.
But this score comes with a critical caveat: Axiom Math is a research breakthrough seeking a product. The $1.6B valuation and $264M in funding reflect belief in the vision, not proven commercial traction. The 20:1 proof-to-code ratio, the limited domain coverage, and the early-stage API all signal that mainstream adoption is years away.
For enterprises building safety-critical systems, for mathematicians pushing the boundaries of human knowledge, and for AI companies that need verifiable behavior guarantees, Axiom Math is already invaluable. For everyone else, it's a fascinating technology to watch — one that might eventually transform how we trust every line of AI-generated code. The science is there. The product will follow. The question is whether Axiom can bridge that gap before the funding runway ends.
Frequently Asked Questions
🔑 Related Keywords
So here's my question to you: In a world where AI writes increasingly critical software, would you trust code that "probably works" — or would you pay the overhead for code that comes with a mathematical proof it always works?
Comments
Post a Comment