AI & Tech

OpenAI Researcher Claims Safety Evaluations Ignore Test-Time Compute Scaling Risk

No Priors Podcast · Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown · June 26, 2026

No Priors Podcast

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

"The preparedness frameworks and responsible scaling policies, they don't really account for the amount of test-time compute. They just say, okay, well, what's the capability of the model? The problem is we're in a world now where the capability of the model is a function of how much money you put into it, basically. If you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10. If you give it a budget of $10 million, it can do even more. And so at what budget should you evaluate these models? The policies that exist today don't really address that question."

Noam Brown, AI researcher at OpenAI, revealed that current AI safety frameworks from major labs do not account for test-time compute scaling when evaluating dangerous capabilities. He argues that models can perform vastly different tasks depending on inference budget, from $10 to $10 million, making fixed evaluations potentially misleading. This gap means models could have latent dangerous capabilities that aren't being tested at sufficient budget levels.

About this episode

On this episode of No Priors, host Sarah Guo interviews Noam Brown, an AI researcher at OpenAI who pioneered inference-time scaling and reasoning approaches. Brown argues forcefully that the AI industry has a broken model evaluation system that fails to account for test-time compute, making benchmark comparisons misleading and safety evaluations inadequate. He reveals that modern models like GPT-4.5 can think productively for weeks before performance plateaus on complex tasks, unlike earlier models that peaked quickly, creating a fundamental problem: the capability of a model is now a function of how much money you spend on inference, from $10 to $10 million. Brown discloses that OpenAI used an unreleased internal model to disprove the Erdős unit distance conjecture, a longstanding mathematics problem, at minimal cost, and that GPT-4.5 could likely do the same with proper scaffolding for $1,000 to $100,000. He warns that current AI safety frameworks from all major labs don't account for this scaling dynamic, meaning models could have dangerous latent capabilities that aren't being tested at sufficient compute budgets. Brown also reveals OpenAI deliberately discourages internal researchers from using advanced models to solve open scientific problems, preferring to focus on developing more capable models for public release. He remains skeptical of overnight intelligence explosion scenarios, arguing time itself is the fundamental bottleneck because models require extended compute periods to reach peak capability. The conversation covers Brown's personal use of models for tasks from poker solver development to tax advice, his belief that models still lack research taste, and his call for the industry to abandon single-number benchmark grids in favor of performance plotted against inference budget.

Key takeaways

Brown revealed current AI safety evaluation frameworks don't account for test-time compute scaling, creating potential blind spots for dangerous capabilities at high inference budgets.
Modern AI models continue improving for weeks of compute time before plateauing, far longer than earlier models, making full capability assessment impossible within typical release cycles.
OpenAI used an unreleased internal model to disprove the Erdős unit distance conjecture at minimal cost, demonstrating significant latent mathematical capability.
Brown argues standard AI benchmarks showing single-number comparisons are misleading because they don't control for inference compute, making GPT-4.5 appear only marginally better than GPT-4.4.
OpenAI deliberately discourages internal researchers from using advanced models to solve open scientific problems, focusing instead on developing more capable models for public release.
Brown remains skeptical of overnight intelligence explosion scenarios, arguing time is a fundamental bottleneck because models require extended periods to reach peak capability.
He disclosed that GPT-4.5 and similar models likely have significant unexplored capability because researchers haven't tested them at very large inference budgets like $100,000 per task.

More stories More from No Priors Podcast