← All stories
AI & Tech

OpenAI Researcher Warns Current AI Safety Evaluations Miss Critical Capability Scaling

No Priors Podcast · Really Big Test-Time Compute in AI Changes Benchmarks, Safety and Research with OpenAI Research Scientist Noam Brown · June 29, 2026
OpenAI Researcher Warns Current AI Safety Evaluations Miss Critical Capability Scaling
No Priors Podcast
No Priors Podcast
Really Big Test-Time Compute in AI Changes Benchmarks, Safety and Research with OpenAI Research Scientist Noam Brown
"The preparedness frameworks and responsible scaling policies, they don't really account for the amount of test-time compute. They just say, okay, well, what's the capability of the model? The problem is we're in a world now where the capability of the model is a function of how much money you put into it, basically. If you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10. If you give it a budget of $10 million, it can do even more."
Noam Brown, an OpenAI researcher, revealed that existing AI safety frameworks developed during the ChatGPT era fail to account for test-time compute scaling, meaning models can achieve dramatically different capability levels depending on inference budget. This creates a blindspot in safety evaluations, as dangerous capabilities could emerge at high budgets that wouldn't be detected in standard testing protocols.

About this episode

On this episode of No Priors, host Sarah Guo interviews Noam Brown, an OpenAI researcher who pioneered inference-time scaling techniques, about the broken state of AI model evaluations and the implications of large-scale test-time compute. Brown argues that current model benchmarking practices fail to account for the fact that modern AI capabilities are now a function of inference budget rather than fixed model properties, making comparisons misleading and safety evaluations inadequate. He revealed that existing responsible scaling policies and preparedness frameworks, developed during the ChatGPT era, don't address how much test-time compute should be allocated when evaluating dangerous capabilities, creating a critical blindspot as models can perform dramatically differently at $10 versus $10 million budgets. Brown disclosed that an internal OpenAI model recently disproved the Erdős unit distance conjecture at minimal cost, and that publicly available models like GPT-5.5 contain significant unexplored capabilities because the rapid release cycle means nobody runs models long enough to discover their limits. He revealed OpenAI is deliberately discouraging internal researchers from solving open problems in mathematics and physics to focus on building more capable models faster. The conversation explored recursive self-improvement, with Brown arguing against fears of overnight intelligence explosion because large-scale test-time compute creates a time bottleneck. He noted current models lack research taste and cannot yet fully replace researchers, though they dramatically accelerate certain tasks like code optimization. Brown predicted that within 6 to 12 months, models will be capable of completing PhD-level work zero-shot and emphasized the need for evaluation practices that plot performance against inference budget rather than reporting single benchmark scores.

Key takeaways

More stories More from No Priors Podcast