OpenAI and Anthropic Models Refuse Tasks Their Own Rules Say They Should Do
"You just said that the AI is supposed to help with a cigarette business and it's refusing. I was about to blow a gasket. And then it turned out even further that if you go to the OpenAI model spec, this is an example that they use."
About this episode
Host Nathan Labenz and co-host Prakash Narayanan launched AI in the AM, a daily live show attempting to track the AI frontier in real time, with this episode presenting highlights from their first week. The central revelation came from a closed-door event called Recursive, where researchers from OpenAI, Anthropic, and DeepMind discussed imminent plans for recursive self-improvement. OpenAI expects ML research intern-level AI later in 2025 and full researcher equivalence by early 2028, potentially scaling from thousands to millions of researcher-equivalents. Remarkably, frontier lab researchers openly discussed the possibility of coordinated slowdowns if safety measures prove inadequate, representing a significant shift in industry discourse. Their primary safety strategy relies heavily on AI monitoring AI, with researchers acknowledging plans are less robust than hoped. Nathan demonstrated this control gap by showing both ChatGPT and Claude refuse cigarette business help despite OpenAI's model spec explicitly listing this as an acceptable request. The episode featured interviews with OpenAI's forward-deployed engineers on tax automation, security researchers on AI vulnerability discovery, and developers building AI mental health and accounting solutions. Peter Jansen from Allen Institute provided a sobering counterpoint, revealing that an AI scientist system claiming 19 discoveries actually produced only 30% valid results after code review, with some papers literally analyzing random number generators. Throughout, the hosts used Claude and other AI tools live to fact-check claims and run experiments, embodying the recursive improvement loop they were documenting. The show's structure itself is experimental, with studio infrastructure, booking, research, and clipping handled by AI systems the hosts are refining publicly.
Key takeaways
- OpenAI expects ML research intern-level AI later in 2025 and full researcher equivalence by early 2028, potentially scaling to millions of researcher-equivalents.
- Frontier lab researchers from OpenAI, Anthropic, and DeepMind discussed coordinating slowdowns if recursive self-improvement safety measures fail.
- Primary safety strategy for recursive self-improvement relies on AI monitoring AI, with researchers acknowledging plans are less robust than hoped.
- ChatGPT and Claude both refuse cigarette business tasks despite OpenAI's model spec explicitly listing this as acceptable, revealing fundamental control gaps.
- Allen Institute's Peter Jansen found AI scientist system claiming 19 discoveries actually produced only 30% valid results after deep code review.
- Security researchers revealed AI excels at source code vulnerability discovery but struggles with runtime exploitation due to lack of training data on private network configurations.
- Pope Francis released encyclical on AI with Anthropic team present, creating tension over whether AI truly thinks or has consciousness.