Anthropic Sets 2027 Goal for Understanding AI's Inner Workings
Anthropic CEO Dario Amodei is pushing for greater transparency in AI. In a recent essay, he highlighted the limited understanding of how leading AI models function internally. He set an ambitious goal for Anthropic: to reliably detect most AI model problems by 2027.
The Urgency of Understanding AI
Amodei stressed the importance of this research in his essay, "The Urgency of Interpretability." While Anthropic has made initial progress in tracing AI decision-making, he acknowledges the need for significant further research, especially as AI models become more powerful.
I am very concerned about deploying such systems without a better handle on interpretability. These systems will be central to the economy, technology, and national security. Their potential autonomy makes it unacceptable for humanity to be ignorant of how they work.
This lack of understanding is a widespread issue. Even with performance improvements, like those seen in OpenAI's new reasoning models, unexpected behaviors like increased "hallucinations" arise, and the reasons remain unknown.
Amodei points out that even simple tasks, like summarizing a document, lack clear explanations for the AI's specific word choices or occasional errors.
Growing AI, Not Building It
Anthropic co-founder Chris Olah describes AI models as being "grown more than they are built." Researchers have found ways to enhance AI intelligence, but the underlying reasons for these improvements remain unclear.
Amodei warns of the dangers of reaching Artificial General Intelligence (AGI) without understanding how these models operate. He previously predicted AGI could arrive by 2026 or 2027, but believes comprehending AI's inner workings is still far off.
"Brain Scans" for AI
Anthropic's long-term vision involves conducting "brain scans" or "MRIs" of advanced AI models. These analyses would identify potential issues like tendencies to lie, seek power, or other vulnerabilities. While this could take 5-10 years, Amodei believes it's essential for safely testing and deploying future AI models.
Early Breakthroughs and Investments
Anthropic has achieved some breakthroughs in understanding AI. They've developed methods to trace an AI model's thinking pathways through what they call "circuits." While they've only identified a few, they estimate millions exist within AI models.
The company is actively investing in interpretability research, including its first investment in a startup focused on this area. While currently viewed as a safety measure, Amodei believes explaining AI's reasoning could offer future commercial advantages.
A Call for Collaboration and Regulation
Amodei urges OpenAI and Google DeepMind to increase their research efforts in interpretability. He also advocates for "light-touch" government regulations to encourage this research, such as requiring companies to disclose safety and security practices. He further suggests export controls on chips to China to mitigate an uncontrolled global AI race.
Anthropic's consistent focus on AI safety sets it apart. Their support for California's AI safety bill, SB 1047, further demonstrates their commitment to responsible AI development.
Anthropic is championing an industry-wide effort to prioritize understanding AI models, not just increasing their capabilities.