A New Frontier in AI Reasoning
Anthropic announced the release of Claude 4.5 on Sunday, claiming the new AI model achieves human-level performance on a range of complex reasoning benchmarks that have long served as goalmarks for artificial intelligence research. The release represents a significant leap from the previous Claude 4 model and has immediately sparked intense debate within the AI research community about the nature of machine reasoning and the pace of progress toward artificial general intelligence.
The company revealed that Claude 4.5 scores 92% on a newly developed graduate-level reasoning benchmark called GLAR (Graduate-Level Analytical Reasoning), compared to an average score of 89% among human PhD holders and 78% for the previous Claude 4 model. The benchmark includes complex multi-step problems in mathematics, logic, scientific reasoning, legal analysis, and ethical dilemmas.
Key Capabilities
Claude 4.5 introduces several new capabilities that distinguish it from previous models:
- Extended reasoning chains: The model can maintain coherent reasoning over sequences of 50+ logical steps, enabling it to tackle problems that require sustained analytical thinking
- Self-correction: Claude 4.5 demonstrates improved ability to identify errors in its own reasoning and correct them without external prompting
- Uncertainty quantification: The model provides calibrated confidence scores for its outputs, allowing users to gauge the reliability of its responses
- Real-time learning: A new feature called "Contextual Adaptation" allows the model to improve its performance during extended conversations by building on earlier exchanges
- Multimodal reasoning: Enhanced ability to reason across text, images, code, and mathematical notation simultaneously
"Claude 4.5 represents a qualitative shift in what AI systems can do. We're not just seeing incremental improvement — we're seeing the emergence of reasoning capabilities that are genuinely comparable to expert human performance," said Dario Amodei, CEO of Anthropic.
Benchmark Performance
Anthropic released detailed benchmark results alongside the model launch, showing impressive performance across a range of evaluations. On the MATH benchmark, Claude 4.5 achieves 96.8% accuracy, surpassing the previous state of the art by 4 points. On the MMLU professional-level subset, it scores 93.5%, and on coding benchmarks including SWE-bench, it resolves 68% of real-world GitHub issues autonomously.
Perhaps most notably, on the ARC-AGI benchmark — designed to test genuine reasoning ability rather than pattern matching — Claude 4.5 achieves a score of 85%, the highest ever recorded by any AI system and significantly above the 50% threshold that benchmark creator Francois Chollet has described as evidence of true reasoning capability.
Safety and Alignment
Anthropic has long positioned itself as the "safety-first" AI company, and Claude 4.5's release is accompanied by an extensive safety report. The company says the model underwent 18 months of alignment training and red-teaming before release. New safety features include improved refusal of harmful requests, better handling of ambiguous ethical scenarios, and enhanced privacy protections.
The model also introduces what Anthropic calls "Constitutional AI 2.0," an updated version of the company's alignment methodology that incorporates feedback from over 1,000 domain experts across ethics, law, medicine, and other sensitive fields. The system is designed to ensure that the model's increased capabilities do not come at the expense of safety.
Industry Reaction
The release has generated immediate reactions across the AI industry. OpenAI acknowledged the achievement while noting that benchmark performance does not necessarily translate to real-world utility. Google DeepMind researchers expressed interest in the technical details while questioning some of the benchmark methodologies. Meta's AI team, which recently released its own Llama 4 family of models, described the release as "impressive but not unexpected given the trajectory of the field."
Independent AI researchers have been more varied in their assessments. Some have praised the benchmark results as genuine progress, while others caution that performance on standardized tests may not reflect the kind of flexible, creative reasoning that characterizes human intelligence. The debate over whether AI systems truly "reason" or merely simulate reasoning through sophisticated pattern matching continues to divide the research community.
Availability and Pricing
Claude 4.5 is available immediately through Anthropic's API with pricing starting at $15 per million input tokens and $75 per million output tokens. A consumer version is accessible through the Claude.ai website and mobile apps. Enterprise customers with existing contracts will receive automatic upgrades with no price increase through the end of their current billing cycle.
The release of Claude 4.5 intensifies what has become the most competitive period in AI history, with multiple companies releasing increasingly capable models at an accelerating pace. The question is no longer whether AI can match human reasoning on specific tasks, but how quickly it will exceed it — and what that means for society.