Breaking news

OpenAI Releases GDPval Benchmark To Gauge AI Performance Against Human Experts

New Benchmark Sheds Light on AI’s Capabilities

OpenAI has unveiled GDPval, a new benchmark designed to evaluate its AI models against human professionals across a broad spectrum of industries. This initiative represents a critical step in understanding how far today’s AI is from matching or surpassing the work quality of experts in sectors such as healthcare, finance, manufacturing, and government.

Methodology and Industry Scope

The GDPval benchmark focuses on nine major industries contributing to America’s gross domestic product and tests AI performance in 44 distinct occupations—from software engineering to nursing and journalism. In its initial version, GDPval-v0, industry professionals compared reports generated by AI models with those produced by their human counterparts. For instance, investment bankers were tasked with evaluating competitor landscape analyses for the last-mile delivery industry, ensuring that the assessment reflects real-world complexity.

Comparative Performance: AI Advances and Limitations

Results indicate promising progress; OpenAI’s GPT-5-high, an enhanced iteration of its flagship model, achieved a win rate of 40.6% when compared head-to-head with industry veterans. More notably, Anthropic’s Claude Opus 4.1 reached nearly 49% on similar criteria. However, OpenAI acknowledges that these models are not yet positioned to replace human labor entirely, as the current iteration of GDPval covers a narrow slice of actual job responsibilities.

Expert Insights and Future Directions

In a discussion with TechCrunch, OpenAI’s chief economist, Dr. Aaron Chatterji, noted that the benchmark’s favorable outcomes suggest professionals may soon delegate routine tasks to AI. This, he argued, will free up valuable time for focusing on higher-impact work. Industry observer Tejal Patwardhan also expressed optimism, emphasizing the significant performance leap from GPT-4’s 13.7% score to nearly triple that figure with GPT-5.

Benchmarking And The Road To Comprehensive AI Evaluation

While GDPval represents an early milestone, it aligns with a broader effort among Silicon Valley titans to create robust testing frameworks, such as AIME 2025 and GPQA Diamond, that better quantify AI proficiency for real-world applications. OpenAI plans to expand GDPval to encapsulate more industries and interactive workflows, aiming to bolster its claims about AI’s growing economic value.

As the benchmark evolves, GDPval could play an instrumental role in the ongoing debate around artificial general intelligence, highlighting the potential and limitations of AI models poised to reshape the modern workforce.

Anthropic Unveils Advanced Cybersecurity AI Through Project Glasswing

Anthropic has introduced Claude Mythos Preview, an artificial intelligence model designed to identify vulnerabilities in software. The release forms part of the company’s Project Glasswing initiative, focused on strengthening cybersecurity as threats continue to evolve.

Innovative Cyber Capabilities

Claude Mythos Preview identifies complex software flaws that are often difficult to detect using traditional methods. In one case, the model uncovered a 27-year-old vulnerability in OpenBSD, an operating system widely known for its security standards. Access to the model is currently restricted. Anthropic said the limitation is intended to reduce the risk of misuse and ensure the technology is applied in defensive contexts.

Strategic Industry Collaborations

Major technology companies, including Apple, Google, Microsoft, Nvidia and Amazon Web Services, joined as early partners in Project Glasswing. More than 40 additional companies, including CrowdStrike and Palo Alto Networks, are working with Anthropic to integrate the model into their cybersecurity systems.

Balancing Innovation With Caution

Dianne Penn said in a CNBC interview that the launch followed an extensive internal review. The company is also working with U.S. agencies, including the Cybersecurity and Infrastructure Security Agency and the Center for AI Standards and Innovation, to align deployment with safety requirements. Dario Amodei said the company is focused on balancing defensive benefits with potential risks linked to advanced AI systems.

Expanding AI Infrastructure Security

Anthropic has allocated up to $100 million in usage credits for selected partners. The programme is aimed at testing the model across proprietary and open-source systems. Early access is focused on companies managing critical infrastructure, as Anthropic evaluates broader deployment scenarios.

Outlook

Project Glasswing reflects a shift toward AI-driven cybersecurity tools designed to identify vulnerabilities earlier in the development cycle. Adoption will depend on how effectively companies balance improved detection capabilities with the risks associated with advanced AI systems.

The Future Forbes Realty Global Properties
Aretilaw firm
Uol
eCredo

Become a Speaker

Become a Speaker

Become a Partner

Subscribe for our weekly newsletter