A new benchmark study from AI training company Mercor reveals that current AI agents still struggle with real-world professional tasks, particularly in consulting, legal, and banking environments. According to the research, AI models acting as autonomous agents successfully completed less than 25 % of consulting tasks on their first attempt, and even after multiple tries, success rates only reached about 40 %. These tasks were designed by experienced professionals and aimed to simulate complex management consulting work requiring sustained planning, nuanced judgement, and multi-step problem solving.
Despite these sobering results, Mercor’s CEO Brendan Foody remains optimistic about the long-term trajectory of AI performance. He noted that more advanced models — like GPT-5.2 and Anthropic’s Opus 4.6 — showed substantial improvements compared with earlier versions, with success rates rising from just a few percent to nearly a third of tasks completed correctly. Foody likened current AI agents to “interns” — capable of basic research and data analysis but still needing significant human oversight for complex consulting work — and predicts that performance could reach around 50 % by the end of the year.
A central weakness identified in the benchmark is how agents handle multi-domain reasoning and cross-tool coordination. Unlike humans, who intuitively navigate email, file systems, and multiple applications to gather context, AI agents often misinterpret where to find key information or how to integrate data from different sources. As tasks require more sustained effort and planning — such as developing a strategy based on market penetration scores — agent performance drops sharply, illustrating that current systems excel at single-tool tasks but falter in environments that mimic real consulting workflows.
Still, Foody argues that rapid improvement and investment in training will drive agents closer to professional-level performance. Mercor — valued at around $10 billion and employing tens of thousands of human contractors to train models — plans to expand its benchmark to evaluate entire professional service value chains, potentially challenging traditional consulting firms. While AI agents aren’t yet ready to replace consultants end-to-end, Foody believes continued progress could disrupt lower-level consulting roles and shift the industry toward more agent-augmented work.