A global team of researchers has developed a new exam designed specifically to measure the true capabilities and limitations of modern artificial intelligence systems. The test, called “Humanity’s Last Exam” (HLE), was created because many existing AI benchmarks have become too easy for advanced AI models. As AI systems improved, they began scoring over 90% on traditional tests, making it difficult for researchers to accurately evaluate how advanced these systems really are.
The new exam is far more demanding than earlier benchmarks. It contains about 2,500 highly challenging questions covering a wide range of subjects including mathematics, natural sciences, humanities, ancient languages, and other specialized academic fields. These questions are designed to require deep reasoning and expert knowledge, pushing AI systems beyond simple pattern recognition or memorized information.
To build the exam, nearly 1,000 researchers and subject experts from around the world collaborated to create difficult questions that even advanced AI models would struggle to answer. The questions were carefully reviewed and selected through multiple stages to ensure their difficulty and accuracy. The aim was to produce a rigorous benchmark capable of revealing the real strengths and weaknesses of modern AI systems.
Early results show that today’s most powerful AI models still perform poorly on this test, highlighting the gap between current AI capabilities and human expert reasoning. Researchers believe the exam will serve as an important tool for tracking progress in artificial intelligence and understanding how close AI is to achieving more advanced, human-like intelligence in the future.