On 28/1, the journal Nature published an article titled "A benchmark of expert-level academic questions to assess AI capabilities". Long served as a lead co-author and helped lead this project.
Being published in the over 150-year-old journal, which has an acceptance rate of around 8%, is considered a significant achievement in any scientist's career.
"This marks a major milestone after five years of pursuing AI research, driven by a desire to contribute meaningfully and create global impact," Long shared.
The Case Western Reserve University alumnus is currently an AI safety research engineer at the Center for AI Safety (CAIS), an organization led by Dan Hendrycks, an advisor to Elon Musk.
![]() |
Phan Nguyen Hoang Long. *Photo: Courtesy of subject.* |
The article presents the results of the Humanity’s Last Exam (HLE) project, a benchmark designed to evaluate the knowledge and reasoning abilities of large language models (LLMs) like ChatGPT, Gemini, and Grok at research and expert levels.
HLE consists of more than typical multiple-choice questions. It comprises 2,500 in-depth questions across 100 fields, including mathematics, natural sciences, and humanities. Over 1,000 professors and experts from 500 leading global universities and research organizations, such as Stanford, Harvard, Princeton, MIT, and Oxford, contributed to this benchmark.
The project originated from an idea by billionaire Elon Musk and has been a collaborative effort between CAIS and Scale AI, an AI startup founded by Alexandr Wang, one of the world's youngest self-made billionaires. Wang also serves as an advisor to the project and leads Meta's super AI laboratory.
The New York Times once commented that HLE is so challenging that "when AI passes it, we must be wary". Indeed, companies like DeepMind, OpenAI, and xAI use it as a crucial metric when launching AI models. In 7/2025, xAI utilized HLE to develop Grok 4. Elon Musk described the test as "extremely difficult" during its launch livestream.
According to Hoang Long, HLE creates a common reference point for policymakers. This provides them with a basis for discussing AI development, potential risks, and appropriate regulatory policies.
The young researcher stated that he will continue to pursue AI safety, believing it to be a critical factor determining technology's impact on society.
Nature is a multidisciplinary scientific journal that has published groundbreaking research since 1869. Submissions must meet criteria for novelty, significant scientific merit, and robust methodology, while also appealing to a broad scientific community.
Khanh Linh
