NMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI

Department

Program

Citation of Original Publication

Rights

Attribution 4.0 International

Abstract

Nuclear Magnetic Resonance (NMR) structure determination is an important problem in education, industry, and research. Solving NMR spectra requires expert knowledge, critical thinking, and careful evaluation of multiple features of spectral data. This study explores the capabilities of large language models (LLMs) for solving NMR spectral tasks. We selected 115 problems from NMR-Challenge.com, which has been used by students practicing NMR structure elucidation, collecting >1 million human responses, and developed a plain text problem format for evaluating LLM reasoning in this domain. We evaluated 7 LLMs (GPT-4o, GPT-4o-mini, o1, o1-mini, o3-mini, Claude-3.5 Sonnet, and Gemini-2.0-Flash), comparing 5 prompts to spur chain-of-thought reasoning in different ways, especially comparing the influence of providing background NMR chemistry knowledge, reasoning strategy, or both. Newer models trained to emphasize reasoning performed better, and increasing reasoning effort led to modest improvements, but prompting and varying temperature didn't have an effect. We also evaluated undergraduate organic chemistry students in a controlled setting, and analyzed answer submission statistics from global submissions to NMR-Challenge.com, to characterize human performance on these problems. The top-performing students surpassed smaller models like GPT-4o by 24%, 33%, and 29% on the Easy, Moderate, and Hard sets. However, reasoning models like o1 exceeded student performance by 13%, 14%, and 19%, respectively. Patterns in mistakes made by humans and LLMs reveal that errors made by LLMs are similar to those typically made by humans, for instance, incorrect positioning of substituents on benzene and incorrect orientation of carboxyl groups in esters. However, LLMs still "think" differently from humans, in some cases, providing answers which no human submitted via the website. This work also illustrates how NMR spectral problems can be used to benchmark LLMs on reasoning-heavy tasks in chemistry, though for this particular set of problems, current LLMs already exceed undergraduate student performance.