NMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI

dc.contributor.authorSharlin, Samiha
dc.contributor.authorAgbere, Fariha
dc.contributor.authorIshimwe, Kevin
dc.contributor.authorOsifová, Zuzana
dc.contributor.authorSocha, Ondřej
dc.contributor.authorDračínský, Martin
dc.contributor.authorJosephson, Tyler R.
dc.date.accessioned2025-08-28T16:11:34Z
dc.date.issued2025-08-13
dc.description.abstractNuclear Magnetic Resonance (NMR) structure determination is an important problem in education, industry, and research. Solving NMR spectra requires expert knowledge, critical thinking, and careful evaluation of multiple features of spectral data. This study explores the capabilities of large language models (LLMs) for solving NMR spectral tasks. We selected 115 problems from NMR-Challenge.com, which has been used by students practicing NMR structure elucidation, collecting >1 million human responses, and developed a plain text problem format for evaluating LLM reasoning in this domain. We evaluated 7 LLMs (GPT-4o, GPT-4o-mini, o1, o1-mini, o3-mini, Claude-3.5 Sonnet, and Gemini-2.0-Flash), comparing 5 prompts to spur chain-of-thought reasoning in different ways, especially comparing the influence of providing background NMR chemistry knowledge, reasoning strategy, or both. Newer models trained to emphasize reasoning performed better, and increasing reasoning effort led to modest improvements, but prompting and varying temperature didn't have an effect. We also evaluated undergraduate organic chemistry students in a controlled setting, and analyzed answer submission statistics from global submissions to NMR-Challenge.com, to characterize human performance on these problems. The top-performing students surpassed smaller models like GPT-4o by 24%, 33%, and 29% on the Easy, Moderate, and Hard sets. However, reasoning models like o1 exceeded student performance by 13%, 14%, and 19%, respectively. Patterns in mistakes made by humans and LLMs reveal that errors made by LLMs are similar to those typically made by humans, for instance, incorrect positioning of substituents on benzene and incorrect orientation of carboxyl groups in esters. However, LLMs still "think" differently from humans, in some cases, providing answers which no human submitted via the website. This work also illustrates how NMR spectral problems can be used to benchmark LLMs on reasoning-heavy tasks in chemistry, though for this particular set of problems, current LLMs already exceed undergraduate student performance.
dc.description.sponsorshipThis work was supported by the Department of Energy’s Bioenergy Technologies O
dc.description.urihttps://chemrxiv.org/engage/chemrxiv/article-details/689cccf6728bf9025e4831c3
dc.format.extent16 pages
dc.genrejournal articles
dc.genrepreprints
dc.identifierdoi:10.13016/m2lt62-suma
dc.identifier.urihttps://doi.org/10.26434/chemrxiv-2025-x8h36-v2
dc.identifier.urihttp://hdl.handle.net/11603/40096
dc.language.isoen
dc.relation.isAvailableAtThe University of Maryland, Baltimore County (UMBC)
dc.relation.ispartofUMBC Chemical, Biochemical & Environmental Engineering Department
dc.relation.ispartofUMBC Student Collection
dc.relation.ispartofUMBC Faculty Collection
dc.relation.ispartofUMBC Computer Science and Electrical Engineering Department
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectLarge Language Models
dc.subjectUMBC ATOMS Lab
dc.subjectReasoning
dc.titleNMR-Challenge for LLMs: Evaluating Chemical Reasoning in Humans and AI
dc.typeText
dcterms.creatorhttps://orcid.org/0000-0002-6379-9206
dcterms.creatorhttps://orcid.org/0000-0002-0100-0227

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
nmrchallengeforllmsevaluatingchemicalreasoninginhumansandai.pdf
Size:
3.57 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
supportinginformation.pdf
Size:
2.2 MB
Format:
Adobe Portable Document Format