ChemLLM: A Chemical Large Language Model

May 20, 2024·

Di Zhang

Wei Liu

Qian Tan

Jingdan Chen

Hang Yan

Yuliang YAN

Jiatong Li

Weiran Huang

Xiangyu Yue

Dongzhan Zhou

Shufei Zhang

Mao Su

Hansen Zhong

Yuqiang Li

Wanli Ouyang

· 0 min read

PDF

Abstract

Large language models (LLMs) have made impressive progress in chemistry applications, including molecular property prediction, molecular generation, experimental protocol design, etc. However, the community lacks a dialogue-based model specifically designed for chemistry. The challenge arises from the fact that most chemical data and scientific knowledge are primarily stored in structured databases, and the direct use of these structured data compromises the model’s ability to maintain coherent dialogue. To tackle this issue, we develop a novel template-based instruction construction method that transforms structured knowledge into plain dialogue, making it suitable for language model training. By leveraging this approach, we develop ChemLLM, the first large language model dedicated to chemistry, capable of performing various tasks across chemical disciplines with smooth dialogue interaction. ChemLLM beats GPT-3.5 on all three principal tasks in chemistry, i.e., name conversion, molecular caption, and reaction prediction, and surpasses GPT-4 on two of them. Remarkably, ChemLLM also shows exceptional adaptability to related mathematical and physical tasks despite being trained mainly on chemical-centric corpora. Furthermore, ChemLLM demonstrates proficiency in specialized NLP tasks within chemistry, such as literature translation and cheminformatic programming. ChemLLM opens up a new avenue for exploration within chemical studies, while our method of integrating structured chemical knowledge into dialogue systems sets a new frontier for developing LLMs across various scientific fields. Codes, Datasets, and Model …

Type

ArXiv

Last updated on May 20, 2024

AI for Science; Large Language Models

Authors

Yuliang YAN

PhD Student

Hallucination detection for generative large language models by bayesian sequential estimation Dec 20, 2023 →