Transferable diversity – a data-driven representation of chemical space

03 October 2023, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Transferability, especially in the context of model generalization, is a paradigm of all scientific disciplines. However, the rapid advancement of machine learned model development threatens this paradigm, as it can be difficult to understand how transferability is embedded (or missed) in complex models. While transferability in general chemistry machine learning should benefit from diverse training data, a rigorous understanding of transferability together with its interplay with chemical representation remains an open problem. We introduce a transferability framework and apply it to a controllable data-driven model for developing density functional approximations (DFAs), an indispensable tool in everyday chemistry research. We reveal that human intuition introduces chemical biases that can hamper the transferability of data-driven DFAs, and we identify strategies for their elimination. We then show that uncritical use of large training sets can actually hinder the transferability of DFAs, in contradiction to typical “more is more” expectations. Finally, our transferability framework yields transferable diversity, a cornerstone principle for data curation for developing general-purpose machine learning models in chemistry

Keywords

transferable diversity
Density functional theory
data

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.