Data-driven imputation of miscibility of aqueous solutions via graph-regularized logistic matrix factorization

21 August 2023, Version 2
This content is a preprint and has not undergone peer review at the time of posting.

Abstract

Aqueous, two-phase systems (ATPSs) may form upon mixing two solutions of independently water-soluble compounds. Many separation, purification, and extraction processes rely on ATPSs. Predicting the miscibility of solutions can accelerate and reduce the cost of the discovery of new ATPSs for these applications. Whereas previous machine learning approaches to ATPS prediction used physicochemical properties of each solute as a descriptor, in this work, we show how to impute missing miscibility outcomes directly from an incomplete collection of pairwise miscibility experiments. We use graph-regularized logistic matrix factorization (GR-LMF) to learn a latent vector of each solution from (i) the observed entries in the pairwise miscibility matrix and (ii) a graph (nodes: solutes, edges: shared relationships) indicating the general category of the solute (i.e., polymer, surfactant, salt, protein). For an experimental dataset of the pairwise miscibility of 68 solutions from Peacock et al. [ACS Appl. Mater. Interfaces 2021, 13, 11449--11460], we find that GR-LMF more accurately predicts missing (im)miscibility outcomes of pairs of solutions than ordinary logistic matrix factorization and random forest classifiers that use physicochemical features of the solutes. GR-LMF obviates the need for features of the solutions/solutes to impute missing miscibility outcomes, but it cannot predict the miscibility of a new solution without some observations of its miscibility with other solutions in the training data set.

Supplementary materials

Title
Description
Actions
Title
Supporting Information
Description
Photograph of an example ATPS experiment, complete miscibility matrix, fraction of immiscible solutions by category, example loss function minimizations, hyperparameter space, the imputed miscibility matrix, distribution of predictions, visualization and 3D plots of the learned latent vectors, visualization of the latent space with $\gamma = 0$, feature importance for the RF model, F1, accuracy, precision, and recall performance metrics for the models. (PDF)
Actions

Supplementary weblinks

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.