Biological and Medicinal Chemistry

Assessment of the generalization abilities of machine-learning scoring functions for structure-based virtual screening


  • Hui Zhu Tsinghua University & National Institute of Biological Sciences, Beijing ,
  • Jincai Yang National Institute of Biological Sciences, Beijing ,
  • Niu Huang National Institute of Biological Sciences, Beijing & Tsinghua University


In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein-ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 11 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with larger predicted binding affinities on complexes owning larger protein-ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, random forest (RF)-Score attained a good performance in Random-CV test. Based on these findings, we strongly advise to assess the generalization ability of MLSFs with Pfam-cluster approach and to be cautious with the features learned by MLSFs.

Version notes

add citation information (title with a list of authors and their affiliations) in SI file


Thumbnail image of manuscript.pdf

Supplementary material

Thumbnail image of Supplementary_figures.docx
supplementary figures
supplementary figures
Thumbnail image of Supplementary_tables.xlsx
supplementary tables
supplementary tables

Supplementary weblinks

scripts of MLSF generalization ability benchmark
The complete Pfam-cluster approach, 3-fold dataset split, and SHAP analysis processes are available on All other data are also available upon request.