J. Chem. Inf. Model. , 2013, 53 (8), 1990-2000
DOI: 10.1021/ci400213d
The dimethyl sulfoxide (DMSO) solubility data from Enamine and two UCB pharma compound collections were analyzed using 8 different machine learning methods and 12 descriptor sets. The analyzed data sets were highly imbalanced with 1.7-5.8% nonsoluble compounds. The libraries' enrichment by soluble molecules from the set of 10% of the most reliable predictions was used to compare prediction performances of the methods. The highest accuracies were calculated using a C4.5 decision classification tree, random forest, and associative neural networks. The performances of the methods developed were estimated on individual data sets and their combinations. The developed models provided on average a 2-fold decrease of the number of nonsoluble compounds amid all compounds predicted as soluble in DMSO. However, a 4-9-fold enrichment was observed if only 10% of the most reliable predictions were considered. The structural features influencing compounds to be soluble or nonsoluble in DMSO were also determined.

Tetko I. V.; Novotarskyi S.; Sushko I.; Ivanov V.; Petrenko A. E.; Dieden R.; Lebon F.; Mathieu B. 
 J. Chem. Inf. Model. 2013, 53 (8), 1990-2000
 DOI: 10.1021/ci400213d
