Predicting small molecule bioactivity with machine intelligence

Poster

Authors/Editors

Strategic Research Themes

Digital Transformation (Strategic Research Themes)

Publication Details

Author list: Nutaya Pravalphruekul, Teeraphan Laomettachit, Monrudee Liangruksa, Teerasit Termsaithong, Supanida Piyayotai

Publication year: 2023

Languages: English-Great Britain (EN-GB)

Abstract

Activity prediction is one of the necessary steps in drug discovery to identify new molecules with therapeutic potential. Consequently, there are numerous ongoing efforts to develop prediction tools and methods that could be utilised to speed up this process. Most methods rely on proxy indicators such as the similarity to known compounds with confirmed activity or the presence of certain pharmacophores. In this work, we aimed to demonstrate the workflow of bioactivity prediction for unlabelled (or already labelled) small molecules, from assigning the probable class label to estimating the activity score. To achieve this goal, we compiled a list of small molecule drugs and bioactive hits/leads from PubChem, counted the molecules in each group, and calculated the degree of internal diversity among them to use as the basis for data selection. After, we proceeded to optimise the machine learning models on two main tasks, i.e., classification and regression. We first selected 10 target classes: ADORA2, ALOX, TACR, CHRM1, ADORA1, CYP19A1, DRD2, PTGS, REN, and F2. Note that ADORA1 and ADORA2 were considered separately, as we followed the MDDR grouping.

For the classification task, we randomly picked 2,000 molecules from each class and split the data using a 4:1 ratio into train and test sets. We then trained a multiclass prediction model with various machine learning algorithms, including SVM, Naïve Bayes, Random Forest, XGBoost, and MLP. We found that the performance did not vary significantly among the algorithms tested. Nonetheless, on the whole, RF slightly outperformed the others in terms of accuracy, F1, and precision (86.4, 86.3, and 86.3, respectively). We then trained the regression models using 11,232 inhibitors, individually associated with a single IC50 value. We log-transformed the IC50 values and appended a 4-bit binary code at the end of each molecular fingerprint to represent the respective class. By comparing Elastic Net, Ridge Regressor, Lasso Regressor, Gradient Boosting, Decision Tree, Support Vector Regressor, Random Forest, and Neural Network, we found that SVR, RFR, GBR, and NN exhibited comparably superior performance (MAE = 0.51-0.54, R2 = 0.64-0.67, MSE = 0.49-0.53). However, among the four, NN required the shortest runtime (41 sec vs 90-460 sec). Additionally, we incorporated the common rules to highlight the structures with undesirable characteristics, such as Ro5 violation, and filtered out the compounds with probable toxicophoric groups. 

Keywords

No matching items found.