Sema YöşiliT., Mehmet ÇalısekiE. Damla PehlivanR.

Introduction

Chemistry is one of the richest disciplines in terms of information and has been for many years. Most information in chemistry is concerned with the two- and three-dimensional (2D and 3D) structures of chemical molecules. The greatness of this information has led to the development of the discipline of cheminformatics, which Gasteriger defines as “the application of informatics methods to solve chemical problems”. Gasteriger is one of the founders of cheminformatics. As he stated, the new modes of access provided by cheminformatics to chemical information profoundly affect the methods of chemical research. The structure of a molecule is a fundamental factor in determining its physical, chemical, and biological properties. Cheminformatics uses techniques such as data mining, graphic theory, multivariate statistics, and machine learning to associate these types of properties with structure1.

Graphic Theory

Chemical graph, also known as ‘molecular graph’ or ‘structural graph’, is a mathematical function that contains the ordered pair G = (V, E). Where V is a series of edges (bonds), E is a connected set of vertices (atoms). Since the chemical graph theory is determined entirely by graphical representations of chemical structures, it represents the information necessary to provide its comprehensive biological modeling and understanding of them. Predominantly, chemical graphs assign values to edges and vertices to indicate bond lengths and other atomic properties2.

Machine Learning

Machine Learning is currently one of the most important and rapidly developing subjects in computer-aided drug discoveries (CADD). Unlike physical models based on physical equations such as quantum chemistry or molecular dynamics simulations, machine learning approaches are used to distinguish the mathematical relationships between empirical observations of small molecules. Besides, it uses pattern recognition algorithms to predict chemical, biological and physical properties of new compounds. Compared to physical models, machine learning techniques are more efficient and easily scaled to large datasets without the requirement for extensive computational resources2.

Figure 1. Machine learning approaches in cheminformatics2.

Data Mining

Considering the volume, heterogeneity and complexity of compound activity data, clearly defined data selection criteria and improved data mining concepts are required. Data mining techniques are also benefitted from while performing virtual scanning. Ligand-based virtual screening aims to identify new active compounds -based upon known active reference molecules.

The main purpose here is to identify structurally different compounds that have similar activity to the references. For this purpose, similarity-based virtual screening techniques are applied.  Since virtual screening aims to identify new active compounds, it is usually not performed in biologically annotated databases such as ChEMBL, but rather in compound databases such as ZINC, which currently contains approximately 35 million small molecules that are typically not biologically annotated3.

High-throughput Screening Analysis

High-Throughput Screening (HTS) Analysis can screen several million compounds; therefore, an efficient computational process is required to analyze the data for use. Besides screening analysis, cheminformatics is also important in choosing which compounds to screen. A large-scale HTS analysis consists of the main screening where 1-2million compounds are schematically tested. Afterward, hits in the dose-response assay are followed to determine that the modulation has an appropriate sigmoidal curve. Hits are also screened in an experiment called orthogonal, in which another mechanically different screening technology is used to eliminate compounds that interfere with the screening technology. The whole process is highly dependent on cheminformatics input, starting with which compounds to be screened4.

Quantitative Structure-Activity Relationships (QSAR)

Quantitative Structure-Activity Relationships (QSAR) is a method for predicting the biophysical properties of small molecules; in the present day, it is widely used in companies and public services. Nonetheless, like any scientific method, it is increasingly in demand today, especially given its possible role in assessing the safety of new chemicals5.

Machine Learning Methods in QSAR

Machine learning techniques can be broadly classified as supervised or unsupervised learning. In supervised learning, labels are assigned to the training data, and once trained, the model can predict labels for specific data inputs. Among supervised machine learning models are multiple regression analysis, k-nearest neighbor (kNN), Bayesian probabilistic learning, SVMs, random forest, and neural networks2.

Unsupervised machine learning techniques learn patterns underlying molecular properties directly from unlabeled data. A special case of supervised learning is semi-supervised learning, where a small amount of labeled data is mixed with unlabeled data in the training process in order to improve the learning accuracy to model a small and unbalanced dataset. Unsupervised methods include dimension reduction techniques such as principal components analysis (PCA), independent component analysis (ICA) as well as various supervised methods that can support unsupervised learning such as SVMs, probabilistic graph models, and neural networks2.

Types of Method(s)

Multiple Regression Analysis: A statistical process to find relationships between dependent variables and one or more independent variables6.

K-Nearest Neighbor: Example-based learning where a sample is classified by the majority rule among its nearest neighbors and k is an integer7.

Bayesian Probabilistic Learning: A probabilistic approach that uses probability and Bayes’ rule to predict membership, presuming the feature is independent8.

Random Forest: It is an ensemble-based classification technique that consists of multiple decision trees and majority voting rules9.

Neural Networks and Deep Learning: It is a model-based learning method that learns from input data and is based on neuron layers consisting of input layers, multiple hidden layers, and output layers10.

Support Vector Machine (SVM): Being a statistical method that maximizes data separation using a nonlinear kernel, SVM maps data to high-dimensional space to define a lower-dimensional hyperplane11.

Principal Component Analysis (PCA): A statistical method that uses the orthogonal procedure to transform a set of related properties into new independent variables called principal components12.

Independent Component Analysis (ICA): It is a statistical method that separates a multivariate output into statistically independent additive components13.

Acknowledgment

Finally, we would like to thank the author and the editor of the Turkish version for their contributions. The review article is available in Turkish on www.bioinforange.com.

https://www.bioinforange.com/bioinforeviews/biyoinformatik/temel-bilgiler/ilac-kesfi-calismalarinda-kemoinformatik-ve-makine-ogrenmesi/

References

1. Willett P. (2016). Special Issue: Chemoinformatics. Molecules (Basel, Switzerland), 21(4), 535. https://doi.org/10.3390/molecules21040535

2. Lo, Y. C., Rensi, S. E., Torng, W., & Altman, R. B. (2018). Machine learning in chemoinformatics and drug discovery. Drug discovery today, 23(8), 1538–1546. https://doi.org/10.1016/j.drudis.2018.05.010

3. Bajorath J. (2017). Compound Data Mining for Drug Discovery. Methods in molecular biology (Clifton, N.J.), 1526, 247–256. https://doi.org/10.1007/978-1-4939- 6613-4_14

4. Chen, H., Kogej, T., & Engkvist, O. (2018). Cheminformatics in Drug Discovery, an Industrial Perspective. Molecular informatics, 37(9-10), e1800041. https://doi.org/10.1002/minf.201800041

5. Gini G. (2018). QSAR: What Else?. Methods in molecular biology (Clifton, N.J.), 1800, 79–105. https://doi.org/10.1007/978-1-4939-7899-1_3

6. Marill, K.A. (2004) Advanced statistics: linear regression, part II: multiple linear regression. Acad. Emerg. Med. 11, 94–102

7. Sahigara, F. et al. (2013) Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions. J. Cheminformatics 5, 27

8. Hert, J. et al. (2006) New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. J. Chem. Inf. Model. 46, 462–470

9. Svetnik, V. et al. (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43, 1947– 1958

10. LeCun, Y. et al. (2015) Deep learning. Nature 521, 436–444

11. Noble, W.S. (2006) What is a support vector machine? Nat. Biotechnol. 24, 1565– 1567

12. Akella, L.B. and DeCaprio, D. (2010) Cheminformatics approaches to analyze diversity in compound screening libraries. Curr. Opin. Chem. Biol. 14, 325–330

13. Hyvarinen, A. and Oja, E. (2000) Independent component analysis: algorithms and applications. Neural Netw. 13, 411–430

error: Bioinfocodes 2021 All Rights Reserved - Mehmet Çalıseki
Share This

Share

Share this post for the scientific community