Fundamentals
Introduction
Molecular similarity permeates much of our understanding and rationalization of chemistry. This has become especially evident in the current era of data-intensive chemical research, where similarity measures serve as the backbone of many supervised and unsupervised Machine Learning (ML) procedures.
Similarity is essential to human cognition because it allows us to generalize features across a category or to classify elements of the universe according to an ordered array of sets when they share a particular characteristic. The possibility of inferring some knowledge about a presumed shared property between two similar elements depends on the aforementioned attribute and the preexisting relationship between it and the shared properties that make the two objects similar. Depending on the particular trait studied, similarity is a subjective reflection of the studied objects. These postulates are highly related to the similarity-property principle widely applied in medicinal chemistry, as well as in other scientific areas, which states that “similar structural features give rise to similar biological properties/activities”.
In Chemistry, similarity refers to the common functionalities, structures, composition, spatial arrangement, biological activity, and physicochemical properties among different chemical compounds, biological systems, and macromolecular complexes, among others. Similarity has become a cornerstone of cheminformatics, making it of great interest to chemists and pharmacists, as well as appearing in various other domains (see the following sections).
Molecular similarity (also called chemical similarity or chemical structure similarity) is a fundamental concept in cheminformatics, playing an important role in computational methods for predicting the properties of chemical compounds, as well as for designing chemicals with desired properties. The underlying assumption in these computational methods is that structurally similar molecules are likely to have similar biological and physicochemical properties (commonly called the similarity principle). Molecular similarity is a simple and easy-to-understand concept, but there is no absolute mathematical definition of molecular similarity that everyone agrees on. As a result, there is a practically infinite number of molecular similarity methods that quantify molecular similarity.
Similar-Structure, Similar-Property Principle
The Similar-Structure, Similar Property Principle is the fundamental assertion that similar molecules will also tend to exhibit similar properties. These properties can either be physical (e.g. boiling points) or biological (e.g. activity).
Example 1: Hexane and heptane should have similar boiling points and water solubility.
Example 2: Cocaine and procaine are both local anesthetics
Quantitative Structure-Property Relationships (QSPR) and Quantitative Structure-Activity Relationships (QSAR) use statistical models to relate a set of predictor values to a response variable. Molecules are described using a set of descriptors, and then mathematical relationships can be developed to explain observed properties. In QSPR and QSAR physico-chemical properties of theoretical descriptors of chemicals are used to predict either a physical property or a biological outcome.
In either case, a set of known molecules is used to create a training set that a statistical model can be derived from. These molecules have known properties or activities. An outside test set is used to validate the model. The test set consists of other molecules with known properties that are excluded from the training set. After the model is validated, it can be used to predict properties or activities of molecules that are outside the previous sets. One caveat- new test molecules cannot be sufficiently different from the ones used in previous sets.
Molecular Descriptors
If we want to develop a computational model to predict properties, we need to be able to describe them in ways that can be tied to a biological or physical properties. There are many ways that we can represent organic molecules.