InChI

TODO Move to smiles

Introduction

In your chemistry studies, begun in the early years, you have encountered many ways to represent chemicals, and here we list a few.

Trivial names (aspirin)
Systematic names (2-acetyloxybenzoic acid)
Formula (C₉H₈O₄)
Images

Now we will look at several other ways to represent chemicals, especially connection tables and line notation.

Chemical Representation for Cheminformatics

Very often, data and information about chemical compounds refer directly to molecular structure (for example, a 2D structural formula or 3D atomic coordinates for a particular conformation of a compound), or are tied to a molecular structure (for example, physical properties of a compound, which you identify by its structural formula). The notion of indexing, ordering, searching, and retrieving information using molecular structures originated within the domain of modern chemistry.

Almost all chemists are engaged in communication tasks for recording, searching, visualizing, and publishing molecular structures. Most forms of chemical representation were developed with these uses in mind. Cheminformatics involves storing, finding, and analyzing these structures using the data processing power of computers to match chemical compounds with bibliographic publications, measured properties, synthetic procedures, spectra, and computational studies. To do this work, computers must use chemical representation to identify, exchange, and validate information about chemical compounds.

In order for (human) chemists to trust the insights of cheminformatics, it is important to understand the way computers store and analyze chemical structure, the methods used by computer programs, and the results they produce. Therefore, cheminformatics depends on the use of representations of molecular structures and related data that are understandable both to human scientists and to machine algorithms. Formulation of chemical structure data

Interacting with a machine is a form of communication. How does communication between chemists differ from communication between a chemist and a machine? In cheminformatics, you are working within a system governed by strict rules that are explicitly defined. If you know the rules, you can make the system work for you. If you do not know the rules of a particular form of representation, sometimes features designed to satisfy requirements in one context will appear as errors in another context.

If a chemist were to recommend to another that a reaction should be carried out using “chloroform” as a solvent for a reaction, this would generally be a successful exercise in communication. For all practical purposes, this word is understood by all chemists and has no ambiguity. However, since “chloroform” is a so-called trivial name, there is no formula to convert it into the actual chemical structure it represents, and a machine will not be able to participate in this exchange of information unless it has been explicitly told about the chemical structure that this word represents, expressed in a format the machine can work with.

A more descriptive way to communicate the composition that is chloroform is through the chemical formula, in this case CHCl 3. A computer program could interpret the basic rules of molecular structure to determine that the substance being described has 5 atoms: 1 carbon, 1 hydrogen, and 3 chlorine. Assembling this into a molecule with bonds can be based on valence rules, identifying 4 of the atoms as normally monovalent and one as normally tetravalent. It is fairly straightforward to create a software algorithm that can join the atoms in the most obvious way, which also turns out to be correct.

Beyond such tiny and simple molecules, difficulties soon arise. Some of these ambiguities affect human chemists the same way they affect machines. Consider the molecular formula of C 3 H 6 O, which is associated with multiple reasonable structures, including a ketone, an aldehyde, a cyclic alcohol, oxygenated alkenes, and cyclic ethers, one of which exists as two enantiomers:

Figure 2.1. 1

: Different ways of drawing C 3 H 6 O (image credit: Evan Hepler-Smith)

Ambiguous representations can refer to more than one chemical entity. This is true for most chemical names when used non-systematically, such as “octane”, when used as a common term for all saturated hydrocarbons with eight carbon atoms, instead of systematically indicating only the straight-chain isomer. Empirical and molecular formulas are also often ambiguous.

In an unambiguous representation system, each name or formula refers to exactly one chemical entity, usually in a way that allows you to draw a structural formula. However, each chemical entity may be represented by more than one name or formula. A canonical form is a completely unique representation within a system. For example, “diethyl ketone” and “3-pentanone” are unambiguous names: each represents one and only one compound. However, since they represent the same compound, they are not unique names. Within the IUPAC preferred name system (see below), “3-pentanone” is a canonical name: a unique and unambiguous representation of this compound.

Note that, since canonical names are necessarily canonical within a system, they may not work correctly if you are interested in structural information that the system does not address, or if you lack structural information that the system requires. For example, within a system that does not address stereochemistry, the different enantiomers of a chiral compound will have the same “canonical” representation. Within a system that requires the specification of stereochemistry, on the other hand, a choice will have to be made between stereospecific canonical representations. If you are working with a racemic mixture or a compound of unknown stereo configuration, this can lead to misrepresentation and misunderstanding.

A chemical structure representation contains two types of information: explicit and implicit. Explicit information is that which is represented directly in a data structure and should at minimum contain what would not otherwise be known, such as the specific atom of a carbon skeleton to which a substituent is attached. Implicit information is what you (or a computer) can figure out from a data structure, given some knowledge of general principles and some work.

In general, data structures that contain less explicit information are simpler and more compact, but require more computation to extract chemical conclusions. Data structures that contain more explicit information take up more space and have a greater risk of containing inconsistencies, but can be analyzed more quickly in a wider variety of ways.

To automate the functions of chemical data, the data structure must be systematically defined and consistently applied. These definitions form part of what constitutes explicit information that an algorithm can easily identify and analyze. The balance of the level of explicit information can also affect the ambiguity of a system and the ability to accurately exchange chemical structures between systems. These are especially important considerations for operations that span a significant part of the corpus of reported chemical compounds (well over 100 million), beyond the scale at which human validation of results is possible. Chemical structure representation Structural formula

In general, the most effective way to communicate with another chemist about the structure of a compound is to draw its structural formula. A structural formula is any formula that indicates the connectivity of a compound, that is, which of its atoms are joined to each other by covalent bonds. Unfortunately, the structural formula is most valuable for small molecules, since they can become complex as the size of the molecule increases. On the other hand, a computer does not “see” a formula the way a human does, but rather “reads” it as a form of data, and we will look at two data structures that computers can “read”, connection tables and line notations. Systematic names

Systematic names describe the structural formula of compounds. If you know the rules and the vocabulary, you should be able to write a name from a structural formula and vice versa. Chemists have developed several ways to translate formulas into names, so it is almost always possible to write more than one systematic name for a given compound.

IUPAC (International Union of Pure and Applied Chemistry) nomenclature is a well-known international system of chemical names that is generally systematic but flexible, allowing the use of certain well-established trivial names.

Since systematic IUPAC names are made according to formalized rules, in principle they could be used by both humans and computers.

However, IUPAC names are often quite difficult for chemists to read, let alone write, and the rules are not canonical, which gives rise to numerous different options for naming each compound.

IUPAC has introduced even more rules to determine canonical IUPAC preferred names (PIN) that are oriented toward making systematic names more easily machine-readable.

Semantic technologies further enable the systematic classification and organization of scientific terms, including descriptions of chemical structures, such as those provided by ChEBI (Chemical Entities of Biological Interest). ChEBI describes small molecular entities based on the nomenclature, symbolism, and terminology endorsed by IUPAC and the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). This dataset is highly curated both by human experts and by machine processes, can be searched openly and programmatically, and includes full references to original authoritative sources.

Graphical Visualizations

Cheminformatics leverages the mathematical discipline of graph theory when representing and comparing chemical structures. A graph represents the relationship between two things, and graph theory involves the pairwise relationship between two objects, where the object is a node (vertex or point of the graph) and the connection between the nodes are the edges (bonds or lines) of the graph. In chemistry the atoms are the vertices and the bonds the edges. In fact, you use graph theory when you use Google Maps to choose a route between two cities, where the cities are the vertices and the roads connecting them are the edges.

Keningsberg_bridges_marked.png

Bridges of Konigsberg on the map (credit: Maksim, Wikimedia Commons) 500px-Königsberg_graph.svg.pngKonigsberg bridge problem in terms of a graph (credit: Riojajar, Wikimedia Commons)

chemical graph.JPG

(credit: Office of Naval Research, Technical Report No. 41, An introduction to Graph Theory, DH Rouvray)

Figure 2.1. 2

: On the left is a map of Konigsberg (left), a graph describing the map (center), and some simple molecules and their graphs (right).

In 1736 Leonhard Euler formulated the foundations of graph theory when he tackled the Konigsberg bridge problem, which consisted of determining whether one could walk across all the bridges to that island in the city of Knogsberg only once and walk across all the bridges (and he proved that it could not be done). Mathematically, Euler treated the landmasses as the nodes and the bridges as the edges joining the nodes. In 1878 the mathematician James Sylvester introduced the concept of the chemicograph in his Journal of Nature article “Chemistry and Algebra”, and the same year he published the chemographs of figure 3 in volume 1, no. 1 of his American Journal of Mathematics article “On an application of the new atomic theory to the graphical representation of the invariants and covariants of binary quantics, with three appendices”. In Sylvester’s chemicograph, the atoms became the nodes and the covalent bonds the edges, and note that a double or triple bond was treated as if it had two or three edges connecting the nodes (atoms). A relationship between these and the Lewis Dot structures that high school and first-year chemistry students cover can quickly be seen, but as we will see, computers can handle structures far more complicated than we can draw on paper.

sylvester 1.JPG

Figure 2.1. 3

: The first eleven of forty-five chemicographs from Sylvester’s 1878 article on the application of the new atomic theory to the graphical representation of the invariants and covariants of binary quantics, with three appendices.“

One of the advantages of graph theory is that it can be used to determine whether two graphs have a one-to-one mapping of nodes and edges, that is, whether they are isomorphic (identical), and if a subgraph of one graph is isomorphic to a subgraph of another, those parts are identical. Although this basic introductory course will not delve deeply into graph theory, it is important for students to understand the basic data structures used by graph-theory-based algorithms, and yes, we will use those algorithms. Chemical graphs on computers Connection tables

A connection table does for computers what systematic nomenclature does for human chemists: it organizes the structural information defined in a molecular graph into a machine-readable form. The difference is that computers can read, sort, search, and group connection tables much faster than humans can work with systematic names or any other type of formula or notation. Connection tables basically provide information about the atoms in a molecule, where the bonds are, and what types of bonds there are. They are treated in more depth in section 2.2 and there are many types of structural data files that use connection tables ( section 2.5 ). In addition to connection tables, other common forms of machine-readable representations are graphical visualizations, line notations, and other descriptive forms such as nomenclature.

Chemists more often think of chemical structure in 2D, even though molecules actually exist in 3D physical space. Most chemical data systems offer 2D and 3D visualizations that human chemists can use in research and analysis. The 2D coordinates stored in a connection table can be used to infer and display chemical information, including the basic structural formula and additional information, such as the E/Z geometry of alkene-like double bonds, the cis/trans isomerism of ligands in a square planar metal complex, or substituents on a cyclic alkane. The 2D representations are designed to mimic the experience of drawing structural formulas on paper. Humans often convert these electronic drawings into two image files for use in publications and presentations, but these image files (jpeg, gif, ping,….) are no longer directly connected to chemical data and are therefore not machine-readable.

The 3D coordinates (x,y,z) can also be stored for each atom and used to display the conformation of a molecule. These coordinates can be determined experimentally (usually by X-ray crystallography) or calculated (using force fields, quantum chemistry, molecular dynamics, or composite models such as docking). Understanding the actual shape of a molecule, whether in solution, in vacuum, or at the binding site of a protein, opens up a completely new domain of computational chemistry. Most molecules have a certain flexibility, and even if a given conformation is the most stable, there are often several competing forms to take into account. Knowing how a particular set of coordinates was determined is crucial for using them intelligently for cheminformatics purposes. Line notations

Line notations represent chemical structures as a linear chain of symbolic characters that can be interpreted using sets of systematic rules and that will be treated in section 2.3 . Line notation could be considered as a nomenclature for computers, since like a connection table, a computer can “read” a line notation and develop a molecule the same way a human can read IUPAC nomenclature and generate the molecule. Many forms of line notation are readable by both machines and humans.

Line notation is widely used in cheminformatics because:

many computational processes work more efficiently on data structured as linear strings than on data structured as tables.
Line notations can be reasonably readable for human chemists who design functions with these tools.

Linear representations are especially well suited to many identification and characterization functions, such as determining:

whether molecules are the same;
how similar they are, according to some metric;
whether a molecular entity is a substructure of another;
whether two molecules are related by a specific transformation;
what happens when molecules are cut into pieces and grafted together in different positions.

In these and other cheminformatics applications, linear notation representations have key advantages for speed and automation, especially when you would like to handle a large number of structures (for example, searching a large database).

Examples of line notations include the Wiswesser Line Formula Notation (WLN), the Sybyl Line Notation (SLN), and the Representation Of Structure Diagram Arranged Linearly (ROSDAL). Currently, the most widely used linear notations are the Simplified Molecular Input Line Entry System (SMILES) and the IUPAC Chemical Identifier (InChI). In this class we will focus on SMILES and InChI line notation.

Resources

Representing Small Molecules on Computers

InChI

#Introduction

#Chemical Representation for Cheminformatics

#Graphical Visualizations

#Resources

Introduction

Chemical Representation for Cheminformatics

Graphical Visualizations

Resources