Large chemical databases are expected to handle the storage and searching of information on millions of molecules taking terabytes of physical memory.
There are two principal techniques for representing chemical structures in digital databases
These approaches have been refined to allow representation of stereochemical differences and charges as well as special kinds of bonding such as those seen in organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.
There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an inverse of a measure of distance in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measures and non-Euclidean measures depending on whether the triangle inequality holds.
Chemicals in the databases may be clustered into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors. One of the most popular clustering approaches is the Jarvis-Patrick algorithm (k-nearest neighbours algorithm).
In pharmacologically-oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds (ADME/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods.
Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'canonical' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique hash codes to achieve the same objective.
A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with stereochemistry unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or racemic. Each of these would be considered a different record in a chemical registry system.
Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen ions in chemicals.
The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations.
There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is OpenBabel. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and PostgreSQL based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL queries with chemical search conditions (For example a query to search for records having a benzene ring in their structure represented as a SMILES string in a SMILESCOL column could be
Algorithms for the conversion of IUPAC names to structure representations and vice versa are also used for extracting structural information from text. However there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI).
Wipo Publishes Patent of Siemens for "Method for Semiautomatically Creating a Simulation Model for a Mechatronic System" (German Inventor)
May 31, 2013; GENEVA, May 31 -- Publication No. WO/2013/076071 was published on May 30.Title of the invention: "METHOD FOR...
US Patent Issued to Panaya on Aug. 27 for "Method and System for Semiautomatic Execution of Functioning Test Scenario" (Israeli Inventors)
Aug 27, 2013; ALEXANDRIA, Va., Aug. 27 -- United States Patent no. 8,522,083, issued on Aug. 27, was assigned to Panaya Ltd. (Raanana,...