Chemistry

Molecular

MolecularGraphEncoder

class molgraph.chemistry.MolecularGraphEncoder(molgraph.chemistry.BaseMolecularGraphEncoder)[source]

Molecular graph encoder, encoding molecular graphs as GraphTensor.

Parameters

atom_encoder (Featurizer, Tokenizer) – The atom encoder to use.
bond_encoder (Featurizer, Tokenizer, None) – The bond encoder to use. Default to None.
molecule_from_string_fn (callable) – A function that produces an RDKit molecule object from some input, e.g. SMILES, InChI or SDFs. Default to chemistry.molecule_from_string.
positional_encoding_dim (int, None) – The dimension of the positional encoding. If None, positional encoding will not be used. Default to 16.
self_loops (bool) – Whether self loops should be added to the molecular graph. Default to False.
auxiliary_encoders – (dict[str, callable], None): Additional encoders to use to compute additional fields for the molecular graph. The outer dimension of the outputs of these encoders should match that of the outer dimension of the output of either the atom encoder or bond encoder. Default to None

Examples:

Generate a molecular graph with featurizers:

>>> # Define atom featurizer (to produce numerical encoding of atoms)
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(),
...     molgraph.chemistry.features.Hybridization()
...     # ...
... ])
>>> # Define bond featurizer (to produce numerical encoding of bonds)
>>> bond_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.BondType(),
...     # ...
... ])
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder(
...     atom_encoder=atom_featurizer,
...     bond_encoder=bond_featurizer,
...     positional_encoding_dim=10,
...     self_loops=False
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> graph_tensor
GraphTensor(
  sizes=<tf.Tensor: shape=(2,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(6, 119), dtype=float32>,
  edge_src=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(8, 4), dtype=float32>,
  node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)

Generate a molecular graph with tokenizers:

>>> # Define bond featurizer (to produce numerical encoding of atoms)
>>> atom_tokenizer = molgraph.chemistry.Tokenizer([
...     molgraph.chemistry.features.Symbol(),
...     molgraph.chemistry.features.Hybridization()
...     # ...
... ])
>>> # Define bond featurizer (to produce numerical encoding of bonds)
>>> bond_tokenizer = molgraph.chemistry.Tokenizer([
...     molgraph.chemistry.features.BondType(),
...     # ...
... ])
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder(
...     atom_encoder=atom_tokenizer,
...     bond_encoder=bond_tokenizer,
...     positional_encoding_dim=10,
...     self_loops=False
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> graph_tensor
GraphTensor(
  sizes=<tf.Tensor: shape=(2,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(6,), dtype=string>,
  edge_src=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(8,), dtype=string>,
  node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)

Obtain numerical encodings of atoms (node_feature) and bonds (bond_feature) with the EmbeddingLookup layer. This is only necessary when tokenizers are used to compute node_feature and edge_feature:

>>> # Define bond featurizer (to produce numerical encoding of atoms)
>>> atom_tokenizer = molgraph.chemistry.Tokenizer([
...    molgraph.chemistry.features.Symbol(),
...    molgraph.chemistry.features.Hybridization()
... ])
>>> # Define bond featurizer (to produce numerical encoding of bonds)
>>> bond_tokenizer = molgraph.chemistry.Tokenizer([
...    molgraph.chemistry.features.BondType(),
... ])
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder(
...    atom_encoder=atom_tokenizer,
...    bond_encoder=bond_tokenizer,
...    positional_encoding_dim=10,
...    self_loops=False
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> # Define embedding layers
>>> node_embedding = molgraph.layers.EmbeddingLookup(
...    feature='node_feature', output_dim=16)
>>> edge_embedding = molgraph.layers.EmbeddingLookup(
...    feature='edge_feature', output_dim=8)
>>> # Adapt embedding layers
>>> node_embedding.adapt(graph_tensor)
>>> edge_embedding.adapt(graph_tensor)
>>> # Build model
>>> model = tf.keras.Sequential([
...    node_embedding,
...    edge_embedding,
... ])
>>> # Pass GraphTensor to model
>>> graph_tensor = model(graph_tensor)
>>> graph_tensor
GraphTensor(
  sizes=<tf.Tensor: shape=(2,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(6, 16), dtype=float32>,
  edge_src=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(8, 8), dtype=float32>,
  node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)

__call__(inputs, *, processes=None, device='/cpu:0', **kwargs)[source]

Generates a molecular graph, namely GraphTensor, from a molecule. Depending on molecule_from_string_fn the molecule(s) could be represented as SMILES, InChI or SDF files, etc.

Parameters

inputs (str, list[str], Chem.Mol, list[Chem.Mol]) – Molecules to be encoded as molecular graphs. Can either be a single molecule, or a list of molecules. In either case, a single GraphTensor will be obtained. The GraphTensor has the flexibility to encode either a single molecule or multiple molecules.
processes (int, None) – The number of worker processes to use. If None os.cpu_count() is used. Default to None.
device (str) – Which device to use for generating the molecular graph. Default to ‘/cpu:0’.
**kwargs – Any extra (keyword) arguments that may be used by the derived class. E.g., MolecularGraphEncoder passes index_dtype to specify the dtype of node (atom) indices.

Returns

A single GraphTensor representing the molecule(s) inputted.

Return type

GraphTensor

MolecularGraphEncoder3D

class molgraph.chemistry.MolecularGraphEncoder3D(molgraph.chemistry.BaseMolecularGraphEncoder)[source]

Distance geometric molecular graph encoder, encoding molecular graphs as GraphTensor.

Parameters

atom_encoder (Featurizer, Tokenizer) – The atom encoder to use.
molecule_from_string_fn (callable) – A function that produces an RDKit molecule object from some input, e.g. SMILES, InChI or SDFs. Default to chemistry.molecule_from_string.
conformer_generator (ConformerGenerator, callable, None) – A conformer generator which produces a conformer of a given molecule, if a conformer does not exist. Default to None.
edge_radius (int, None) – The order of neighbors to consider for the distance geometry. If None, all atom pairs will be considered. Default to None.
coulomb (bool) – Whether coulomb values should be computed from the distances, and the associated atomic charges of the atom pairs. Default to True.

Examples:

>>> # Define bond featurizer (to produce numerical encoding of atoms)
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(),
...     molgraph.chemistry.features.Hybridization()
...     # ...
... ])
>>> # Define conformer generator.
>>> conformer_generator = molgraph.chemistry.ConformerGenerator()
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder3D(
...     atom_encoder=atom_featurizer,
...     conformer_generator=conformer_generator,
...     edge_radius=None,
...     coulomb=False,
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> # The main difference between the 2d and 3d encoder is
>>> # the edge_feature field. Here, in contains coulomb values,
>>> # which mimics electrostatic interactions between nuclei
>>> graph_tensor.edge_feature
<tf.Tensor: shape=(12, 1), dtype=float32, numpy=
array([[1.525636 ],
       [2.5192354],
       [1.525636 ],
       [1.5256361],
       [1.5256361],
       [2.5192354],
       [1.5208266],
       [2.3878794],
       [1.5208266],
       [1.3999726],
       [1.3999726],
       [2.3878794]], dtype=float32)>

__call__(inputs, *, processes=None, device='/cpu:0', **kwargs)[source]

Generates a molecular graph, namely GraphTensor, from a molecule. Depending on molecule_from_string_fn the molecule(s) could be represented as SMILES, InChI or SDF files, etc.

Parameters

inputs (str, list[str], Chem.Mol, list[Chem.Mol]) – Molecules to be encoded as molecular graphs. Can either be a single molecule, or a list of molecules. In either case, a single GraphTensor will be obtained. The GraphTensor has the flexibility to encode either a single molecule or multiple molecules.
processes (int, None) – The number of worker processes to use. If None os.cpu_count() is used. Default to None.
device (str) – Which device to use for generating the molecular graph. Default to ‘/cpu:0’.
**kwargs – Any extra (keyword) arguments that may be used by the derived class. E.g., MolecularGraphEncoder passes index_dtype to specify the dtype of node (atom) indices.

Returns

A single GraphTensor representing the molecule(s) inputted.

Return type

GraphTensor

ConformerGenerator

class molgraph.chemistry.ConformerGenerator[source]

Conformer generator to generate molecular conformers.

Parameters

num_conformer_candidates (int, str, None) – Number of conformers to generate, from which the conformer with the lowest energy will be selected. If ‘auto’, the number of conformers will depend on the size of the molecule. Default to ‘auto’.
embedding_method (str) – The embedding method to use. Either of ‘ETDG’, ‘ETKDG’, ‘ETKDGv2’, ‘ETKDGv3’, ‘srETKDGv3’ or ‘KDG’. Default to ‘ETKDGv2’.
force_field_method (str) – The force field method to use. Either of ‘MMFF’, ‘MMFF94’, ‘MMFF94s’ or ‘UFF’. Default to ‘UFF’.
max_iter (int, str, None) – Maximum number of iterations for generating a conformer. If ‘auto’, the number of iterations will depend on the size of the molecule. Default to ‘auto’.
keep_hydrogens (bool) – Whether to keep the hydrogens of the selected conformer. Default to False.

__call__(molecule)[source]: Call self as a function.

property available_embedding_methods: Available embedding methods for conformer generation.

property available_force_field_methods: Available force field methods for conformer optimization.

Atomic

Featurizer

class molgraph.chemistry.Featurizer[source]

Atomic featurizer.

Parameters

features (list[Feature]) – List of atomic features.
output_dtype (str, np.dtype) – The output dtype.

Examples:

Atom featurizer:

>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(
...         allowable_set={'C', 'N'},
...         ordinal=False,
...         oov_size=1
...     ),
...     molgraph.chemistry.features.Hybridization(
...         allowable_set={'SP', 'SP2', 'SP3'},
...         ordinal=False,
...         oov_size=1
...     )
... ])
>>> # Obtain an Atom
>>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC')
>>> rdkit_atom = rdkit_mol.GetAtomWithIdx(0)
>>> # Encode Atom as a numerical vector
>>> atom_featurizer(rdkit_atom)
array([0., 1., 0., 0., 0., 0., 1.], dtype=float32)

Bond featurizer:

>>> bond_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.BondType(
...         allowable_set={'SINGLE', 'DOUBLE'},
...         ordinal=False,
...         oov_size=1
...     ),
... ])
>>> # Obtain a Bond
>>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC')
>>> rdkit_bond = rdkit_mol.GetBondWithIdx(0)
>>> # Encode Bond as a numerical vector
>>> bond_featurizer(rdkit_bond)
array([0., 0., 1.], dtype=float32)

__call__(inputs, *args, **kwargs)[source]

Featurizes RDKit atom(s) or bond(s).

Parameters: inputs (list, rdkit.Chem.Atom, rdkit.Chem.Bond) – Either a single RDKit atom, a single RDKit bond, a list of RDKit atoms, or a list of RDKit bonds. If bonds do not exist for a given molecule, list of bonds will be an empty list. And if bond is a self loop, the bond will be represented as None.
Returns: numerical encodings of atom(s) or bond(s).
Return type: np.ndarray

Tokenizer

class molgraph.chemistry.Tokenizer[source]

Atomic tokenizer.

Parameters: features (list[Feature]) – List of atomic (atom or bond) features.

Example:

>>> atom_tokenizer = molgraph.chemistry.Tokenizer([
...     molgraph.chemistry.features.Symbol(
...         allowable_set={'C', 'N'},           # irrelevant
...         ordinal=False,                      # irrelevant
...         oov_size=0                          # irrelevant
...     ),
...     molgraph.chemistry.features.Hybridization(
...         allowable_set={'SP', 'SP2', 'SP3'},
...     )
... ])
>>> # Obtain an Atom
>>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC')
>>> rdkit_atom = rdkit_mol.GetAtomWithIdx(0)
>>> # Encode Atom as a token
>>> atom_tokenizer(rdkit_atom)
array('Sym:C|Hyb:SP3', dtype='<U13')

__call__(inputs, *args, **kwargs)[source]

Tokenizes RDKit atom(s) or bond(s).

Parameters: inputs (list, rdkit.Chem.Atom, rdkit.Chem.Bond) – Either a single RDKit atom, a single RDKit bond, a list of RDKit atoms, or a list of RDKit bonds. If bonds do not exist for a given molecule, list of bonds will be an empty list. And if bond is a self loop, the bond will be represented as None.
Returns: Token encoding of atom(s) or bond(s).
Return type: np.ndarray

Features

class molgraph.chemistry.Feature[source]

Atomic feature.

Defines an atom or bond feature, which can subsequently be passed to an chemistry.Featurizer or chemistry.Tokenizer to be featurized or tokenized respectively.

Arguments (see below) will only take effect when Feature is wrapped in an Encoding, which occurs automatically inside chemistry.Featurizer or chemistry.Tokenizer.

Parameters

allowable_set (set, list, tuple, None) – A set of features that will be considered. If None, default set will be used, if it exists. Default to None.
ordinal (bool) – Whether to encode the feature as an ordinal vector. Only relevant if allowable_set exists and feature is passed to an AtomicFeatuizer. Default to False.
oov_size (int) – The number of bins alloted to OOV features. Only relevant if allowable_set exists. Default to 0.

Examples:

Pass features to chemistry.Featurizer to encode atom(s)

>>> # Obtain RDKit atoms, via the RDKit API
>>> atoms = rdkit.Chem.MolFromSmiles('COO').GetAtoms()
>>> # Build an atom featurizer, from Featurizer
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(
...         allowable_set={'C', 'O'}, oov_size=1),     # specify param
...     molgraph.chemistry.features.HydrogenAcceptor() # use default param
... ])
>>> # Compute numerical encoding of atoms. (OOV bin is prepended.)
>>> atom_featurizer(atoms)
array([[0., 1., 0., 0.],
       [0., 0., 1., 1.],
       [0., 0., 1., 1.]], dtype=float32)

Create custom features by subclassing chemistry.Feature

>>> class MySymbolFeature(molgraph.chemistry.Feature):
...     def __call__(self, atom: rdkit.Chem.Atom) -> str:
...         return atom.GetSymbol()
>>> # Obtain RDKit atoms, via the RDKit API
>>> atoms = rdkit.Chem.MolFromSmiles('COO').GetAtoms()
>>> # Build an atom featurizer, from Featurizer
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     MySymbolFeature(allowable_set={'C', 'O'}, oov_size=1),
...     molgraph.chemistry.features.HydrogenAcceptor()
... ])
>>> # Compute numerical encoding of atoms. (OOV bin is prepended.)
>>> atom_featurizer(atoms)
array([[0., 1., 0., 0.],
       [0., 0., 1., 1.],
       [0., 0., 1., 1.]], dtype=float32)

Utilize feature factories chemistry.atom_features and chemistry.bond_features

>>> molgraph.chemistry.bond_features.registered_features 
['bond_type',
'conjugated',
'rotatable',
'stereo']
>>> molgraph.chemistry.atom_features.registered_features 
['symbol',
 'hybridization',
 'cip_code',
 'chiral_center',
 'formal_charge',
 'total_num_hs',
 'total_valence',
 'num_radical_electrons',
 'degree',
 'aromatic',
 'hetero',
 'hydrogen_donor',
 'hydrogen_acceptor',
 'ring_size',
 'ring',
 'crippen_log_p_contribution',
 'crippen_molar_refractivity_contribution',
 'tpsa_contribution',
 'labute_asa_contribution',
 'gasteiger_charge']
>>> molgraph.chemistry.atom_features.get('cip_code') 
CIPCode(allowable_set={None, 'R', 'S'}, ordinal=False, oov_size=0)
>>> molgraph.chemistry.atom_features.get('cip_code', ordinal=True) 
CIPCode(allowable_set={None, 'R', 'S'}, ordinal=True, oov_size=0)

abstract __call__(inputs)[source]: Obtain feature for RDKit atom(s) or bond(s).

class molgraph.chemistry.features.Symbol(Feature)[source]

Atom feature.

__call__(atom)[source]