Chemistry

Molecular

MolecularGraphEncoder

class molgraph.chemistry.MolecularGraphEncoder(molgraph.chemistry.BaseMolecularGraphEncoder)[source]

Molecular graph encoder, encoding molecular graphs as GraphTensor.

Parameters
  • atom_encoder (Featurizer, Tokenizer) – The atom encoder to use.

  • bond_encoder (Featurizer, Tokenizer, None) – The bond encoder to use. Default to None.

  • molecule_from_string_fn (callable) – A function that produces an RDKit molecule object from some input, e.g. SMILES, InChI or SDFs. Default to chemistry.molecule_from_string.

  • positional_encoding_dim (int, None) – The dimension of the positional encoding. If None, positional encoding will not be used. Default to 16.

  • self_loops (bool) – Whether self loops should be added to the molecular graph. Default to False.

  • auxiliary_encoders – (dict[str, callable], None): Additional encoders to use to compute additional fields for the molecular graph. The outer dimension of the outputs of these encoders should match that of the outer dimension of the output of either the atom encoder or bond encoder. Default to None

Examples:

Generate a molecular graph with featurizers:

>>> # Define atom featurizer (to produce numerical encoding of atoms)
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(),
...     molgraph.chemistry.features.Hybridization()
...     # ...
... ])
>>> # Define bond featurizer (to produce numerical encoding of bonds)
>>> bond_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.BondType(),
...     # ...
... ])
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder(
...     atom_encoder=atom_featurizer,
...     bond_encoder=bond_featurizer,
...     positional_encoding_dim=10,
...     self_loops=False
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> graph_tensor
GraphTensor(
  sizes=<tf.Tensor: shape=(2,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(6, 119), dtype=float32>,
  edge_src=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(8, 4), dtype=float32>,
  node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)

Generate a molecular graph with tokenizers:

>>> # Define bond featurizer (to produce numerical encoding of atoms)
>>> atom_tokenizer = molgraph.chemistry.Tokenizer([
...     molgraph.chemistry.features.Symbol(),
...     molgraph.chemistry.features.Hybridization()
...     # ...
... ])
>>> # Define bond featurizer (to produce numerical encoding of bonds)
>>> bond_tokenizer = molgraph.chemistry.Tokenizer([
...     molgraph.chemistry.features.BondType(),
...     # ...
... ])
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder(
...     atom_encoder=atom_tokenizer,
...     bond_encoder=bond_tokenizer,
...     positional_encoding_dim=10,
...     self_loops=False
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> graph_tensor
GraphTensor(
  sizes=<tf.Tensor: shape=(2,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(6,), dtype=string>,
  edge_src=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(8,), dtype=string>,
  node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)

Obtain numerical encodings of atoms (node_feature) and bonds (bond_feature) with the EmbeddingLookup layer. This is only necessary when tokenizers are used to compute node_feature and edge_feature:

>>> # Define bond featurizer (to produce numerical encoding of atoms)
>>> atom_tokenizer = molgraph.chemistry.Tokenizer([
...    molgraph.chemistry.features.Symbol(),
...    molgraph.chemistry.features.Hybridization()
... ])
>>> # Define bond featurizer (to produce numerical encoding of bonds)
>>> bond_tokenizer = molgraph.chemistry.Tokenizer([
...    molgraph.chemistry.features.BondType(),
... ])
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder(
...    atom_encoder=atom_tokenizer,
...    bond_encoder=bond_tokenizer,
...    positional_encoding_dim=10,
...    self_loops=False
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> # Define embedding layers
>>> node_embedding = molgraph.layers.EmbeddingLookup(
...    feature='node_feature', output_dim=16)
>>> edge_embedding = molgraph.layers.EmbeddingLookup(
...    feature='edge_feature', output_dim=8)
>>> # Adapt embedding layers
>>> node_embedding.adapt(graph_tensor)
>>> edge_embedding.adapt(graph_tensor)
>>> # Build model
>>> model = tf.keras.Sequential([
...    node_embedding,
...    edge_embedding,
... ])
>>> # Pass GraphTensor to model
>>> graph_tensor = model(graph_tensor)
>>> graph_tensor
GraphTensor(
  sizes=<tf.Tensor: shape=(2,), dtype=int32>,
  node_feature=<tf.Tensor: shape=(6, 16), dtype=float32>,
  edge_src=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_dst=<tf.Tensor: shape=(8,), dtype=int32>,
  edge_feature=<tf.Tensor: shape=(8, 8), dtype=float32>,
  node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)
__call__(inputs, *, processes=None, device='/cpu:0', **kwargs)[source]

Generates a molecular graph, namely GraphTensor, from a molecule. Depending on molecule_from_string_fn the molecule(s) could be represented as SMILES, InChI or SDF files, etc.

Parameters
  • inputs (str, list[str], Chem.Mol, list[Chem.Mol]) – Molecules to be encoded as molecular graphs. Can either be a single molecule, or a list of molecules. In either case, a single GraphTensor will be obtained. The GraphTensor has the flexibility to encode either a single molecule or multiple molecules.

  • processes (int, None) – The number of worker processes to use. If None os.cpu_count() is used. Default to None.

  • device (str) – Which device to use for generating the molecular graph. Default to ‘/cpu:0’.

  • **kwargs – Any extra (keyword) arguments that may be used by the derived class. E.g., MolecularGraphEncoder passes index_dtype to specify the dtype of node (atom) indices.

Returns

A single GraphTensor representing the molecule(s) inputted.

Return type

GraphTensor

MolecularGraphEncoder3D

class molgraph.chemistry.MolecularGraphEncoder3D(molgraph.chemistry.BaseMolecularGraphEncoder)[source]

Distance geometric molecular graph encoder, encoding molecular graphs as GraphTensor.

Parameters
  • atom_encoder (Featurizer, Tokenizer) – The atom encoder to use.

  • molecule_from_string_fn (callable) – A function that produces an RDKit molecule object from some input, e.g. SMILES, InChI or SDFs. Default to chemistry.molecule_from_string.

  • conformer_generator (ConformerGenerator, callable, None) – A conformer generator which produces a conformer of a given molecule, if a conformer does not exist. Default to None.

  • edge_radius (int, None) – The order of neighbors to consider for the distance geometry. If None, all atom pairs will be considered. Default to None.

  • coulomb (bool) – Whether coulomb values should be computed from the distances, and the associated atomic charges of the atom pairs. Default to True.

Examples:

>>> # Define bond featurizer (to produce numerical encoding of atoms)
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(),
...     molgraph.chemistry.features.Hybridization()
...     # ...
... ])
>>> # Define conformer generator.
>>> conformer_generator = molgraph.chemistry.ConformerGenerator()
>>> # Define molecular graph encoder
>>> encoder = molgraph.chemistry.MolecularGraphEncoder3D(
...     atom_encoder=atom_featurizer,
...     conformer_generator=conformer_generator,
...     edge_radius=None,
...     coulomb=False,
... )
>>> # Encode two molecules as a GraphTensor
>>> graph_tensor = encoder(['CCC', 'CCO'])
>>> # The main difference between the 2d and 3d encoder is
>>> # the edge_feature field. Here, in contains coulomb values,
>>> # which mimics electrostatic interactions between nuclei
>>> graph_tensor.edge_feature
<tf.Tensor: shape=(12, 1), dtype=float32, numpy=
array([[1.525636 ],
       [2.5192354],
       [1.525636 ],
       [1.5256361],
       [1.5256361],
       [2.5192354],
       [1.5208266],
       [2.3878794],
       [1.5208266],
       [1.3999726],
       [1.3999726],
       [2.3878794]], dtype=float32)>
__call__(inputs, *, processes=None, device='/cpu:0', **kwargs)[source]

Generates a molecular graph, namely GraphTensor, from a molecule. Depending on molecule_from_string_fn the molecule(s) could be represented as SMILES, InChI or SDF files, etc.

Parameters
  • inputs (str, list[str], Chem.Mol, list[Chem.Mol]) – Molecules to be encoded as molecular graphs. Can either be a single molecule, or a list of molecules. In either case, a single GraphTensor will be obtained. The GraphTensor has the flexibility to encode either a single molecule or multiple molecules.

  • processes (int, None) – The number of worker processes to use. If None os.cpu_count() is used. Default to None.

  • device (str) – Which device to use for generating the molecular graph. Default to ‘/cpu:0’.

  • **kwargs – Any extra (keyword) arguments that may be used by the derived class. E.g., MolecularGraphEncoder passes index_dtype to specify the dtype of node (atom) indices.

Returns

A single GraphTensor representing the molecule(s) inputted.

Return type

GraphTensor

ConformerGenerator

class molgraph.chemistry.ConformerGenerator[source]

Conformer generator to generate molecular conformers.

Parameters
  • num_conformer_candidates (int, str, None) – Number of conformers to generate, from which the conformer with the lowest energy will be selected. If ‘auto’, the number of conformers will depend on the size of the molecule. Default to ‘auto’.

  • embedding_method (str) – The embedding method to use. Either of ‘ETDG’, ‘ETKDG’, ‘ETKDGv2’, ‘ETKDGv3’, ‘srETKDGv3’ or ‘KDG’. Default to ‘ETKDGv2’.

  • force_field_method (str) – The force field method to use. Either of ‘MMFF’, ‘MMFF94’, ‘MMFF94s’ or ‘UFF’. Default to ‘UFF’.

  • max_iter (int, str, None) – Maximum number of iterations for generating a conformer. If ‘auto’, the number of iterations will depend on the size of the molecule. Default to ‘auto’.

  • keep_hydrogens (bool) – Whether to keep the hydrogens of the selected conformer. Default to False.

__call__(molecule)[source]

Call self as a function.

property available_embedding_methods

Available embedding methods for conformer generation.

property available_force_field_methods

Available force field methods for conformer optimization.

Atomic

Featurizer

class molgraph.chemistry.Featurizer[source]

Atomic featurizer.

Parameters
  • features (list[Feature]) – List of atomic features.

  • output_dtype (str, np.dtype) – The output dtype.

Examples:

Atom featurizer:

>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(
...         allowable_set={'C', 'N'},
...         ordinal=False,
...         oov_size=1
...     ),
...     molgraph.chemistry.features.Hybridization(
...         allowable_set={'SP', 'SP2', 'SP3'},
...         ordinal=False,
...         oov_size=1
...     )
... ])
>>> # Obtain an Atom
>>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC')
>>> rdkit_atom = rdkit_mol.GetAtomWithIdx(0)
>>> # Encode Atom as a numerical vector
>>> atom_featurizer(rdkit_atom)
array([0., 1., 0., 0., 0., 0., 1.], dtype=float32)

Bond featurizer:

>>> bond_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.BondType(
...         allowable_set={'SINGLE', 'DOUBLE'},
...         ordinal=False,
...         oov_size=1
...     ),
... ])
>>> # Obtain a Bond
>>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC')
>>> rdkit_bond = rdkit_mol.GetBondWithIdx(0)
>>> # Encode Bond as a numerical vector
>>> bond_featurizer(rdkit_bond)
array([0., 0., 1.], dtype=float32)
__call__(inputs, *args, **kwargs)[source]

Featurizes RDKit atom(s) or bond(s).

Parameters

inputs (list, rdkit.Chem.Atom, rdkit.Chem.Bond) – Either a single RDKit atom, a single RDKit bond, a list of RDKit atoms, or a list of RDKit bonds. If bonds do not exist for a given molecule, list of bonds will be an empty list. And if bond is a self loop, the bond will be represented as None.

Returns

numerical encodings of atom(s) or bond(s).

Return type

np.ndarray

Tokenizer

class molgraph.chemistry.Tokenizer[source]

Atomic tokenizer.

Parameters

features (list[Feature]) – List of atomic (atom or bond) features.

Example:

>>> atom_tokenizer = molgraph.chemistry.Tokenizer([
...     molgraph.chemistry.features.Symbol(
...         allowable_set={'C', 'N'},           # irrelevant
...         ordinal=False,                      # irrelevant
...         oov_size=0                          # irrelevant
...     ),
...     molgraph.chemistry.features.Hybridization(
...         allowable_set={'SP', 'SP2', 'SP3'},
...     )
... ])
>>> # Obtain an Atom
>>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC')
>>> rdkit_atom = rdkit_mol.GetAtomWithIdx(0)
>>> # Encode Atom as a token
>>> atom_tokenizer(rdkit_atom)
array('Sym:C|Hyb:SP3', dtype='<U13')
__call__(inputs, *args, **kwargs)[source]

Tokenizes RDKit atom(s) or bond(s).

Parameters

inputs (list, rdkit.Chem.Atom, rdkit.Chem.Bond) – Either a single RDKit atom, a single RDKit bond, a list of RDKit atoms, or a list of RDKit bonds. If bonds do not exist for a given molecule, list of bonds will be an empty list. And if bond is a self loop, the bond will be represented as None.

Returns

Token encoding of atom(s) or bond(s).

Return type

np.ndarray

Features

class molgraph.chemistry.Feature[source]

Atomic feature.

Defines an atom or bond feature, which can subsequently be passed to an chemistry.Featurizer or chemistry.Tokenizer to be featurized or tokenized respectively.

Arguments (see below) will only take effect when Feature is wrapped in an Encoding, which occurs automatically inside chemistry.Featurizer or chemistry.Tokenizer.

Parameters
  • allowable_set (set, list, tuple, None) – A set of features that will be considered. If None, default set will be used, if it exists. Default to None.

  • ordinal (bool) – Whether to encode the feature as an ordinal vector. Only relevant if allowable_set exists and feature is passed to an AtomicFeatuizer. Default to False.

  • oov_size (int) – The number of bins alloted to OOV features. Only relevant if allowable_set exists. Default to 0.

Examples:

Pass features to chemistry.Featurizer to encode atom(s)

>>> # Obtain RDKit atoms, via the RDKit API
>>> atoms = rdkit.Chem.MolFromSmiles('COO').GetAtoms()
>>> # Build an atom featurizer, from Featurizer
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     molgraph.chemistry.features.Symbol(
...         allowable_set={'C', 'O'}, oov_size=1),     # specify param
...     molgraph.chemistry.features.HydrogenAcceptor() # use default param
... ])
>>> # Compute numerical encoding of atoms. (OOV bin is prepended.)
>>> atom_featurizer(atoms)
array([[0., 1., 0., 0.],
       [0., 0., 1., 1.],
       [0., 0., 1., 1.]], dtype=float32)

Create custom features by subclassing chemistry.Feature

>>> class MySymbolFeature(molgraph.chemistry.Feature):
...     def __call__(self, atom: rdkit.Chem.Atom) -> str:
...         return atom.GetSymbol()
>>> # Obtain RDKit atoms, via the RDKit API
>>> atoms = rdkit.Chem.MolFromSmiles('COO').GetAtoms()
>>> # Build an atom featurizer, from Featurizer
>>> atom_featurizer = molgraph.chemistry.Featurizer([
...     MySymbolFeature(allowable_set={'C', 'O'}, oov_size=1),
...     molgraph.chemistry.features.HydrogenAcceptor()
... ])
>>> # Compute numerical encoding of atoms. (OOV bin is prepended.)
>>> atom_featurizer(atoms)
array([[0., 1., 0., 0.],
       [0., 0., 1., 1.],
       [0., 0., 1., 1.]], dtype=float32)

Utilize feature factories chemistry.atom_features and chemistry.bond_features

>>> molgraph.chemistry.bond_features.registered_features 
['bond_type',
'conjugated',
'rotatable',
'stereo']
>>> molgraph.chemistry.atom_features.registered_features 
['symbol',
 'hybridization',
 'cip_code',
 'chiral_center',
 'formal_charge',
 'total_num_hs',
 'total_valence',
 'num_radical_electrons',
 'degree',
 'aromatic',
 'hetero',
 'hydrogen_donor',
 'hydrogen_acceptor',
 'ring_size',
 'ring',
 'crippen_log_p_contribution',
 'crippen_molar_refractivity_contribution',
 'tpsa_contribution',
 'labute_asa_contribution',
 'gasteiger_charge']
>>> molgraph.chemistry.atom_features.get('cip_code') 
CIPCode(allowable_set={None, 'R', 'S'}, ordinal=False, oov_size=0)
>>> molgraph.chemistry.atom_features.get('cip_code', ordinal=True) 
CIPCode(allowable_set={None, 'R', 'S'}, ordinal=True, oov_size=0)
abstract __call__(inputs)[source]

Obtain feature for RDKit atom(s) or bond(s).

class molgraph.chemistry.features.Symbol(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.Hybridization(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.CIPCode(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.ChiralCenter(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.FormalCharge(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.TotalNumHs(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.TotalValence(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.NumRadicalElectrons(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.Degree(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.Aromatic(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.Hetero(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.HydrogenDonor(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.HydrogenAcceptor(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.RingSize(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.Ring(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.CrippenLogPContribution(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.CrippenMolarRefractivityContribution(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.TPSAContribution(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.LabuteASAContribution(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.GasteigerCharge(Feature)[source]

Atom feature.

__call__(atom)[source]

Transforms an rdkit.Chem.Atom to a feature.

Parameters

atom (rdkit.Chem.Atom) – The input to be transformed to a feature.

class molgraph.chemistry.features.BondType(Feature)[source]

Bond feature.

__call__(bond)[source]

Transforms an rdkit.Chem.Bond to a feature.

Parameters

bond (rdkit.Chem.Bond) – The input to be transformed to a feature.

class molgraph.chemistry.features.Conjugated(Feature)[source]

Bond feature.

__call__(bond)[source]

Transforms an rdkit.Chem.Bond to a feature.

Parameters

bond (rdkit.Chem.Bond) – The input to be transformed to a feature.

class molgraph.chemistry.features.Rotatable(Feature)[source]

Bond feature.

__call__(bond)[source]

Transforms an rdkit.Chem.Bond to a feature.

Parameters

bond (rdkit.Chem.Bond) – The input to be transformed to a feature.

class molgraph.chemistry.features.Stereo(Feature)[source]

Bond feature.

__call__(bond)[source]

Transforms an rdkit.Chem.Bond to a feature.

Parameters

bond (rdkit.Chem.Bond) – The input to be transformed to a feature.

Chemistry ops

molgraph.chemistry.ops.molecule_from_string(molecule, catch_errors=True)[source]

Generates an RDKit molecule object from a SMILES string, InChI string, or SDF string. If an RDKit molecule object is inputted, it is immediately returned.

TF records

molgraph.chemistry.benchmark.tf_records.write(path, data, encoder=None, num_files=None, num_processes=None, **kwargs)[source]

Writes TF records.

Example:

>>> x = ['CC', 'CCC', 'CCCC']
>>> y = [ 5.2,  7.4,   8.1]
>>> encoder = molgraph.chemistry.MolecularGraphEncoder(
...     atom_encoder=molgraph.chemistry.Featurizer([
...         molgraph.chemistry.features.Symbol(),
...         molgraph.chemistry.features.Hybridization(),
...     ])
... )
>>> molgraph.chemistry.tf_records.write( 
...     path='/tmp/dummy_records/',
...     data={'x': x, 'y': y},
...     encoder=encoder
... )

Important: If the current enviromnet is running on a GPU by default, please use the writer context manager instead:

>>> with molgraph.chemistry.tf_records.writer('/tmp/dummy_records/') as writer:
...
...     # In contrast to previous example, lets obtain the GraphTensor
...     # instances outside the write function. Note: this would cause
...     # issues if run on a GPU (namely, without the writer context manager)
...
...     t1 = GraphTensor(
...         node_feature=tf.constant([[1.]]),
...         edge_src=tf.constant([], dtype=tf.int64),
...         edge_dst=tf.constant([], dtype=tf.int64))
...
...     x2 = GraphTensor(
...         node_feature=tf.constant([[1.], [2.]]),
...         edge_src=tf.constant([0, 1], dtype=tf.int64),
...         edge_dst=tf.constant([1, 0], dtype=tf.int64))
...
...     x3 = GraphTensor(
...         node_feature=tf.constant([[1.], [2.], [3.]]),
...         edge_src=tf.constant([0, 1, 2], dtype=tf.int64),
...         edge_dst=tf.constant([1, 2, 0], dtype=tf.int64))
...
...     # Should not specify path, num_files or num_processes; encoder is
...     # optional: here not needed as we already obtained the graph tensors.
...     writer.write( 
...         data={'x': [x1, x2, x3], 'y': [0., 1., 2.]},
...         encoder=None # encoder not needed as graph tensors are passed
...     )
...
>>> # load tf records as tf.data.Dataset
>>> ds = molgraph.chemistry.tf_records.load( 
...     '/tmp/dummy_records/'
... )
Parameters
  • path (str) – The path to write TF records to (save path). Should not include file name. File names are automatically determined.

  • data (dict) – The data to be written as TF records. The keys of the data (dict), are the name of the data fields, while the values are the actual values (of the fields). E.g., {'x': ['CC', 'CCO'], 'y': [4.1, 2.4]}. The encoder will be applied to the mandatory data['x'] field.

  • encoder (MolecularGraphEncoder, MolecularGraphEncoder3D, None) – The encoder to be applied to data['x']. The encoder transforms the string (or rdkit.Chem.Mol) representations of molecules into a GraphTensor. If None, it is assumed that data['x'] already contains GraphTensor instances in a list: [gt_1, gt_2, …, gt_n]. Default to None.

  • num_files (int, None) – The number of TF record files to write to. If None, num_files will be set to num_processes. Default to None.

  • num_processes (int, None) – The number of worker processes to use. If None, multiprocessing.cpu_count() will be used. Using multiple worker processes significantly speeds up writing of TF records. If num_files < num_processes, only num_files processes will be used. Default to None.

Returns

None

molgraph.chemistry.benchmark.tf_records.load(path, extract_tuple=None, shuffle_tf_records=False)[source]

Loads TF records.

Example:

>>> ds = molgraph.chemistry.tf_records.load( 
...     path='/tmp/dummy_records/', # extract_tuple=('x', 'y')
... )
>>> ds = ds.shuffle(3) 
>>> ds = ds.batch(2) 
>>> ds = ds.prefetch(-1) 
>>> for batch in ds.take(1): 
...     print(batch['x'])
Parameters
  • path (str) – Path to TF record files (excluding file names).

  • extract_tuple (list[str], tuple[str], None) – Optionally specify what fields to extract. If None, returned TF dataset will produce dictionaries (corresponding to inputs passed to write). If not None, tuples will be produced. Default to None.

  • shuffle_tf_records (bool) – Whether tf record files should be shuffled. Default to False. Recommended to be set to True when loading training dataet.

Returns

A TF dataset ready to be passed to GNN models.

Return type

tf.data.Dataset

Datasets

class molgraph.chemistry.benchmark.datasets.DatasetFactory[source]

A dataset factory.

Obtain a dataset with ease:

>>> esol_dataset = molgraph.chemistry.datasets.get('esol') 

List available (registered) datasets:

>>> molgraph.chemistry.datasets.registered_datasets 
['smrt',
 'qm7',
 'qm8',
 'qm9',
 'pcba',
 'muv',
 'hiv',
 'bace',
 'bbbp',
 'tox21',
 'toxcast',
 'sider',
 'clintox',
 'esol',
 'freesolv',
 'lipophilicity']
get(name, **config)[source]

Get dataset, by name.

Parameters
  • name (str) – Name of the dataset (see registered_datasets).

  • **config – Overwrite default configuration by specifying (as keyword arugments) new configuration. E.g., dataset.get('esol', splitter=None) to avoid splitting the dataset.

Returns

A Dataset or nested Dataset.

get_config(name)[source]

Get config of dataset, by name.

Parameters

name (str) – Name of the dataset (see registered_datasets).

Returns

The configuration (dict) of the specified dataset.

property registered_datasets

Lists all registered datasets.