Chemistry
Molecular
MolecularGraphEncoder
- class molgraph.chemistry.MolecularGraphEncoder(molgraph.chemistry.BaseMolecularGraphEncoder)[source]
Molecular graph encoder, encoding molecular graphs as
GraphTensor.- Parameters
atom_encoder (Featurizer, Tokenizer) – The atom encoder to use.
bond_encoder (Featurizer, Tokenizer, None) – The bond encoder to use. Default to None.
molecule_from_string_fn (callable) – A function that produces an RDKit molecule object from some input, e.g. SMILES, InChI or SDFs. Default to
chemistry.molecule_from_string.positional_encoding_dim (int, None) – The dimension of the positional encoding. If None, positional encoding will not be used. Default to 16.
self_loops (bool) – Whether self loops should be added to the molecular graph. Default to False.
auxiliary_encoders – (dict[str, callable], None): Additional encoders to use to compute additional fields for the molecular graph. The outer dimension of the outputs of these encoders should match that of the outer dimension of the output of either the atom encoder or bond encoder. Default to None
Examples:
Generate a molecular graph with featurizers:
>>> # Define atom featurizer (to produce numerical encoding of atoms) >>> atom_featurizer = molgraph.chemistry.Featurizer([ ... molgraph.chemistry.features.Symbol(), ... molgraph.chemistry.features.Hybridization() ... # ... ... ]) >>> # Define bond featurizer (to produce numerical encoding of bonds) >>> bond_featurizer = molgraph.chemistry.Featurizer([ ... molgraph.chemistry.features.BondType(), ... # ... ... ]) >>> # Define molecular graph encoder >>> encoder = molgraph.chemistry.MolecularGraphEncoder( ... atom_encoder=atom_featurizer, ... bond_encoder=bond_featurizer, ... positional_encoding_dim=10, ... self_loops=False ... ) >>> # Encode two molecules as a GraphTensor >>> graph_tensor = encoder(['CCC', 'CCO']) >>> graph_tensor GraphTensor( sizes=<tf.Tensor: shape=(2,), dtype=int32>, node_feature=<tf.Tensor: shape=(6, 119), dtype=float32>, edge_src=<tf.Tensor: shape=(8,), dtype=int32>, edge_dst=<tf.Tensor: shape=(8,), dtype=int32>, edge_feature=<tf.Tensor: shape=(8, 4), dtype=float32>, node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)
Generate a molecular graph with tokenizers:
>>> # Define bond featurizer (to produce numerical encoding of atoms) >>> atom_tokenizer = molgraph.chemistry.Tokenizer([ ... molgraph.chemistry.features.Symbol(), ... molgraph.chemistry.features.Hybridization() ... # ... ... ]) >>> # Define bond featurizer (to produce numerical encoding of bonds) >>> bond_tokenizer = molgraph.chemistry.Tokenizer([ ... molgraph.chemistry.features.BondType(), ... # ... ... ]) >>> # Define molecular graph encoder >>> encoder = molgraph.chemistry.MolecularGraphEncoder( ... atom_encoder=atom_tokenizer, ... bond_encoder=bond_tokenizer, ... positional_encoding_dim=10, ... self_loops=False ... ) >>> # Encode two molecules as a GraphTensor >>> graph_tensor = encoder(['CCC', 'CCO']) >>> graph_tensor GraphTensor( sizes=<tf.Tensor: shape=(2,), dtype=int32>, node_feature=<tf.Tensor: shape=(6,), dtype=string>, edge_src=<tf.Tensor: shape=(8,), dtype=int32>, edge_dst=<tf.Tensor: shape=(8,), dtype=int32>, edge_feature=<tf.Tensor: shape=(8,), dtype=string>, node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)
Obtain numerical encodings of atoms (
node_feature) and bonds (bond_feature) with the EmbeddingLookup layer. This is only necessary when tokenizers are used to computenode_featureandedge_feature:>>> # Define bond featurizer (to produce numerical encoding of atoms) >>> atom_tokenizer = molgraph.chemistry.Tokenizer([ ... molgraph.chemistry.features.Symbol(), ... molgraph.chemistry.features.Hybridization() ... ]) >>> # Define bond featurizer (to produce numerical encoding of bonds) >>> bond_tokenizer = molgraph.chemistry.Tokenizer([ ... molgraph.chemistry.features.BondType(), ... ]) >>> # Define molecular graph encoder >>> encoder = molgraph.chemistry.MolecularGraphEncoder( ... atom_encoder=atom_tokenizer, ... bond_encoder=bond_tokenizer, ... positional_encoding_dim=10, ... self_loops=False ... ) >>> # Encode two molecules as a GraphTensor >>> graph_tensor = encoder(['CCC', 'CCO']) >>> # Define embedding layers >>> node_embedding = molgraph.layers.EmbeddingLookup( ... feature='node_feature', output_dim=16) >>> edge_embedding = molgraph.layers.EmbeddingLookup( ... feature='edge_feature', output_dim=8) >>> # Adapt embedding layers >>> node_embedding.adapt(graph_tensor) >>> edge_embedding.adapt(graph_tensor) >>> # Build model >>> model = tf.keras.Sequential([ ... node_embedding, ... edge_embedding, ... ]) >>> # Pass GraphTensor to model >>> graph_tensor = model(graph_tensor) >>> graph_tensor GraphTensor( sizes=<tf.Tensor: shape=(2,), dtype=int32>, node_feature=<tf.Tensor: shape=(6, 16), dtype=float32>, edge_src=<tf.Tensor: shape=(8,), dtype=int32>, edge_dst=<tf.Tensor: shape=(8,), dtype=int32>, edge_feature=<tf.Tensor: shape=(8, 8), dtype=float32>, node_position=<tf.Tensor: shape=(6, 10), dtype=float32>)
- __call__(inputs, *, processes=None, device='/cpu:0', **kwargs)[source]
Generates a molecular graph, namely
GraphTensor, from a molecule. Depending onmolecule_from_string_fnthe molecule(s) could be represented as SMILES, InChI or SDF files, etc.- Parameters
inputs (str, list[str], Chem.Mol, list[Chem.Mol]) – Molecules to be encoded as molecular graphs. Can either be a single molecule, or a list of molecules. In either case, a single
GraphTensorwill be obtained. TheGraphTensorhas the flexibility to encode either a single molecule or multiple molecules.processes (int, None) – The number of worker processes to use. If None
os.cpu_count()is used. Default to None.device (str) – Which device to use for generating the molecular graph. Default to ‘/cpu:0’.
**kwargs – Any extra (keyword) arguments that may be used by the derived class. E.g.,
MolecularGraphEncoderpassesindex_dtypeto specify the dtype of node (atom) indices.
- Returns
A single
GraphTensorrepresenting the molecule(s) inputted.- Return type
MolecularGraphEncoder3D
- class molgraph.chemistry.MolecularGraphEncoder3D(molgraph.chemistry.BaseMolecularGraphEncoder)[source]
Distance geometric molecular graph encoder, encoding molecular graphs as
GraphTensor.- Parameters
atom_encoder (Featurizer, Tokenizer) – The atom encoder to use.
molecule_from_string_fn (callable) – A function that produces an RDKit molecule object from some input, e.g. SMILES, InChI or SDFs. Default to
chemistry.molecule_from_string.conformer_generator (ConformerGenerator, callable, None) – A conformer generator which produces a conformer of a given molecule, if a conformer does not exist. Default to None.
edge_radius (int, None) – The order of neighbors to consider for the distance geometry. If None, all atom pairs will be considered. Default to None.
coulomb (bool) – Whether coulomb values should be computed from the distances, and the associated atomic charges of the atom pairs. Default to True.
Examples:
>>> # Define bond featurizer (to produce numerical encoding of atoms) >>> atom_featurizer = molgraph.chemistry.Featurizer([ ... molgraph.chemistry.features.Symbol(), ... molgraph.chemistry.features.Hybridization() ... # ... ... ]) >>> # Define conformer generator. >>> conformer_generator = molgraph.chemistry.ConformerGenerator() >>> # Define molecular graph encoder >>> encoder = molgraph.chemistry.MolecularGraphEncoder3D( ... atom_encoder=atom_featurizer, ... conformer_generator=conformer_generator, ... edge_radius=None, ... coulomb=False, ... ) >>> # Encode two molecules as a GraphTensor >>> graph_tensor = encoder(['CCC', 'CCO']) >>> # The main difference between the 2d and 3d encoder is >>> # the edge_feature field. Here, in contains coulomb values, >>> # which mimics electrostatic interactions between nuclei >>> graph_tensor.edge_feature <tf.Tensor: shape=(12, 1), dtype=float32, numpy= array([[1.525636 ], [2.5192354], [1.525636 ], [1.5256361], [1.5256361], [2.5192354], [1.5208266], [2.3878794], [1.5208266], [1.3999726], [1.3999726], [2.3878794]], dtype=float32)>
- __call__(inputs, *, processes=None, device='/cpu:0', **kwargs)[source]
Generates a molecular graph, namely
GraphTensor, from a molecule. Depending onmolecule_from_string_fnthe molecule(s) could be represented as SMILES, InChI or SDF files, etc.- Parameters
inputs (str, list[str], Chem.Mol, list[Chem.Mol]) – Molecules to be encoded as molecular graphs. Can either be a single molecule, or a list of molecules. In either case, a single
GraphTensorwill be obtained. TheGraphTensorhas the flexibility to encode either a single molecule or multiple molecules.processes (int, None) – The number of worker processes to use. If None
os.cpu_count()is used. Default to None.device (str) – Which device to use for generating the molecular graph. Default to ‘/cpu:0’.
**kwargs – Any extra (keyword) arguments that may be used by the derived class. E.g.,
MolecularGraphEncoderpassesindex_dtypeto specify the dtype of node (atom) indices.
- Returns
A single
GraphTensorrepresenting the molecule(s) inputted.- Return type
ConformerGenerator
- class molgraph.chemistry.ConformerGenerator[source]
Conformer generator to generate molecular conformers.
- Parameters
num_conformer_candidates (int, str, None) – Number of conformers to generate, from which the conformer with the lowest energy will be selected. If ‘auto’, the number of conformers will depend on the size of the molecule. Default to ‘auto’.
embedding_method (str) – The embedding method to use. Either of ‘ETDG’, ‘ETKDG’, ‘ETKDGv2’, ‘ETKDGv3’, ‘srETKDGv3’ or ‘KDG’. Default to ‘ETKDGv2’.
force_field_method (str) – The force field method to use. Either of ‘MMFF’, ‘MMFF94’, ‘MMFF94s’ or ‘UFF’. Default to ‘UFF’.
max_iter (int, str, None) – Maximum number of iterations for generating a conformer. If ‘auto’, the number of iterations will depend on the size of the molecule. Default to ‘auto’.
keep_hydrogens (bool) – Whether to keep the hydrogens of the selected conformer. Default to False.
- property available_embedding_methods
Available embedding methods for conformer generation.
- property available_force_field_methods
Available force field methods for conformer optimization.
Atomic
Featurizer
- class molgraph.chemistry.Featurizer[source]
Atomic featurizer.
- Parameters
features (list[Feature]) – List of atomic features.
output_dtype (str, np.dtype) – The output dtype.
Examples:
Atom featurizer:
>>> atom_featurizer = molgraph.chemistry.Featurizer([ ... molgraph.chemistry.features.Symbol( ... allowable_set={'C', 'N'}, ... ordinal=False, ... oov_size=1 ... ), ... molgraph.chemistry.features.Hybridization( ... allowable_set={'SP', 'SP2', 'SP3'}, ... ordinal=False, ... oov_size=1 ... ) ... ]) >>> # Obtain an Atom >>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC') >>> rdkit_atom = rdkit_mol.GetAtomWithIdx(0) >>> # Encode Atom as a numerical vector >>> atom_featurizer(rdkit_atom) array([0., 1., 0., 0., 0., 0., 1.], dtype=float32)
Bond featurizer:
>>> bond_featurizer = molgraph.chemistry.Featurizer([ ... molgraph.chemistry.features.BondType( ... allowable_set={'SINGLE', 'DOUBLE'}, ... ordinal=False, ... oov_size=1 ... ), ... ]) >>> # Obtain a Bond >>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC') >>> rdkit_bond = rdkit_mol.GetBondWithIdx(0) >>> # Encode Bond as a numerical vector >>> bond_featurizer(rdkit_bond) array([0., 0., 1.], dtype=float32)
- __call__(inputs, *args, **kwargs)[source]
Featurizes RDKit atom(s) or bond(s).
- Parameters
inputs (list, rdkit.Chem.Atom, rdkit.Chem.Bond) – Either a single RDKit atom, a single RDKit bond, a list of RDKit atoms, or a list of RDKit bonds. If bonds do not exist for a given molecule, list of bonds will be an empty list. And if bond is a self loop, the bond will be represented as
None.- Returns
numerical encodings of atom(s) or bond(s).
- Return type
np.ndarray
Tokenizer
- class molgraph.chemistry.Tokenizer[source]
Atomic tokenizer.
- Parameters
features (list[Feature]) – List of atomic (atom or bond) features.
Example:
>>> atom_tokenizer = molgraph.chemistry.Tokenizer([ ... molgraph.chemistry.features.Symbol( ... allowable_set={'C', 'N'}, # irrelevant ... ordinal=False, # irrelevant ... oov_size=0 # irrelevant ... ), ... molgraph.chemistry.features.Hybridization( ... allowable_set={'SP', 'SP2', 'SP3'}, ... ) ... ]) >>> # Obtain an Atom >>> rdkit_mol = rdkit.Chem.MolFromSmiles('CC') >>> rdkit_atom = rdkit_mol.GetAtomWithIdx(0) >>> # Encode Atom as a token >>> atom_tokenizer(rdkit_atom) array('Sym:C|Hyb:SP3', dtype='<U13')
- __call__(inputs, *args, **kwargs)[source]
Tokenizes RDKit atom(s) or bond(s).
- Parameters
inputs (list, rdkit.Chem.Atom, rdkit.Chem.Bond) – Either a single RDKit atom, a single RDKit bond, a list of RDKit atoms, or a list of RDKit bonds. If bonds do not exist for a given molecule, list of bonds will be an empty list. And if bond is a self loop, the bond will be represented as
None.- Returns
Token encoding of atom(s) or bond(s).
- Return type
np.ndarray
Features
- class molgraph.chemistry.Feature[source]
Atomic feature.
Defines an atom or bond feature, which can subsequently be passed to an
chemistry.Featurizerorchemistry.Tokenizerto be featurized or tokenized respectively.Arguments (see below) will only take effect when
Featureis wrapped in anEncoding, which occurs automatically insidechemistry.Featurizerorchemistry.Tokenizer.- Parameters
allowable_set (set, list, tuple, None) – A set of features that will be considered. If None, default set will be used, if it exists. Default to None.
ordinal (bool) – Whether to encode the feature as an ordinal vector. Only relevant if
allowable_setexists and feature is passed to anAtomicFeatuizer. Default to False.oov_size (int) – The number of bins alloted to OOV features. Only relevant if
allowable_setexists. Default to 0.
Examples:
Pass features to
chemistry.Featurizerto encode atom(s)>>> # Obtain RDKit atoms, via the RDKit API >>> atoms = rdkit.Chem.MolFromSmiles('COO').GetAtoms() >>> # Build an atom featurizer, from Featurizer >>> atom_featurizer = molgraph.chemistry.Featurizer([ ... molgraph.chemistry.features.Symbol( ... allowable_set={'C', 'O'}, oov_size=1), # specify param ... molgraph.chemistry.features.HydrogenAcceptor() # use default param ... ]) >>> # Compute numerical encoding of atoms. (OOV bin is prepended.) >>> atom_featurizer(atoms) array([[0., 1., 0., 0.], [0., 0., 1., 1.], [0., 0., 1., 1.]], dtype=float32)
Create custom features by subclassing
chemistry.Feature>>> class MySymbolFeature(molgraph.chemistry.Feature): ... def __call__(self, atom: rdkit.Chem.Atom) -> str: ... return atom.GetSymbol() >>> # Obtain RDKit atoms, via the RDKit API >>> atoms = rdkit.Chem.MolFromSmiles('COO').GetAtoms() >>> # Build an atom featurizer, from Featurizer >>> atom_featurizer = molgraph.chemistry.Featurizer([ ... MySymbolFeature(allowable_set={'C', 'O'}, oov_size=1), ... molgraph.chemistry.features.HydrogenAcceptor() ... ]) >>> # Compute numerical encoding of atoms. (OOV bin is prepended.) >>> atom_featurizer(atoms) array([[0., 1., 0., 0.], [0., 0., 1., 1.], [0., 0., 1., 1.]], dtype=float32)
Utilize feature factories
chemistry.atom_featuresandchemistry.bond_features>>> molgraph.chemistry.bond_features.registered_features ['bond_type', 'conjugated', 'rotatable', 'stereo'] >>> molgraph.chemistry.atom_features.registered_features ['symbol', 'hybridization', 'cip_code', 'chiral_center', 'formal_charge', 'total_num_hs', 'total_valence', 'num_radical_electrons', 'degree', 'aromatic', 'hetero', 'hydrogen_donor', 'hydrogen_acceptor', 'ring_size', 'ring', 'crippen_log_p_contribution', 'crippen_molar_refractivity_contribution', 'tpsa_contribution', 'labute_asa_contribution', 'gasteiger_charge'] >>> molgraph.chemistry.atom_features.get('cip_code') CIPCode(allowable_set={None, 'R', 'S'}, ordinal=False, oov_size=0) >>> molgraph.chemistry.atom_features.get('cip_code', ordinal=True) CIPCode(allowable_set={None, 'R', 'S'}, ordinal=True, oov_size=0)
- class molgraph.chemistry.features.CrippenMolarRefractivityContribution(Feature)[source]
Atom feature.
Chemistry ops
TF records
- molgraph.chemistry.benchmark.tf_records.write(path, data, encoder=None, num_files=None, num_processes=None, **kwargs)[source]
Writes TF records.
Example:
>>> x = ['CC', 'CCC', 'CCCC'] >>> y = [ 5.2, 7.4, 8.1] >>> encoder = molgraph.chemistry.MolecularGraphEncoder( ... atom_encoder=molgraph.chemistry.Featurizer([ ... molgraph.chemistry.features.Symbol(), ... molgraph.chemistry.features.Hybridization(), ... ]) ... ) >>> molgraph.chemistry.tf_records.write( ... path='/tmp/dummy_records/', ... data={'x': x, 'y': y}, ... encoder=encoder ... )
Important: If the current enviromnet is running on a GPU by default, please use the writer context manager instead:
>>> with molgraph.chemistry.tf_records.writer('/tmp/dummy_records/') as writer: ... ... # In contrast to previous example, lets obtain the GraphTensor ... # instances outside the write function. Note: this would cause ... # issues if run on a GPU (namely, without the writer context manager) ... ... t1 = GraphTensor( ... node_feature=tf.constant([[1.]]), ... edge_src=tf.constant([], dtype=tf.int64), ... edge_dst=tf.constant([], dtype=tf.int64)) ... ... x2 = GraphTensor( ... node_feature=tf.constant([[1.], [2.]]), ... edge_src=tf.constant([0, 1], dtype=tf.int64), ... edge_dst=tf.constant([1, 0], dtype=tf.int64)) ... ... x3 = GraphTensor( ... node_feature=tf.constant([[1.], [2.], [3.]]), ... edge_src=tf.constant([0, 1, 2], dtype=tf.int64), ... edge_dst=tf.constant([1, 2, 0], dtype=tf.int64)) ... ... # Should not specify path, num_files or num_processes; encoder is ... # optional: here not needed as we already obtained the graph tensors. ... writer.write( ... data={'x': [x1, x2, x3], 'y': [0., 1., 2.]}, ... encoder=None # encoder not needed as graph tensors are passed ... ) ... >>> # load tf records as tf.data.Dataset >>> ds = molgraph.chemistry.tf_records.load( ... '/tmp/dummy_records/' ... )
- Parameters
path (str) – The path to write TF records to (save path). Should not include file name. File names are automatically determined.
data (dict) – The data to be written as TF records. The keys of the data (dict), are the name of the data fields, while the values are the actual values (of the fields). E.g.,
{'x': ['CC', 'CCO'], 'y': [4.1, 2.4]}. Theencoderwill be applied to the mandatorydata['x']field.encoder (MolecularGraphEncoder, MolecularGraphEncoder3D, None) – The encoder to be applied to
data['x']. The encoder transforms the string (or rdkit.Chem.Mol) representations of molecules into aGraphTensor. If None, it is assumed thatdata['x']already contains GraphTensor instances in a list: [gt_1, gt_2, …, gt_n]. Default to None.num_files (int, None) – The number of TF record files to write to. If None, num_files will be set to
num_processes. Default to None.num_processes (int, None) – The number of worker processes to use. If None,
multiprocessing.cpu_count()will be used. Using multiple worker processes significantly speeds up writing of TF records. Ifnum_files<num_processes, onlynum_filesprocesses will be used. Default to None.
- Returns
None
- molgraph.chemistry.benchmark.tf_records.load(path, extract_tuple=None, shuffle_tf_records=False)[source]
Loads TF records.
Example:
>>> ds = molgraph.chemistry.tf_records.load( ... path='/tmp/dummy_records/', # extract_tuple=('x', 'y') ... ) >>> ds = ds.shuffle(3) >>> ds = ds.batch(2) >>> ds = ds.prefetch(-1) >>> for batch in ds.take(1): ... print(batch['x'])
- Parameters
path (str) – Path to TF record files (excluding file names).
extract_tuple (list[str], tuple[str], None) – Optionally specify what fields to extract. If None, returned TF dataset will produce dictionaries (corresponding to
inputspassed towrite). If not None, tuples will be produced. Default to None.shuffle_tf_records (bool) – Whether tf record files should be shuffled. Default to False. Recommended to be set to True when loading training dataet.
- Returns
A TF dataset ready to be passed to GNN models.
- Return type
tf.data.Dataset
Datasets
- class molgraph.chemistry.benchmark.datasets.DatasetFactory[source]
A dataset factory.
Obtain a dataset with ease:
>>> esol_dataset = molgraph.chemistry.datasets.get('esol')
List available (registered) datasets:
>>> molgraph.chemistry.datasets.registered_datasets ['smrt', 'qm7', 'qm8', 'qm9', 'pcba', 'muv', 'hiv', 'bace', 'bbbp', 'tox21', 'toxcast', 'sider', 'clintox', 'esol', 'freesolv', 'lipophilicity']
- get(name, **config)[source]
Get dataset, by name.
- Parameters
name (str) – Name of the dataset (see
registered_datasets).**config – Overwrite default configuration by specifying (as keyword arugments) new configuration. E.g.,
dataset.get('esol', splitter=None)to avoid splitting the dataset.
- Returns
A
Datasetor nestedDataset.
- get_config(name)[source]
Get config of dataset, by name.
- Parameters
name (str) – Name of the dataset (see
registered_datasets).- Returns
The configuration (dict) of the specified dataset.
- property registered_datasets
Lists all registered datasets.