Loading and Featurising Molecular Data#
In this noteboook, we will use GAUCHE’s to quickly and easily load and preprocess molecular property and yield prediction datasets
Molecular Property Prediction#
The MolPropLoader class provides a range of useful helper function for loading and featurising molecular property prediction datasets. It comes with a number of built-in datasets that you can use to test your models:
Photoswitch
: The task is to predict the values of the E isomer π − π∗ transition wavelength for 392 photoswitch molecules.ESOL
The task is to predict the logarithmic aqueous solubility values for 1128 organic small molecules.FreeSolv
The task is to predict the hydration free energy values for 642 organic small molecules.Lipophilicity
The task is to predict the octanol/water distribution coefficients for 4200 organic small molecules.
You can load them by calling the load_benchmark
function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the read_csv(path, smiles_column, label_column)
with the path to your dataset and the name of the columns containing the SMILES strings and labels instead.
[1]:
from gauche.dataloader import MolPropLoader
# load a benchmark dataset
loader = MolPropLoader()
loader.load_benchmark("Photoswitch")
Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]
To turn validation off, use dataloader.read_csv(..., validate=False).
As you can see, the dataloader automatically runs a validation function that filters out invalid SMILES strings and non-numeric labels. The valid and canonicalised SMILES strings and labels are now stored in the loader.features
and loader.labels
attributes.
[2]:
display(loader.features[:5])
display(loader.labels[:5])
['Cn1nnc(N=Nc2ccccc2)n1',
'Cn1cnc(N=Nc2ccccc2)n1',
'Cn1ccc(N=Nc2ccccc2)n1',
'Cc1cn(C)nc1N=Nc1ccccc1',
'Cn1cc(N=Nc2ccccc2)cn1']
array([[310.],
[310.],
[320.],
[325.],
[328.]])
We can now use the loader.featurize
function to featurise the molecules. These featurisers are simply functions that take a list of SMILES strings and return a list of feature vectors. GAUCHE comes with a number of built-in featurisers that you can use:
ecfp_fingerprints
: Extended Connectivity Fingerprints (ECFP) that encode all circular substructures up to a certain diameter.fragments
: A featuriser that encodes the presence of a number of predefined rdkit fragments.ecfp_fragprints
: A combination ofecfp_fingerprints
andfragments
.molecular_graphs
: A featuriser that encodes the molecular graph as a graph of atoms and bonds.bag_of_smiles
: A featuriser that encodes the SMILES strings as a bag of characters.bag_of_selfies
: A featuriser that encodes the SMILES strings as a bag of SELFIES characters.
When calling the loader.featurize
function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For example, we can specify the diameter of the ECFP fingerprints or the maximum number of fragments to encode. For a full list of keyword arguments, please refer to the documentation.
[3]:
loader.featurize("ecfp_fingerprints")
loader.features[:5]
[3]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
We can also pass any custom featuriser that maps a list of SMILES strings to a list of feature vectors. For example, we can just return the length of the SMILES strings as a feature vector:
[4]:
# load dataset again to undo featurisation
loader = MolPropLoader()
loader.load_benchmark("Photoswitch")
# define custom featurisation function
def smiles_length(smiles):
return [len(s) for s in smiles]
loader.featurize(smiles_length)
loader.features[:5]
Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]
To turn validation off, use dataloader.read_csv(..., validate=False).
[4]:
[21, 21, 21, 22, 21]
This was all we needed to do to load and featurise our dataset. The featurised molecules are now stored in the loader.features
attribute and can be passed to the GP models.
Reaction Yield Prediction#
The ReactionYieldLoader class provides a range of useful helper function for loading and featurising reaction yield prediction datasets. The reaction data can be provided as either multple SMILES columns or a single reaction SMARTS column. It comes with a number of built-in datasets that you can use to test your models:
DreherDoyle
: Data from Predicting reaction performance in C–N cross-coupling using machine learning. Science, 2018. as multiple SMILES columns. The task is to predict the yields for 3955 Pd-catalysed Buchwald–Hartwig C–N cross-couplings.DreherDoyleRXN
: TheDreherDoyle
dataset as a single reaction SMARTS column.Suzuki-Miyaura
: Data from A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science, 2018. The task is to predict the yields for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings.Suzuki-MiyauraRXN
: TheSuzuki-Miyaura
dataset as a single reaction SMARTS column.
You can load them by calling the load_benchmark
function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the read_csv(path, reactant_column, label_column)
with the path to your dataset and the name of your label column instead. The reactant_column
argument can either be a single reaction SMARTS column or a list of SMILES columns.
[5]:
from gauche.dataloader import ReactionLoader
# load a benchmark dataset
loader = ReactionLoader()
loader.load_benchmark("DreherDoyleRXN")
display(loader.features[:5])
loader.labels[:5]
0 Clc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...
1 Brc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...
2 CCc1ccc(I)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2cc...
3 FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#4...
4 COc1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2c...
Name: rxn, dtype: object
[5]:
array([[70.41045785],
[11.06445724],
[10.22354965],
[20.0833829 ],
[ 0.49266271]])
We can now use the loader.featurize
function to featurise the SMILES/SMARTS. GAUCHE comes with a number of built-in featurisers that you can use:
ohe
: A one-hot encoding that specifies which of the components in the different reactant and reagent categories is present. In the Buchwald-Hartwig example, the OHE would describe which of the aryl halides, Buchwald ligands, bases and additives are used in the reactiondrfp
: The differential reaction fingerprint; constructed by taking the symmetric difference of the sets containing the molecular substructures on both sides of the reaction arrow. Reagents are added to the reactants. (Only works for reaction SMARTS).rxnfp
: A data-driven reaction fingerprint using Transformer models such as BERT and trained in a supervised or an unsupervised fashion on reaction SMILES. (Only works for reaction SMARTS).bag_of_smiles
: A bag of characters representation of the reaction SMARTS. (Only works for reaction SMARTS).
When calling the loader.featurize
function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For a full list of keyword arguments, please refer to the documentation.
If drfp requirement is not satisfied you can run
!pip install drfp
in the next cell.
[7]:
loader.featurize("drfp")
loader.features[:5]
[7]:
array([[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.]])
We can again pass any custom featuriser that maps a list of SMILES or a reaction SMARTS string to a list of feature vectors. For example, we can take the length of of the reaction SMARTS string.
[8]:
# load dataset again to undo featurisation
loader = ReactionLoader()
loader.load_benchmark("DreherDoyleRXN")
# define custom featurisation function
def smiles_length(smiles):
return [len(s) for s in smiles]
loader.featurize(smiles_length)
loader.features[:5]
[8]:
[274, 277, 212, 207, 257]