Loading and Featurising Molecular Data#

In this noteboook, we will use GAUCHE’s to quickly and easily load and preprocess molecular property and yield prediction datasets

Molecular Property Prediction#

The MolPropLoader class provides a range of useful helper function for loading and featurising molecular property prediction datasets. It comes with a number of built-in datasets that you can use to test your models:

  • Photoswitch: The task is to predict the values of the E isomer π − π∗ transition wavelength for 392 photoswitch molecules.

  • ESOL The task is to predict the logarithmic aqueous solubility values for 1128 organic small molecules.

  • FreeSolv The task is to predict the hydration free energy values for 642 organic small molecules.

  • Lipophilicity The task is to predict the octanol/water distribution coefficients for 4200 organic small molecules.

You can load them by calling the load_benchmark function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the read_csv(path, smiles_column, label_column) with the path to your dataset and the name of the columns containing the SMILES strings and labels instead.

[1]:
from gauche.dataloader import MolPropLoader

# load a benchmark dataset
loader = MolPropLoader()
loader.load_benchmark("Photoswitch")
Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]
To turn validation off, use dataloader.read_csv(..., validate=False).

As you can see, the dataloader automatically runs a validation function that filters out invalid SMILES strings and non-numeric labels. The valid and canonicalised SMILES strings and labels are now stored in the loader.features and loader.labels attributes.

[2]:
display(loader.features[:5])
display(loader.labels[:5])
['Cn1nnc(N=Nc2ccccc2)n1',
 'Cn1cnc(N=Nc2ccccc2)n1',
 'Cn1ccc(N=Nc2ccccc2)n1',
 'Cc1cn(C)nc1N=Nc1ccccc1',
 'Cn1cc(N=Nc2ccccc2)cn1']
array([[310.],
       [310.],
       [320.],
       [325.],
       [328.]])

We can now use the loader.featurize function to featurise the molecules. These featurisers are simply functions that take a list of SMILES strings and return a list of feature vectors. GAUCHE comes with a number of built-in featurisers that you can use:

  • ecfp_fingerprints: Extended Connectivity Fingerprints (ECFP) that encode all circular substructures up to a certain diameter.

  • fragments: A featuriser that encodes the presence of a number of predefined rdkit fragments.

  • ecfp_fragprints: A combination of ecfp_fingerprints and fragments.

  • molecular_graphs: A featuriser that encodes the molecular graph as a graph of atoms and bonds.

  • bag_of_smiles: A featuriser that encodes the SMILES strings as a bag of characters.

  • bag_of_selfies: A featuriser that encodes the SMILES strings as a bag of SELFIES characters.

When calling the loader.featurize function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For example, we can specify the diameter of the ECFP fingerprints or the maximum number of fragments to encode. For a full list of keyword arguments, please refer to the documentation.

[3]:
loader.featurize("ecfp_fingerprints")
loader.features[:5]
[3]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

We can also pass any custom featuriser that maps a list of SMILES strings to a list of feature vectors. For example, we can just return the length of the SMILES strings as a feature vector:

[4]:
# load dataset again to undo featurisation
loader = MolPropLoader()
loader.load_benchmark("Photoswitch")

# define custom featurisation function
def smiles_length(smiles):
    return [len(s) for s in smiles]

loader.featurize(smiles_length)
loader.features[:5]
Found 13 invalid labels [nan nan nan nan nan nan nan nan nan nan nan nan nan] at indices [41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 158]
To turn validation off, use dataloader.read_csv(..., validate=False).
[4]:
[21, 21, 21, 22, 21]

This was all we needed to do to load and featurise our dataset. The featurised molecules are now stored in the loader.features attribute and can be passed to the GP models.

Reaction Yield Prediction#

The ReactionYieldLoader class provides a range of useful helper function for loading and featurising reaction yield prediction datasets. The reaction data can be provided as either multple SMILES columns or a single reaction SMARTS column. It comes with a number of built-in datasets that you can use to test your models:

You can load them by calling the load_benchmark function with the corresponding argument. Alternatively, you can simply load your own dataset by calling the read_csv(path, reactant_column, label_column) with the path to your dataset and the name of your label column instead. The reactant_column argument can either be a single reaction SMARTS column or a list of SMILES columns.

[5]:
from gauche.dataloader import ReactionLoader

# load a benchmark dataset
loader = ReactionLoader()
loader.load_benchmark("DreherDoyleRXN")

display(loader.features[:5])
loader.labels[:5]
0    Clc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...
1    Brc1ccccn1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2ccccc...
2    CCc1ccc(I)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2cc...
3    FC(F)(F)c1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#4...
4    COc1ccc(Cl)cc1.Cc1ccc(N)cc1.O=S(=O)(O[#46]1c2c...
Name: rxn, dtype: object
[5]:
array([[70.41045785],
       [11.06445724],
       [10.22354965],
       [20.0833829 ],
       [ 0.49266271]])

We can now use the loader.featurize function to featurise the SMILES/SMARTS. GAUCHE comes with a number of built-in featurisers that you can use:

  • ohe: A one-hot encoding that specifies which of the components in the different reactant and reagent categories is present. In the Buchwald-Hartwig example, the OHE would describe which of the aryl halides, Buchwald ligands, bases and additives are used in the reaction

  • drfp: The differential reaction fingerprint; constructed by taking the symmetric difference of the sets containing the molecular substructures on both sides of the reaction arrow. Reagents are added to the reactants. (Only works for reaction SMARTS).

  • rxnfp: A data-driven reaction fingerprint using Transformer models such as BERT and trained in a supervised or an unsupervised fashion on reaction SMILES. (Only works for reaction SMARTS).

  • bag_of_smiles: A bag of characters representation of the reaction SMARTS. (Only works for reaction SMARTS).

When calling the loader.featurize function, we can additionally specify a range of keyword arguments that are passed to the featuriser. For a full list of keyword arguments, please refer to the documentation.

If drfp requirement is not satisfied you can run

!pip install drfp

in the next cell.

[7]:
loader.featurize("drfp")
loader.features[:5]
[7]:
array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

We can again pass any custom featuriser that maps a list of SMILES or a reaction SMARTS string to a list of feature vectors. For example, we can take the length of of the reaction SMARTS string.

[8]:
# load dataset again to undo featurisation
loader = ReactionLoader()
loader.load_benchmark("DreherDoyleRXN")

# define custom featurisation function
def smiles_length(smiles):
    return [len(s) for s in smiles]

loader.featurize(smiles_length)
loader.features[:5]
[8]:
[274, 277, 212, 207, 257]