gauche.dataloader#

Dataloader#

Abstract class implementing the data loading, data splitting, type validation and feature extraction functionalities.

Molecular Properties#

Subclass of the abstract data loader class for molecular property prediction datasets.

class gauche.dataloader.molprop_loader.MolPropLoader[source]#

Data loader class for molecular property prediction datasets with a single regression target. Expects input to be a csv file with one column for SMILES strings and one column for labels. Contains methods to validate the dataset and to transform the SMILES strings into different molecular representations.

__init__()[source]#

featurize(representation: str | Callable, **kwargs) → None[source]#

Transforms SMILES into the specified molecular representation.

Parameters:

representation (str or Callable) – the desired molecular representation, one of [ecfp_fingerprints, fragments, ecfp_fragprints, molecular_graphs, bag_of_smiles, bag_of_selfies, mqn] or a callable that takes a list of SMILES strings as input and returns the desired featurization.
kwargs (dict) – additional keyword arguments for the representation function

load_benchmark(benchmark: str, path=None) → None[source]#

Loads a selection of existing benchmarks data directory.

Parameters:

benchmark (str) – the benchmark dataset to be loaded, one of [Photoswitch, ESOL, FreeSolv, Lipophilicity].
path (str) – the path to the directory that contains the dataset, defaults to the data directory of the project if None

read_csv(path: str, smiles_column: str, label_column: str, validate: bool = True) → None[source]#

Loads a dataset from a .csv file. The file must contain the two specified columns with the SMILES strings and labels.

Parameters:

path (str) – path to the csv file
smiles_column (str) – name of the column containing the SMILES strings
label_column (str) – name of the column containing the labels
validate (bool) – whether to validate the loaded data

validate(drop: bool | None = True, canonicalize: bool | None = True) → None[source]#

Utility function to validate a read-in dataset of smiles and labels by checking that all SMILES strings can be converted to rdkit molecules and that all labels are numeric and not NaNs. Optionally drops all invalid entries and makes the remaining SMILES strings canonical (default).

Parameters:

drop (bool) – whether to drop invalid entries
canonicalize (bool) – whether to make the SMILES strings canonical

Reaction Loader#

Subclass of the abstract data loader class for reaction yield prediction datasets.

class gauche.dataloader.reaction_loader.ReactionLoader[source]#

Data loader class for reaction yield prediction datasets with a single regression target. Expects input to be a csv file with either multiple SMILES columns or a single reaction SMARTS column. Contains methods to validate the dataset and to transform the SMILES/SMARTS strings into different molecular representations.

__init__()[source]#

featurize(representation: str | Callable, **kwargs)[source]#

Transforms reactions into the specified representation.

Parameters:

representation (str or Callable) – the desired reaction representation, one of [ohe, rxnfp, drfp, bag_of_smiles] or a callable that takes a list of SMILES strings as input and returns the desired featurization.
kwargs (dict) – additional keyword arguments for the representation function

load_benchmark(benchmark: str, path=None) → None[source]#

Loads features and labels from one of the included benchmark datasets: and feeds them into the DataLoader.

Parameters:

benchmark (str) –

the benchmark dataset to be loaded, one of [DreherDoyle, SuzukiMiyaura, DreherDoyleRXN, SuzukiMiyauraRXN]

RXN suffix denotes that csv file contains reaction smiles in a dedicated column.

read_csv(path: str, reactant_column: str | List[str], label_column: str, validate: bool = True) → None[source]#

Loads a dataset from a .csv file. Reactants must be provided as either multple SMILES columns or a single reaction SMARTS column.

Parameters:

path (str) – path to the csv file
reactant_column (str or List[str]) – name of the column(s) containing the reactants
label_column (str) – name of the column containing the labels
validate (bool) – whether to validate the loaded data

validate(drop: bool | None = True, canonicalize: bool | None = True)[source]#

Utility function to validate a read-in dataset of reaction representations and reactions yield labels. Checks if all SMILES/reaction SMARTS strings can be converted to rdkit molecules/reactions and if all labels are numeric and not NaNs. Optionally drops all invalid entries and makes the remaining SMILES/SMARTS strings canonical (default).

Parameters:

drop (bool) – whether to drop invalid entries
canonicalize (bool) – whether to make the SMILES/SMARTS strings canonical

Utils#

Utility functions for molecular data

gauche.dataloader.data_utils.transform_data(X_train: array, y_train: array, X_test: array, y_test: array, use_pca: bool | None = False, n_components: int | None = 10) → tuple[source]#

Apply feature scaling, dimensionality reduction to the data. Return the standardised and low-dimensional train and test sets together with the scaler object for the target values.

Parameters:

X_train (np.array) – training set features
y_train (np.array) – training set targets
X_test (np.array) – test set features
y_test (np.array) – test set targets
use_pca (bool) – whether to use PCA for dimensionality reduction
n_components (int) – number of principal components to retain

Returns:

X_train_scaled, y_train_scaled, X_test_scaled, y_test_scaled, y_scaler