gauche.dataloader#
Dataloader#
Abstract class implementing the data loading, data splitting, type validation and feature extraction functionalities.
Molecular Properties#
Subclass of the abstract data loader class for molecular property prediction datasets.
- class gauche.dataloader.molprop_loader.MolPropLoader[source]#
Data loader class for molecular property prediction datasets with a single regression target. Expects input to be a csv file with one column for SMILES strings and one column for labels. Contains methods to validate the dataset and to transform the SMILES strings into different molecular representations.
- featurize(representation: str | Callable, **kwargs) None [source]#
Transforms SMILES into the specified molecular representation.
- Parameters:
representation (str or Callable) – the desired molecular representation, one of [ecfp_fingerprints, fragments, ecfp_fragprints, molecular_graphs, bag_of_smiles, bag_of_selfies, mqn] or a callable that takes a list of SMILES strings as input and returns the desired featurization.
kwargs (dict) – additional keyword arguments for the representation function
- load_benchmark(benchmark: str, path=None) None [source]#
Loads a selection of existing benchmarks data directory.
- read_csv(path: str, smiles_column: str, label_column: str, validate: bool = True) None [source]#
Loads a dataset from a .csv file. The file must contain the two specified columns with the SMILES strings and labels.
- validate(drop: bool | None = True, canonicalize: bool | None = True) None [source]#
Utility function to validate a read-in dataset of smiles and labels by checking that all SMILES strings can be converted to rdkit molecules and that all labels are numeric and not NaNs. Optionally drops all invalid entries and makes the remaining SMILES strings canonical (default).
Reaction Loader#
Subclass of the abstract data loader class for reaction yield prediction datasets.
- class gauche.dataloader.reaction_loader.ReactionLoader[source]#
Data loader class for reaction yield prediction datasets with a single regression target. Expects input to be a csv file with either multiple SMILES columns or a single reaction SMARTS column. Contains methods to validate the dataset and to transform the SMILES/SMARTS strings into different molecular representations.
- featurize(representation: str | Callable, **kwargs)[source]#
Transforms reactions into the specified representation.
- load_benchmark(benchmark: str, path=None) None [source]#
- Loads features and labels from one of the included benchmark datasets
and feeds them into the DataLoader.
- Parameters:
benchmark (str) –
the benchmark dataset to be loaded, one of
[DreherDoyle, SuzukiMiyaura, DreherDoyleRXN, SuzukiMiyauraRXN]
RXN suffix denotes that csv file contains reaction smiles in a dedicated column.
- read_csv(path: str, reactant_column: str | List[str], label_column: str, validate: bool = True) None [source]#
Loads a dataset from a .csv file. Reactants must be provided as either multple SMILES columns or a single reaction SMARTS column.
- validate(drop: bool | None = True, canonicalize: bool | None = True)[source]#
Utility function to validate a read-in dataset of reaction representations and reactions yield labels. Checks if all SMILES/reaction SMARTS strings can be converted to rdkit molecules/reactions and if all labels are numeric and not NaNs. Optionally drops all invalid entries and makes the remaining SMILES/SMARTS strings canonical (default).
Utils#
Utility functions for molecular data
- gauche.dataloader.data_utils.transform_data(X_train: array, y_train: array, X_test: array, y_test: array, use_pca: bool | None = False, n_components: int | None = 10) tuple [source]#
Apply feature scaling, dimensionality reduction to the data. Return the standardised and low-dimensional train and test sets together with the scaler object for the target values.
- Parameters:
- Returns:
X_train_scaled, y_train_scaled, X_test_scaled, y_test_scaled, y_scaler