gauche.representations#

Fingerprint Representations#

Contains methods to generate fingerprint representations of molecules, chemical reactions and proteins.

gauche.representations.fingerprints.drfp(reaction_smiles: List[str], nBits: int | None = 2048) ndarray[source]#

https://github.com/reymond-group/drfp

Builds reaction representation as a binary DRFP fingerprints. :param reaction_smiles: list of reaction smiles :type reaction_smiles: list :return: array of shape [len(reaction_smiles), nBits] with drfp featurised reactions

gauche.representations.fingerprints.ecfp_fingerprints(smiles: List[str], bond_radius: int | None = 3, nBits: int | None = 2048) ndarray[source]#

Builds molecular representation as a binary ECFP fingerprints.

Parameters:
  • smiles (list) – list of molecular smiles

  • bond_radius (int) – int giving the bond radius for Morgan fingerprints. Default is 3

  • nBits (int) – int giving the bit vector length for Morgan fingerprints. Default is 2048

Returns:

array of shape [len(smiles), nBits] with ecfp featurised molecules

gauche.representations.fingerprints.fragments(smiles: List[str]) ndarray[source]#

Builds molecular representation as a vector of fragment counts.

Parameters:

smiles (list) – list of molecular smiles

Returns:

array of shape [len(smiles), 85] with fragment featurised molecules

gauche.representations.fingerprints.mqn_features(smiles: List[str]) ndarray[source]#

Builds molecular representation as a vector of Molecular Quantum Numbers.

Parameters:

reaction_smiles (list) – list of molecular smiles

Returns:

array of mqn featurised molecules

gauche.representations.fingerprints.one_hot(df: DataFrame) ndarray[source]#

Builds reaction representation as a bit vector which indicates whether a certain condition, reagent, reactant etc. is present in the reaction.

Parameters:

df (pandas DataFrame) – pandas DataFrame with columns representing different parameters of the reaction (e.g. reactants, reagents, conditions).

Returns:

array of shape [len(reaction_smiles), sum(unique values for different columns in df)] with one-hot encoding of reactions

gauche.representations.fingerprints.rxnfp(reaction_smiles: List[str]) ndarray[source]#

https://rxn4chemistry.github.io/rxnfp/

Builds reaction representation as a continuous RXNFP fingerprints. :param reaction_smiles: list of reaction smiles :type reaction_smiles: list :return: array of shape [len(reaction_smiles), 256] with rxnfp featurised reactions

Graph Representations#

Contains methods to generate graph representations of molecules, chemical reactions and proteins.

gauche.representations.graphs.molecular_graphs(smiles: List[str], graphein_config: bool | None = None) List[Graph][source]#

Convers a list of SMILES strings into molecular graphs using the feautrisation utilities of graphein.

Parameters:
  • smiles (list) – list of molecular SMILES

  • graphein_config (graphein/config/graphein_config) – graphein configuration object

Returns:

list of molecular graphs

String Representations#

Contains methods to generate string representations of molecules, chemical reactions and proteins.

gauche.representations.strings.bag_of_characters(strings: List[str], max_ngram: int | None = 5, selfies: bool | None = False) ndarray[source]#

Featursises any string representation (molecules/chemical reactions/proteins) into a bag of characters (boc) representation.

Parameters:
  • strings (list) – list of molecular strings

  • max_ngram (int) – maximum length of ngrams to be considered

  • selfies (bool) – when using molecular SMILES, optionally convert them into SELFIES

Returns:

array of shape [len(strings), n_features] with bag of characters featurised molecules