kinoml.features.protein

Featurizers that mostly concern protein-based models

Module Contents

kinoml.features.protein.logger
class kinoml.features.protein.SingleProteinFeaturizer(**kwargs)

Bases: kinoml.features.core.ParallelBaseFeaturizer

Provides a minimally useful ._supports() method for all Protein-like featurizers.

_COMPATIBLE_PROTEIN_TYPES = ()
_supports(system: Union[kinoml.core.systems.ProteinSystem, kinoml.core.systems.ProteinLigandComplex]) bool

Check that exactly one protein is present in the System

class kinoml.features.protein.AminoAcidCompositionFeaturizer(**kwargs)

Bases: SingleProteinFeaturizer

Featurizes the protein using the composition of the residues in the binding site.

_counter
_featurize_one(system: Union[kinoml.core.systems.ProteinSystem, kinoml.core.systems.ProteinLigandComplex]) Union[numpy.array, None]

Featurizes a protein using the residue count in the sequence.

Parameters

system (ProteinSystem or ProteinLigandComplex) – The System to be featurized.

Returns

The count of amino acids in the binding site.

Return type

np.array or None

class kinoml.features.protein.OneHotEncodedSequenceFeaturizer(sequence_type: str = 'full', **kwargs)

Bases: kinoml.features.core.BaseOneHotEncodingFeaturizer, SingleProteinFeaturizer

Featurizes the sequence of the protein to a one hot encoding.

ALPHABET
_retrieve_sequence(system: Union[kinoml.core.systems.ProteinSystem, kinoml.core.systems.ProteinLigandComplex]) str

Implement in your component-specific subclass!

class kinoml.features.protein.OEProteinStructureFeaturizer(**kwargs)

Bases: kinoml.features.core.OEBaseModelingFeaturizer, SingleProteinFeaturizer

Given systems with exactly one protein, prepare the protein structure by:

  • modeling missing loops with OESpruce according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

  • building missing side chains

  • substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be modeled with OESpruce, if an alteration could not be modeled, the corresponding mismatch in the structure will be deleted

  • removing everything but protein and water

  • protonation at pH 7.4

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’OpenEye’ and give access to a molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

  • name: A string specifying the name of the protein, will be used for

    generating the output file name.

  • chain_id: A string specifying which chain should be used.

  • alternate_location: A string specifying which alternate location

    should be used.

  • expo_id: A string specifying a ligand bound to the protein of interest. This is especially useful if multiple proteins are found in one PDB structure.

  • uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

  • sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

Parameters
  • loop_db (str) – The path to the loop database used by OESpruce to model missing loops.

  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

_featurize_one(system: kinoml.core.systems.ProteinSystem) Union[Universe, None]

Prepare a protein structure.

Parameters

system (ProteinSystem) – A system object holding a protein component.

Returns

An MDAnalysis universe of the featurized system. None if no design unit was found.

Return type

Universe or None