kinoml.features.core
¶
Featurizers can transform a kinoml.core.system.System
object and produce
new representations of the molecular entities and their associated measurements.
Module Contents¶
- kinoml.features.core.logger¶
- class kinoml.features.core.BaseFeaturizer¶
Abstract Featurizer class.
- property name¶
- _SUPPORTED_TYPES = ()¶
- featurize(systems: List[kinoml.core.systems.System], keep=True) List[kinoml.core.systems.System] ¶
Given some systems (compatible with
_SUPPORTED_TYPES
), apply the featurization scheme implemented in this class.First,
self.supports()
will check whether the systems are compatible with the featurization scheme. We assume all of them are equal, so only the first one will be checked. Then, the Systems are passed toself._featurize
to handle the actual leg-work.- Parameters
systems (list of System) – This is the collection of System objects that will be transformed.
keep (bool, optional=True) – Whether to store the current featurizer in the
system.featurizations
dictionary with its own key (self.name
), in addition tolast
.
- Returns
systems – The same systems that were passed in. The returned Systems will have an extra entry in the
.featurizations
dictionary, containing the featurized object (either a new System or an array-like object) under a key named after.name
.- Return type
list of System
- __call__(*args, **kwargs)¶
You can also call the instance directly. This forwards to
.featurize()
.
- _pre_featurize(systems: List[kinoml.core.systems.System]) None ¶
Run before featurizing all systems. Redefine this method if needed.
- Parameters
systems (list of System) – This is the collection of System objects that will be transformed.
- _featurize(systems: List[kinoml.core.systems.System]) List[object] ¶
Featurize all system objects in a serial fashion as defined in
._featurize_one()
.- Parameters
systems (list of System) – This is the collection of System objects that will be transformed.
- Returns
features
- Return type
list of System or array-like
- abstract _featurize_one(system: kinoml.core.systems.System) object ¶
Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.
- _post_featurize(systems: List[kinoml.core.systems.System], features: List, keep: bool = True) List[kinoml.core.systems.System] ¶
Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn’t need to redefine this method.
- Parameters
systems (list of System) – The systems being featurized
features (list) – The features returned by
self._featurize
keep (bool, optional=True) – Whether to store the current featurizer in the
system.featurizations
dictionary with its own key (self.name
), in addition tolast
.
- Returns
filtered_systems – The same systems as passed, but with
.featurizations
extended with the calculated features in two entries: the featurizer name andlast
. Systems with a feature of None will be removed.- Return type
systems
- supports(*systems: kinoml.core.systems.System, raise_errors: bool = True) bool ¶
Check if these systems are supported by this featurizer.
Do NOT reimplement in subclass. Check
._supports()
instead.- Parameters
systems (list of System) – Systems to be checked (by type, contained attributes, etc)
raise_errors (bool, optional=True) – if True, raise ValueError if errors were found
- Returns
True if all systems are compatible, False otherwise
- Return type
bool
- Raises
ValueError` if ._supports() fails and raise_errors is True –
- _supports(system: kinoml.core.systems.System) bool ¶
This is the private method that actually tests for compatibility between a single system and the current featurizer.
This is the method you should reimplement in your subclass.
- Parameters
system (System) – The system that will be checked
- Return type
True if compatible, False otherwise
- __repr__()¶
Return repr(self).
- class kinoml.features.core.ParallelBaseFeaturizer(use_multiprocessing: bool = True, n_processes: Union[int, None] = None, chunksize: Union[int, None] = None, dask_client=None, **kwargs)¶
Bases:
BaseFeaturizer
Abstract Featurizer class with support for multiprocessing.
- Parameters
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.
dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.
- _SUPPORTED_TYPES = ()¶
- __getstate__()¶
Only preserve object fields that are serializable
- __setstate__(state)¶
Only preserve object fields that are serializable.
- _featurize(systems: List[kinoml.core.systems.System]) List[object] ¶
Featurize all system objects in a parallel fashion as defined in
._featurize_one()
.- Parameters
systems (list of System) – This is the collection of System objects that will be transformed.
- Returns
features
- Return type
list of System or array-like
- class kinoml.features.core.Pipeline(featurizers: List[BaseFeaturizer], shortname=None, **kwargs)¶
Bases:
BaseFeaturizer
Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y).
- Parameters
featurizers (iterable of BaseFeaturizer) – Featurizers to stack. They must be compatible with each other!
Note
While
Pipeline
is a subclass ofBaseFeaturizer
, it should be considered a special case of such. It indeed shares the same API but the implementation details of._featurize()
are slightly different. It acts as a wrapper around individualFeaturizer
objects.- property name¶
- property shortname¶
- _featurize(systems: List[kinoml.core.systems.System], keep: bool = True) List[object] ¶
Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y) and store the features in the systems.
- Parameters
systems (list of System) – This is the collection of System objects that will be transformed
keep (bool, optional=True) – Whether to store the current featurizer in the
system.featurizations
dictionary with its own key (self.name
), in addition tolast
.
- Returns
features
- Return type
list of System or array-like
- supports(*systems: kinoml.core.systems.System, raise_errors: bool = False) bool ¶
Check if these systems are supported by all featurizers.
- Parameters
systems (list of System) – systems to be checked (by type, contained attributes, etc)
raise_errors (bool, optional=False) – If True, raise
ValueError
- Returns
True if all systems are compatible with all featurizers, False otherwise
- Return type
bool
- Raises
ValueError` if f.supports() fails and raise_errors is True –
- class kinoml.features.core.Concatenated(featurizers: List[BaseFeaturizer], axis: int = 1, **kwargs)¶
Bases:
Pipeline
Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).
- Parameters
featurizers (list of BaseFeaturizer) – These should take a System or array, but return only arrays so they can be concatenated. Note that the arrays must have the same number of dimensions. If that is not the case, you will need to reshape one of them using
CallableFeaturizer
and a lambda function that relies onnp.reshape
or similar.axis (int, optional=1) – On which axis to concatenate. By default, it will concatenate on axis
1
, which means that the features in each pipeline will be concatenated.
Notes
This Featurizer maybe removed in the future, since it can be replaced by TupleOfArrays.
- _featurize(systems: List[kinoml.core.systems.System], keep=True) numpy.ndarray ¶
Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).
- Parameters
systems (list of System or array-like) – The Systems (or arrays) to be featurized.
keep (bool, optional=True) – Whether to store the current featurizer in the
system.featurizations
dictionary with its own key (self.name
), in addition tolast
.
- Returns
Concatenated arrays along specified
axis
.- Return type
np.ndarray
- class kinoml.features.core.TupleOfArrays(*args, **kwargs)¶
Bases:
Pipeline
Given a list of featurizers, apply them serially and return the result directly as a flattened tuple of the arrays, for each system. E.g; given one system, featurizer A returns X, and featurizer B returns Y, Z; the output is a tuple of X, Y, Z).
The final result will be tuple of tuples.
- _featurize(systems: List[kinoml.core.systems.System], keep: bool = True) List ¶
Given a list of featurizers, apply them serially and build a flat tuple out of the results.
- Parameters
systems (list of System or array-like) – The Systems (or arrays) to be featurized.
keep (bool, optional=True) – Whether to store the current featurizer in the
system.featurizations
dictionary with its own key (self.name
), in addition tolast
.
- Returns
If the last featurizer is returning a single array, the shape of the object will be (N_systems,). If the last featurizer returns more than one array, it will be (N_systems, M_returned_objects).
- Return type
tuple of (of tuples) arraylike
- class kinoml.features.core.BaseOneHotEncodingFeaturizer(dictionary: dict = None, **kwargs)¶
Bases:
ParallelBaseFeaturizer
Base class for Featurizers concerning one hot encoding.
- ALPHABET¶
- _featurize_one(system: Union[kinoml.core.systems.LigandSystem, kinoml.core.systems.ProteinLigandComplex]) Union[numpy.ndarray, None] ¶
One hot encode one system.
- Parameters
system (LigandSystem or ProteinLigandComplex) – The System to be featurized.
- Return type
array or None
- abstract _retrieve_sequence(system: kinoml.core.systems.System)¶
Implement in your component-specific subclass!
- static one_hot_encode(sequence: Iterable, dictionary: dict | Sequence) numpy.ndarray ¶
One-hot encode a sequence of characters, given a dictionary.
- Parameters
sequence (Iterable) –
dictionary (dict or sequuence-like) – Mapping of each character to their position in the alphabet. If a sequence-like is given, it will be enumerated into a dict.
- Returns
One-hot encoded matrix with shape
(len(dictionary), len(sequence))
- Return type
array-like
- class kinoml.features.core.PadFeaturizer(shape: Iterable[int] = 'auto', key: Hashable = 'last', pad_with: int = 0, **kwargs)¶
Bases:
ParallelBaseFeaturizer
Pads features of a given system to a desired size or length.
This class wraps
numpy.pad
withmode=constant
, auto-calculating the needed additions to match the requested shape.- Parameters
shape (tuple of int, or "auto") – The desired size of the transformed features. If “auto”, shape will be estimated from the Dataset passed at runtime so it matches the largest observed.
key (hashable) – element to retrieve from
System.featurizations
pad_with (int) – value to fill the array-like features with
- _pre_featurize(systems) None ¶
Compute the largest shape in the input arrays and store in shape attribute.
- Parameters
systems (list of System) –
- _featurize_one(system: kinoml.core.systems.System) numpy.ndarray ¶
- Parameters
system (System or array-like) – The System (or array) to be featurized.
options (dict) – Must contain a key
shape
with the expected final shape of the systems.
- Return type
array
- class kinoml.features.core.HashFeaturizer(getter: Callable[[kinoml.core.systems.System], str] = None, normalize=True, **kwargs)¶
Bases:
BaseFeaturizer
Hash an attribute of the protein, such as the name or id.
- Parameters
getter (callable, optional) – A function or lambda that takes a System and returns a string to be hashed. Default value will return whatever
system.featurizations["last"]
contains, as a stringnormalize (bool, default=True) – Normalizes the hash to obtain a value in the unit interval
- static _getter(system)¶
- _featurize_one(system: kinoml.core.systems.System) numpy.ndarray ¶
Featurizes a component using the hash of the chosen attribute.
- Parameters
system (System) – The System to be featurized.
- Returns
Sha256’d attribute
- Return type
array
- class kinoml.features.core.NullFeaturizer(**kwargs)¶
Bases:
ParallelBaseFeaturizer
Abstract Featurizer class with support for multiprocessing.
- Parameters
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.
dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.
- _featurize(systems: Iterable[kinoml.core.systems.System], keep: bool = None) object ¶
Featurize all system objects in a parallel fashion as defined in
._featurize_one()
.- Parameters
systems (list of System) – This is the collection of System objects that will be transformed.
- Returns
features
- Return type
list of System or array-like
- class kinoml.features.core.CallableFeaturizer(func: Callable[[System], System | np.array] | str = None, **kwargs)¶
Bases:
BaseFeaturizer
Apply an arbitrary callable to a System.
- Parameters
func (callable or str or None) – Must take a System and return a System or array. If
str
it will beeval
’d into a callable. If None, the default callable will returnsystem.featurizations["last"]
for each system.
- static _default_func(system)¶
- class kinoml.features.core.ClearFeaturizations(keys=('last',), style='keep', **kwargs)¶
Bases:
BaseFeaturizer
Remove keys from the
.featurizations
dictionary in eachSystem
object. By default, it will remove all keys that are notlast
.- Parameters
keys (tuple of str, optional=("last",)) – Which keys to keep or remove, depending on
style
.style (str, optional="keep") – Whether to
keep
orremove
the entries passed askeys
.
- _featurize_one(system: kinoml.core.systems.System) kinoml.core.systems.System ¶
Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.
- _post_featurize(systems: Iterable[kinoml.core.systems.System], features: Iterable[System | np.array], keep: bool = True) Iterable[kinoml.core.systems.System] ¶
Bypass the automated population of the
.featurizations
dict in each System
- class kinoml.features.core.OEBaseModelingFeaturizer(loop_db: Union[str, None] = None, cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, **kwargs)¶
Bases:
ParallelBaseFeaturizer
This abstract class defines several methods that use functionality from the OpenEye toolkit for molecular modeling. Featurizers that subclass OEBaseModelingFeaturizer need to implement at least the _featurize_one method.
- Parameters
loop_db (str) – The path to the loop database used by OESpruce to model missing loops.
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
- _read_protein_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Union[oechem.OEGraphMol, None] ¶
Returns the protein structure of the given protein object as OpenEye molecule.
- Parameters
protein (Protein or KLIFSKinase) – The protein object.
- Returns
The protein structure as OpenEye molecule or None.
- Return type
oechem.OEGraphMol or None
- Raises
ValueError – If wrong toolkit was used during initialization of the protein object.
- _get_design_unit(structure: openeye.oechem.OEMolBase, chain_id: Union[str, None], alternate_location: Union[str, None], has_ligand: bool, ligand_name: Union[str, None], model_loops_and_caps: bool) Union[openeye.oechem.OEDesignUnit, None] ¶
Get an OpenEye design unit based on the given input.
- Parameters
structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to prepare.
chain_id (str or None) – The chain ID of interest.
alternate_location (str or None) – The alternate location of interest.
has_ligand (bool) – If design unit generation should consider ligands. If True, design units will be only generated for protein ligand complexes. If False, design units will not consider co-crystallized ligands.
ligand_name (str or None) – The ligand expo ID bound to the protein of interest. Design units will be filtered to contain the respective ligand.
model_loops_and_caps (bool) – If loops and caps should be modeled.
- Returns
design_unit – The design unit or None if no design unit was found.
- Return type
oechem.OEDesignUnit or None
- static _get_components(design_unit: openeye.oechem.OEDesignUnit, chain_id: Union[str, None]) Tuple[oechem.OEGraphMol(), oechem.OEGraphMol(), oechem.OEGraphMol()] ¶
Get protein, solvent and ligand components from an OpenEye design unit.
- Parameters
design_unit (oechem.OEDesignUnit) – The OpenEye design unit to extract components from.
chain_id (str or None) – The chain ID of interest.
- Returns
components – OpenEye molecules holding protein, solvent and ligand.
- Return type
tuple of oechem.OEGraphMol, oechem.OEGraphMol and oechem.OEGraphMol
- _process_protein(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1, ligand: Union[oechem.OEMolBase, None] = None) oechem.OEMolBase ¶
Process a protein structure according to the given amino acid sequence.
- Parameters
protein_structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to process.
amino_acid_sequence (str) – The amino acid sequence with associated metadata.
first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.
ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule that should be checked for heavy atom clashes with built insertions.
- Returns
An OpenEye molecule holding the processed protein structure.
- Return type
oechem.OEMolBase
- static _get_protein_residue_numbers(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1) List[int] ¶
Get the residue numbers of a protein structure according to given amino acid sequence.
- Parameters
protein_structure (oechem.OEMolBase) – The kinase domain structure.
amino_acid_sequence (core.sequences.AminoAcidSequence) – The template amino acid sequence.
first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.
- Returns
residue_number – A list of residue numbers according to the given amino acid sequence in the same order as the residues in the given protein structure.
- Return type
list of int
- _assemble_components(protein: openeye.oechem.OEMolBase, solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None] = None) openeye.oechem.OEMolBase ¶
Assemble components of a solvated protein-ligand complex into a single OpenEye molecule.
- Parameters
protein (oechem.OEMolBase) – An OpenEye molecule holding the protein of interest.
solvent (oechem.OEMolBase) – An OpenEye molecule holding the solvent of interest.
ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule holding the ligand of interest if given.
- Returns
assembled_components – An OpenEye molecule holding protein, solvent and ligand if given.
- Return type
oechem.OEMolBase
- static _remove_clashing_water(solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None], protein: openeye.oechem.OEMolBase) openeye.oechem.OEGraphMol ¶
Remove water molecules clashing with a ligand or newly modeled protein residues.
- Parameters
solvent (oechem.OEGraphMol) – An OpenEye molecule holding the water molecules.
ligand (oechem.OEGraphMol or None) – An OpenEye molecule holding the ligand or None.
protein (oechem.OEGraphMol) – An OpenEye molecule holding the protein.
- Returns
An OpenEye molecule holding water molecules not clashing with the ligand or newly modeled protein residues.
- Return type
oechem.OEGraphMol
- _update_pdb_header(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: [str, None] = None, other_pdb_header_info: Union[None, Iterable[Tuple[str, str]]] = None) openeye.oechem.OEMolBase ¶
Stores information about Featurizer, protein and ligand in the PDB header COMPND section in the given OpenEye molecule.
- Parameters
structure (oechem.OEMolBase) – An OpenEye molecule.
protein_name (str) – The name of the protein.
ligand_name (str or None, default=None) – The name of the ligand if present.
other_pdb_header_info (None or iterable of tuple of str) – Tuples with information that should be saved in the PDB header. Each tuple consists of two strings, i.e., the PDB header section (e.g. COMPND) and the respective information.
- Returns
The OpenEye molecule containing the updated PDB header.
- Return type
oechem.OEMolBase
- _write_results(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: Union[str, None] = None) pathlib.Path ¶
Write the results from the Featurizer and retrieve the paths to protein or complex if a ligand is present.
- Parameters
structure (oechem.OEMolBase) – The OpenEye molecule holding the featurized system.
protein_name (str) – The name of the protein.
ligand_name (str or None, default=None) – The name of the ligand if present.
- Returns
Path to prepared protein or complex if ligand is present.
- Return type
Path