`kinoml.features.core`¶

Featurizers can transform a kinoml.core.system.System object and produce new representations of the molecular entities and their associated measurements.

Module Contents¶

kinoml.features.core.logger¶

class kinoml.features.core.BaseFeaturizer¶

Abstract Featurizer class.

property name¶

_SUPPORTED_TYPES = ()¶

featurize(systems: List[kinoml.core.systems.System], keep=True) → List[kinoml.core.systems.System]¶

Given some systems (compatible with _SUPPORTED_TYPES), apply the featurization scheme implemented in this class.

First, self.supports() will check whether the systems are compatible with the featurization scheme. We assume all of them are equal, so only the first one will be checked. Then, the Systems are passed to self._featurize to handle the actual leg-work.

Parameters

systems (list of System) – This is the collection of System objects that will be transformed.
keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

systems – The same systems that were passed in. The returned Systems will have an extra entry in the .featurizations dictionary, containing the featurized object (either a new System or an array-like object) under a key named after .name.

Return type

list of System

__call__(*args, **kwargs)¶: You can also call the instance directly. This forwards to .featurize().

_pre_featurize(systems: List[kinoml.core.systems.System]) → None¶

Run before featurizing all systems. Redefine this method if needed.

Parameters: systems (list of System) – This is the collection of System objects that will be transformed.

_featurize(systems: List[kinoml.core.systems.System]) → List[object]¶

Featurize all system objects in a serial fashion as defined in ._featurize_one().

Parameters: systems (list of System) – This is the collection of System objects that will be transformed.
Returns: features
Return type: list of System or array-like

abstract _featurize_one(system: kinoml.core.systems.System) → object¶

Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.

Parameters: system (System) – The System to be featurized.
Return type: System or array-like

_post_featurize(systems: List[kinoml.core.systems.System], features: List, keep: bool = True) → List[kinoml.core.systems.System]¶

Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn’t need to redefine this method.

Parameters

systems (list of System) – The systems being featurized
features (list) – The features returned by self._featurize
keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

filtered_systems – The same systems as passed, but with .featurizations extended with the calculated features in two entries: the featurizer name and last. Systems with a feature of None will be removed.

Return type

systems

supports(*systems: kinoml.core.systems.System, raise_errors: bool = True) → bool¶

Check if these systems are supported by this featurizer.

Do NOT reimplement in subclass. Check ._supports() instead.

Parameters

systems (list of System) – Systems to be checked (by type, contained attributes, etc)
raise_errors (bool, optional=True) – if True, raise ValueError if errors were found

Returns

True if all systems are compatible, False otherwise

Return type

bool

Raises

ValueError` if ._supports() fails and raise_errors is True –

_supports(system: kinoml.core.systems.System) → bool¶

This is the private method that actually tests for compatibility between a single system and the current featurizer.

This is the method you should reimplement in your subclass.

Parameters: system (System) – The system that will be checked
Return type: True if compatible, False otherwise

__repr__()¶: Return repr(self).

class kinoml.features.core.ParallelBaseFeaturizer(use_multiprocessing: bool = True, n_processes: Union[int, None] = None, chunksize: Union[int, None] = None, dask_client=None, **kwargs)¶

Bases: BaseFeaturizer

Abstract Featurizer class with support for multiprocessing.

Parameters

use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.
dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.

_SUPPORTED_TYPES = ()¶

__getstate__()¶: Only preserve object fields that are serializable

__setstate__(state)¶: Only preserve object fields that are serializable.

_featurize(systems: List[kinoml.core.systems.System]) → List[object]¶

Featurize all system objects in a parallel fashion as defined in ._featurize_one().

Parameters: systems (list of System) – This is the collection of System objects that will be transformed.
Returns: features
Return type: list of System or array-like

class kinoml.features.core.Pipeline(featurizers: List[BaseFeaturizer], shortname=None, **kwargs)¶

Bases: BaseFeaturizer

Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y).

Parameters: featurizers (iterable of BaseFeaturizer) – Featurizers to stack. They must be compatible with each other!

Note

While Pipeline is a subclass of BaseFeaturizer, it should be considered a special case of such. It indeed shares the same API but the implementation details of ._featurize() are slightly different. It acts as a wrapper around individual Featurizer objects.

property name¶

property shortname¶

_featurize(systems: List[kinoml.core.systems.System], keep: bool = True) → List[object]¶

Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y) and store the features in the systems.

Parameters

systems (list of System) – This is the collection of System objects that will be transformed
keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

features

Return type

list of System or array-like

supports(*systems: kinoml.core.systems.System, raise_errors: bool = False) → bool¶

Check if these systems are supported by all featurizers.

Parameters

systems (list of System) – systems to be checked (by type, contained attributes, etc)
raise_errors (bool, optional=False) – If True, raise ValueError

Returns

True if all systems are compatible with all featurizers, False otherwise

Return type

bool

Raises

ValueError` if f.supports() fails and raise_errors is True –

class kinoml.features.core.Concatenated(featurizers: List[BaseFeaturizer], axis: int = 1, **kwargs)¶

Bases: Pipeline

Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).

Parameters

featurizers (list of BaseFeaturizer) – These should take a System or array, but return only arrays so they can be concatenated. Note that the arrays must have the same number of dimensions. If that is not the case, you will need to reshape one of them using CallableFeaturizer and a lambda function that relies on np.reshape or similar.
axis (int, optional=1) – On which axis to concatenate. By default, it will concatenate on axis 1, which means that the features in each pipeline will be concatenated.

Notes

This Featurizer maybe removed in the future, since it can be replaced by TupleOfArrays.

_featurize(systems: List[kinoml.core.systems.System], keep=True) → numpy.ndarray¶

Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).

Parameters

systems (list of System or array-like) – The Systems (or arrays) to be featurized.
keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

Concatenated arrays along specified axis.

Return type

np.ndarray

class kinoml.features.core.TupleOfArrays(*args, **kwargs)¶

Bases: Pipeline

Given a list of featurizers, apply them serially and return the result directly as a flattened tuple of the arrays, for each system. E.g; given one system, featurizer A returns X, and featurizer B returns Y, Z; the output is a tuple of X, Y, Z).

The final result will be tuple of tuples.

_featurize(systems: List[kinoml.core.systems.System], keep: bool = True) → List¶

Given a list of featurizers, apply them serially and build a flat tuple out of the results.

Parameters

systems (list of System or array-like) – The Systems (or arrays) to be featurized.
keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

If the last featurizer is returning a single array, the shape of the object will be (N_systems,). If the last featurizer returns more than one array, it will be (N_systems, M_returned_objects).

Return type

tuple of (of tuples) arraylike

class kinoml.features.core.BaseOneHotEncodingFeaturizer(dictionary: dict = None, **kwargs)¶

Bases: ParallelBaseFeaturizer

Base class for Featurizers concerning one hot encoding.

ALPHABET¶

_featurize_one(system: Union[kinoml.core.systems.LigandSystem, kinoml.core.systems.ProteinLigandComplex]) → Union[numpy.ndarray, None]¶

One hot encode one system.

Parameters: system (LigandSystem or ProteinLigandComplex) – The System to be featurized.
Return type: array or None

abstract _retrieve_sequence(system: kinoml.core.systems.System)¶: Implement in your component-specific subclass!

static one_hot_encode(sequence: Iterable, dictionary: dict | Sequence) → numpy.ndarray¶

One-hot encode a sequence of characters, given a dictionary.

Parameters

sequence (Iterable) –
dictionary (dict or sequuence-like) – Mapping of each character to their position in the alphabet. If a sequence-like is given, it will be enumerated into a dict.

Returns

One-hot encoded matrix with shape (len(dictionary), len(sequence))

Return type

array-like

class kinoml.features.core.PadFeaturizer(shape: Iterable[int] = 'auto', key: Hashable = 'last', pad_with: int = 0, **kwargs)¶

Bases: ParallelBaseFeaturizer

Pads features of a given system to a desired size or length.

This class wraps numpy.pad with mode=constant, auto-calculating the needed additions to match the requested shape.

Parameters

shape (tuple of int, or "auto") – The desired size of the transformed features. If “auto”, shape will be estimated from the Dataset passed at runtime so it matches the largest observed.
key (hashable) – element to retrieve from System.featurizations
pad_with (int) – value to fill the array-like features with

_get_array(system_or_array: System | np.ndarray) → numpy.ndarray¶

_pre_featurize(systems) → None¶

Compute the largest shape in the input arrays and store in shape attribute.

Parameters: systems (list of System) –

_featurize_one(system: kinoml.core.systems.System) → numpy.ndarray¶

Parameters

system (System or array-like) – The System (or array) to be featurized.
options (dict) – Must contain a key shape with the expected final shape of the systems.

Return type

array

class kinoml.features.core.HashFeaturizer(getter: Callable[[kinoml.core.systems.System], str] = None, normalize=True, **kwargs)¶

Bases: BaseFeaturizer

Hash an attribute of the protein, such as the name or id.

Parameters

getter (callable, optional) – A function or lambda that takes a System and returns a string to be hashed. Default value will return whatever system.featurizations["last"] contains, as a string
normalize (bool, default=True) – Normalizes the hash to obtain a value in the unit interval

static _getter(system)¶

_featurize_one(system: kinoml.core.systems.System) → numpy.ndarray¶

Featurizes a component using the hash of the chosen attribute.

Parameters: system (System) – The System to be featurized.
Returns: Sha256’d attribute
Return type: array

class kinoml.features.core.NullFeaturizer(**kwargs)¶

Bases: ParallelBaseFeaturizer

Abstract Featurizer class with support for multiprocessing.

Parameters

use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.
dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.

_featurize(systems: Iterable[kinoml.core.systems.System], keep: bool = None) → object¶

Featurize all system objects in a parallel fashion as defined in ._featurize_one().

Parameters: systems (list of System) – This is the collection of System objects that will be transformed.
Returns: features
Return type: list of System or array-like

class kinoml.features.core.CallableFeaturizer(func: Callable[[System], System | np.array] | str = None, **kwargs)¶

Bases: BaseFeaturizer

Apply an arbitrary callable to a System.

Parameters: func (callable or str or None) – Must take a System and return a System or array. If str it will be eval’d into a callable. If None, the default callable will return system.featurizations["last"] for each system.

static _default_func(system)¶

_featurize_one(system: System | np.ndarray) → numpy.ndarray¶

Parameters

system (System or array-like) – The System (or array) to be featurized.
options (dict) – Unused

Return type

array-like

class kinoml.features.core.ClearFeaturizations(keys=('last',), style='keep', **kwargs)¶

Bases: BaseFeaturizer

Remove keys from the .featurizations dictionary in each System object. By default, it will remove all keys that are not last.

Parameters

keys (tuple of str, optional=("last",)) – Which keys to keep or remove, depending on style.
style (str, optional="keep") – Whether to keep or remove the entries passed as keys.

_featurize_one(system: kinoml.core.systems.System) → kinoml.core.systems.System¶

Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.

Parameters: system (System) – The System to be featurized.
Return type: System or array-like

_post_featurize(systems: Iterable[kinoml.core.systems.System], features: Iterable[System | np.array], keep: bool = True) → Iterable[kinoml.core.systems.System]¶: Bypass the automated population of the .featurizations dict in each System

class kinoml.features.core.OEBaseModelingFeaturizer(loop_db: Union[str, None] = None, cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, **kwargs)¶

Bases: ParallelBaseFeaturizer

This abstract class defines several methods that use functionality from the OpenEye toolkit for molecular modeling. Featurizers that subclass OEBaseModelingFeaturizer need to implement at least the _featurize_one method.

Parameters

loop_db (str) – The path to the loop database used by OESpruce to model missing loops.
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

_read_protein_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) → Union[oechem.OEGraphMol, None]¶

Returns the protein structure of the given protein object as OpenEye molecule.

Parameters: protein (Protein or KLIFSKinase) – The protein object.
Returns: The protein structure as OpenEye molecule or None.
Return type: oechem.OEGraphMol or None
Raises: ValueError – If wrong toolkit was used during initialization of the protein object.

_get_design_unit(structure: openeye.oechem.OEMolBase, chain_id: Union[str, None], alternate_location: Union[str, None], has_ligand: bool, ligand_name: Union[str, None], model_loops_and_caps: bool) → Union[openeye.oechem.OEDesignUnit, None]¶

Get an OpenEye design unit based on the given input.

Parameters

structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to prepare.
chain_id (str or None) – The chain ID of interest.
alternate_location (str or None) – The alternate location of interest.
has_ligand (bool) – If design unit generation should consider ligands. If True, design units will be only generated for protein ligand complexes. If False, design units will not consider co-crystallized ligands.
ligand_name (str or None) – The ligand expo ID bound to the protein of interest. Design units will be filtered to contain the respective ligand.
model_loops_and_caps (bool) – If loops and caps should be modeled.

Returns

design_unit – The design unit or None if no design unit was found.

Return type

oechem.OEDesignUnit or None

static _get_components(design_unit: openeye.oechem.OEDesignUnit, chain_id: Union[str, None]) → Tuple[oechem.OEGraphMol(), oechem.OEGraphMol(), oechem.OEGraphMol()]¶

Get protein, solvent and ligand components from an OpenEye design unit.

Parameters

design_unit (oechem.OEDesignUnit) – The OpenEye design unit to extract components from.
chain_id (str or None) – The chain ID of interest.

Returns

components – OpenEye molecules holding protein, solvent and ligand.

Return type

tuple of oechem.OEGraphMol, oechem.OEGraphMol and oechem.OEGraphMol

_process_protein(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1, ligand: Union[oechem.OEMolBase, None] = None) → oechem.OEMolBase¶

Process a protein structure according to the given amino acid sequence.

Parameters

protein_structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to process.
amino_acid_sequence (str) – The amino acid sequence with associated metadata.
first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.
ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule that should be checked for heavy atom clashes with built insertions.

Returns

An OpenEye molecule holding the processed protein structure.

Return type

oechem.OEMolBase

static _get_protein_residue_numbers(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1) → List[int]¶

Get the residue numbers of a protein structure according to given amino acid sequence.

Parameters

protein_structure (oechem.OEMolBase) – The kinase domain structure.
amino_acid_sequence (core.sequences.AminoAcidSequence) – The template amino acid sequence.
first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.

Returns

residue_number – A list of residue numbers according to the given amino acid sequence in the same order as the residues in the given protein structure.

Return type

list of int

_assemble_components(protein: openeye.oechem.OEMolBase, solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None] = None) → openeye.oechem.OEMolBase¶

Assemble components of a solvated protein-ligand complex into a single OpenEye molecule.

Parameters

protein (oechem.OEMolBase) – An OpenEye molecule holding the protein of interest.
solvent (oechem.OEMolBase) – An OpenEye molecule holding the solvent of interest.
ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule holding the ligand of interest if given.

Returns

assembled_components – An OpenEye molecule holding protein, solvent and ligand if given.

Return type

oechem.OEMolBase

static _remove_clashing_water(solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None], protein: openeye.oechem.OEMolBase) → openeye.oechem.OEGraphMol¶

Remove water molecules clashing with a ligand or newly modeled protein residues.

Parameters

solvent (oechem.OEGraphMol) – An OpenEye molecule holding the water molecules.
ligand (oechem.OEGraphMol or None) – An OpenEye molecule holding the ligand or None.
protein (oechem.OEGraphMol) – An OpenEye molecule holding the protein.

Returns

An OpenEye molecule holding water molecules not clashing with the ligand or newly modeled protein residues.

Return type

oechem.OEGraphMol

_update_pdb_header(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: [str, None] = None, other_pdb_header_info: Union[None, Iterable[Tuple[str, str]]] = None) → openeye.oechem.OEMolBase¶

Stores information about Featurizer, protein and ligand in the PDB header COMPND section in the given OpenEye molecule.

Parameters

structure (oechem.OEMolBase) – An OpenEye molecule.
protein_name (str) – The name of the protein.
ligand_name (str or None, default=None) – The name of the ligand if present.
other_pdb_header_info (None or iterable of tuple of str) – Tuples with information that should be saved in the PDB header. Each tuple consists of two strings, i.e., the PDB header section (e.g. COMPND) and the respective information.

Returns

The OpenEye molecule containing the updated PDB header.

Return type

oechem.OEMolBase

_write_results(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: Union[str, None] = None) → pathlib.Path¶

Write the results from the Featurizer and retrieve the paths to protein or complex if a ligand is present.

Parameters

structure (oechem.OEMolBase) – The OpenEye molecule holding the featurized system.
protein_name (str) – The name of the protein.
ligand_name (str or None, default=None) – The name of the ligand if present.

Returns

Path to prepared protein or complex if ligand is present.

Return type

Path

kinoml.features.core¶

Module Contents¶

`kinoml.features.core`¶