kinoml.features.core

Featurizers can transform a kinoml.core.system.System object and produce new representations of the molecular entities and their associated measurements.

Module Contents

kinoml.features.core.logger
class kinoml.features.core.BaseFeaturizer

Abstract Featurizer class.

property name
_SUPPORTED_TYPES = ()
featurize(systems: List[kinoml.core.systems.System], keep=True) List[kinoml.core.systems.System]

Given some systems (compatible with _SUPPORTED_TYPES), apply the featurization scheme implemented in this class.

First, self.supports() will check whether the systems are compatible with the featurization scheme. We assume all of them are equal, so only the first one will be checked. Then, the Systems are passed to self._featurize to handle the actual leg-work.

Parameters
  • systems (list of System) – This is the collection of System objects that will be transformed.

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

systems – The same systems that were passed in. The returned Systems will have an extra entry in the .featurizations dictionary, containing the featurized object (either a new System or an array-like object) under a key named after .name.

Return type

list of System

__call__(*args, **kwargs)

You can also call the instance directly. This forwards to .featurize().

_pre_featurize(systems: List[kinoml.core.systems.System]) None

Run before featurizing all systems. Redefine this method if needed.

Parameters

systems (list of System) – This is the collection of System objects that will be transformed.

_featurize(systems: List[kinoml.core.systems.System]) List[object]

Featurize all system objects in a serial fashion as defined in ._featurize_one().

Parameters

systems (list of System) – This is the collection of System objects that will be transformed.

Returns

features

Return type

list of System or array-like

abstract _featurize_one(system: kinoml.core.systems.System) object

Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.

Parameters

system (System) – The System to be featurized.

Return type

System or array-like

_post_featurize(systems: List[kinoml.core.systems.System], features: List, keep: bool = True) List[kinoml.core.systems.System]

Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn’t need to redefine this method.

Parameters
  • systems (list of System) – The systems being featurized

  • features (list) – The features returned by self._featurize

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

filtered_systems – The same systems as passed, but with .featurizations extended with the calculated features in two entries: the featurizer name and last. Systems with a feature of None will be removed.

Return type

systems

supports(*systems: kinoml.core.systems.System, raise_errors: bool = True) bool

Check if these systems are supported by this featurizer.

Do NOT reimplement in subclass. Check ._supports() instead.

Parameters
  • systems (list of System) – Systems to be checked (by type, contained attributes, etc)

  • raise_errors (bool, optional=True) – if True, raise ValueError if errors were found

Returns

True if all systems are compatible, False otherwise

Return type

bool

Raises

ValueError` if ._supports() fails and raise_errors is True

_supports(system: kinoml.core.systems.System) bool

This is the private method that actually tests for compatibility between a single system and the current featurizer.

This is the method you should reimplement in your subclass.

Parameters

system (System) – The system that will be checked

Return type

True if compatible, False otherwise

__repr__()

Return repr(self).

class kinoml.features.core.ParallelBaseFeaturizer(use_multiprocessing: bool = True, n_processes: Union[int, None] = None, chunksize: Union[int, None] = None, dask_client=None, **kwargs)

Bases: BaseFeaturizer

Abstract Featurizer class with support for multiprocessing.

Parameters
  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

  • chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.

  • dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.

_SUPPORTED_TYPES = ()
__getstate__()

Only preserve object fields that are serializable

__setstate__(state)

Only preserve object fields that are serializable.

_featurize(systems: List[kinoml.core.systems.System]) List[object]

Featurize all system objects in a parallel fashion as defined in ._featurize_one().

Parameters

systems (list of System) – This is the collection of System objects that will be transformed.

Returns

features

Return type

list of System or array-like

class kinoml.features.core.Pipeline(featurizers: List[BaseFeaturizer], shortname=None, **kwargs)

Bases: BaseFeaturizer

Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y).

Parameters

featurizers (iterable of BaseFeaturizer) – Featurizers to stack. They must be compatible with each other!

Note

While Pipeline is a subclass of BaseFeaturizer, it should be considered a special case of such. It indeed shares the same API but the implementation details of ._featurize() are slightly different. It acts as a wrapper around individual Featurizer objects.

property name
property shortname
_featurize(systems: List[kinoml.core.systems.System], keep: bool = True) List[object]

Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y) and store the features in the systems.

Parameters
  • systems (list of System) – This is the collection of System objects that will be transformed

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

features

Return type

list of System or array-like

supports(*systems: kinoml.core.systems.System, raise_errors: bool = False) bool

Check if these systems are supported by all featurizers.

Parameters
  • systems (list of System) – systems to be checked (by type, contained attributes, etc)

  • raise_errors (bool, optional=False) – If True, raise ValueError

Returns

True if all systems are compatible with all featurizers, False otherwise

Return type

bool

Raises

ValueError` if f.supports() fails and raise_errors is True

class kinoml.features.core.Concatenated(featurizers: List[BaseFeaturizer], axis: int = 1, **kwargs)

Bases: Pipeline

Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).

Parameters
  • featurizers (list of BaseFeaturizer) – These should take a System or array, but return only arrays so they can be concatenated. Note that the arrays must have the same number of dimensions. If that is not the case, you will need to reshape one of them using CallableFeaturizer and a lambda function that relies on np.reshape or similar.

  • axis (int, optional=1) – On which axis to concatenate. By default, it will concatenate on axis 1, which means that the features in each pipeline will be concatenated.

Notes

This Featurizer maybe removed in the future, since it can be replaced by TupleOfArrays.

_featurize(systems: List[kinoml.core.systems.System], keep=True) numpy.ndarray

Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).

Parameters
  • systems (list of System or array-like) – The Systems (or arrays) to be featurized.

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

Concatenated arrays along specified axis.

Return type

np.ndarray

class kinoml.features.core.TupleOfArrays(*args, **kwargs)

Bases: Pipeline

Given a list of featurizers, apply them serially and return the result directly as a flattened tuple of the arrays, for each system. E.g; given one system, featurizer A returns X, and featurizer B returns Y, Z; the output is a tuple of X, Y, Z).

The final result will be tuple of tuples.

_featurize(systems: List[kinoml.core.systems.System], keep: bool = True) List

Given a list of featurizers, apply them serially and build a flat tuple out of the results.

Parameters
  • systems (list of System or array-like) – The Systems (or arrays) to be featurized.

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

If the last featurizer is returning a single array, the shape of the object will be (N_systems,). If the last featurizer returns more than one array, it will be (N_systems, M_returned_objects).

Return type

tuple of (of tuples) arraylike

class kinoml.features.core.BaseOneHotEncodingFeaturizer(dictionary: dict = None, **kwargs)

Bases: ParallelBaseFeaturizer

Base class for Featurizers concerning one hot encoding.

ALPHABET
_featurize_one(system: Union[kinoml.core.systems.LigandSystem, kinoml.core.systems.ProteinLigandComplex]) Union[numpy.ndarray, None]

One hot encode one system.

Parameters

system (LigandSystem or ProteinLigandComplex) – The System to be featurized.

Return type

array or None

abstract _retrieve_sequence(system: kinoml.core.systems.System)

Implement in your component-specific subclass!

static one_hot_encode(sequence: Iterable, dictionary: dict | Sequence) numpy.ndarray

One-hot encode a sequence of characters, given a dictionary.

Parameters
  • sequence (Iterable) –

  • dictionary (dict or sequuence-like) – Mapping of each character to their position in the alphabet. If a sequence-like is given, it will be enumerated into a dict.

Returns

One-hot encoded matrix with shape (len(dictionary), len(sequence))

Return type

array-like

class kinoml.features.core.PadFeaturizer(shape: Iterable[int] = 'auto', key: Hashable = 'last', pad_with: int = 0, **kwargs)

Bases: ParallelBaseFeaturizer

Pads features of a given system to a desired size or length.

This class wraps numpy.pad with mode=constant, auto-calculating the needed additions to match the requested shape.

Parameters
  • shape (tuple of int, or "auto") – The desired size of the transformed features. If “auto”, shape will be estimated from the Dataset passed at runtime so it matches the largest observed.

  • key (hashable) – element to retrieve from System.featurizations

  • pad_with (int) – value to fill the array-like features with

_get_array(system_or_array: System | np.ndarray) numpy.ndarray
_pre_featurize(systems) None

Compute the largest shape in the input arrays and store in shape attribute.

Parameters

systems (list of System) –

_featurize_one(system: kinoml.core.systems.System) numpy.ndarray
Parameters
  • system (System or array-like) – The System (or array) to be featurized.

  • options (dict) – Must contain a key shape with the expected final shape of the systems.

Return type

array

class kinoml.features.core.HashFeaturizer(getter: Callable[[kinoml.core.systems.System], str] = None, normalize=True, **kwargs)

Bases: BaseFeaturizer

Hash an attribute of the protein, such as the name or id.

Parameters
  • getter (callable, optional) – A function or lambda that takes a System and returns a string to be hashed. Default value will return whatever system.featurizations["last"] contains, as a string

  • normalize (bool, default=True) – Normalizes the hash to obtain a value in the unit interval

static _getter(system)
_featurize_one(system: kinoml.core.systems.System) numpy.ndarray

Featurizes a component using the hash of the chosen attribute.

Parameters

system (System) – The System to be featurized.

Returns

Sha256’d attribute

Return type

array

class kinoml.features.core.NullFeaturizer(**kwargs)

Bases: ParallelBaseFeaturizer

Abstract Featurizer class with support for multiprocessing.

Parameters
  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

  • chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.

  • dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.

_featurize(systems: Iterable[kinoml.core.systems.System], keep: bool = None) object

Featurize all system objects in a parallel fashion as defined in ._featurize_one().

Parameters

systems (list of System) – This is the collection of System objects that will be transformed.

Returns

features

Return type

list of System or array-like

class kinoml.features.core.CallableFeaturizer(func: Callable[[System], System | np.array] | str = None, **kwargs)

Bases: BaseFeaturizer

Apply an arbitrary callable to a System.

Parameters

func (callable or str or None) – Must take a System and return a System or array. If str it will be eval’d into a callable. If None, the default callable will return system.featurizations["last"] for each system.

static _default_func(system)
_featurize_one(system: System | np.ndarray) numpy.ndarray
Parameters
  • system (System or array-like) – The System (or array) to be featurized.

  • options (dict) – Unused

Return type

array-like

class kinoml.features.core.ClearFeaturizations(keys=('last',), style='keep', **kwargs)

Bases: BaseFeaturizer

Remove keys from the .featurizations dictionary in each System object. By default, it will remove all keys that are not last.

Parameters
  • keys (tuple of str, optional=("last",)) – Which keys to keep or remove, depending on style.

  • style (str, optional="keep") – Whether to keep or remove the entries passed as keys.

_featurize_one(system: kinoml.core.systems.System) kinoml.core.systems.System

Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.

Parameters

system (System) – The System to be featurized.

Return type

System or array-like

_post_featurize(systems: Iterable[kinoml.core.systems.System], features: Iterable[System | np.array], keep: bool = True) Iterable[kinoml.core.systems.System]

Bypass the automated population of the .featurizations dict in each System

class kinoml.features.core.OEBaseModelingFeaturizer(loop_db: Union[str, None] = None, cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, **kwargs)

Bases: ParallelBaseFeaturizer

This abstract class defines several methods that use functionality from the OpenEye toolkit for molecular modeling. Featurizers that subclass OEBaseModelingFeaturizer need to implement at least the _featurize_one method.

Parameters
  • loop_db (str) – The path to the loop database used by OESpruce to model missing loops.

  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

_read_protein_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Union[oechem.OEGraphMol, None]

Returns the protein structure of the given protein object as OpenEye molecule.

Parameters

protein (Protein or KLIFSKinase) – The protein object.

Returns

The protein structure as OpenEye molecule or None.

Return type

oechem.OEGraphMol or None

Raises

ValueError – If wrong toolkit was used during initialization of the protein object.

_get_design_unit(structure: openeye.oechem.OEMolBase, chain_id: Union[str, None], alternate_location: Union[str, None], has_ligand: bool, ligand_name: Union[str, None], model_loops_and_caps: bool) Union[openeye.oechem.OEDesignUnit, None]

Get an OpenEye design unit based on the given input.

Parameters
  • structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to prepare.

  • chain_id (str or None) – The chain ID of interest.

  • alternate_location (str or None) – The alternate location of interest.

  • has_ligand (bool) – If design unit generation should consider ligands. If True, design units will be only generated for protein ligand complexes. If False, design units will not consider co-crystallized ligands.

  • ligand_name (str or None) – The ligand expo ID bound to the protein of interest. Design units will be filtered to contain the respective ligand.

  • model_loops_and_caps (bool) – If loops and caps should be modeled.

Returns

design_unit – The design unit or None if no design unit was found.

Return type

oechem.OEDesignUnit or None

static _get_components(design_unit: openeye.oechem.OEDesignUnit, chain_id: Union[str, None]) Tuple[oechem.OEGraphMol(), oechem.OEGraphMol(), oechem.OEGraphMol()]

Get protein, solvent and ligand components from an OpenEye design unit.

Parameters
  • design_unit (oechem.OEDesignUnit) – The OpenEye design unit to extract components from.

  • chain_id (str or None) – The chain ID of interest.

Returns

components – OpenEye molecules holding protein, solvent and ligand.

Return type

tuple of oechem.OEGraphMol, oechem.OEGraphMol and oechem.OEGraphMol

_process_protein(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1, ligand: Union[oechem.OEMolBase, None] = None) oechem.OEMolBase

Process a protein structure according to the given amino acid sequence.

Parameters
  • protein_structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to process.

  • amino_acid_sequence (str) – The amino acid sequence with associated metadata.

  • first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.

  • ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule that should be checked for heavy atom clashes with built insertions.

Returns

An OpenEye molecule holding the processed protein structure.

Return type

oechem.OEMolBase

static _get_protein_residue_numbers(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1) List[int]

Get the residue numbers of a protein structure according to given amino acid sequence.

Parameters
  • protein_structure (oechem.OEMolBase) – The kinase domain structure.

  • amino_acid_sequence (core.sequences.AminoAcidSequence) – The template amino acid sequence.

  • first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.

Returns

residue_number – A list of residue numbers according to the given amino acid sequence in the same order as the residues in the given protein structure.

Return type

list of int

_assemble_components(protein: openeye.oechem.OEMolBase, solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None] = None) openeye.oechem.OEMolBase

Assemble components of a solvated protein-ligand complex into a single OpenEye molecule.

Parameters
  • protein (oechem.OEMolBase) – An OpenEye molecule holding the protein of interest.

  • solvent (oechem.OEMolBase) – An OpenEye molecule holding the solvent of interest.

  • ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule holding the ligand of interest if given.

Returns

assembled_components – An OpenEye molecule holding protein, solvent and ligand if given.

Return type

oechem.OEMolBase

static _remove_clashing_water(solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None], protein: openeye.oechem.OEMolBase) openeye.oechem.OEGraphMol

Remove water molecules clashing with a ligand or newly modeled protein residues.

Parameters
  • solvent (oechem.OEGraphMol) – An OpenEye molecule holding the water molecules.

  • ligand (oechem.OEGraphMol or None) – An OpenEye molecule holding the ligand or None.

  • protein (oechem.OEGraphMol) – An OpenEye molecule holding the protein.

Returns

An OpenEye molecule holding water molecules not clashing with the ligand or newly modeled protein residues.

Return type

oechem.OEGraphMol

_update_pdb_header(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: [str, None] = None, other_pdb_header_info: Union[None, Iterable[Tuple[str, str]]] = None) openeye.oechem.OEMolBase

Stores information about Featurizer, protein and ligand in the PDB header COMPND section in the given OpenEye molecule.

Parameters
  • structure (oechem.OEMolBase) – An OpenEye molecule.

  • protein_name (str) – The name of the protein.

  • ligand_name (str or None, default=None) – The name of the ligand if present.

  • other_pdb_header_info (None or iterable of tuple of str) – Tuples with information that should be saved in the PDB header. Each tuple consists of two strings, i.e., the PDB header section (e.g. COMPND) and the respective information.

Returns

The OpenEye molecule containing the updated PDB header.

Return type

oechem.OEMolBase

_write_results(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: Union[str, None] = None) pathlib.Path

Write the results from the Featurizer and retrieve the paths to protein or complex if a ligand is present.

Parameters
  • structure (oechem.OEMolBase) – The OpenEye molecule holding the featurized system.

  • protein_name (str) – The name of the protein.

  • ligand_name (str or None, default=None) – The name of the ligand if present.

Returns

Path to prepared protein or complex if ligand is present.

Return type

Path