kinoml.features.core

Featurizers can transform a kinoml.core.system.System object and produce new representations of the molecular entities and their associated measurements.

Module Contents

kinoml.features.core.logger
class kinoml.features.core.BaseFeaturizer

Abstract Featurizer class.

_SUPPORTED_TYPES
featurize(systems: List[kinoml.core.systems.System], keep=True) List[kinoml.core.systems.System]

Given some systems (compatible with _SUPPORTED_TYPES), apply the featurization scheme implemented in this class.

First, self.supports() will check whether the systems are compatible with the featurization scheme. We assume all of them are equal, so only the first one will be checked. Then, the Systems are passed to self._featurize to handle the actual leg-work.

Parameters:
  • systems (list of System) – This is the collection of System objects that will be transformed.

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns:

systems – The same systems that were passed in. The returned Systems will have an extra entry in the .featurizations dictionary, containing the featurized object (either a new System or an array-like object) under a key named after .name.

Return type:

list of System

__call__(*args, **kwargs)

You can also call the instance directly. This forwards to .featurize().

_pre_featurize(systems: List[kinoml.core.systems.System]) None

Run before featurizing all systems. Redefine this method if needed.

Parameters:

systems (list of System) – This is the collection of System objects that will be transformed.

_featurize(systems: List[kinoml.core.systems.System]) List[object]

Featurize all system objects in a serial fashion as defined in ._featurize_one().

Parameters:

systems (list of System) – This is the collection of System objects that will be transformed.

Returns:

features

Return type:

list of System or array-like

abstract _featurize_one(system: kinoml.core.systems.System) object

Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.

Parameters:

system (System) – The System to be featurized.

Return type:

System or array-like

_post_featurize(systems: List[kinoml.core.systems.System], features: List, keep: bool = True) List[kinoml.core.systems.System]

Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn’t need to redefine this method.

Parameters:
  • systems (list of System) – The systems being featurized

  • features (list) – The features returned by self._featurize

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns:

filtered_systems – The same systems as passed, but with .featurizations extended with the calculated features in two entries: the featurizer name and last. Systems with a feature of None will be removed.

Return type:

systems

supports(*systems: kinoml.core.systems.System, raise_errors: bool = True) bool

Check if these systems are supported by this featurizer.

Do NOT reimplement in subclass. Check ._supports() instead.

Parameters:
  • systems (list of System) – Systems to be checked (by type, contained attributes, etc)

  • raise_errors (bool, optional=True) – if True, raise ValueError if errors were found

Returns:

True if all systems are compatible, False otherwise

Return type:

bool

Raises:

ValueError` if ._supports() fails and raise_errors is True

_supports(system: kinoml.core.systems.System) bool

This is the private method that actually tests for compatibility between a single system and the current featurizer.

This is the method you should reimplement in your subclass.

Parameters:

system (System) – The system that will be checked

Return type:

True if compatible, False otherwise

property name
__repr__()
class kinoml.features.core.ParallelBaseFeaturizer(use_multiprocessing: bool = True, n_processes: int | None = None, chunksize: int | None = None, dask_client=None, **kwargs)

Bases: BaseFeaturizer

Abstract Featurizer class with support for multiprocessing.

Parameters:
  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

  • chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.

  • dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.

_SUPPORTED_TYPES
use_multiprocessing = True
n_processes = None
chunksize = None
dask_client = None
__getstate__()

Only preserve object fields that are serializable

__setstate__(state)

Only preserve object fields that are serializable.

_featurize(systems: List[kinoml.core.systems.System]) List[object]

Featurize all system objects in a parallel fashion as defined in ._featurize_one().

Parameters:

systems (list of System) – This is the collection of System objects that will be transformed.

Returns:

features

Return type:

list of System or array-like

class kinoml.features.core.Pipeline(featurizers: List[BaseFeaturizer], shortname=None, **kwargs)

Bases: BaseFeaturizer

Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y).

Parameters:

featurizers (iterable of BaseFeaturizer) – Featurizers to stack. They must be compatible with each other!

Note

While Pipeline is a subclass of BaseFeaturizer, it should be considered a special case of such. It indeed shares the same API but the implementation details of ._featurize() are slightly different. It acts as a wrapper around individual Featurizer objects.

featurizers
_shortname = None
_featurize(systems: List[kinoml.core.systems.System], keep: bool = True) List[object]

Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y) and store the features in the systems.

Parameters:
  • systems (list of System) – This is the collection of System objects that will be transformed

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns:

features

Return type:

list of System or array-like

supports(*systems: kinoml.core.systems.System, raise_errors: bool = False) bool

Check if these systems are supported by all featurizers.

Parameters:
  • systems (list of System) – systems to be checked (by type, contained attributes, etc)

  • raise_errors (bool, optional=False) – If True, raise ValueError

Returns:

True if all systems are compatible with all featurizers, False otherwise

Return type:

bool

Raises:

ValueError` if f.supports() fails and raise_errors is True

property name
property shortname
class kinoml.features.core.Concatenated(featurizers: List[BaseFeaturizer], axis: int = 1, **kwargs)

Bases: Pipeline

Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).

Parameters:
  • featurizers (list of BaseFeaturizer) – These should take a System or array, but return only arrays so they can be concatenated. Note that the arrays must have the same number of dimensions. If that is not the case, you will need to reshape one of them using CallableFeaturizer and a lambda function that relies on np.reshape or similar.

  • axis (int, optional=1) – On which axis to concatenate. By default, it will concatenate on axis 1, which means that the features in each pipeline will be concatenated.

Notes

This Featurizer maybe removed in the future, since it can be replaced by TupleOfArrays.

axis = 1
_featurize(systems: List[kinoml.core.systems.System], keep=True) numpy.ndarray

Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY).

Parameters:
  • systems (list of System or array-like) – The Systems (or arrays) to be featurized.

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns:

Concatenated arrays along specified axis.

Return type:

np.ndarray

class kinoml.features.core.TupleOfArrays(*args, **kwargs)

Bases: Pipeline

Given a list of featurizers, apply them serially and return the result directly as a flattened tuple of the arrays, for each system. E.g; given one system, featurizer A returns X, and featurizer B returns Y, Z; the output is a tuple of X, Y, Z).

The final result will be tuple of tuples.

_featurize(systems: List[kinoml.core.systems.System], keep: bool = True) List

Given a list of featurizers, apply them serially and build a flat tuple out of the results.

Parameters:
  • systems (list of System or array-like) – The Systems (or arrays) to be featurized.

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns:

If the last featurizer is returning a single array, the shape of the object will be (N_systems,). If the last featurizer returns more than one array, it will be (N_systems, M_returned_objects).

Return type:

tuple of (of tuples) arraylike

class kinoml.features.core.BaseOneHotEncodingFeaturizer(dictionary: dict = None, **kwargs)

Bases: ParallelBaseFeaturizer

Base class for Featurizers concerning one hot encoding.

ALPHABET = None
dictionary = None
_featurize_one(system: kinoml.core.systems.LigandSystem | kinoml.core.systems.ProteinLigandComplex) numpy.ndarray | None

One hot encode one system.

Parameters:

system (LigandSystem or ProteinLigandComplex) – The System to be featurized.

Return type:

array or None

abstract _retrieve_sequence(system: kinoml.core.systems.System)

Implement in your component-specific subclass!

static one_hot_encode(sequence: Iterable, dictionary: dict | Sequence) numpy.ndarray

One-hot encode a sequence of characters, given a dictionary.

Parameters:
  • sequence (Iterable)

  • dictionary (dict or sequuence-like) – Mapping of each character to their position in the alphabet. If a sequence-like is given, it will be enumerated into a dict.

Returns:

One-hot encoded matrix with shape (len(dictionary), len(sequence))

Return type:

array-like

class kinoml.features.core.PadFeaturizer(shape: Iterable[int] = 'auto', key: Hashable = 'last', pad_with: int = 0, **kwargs)

Bases: ParallelBaseFeaturizer

Pads features of a given system to a desired size or length.

This class wraps numpy.pad with mode=constant, auto-calculating the needed additions to match the requested shape.

Parameters:
  • shape (tuple of int, or "auto") – The desired size of the transformed features. If “auto”, shape will be estimated from the Dataset passed at runtime so it matches the largest observed.

  • key (hashable) – element to retrieve from System.featurizations

  • pad_with (int) – value to fill the array-like features with

shape = 'auto'
key = 'last'
pad_with = 0
_get_array(system_or_array: kinoml.core.systems.System | numpy.ndarray) numpy.ndarray
_pre_featurize(systems) None

Compute the largest shape in the input arrays and store in shape attribute.

Parameters:

systems (list of System)

_featurize_one(system: kinoml.core.systems.System) numpy.ndarray
Parameters:
  • system (System or array-like) – The System (or array) to be featurized.

  • options (dict) – Must contain a key shape with the expected final shape of the systems.

Return type:

array

class kinoml.features.core.HashFeaturizer(getter: Callable[[kinoml.core.systems.System], str] = None, normalize=True, **kwargs)

Bases: BaseFeaturizer

Hash an attribute of the protein, such as the name or id.

Parameters:
  • getter (callable, optional) – A function or lambda that takes a System and returns a string to be hashed. Default value will return whatever system.featurizations["last"] contains, as a string

  • normalize (bool, default=True) – Normalizes the hash to obtain a value in the unit interval

getter
normalize = True
denominator = 115792089237316195423570985008687907853269984665640564039457584007913129639936
static _getter(system)
_featurize_one(system: kinoml.core.systems.System) numpy.ndarray

Featurizes a component using the hash of the chosen attribute.

Parameters:

system (System) – The System to be featurized.

Returns:

Sha256’d attribute

Return type:

array

class kinoml.features.core.NullFeaturizer(**kwargs)

Bases: ParallelBaseFeaturizer

Abstract Featurizer class with support for multiprocessing.

Parameters:
  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

  • chunksize (int, optional=None) – See https://stackoverflow.com/a/54032744/3407590.

  • dask_client (dask.distributed.Client or None, default=None) – A dask client to manage multiprocessing. Will ignore use_multiprocessing chunksize and n_processes attributes.

_featurize(systems: Iterable[kinoml.core.systems.System], keep: bool = None) object

Featurize all system objects in a parallel fashion as defined in ._featurize_one().

Parameters:

systems (list of System) – This is the collection of System objects that will be transformed.

Returns:

features

Return type:

list of System or array-like

class kinoml.features.core.CallableFeaturizer(func: Callable[[kinoml.core.systems.System], kinoml.core.systems.System | numpy.array] | str = None, **kwargs)

Bases: BaseFeaturizer

Apply an arbitrary callable to a System.

Parameters:

func (callable or str or None) – Must take a System and return a System or array. If str it will be eval’d into a callable. If None, the default callable will return system.featurizations["last"] for each system.

callable = None
static _default_func(system)
_featurize_one(system: kinoml.core.systems.System | numpy.ndarray) numpy.ndarray
Parameters:
  • system (System or array-like) – The System (or array) to be featurized.

  • options (dict) – Unused

Return type:

array-like

class kinoml.features.core.ClearFeaturizations(keys=('last',), style='keep', **kwargs)

Bases: BaseFeaturizer

Remove keys from the .featurizations dictionary in each System object. By default, it will remove all keys that are not last.

Parameters:
  • keys (tuple of str, optional=("last",)) – Which keys to keep or remove, depending on style.

  • style (str, optional="keep") – Whether to keep or remove the entries passed as keys.

keys = ('last',)
style = 'keep'
_featurize_one(system: kinoml.core.systems.System) kinoml.core.systems.System

Implement this method to do the actual leg-work for self.featurize(). It takes a single System object and returns either a new System object or an array-like object.

Parameters:

system (System) – The System to be featurized.

Return type:

System or array-like

_post_featurize(systems: Iterable[kinoml.core.systems.System], features: Iterable[kinoml.core.systems.System | numpy.array], keep: bool = True) Iterable[kinoml.core.systems.System]

Bypass the automated population of the .featurizations dict in each System

class kinoml.features.core.OEBaseModelingFeaturizer(loop_db: str | None = None, cache_dir: str | pathlib.Path | None = None, output_dir: str | pathlib.Path | None = None, **kwargs)

Bases: ParallelBaseFeaturizer

This abstract class defines several methods that use functionality from the OpenEye toolkit for molecular modeling. Featurizers that subclass OEBaseModelingFeaturizer need to implement at least the _featurize_one method.

Parameters:
  • loop_db (str) – The path to the loop database used by OESpruce to model missing loops.

  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

loop_db = None
cache_dir
output_dir = None
_read_protein_structure(protein: kinoml.core.proteins.Protein | kinoml.core.proteins.KLIFSKinase) oechem.OEGraphMol | None

Returns the protein structure of the given protein object as OpenEye molecule.

Parameters:

protein (Protein or KLIFSKinase) – The protein object.

Returns:

The protein structure as OpenEye molecule or None.

Return type:

oechem.OEGraphMol or None

Raises:

ValueError – If wrong toolkit was used during initialization of the protein object.

_get_design_unit(structure: openeye.oechem.OEMolBase, chain_id: str | None, alternate_location: str | None, has_ligand: bool, ligand_name: str | None, model_loops_and_caps: bool) openeye.oechem.OEDesignUnit | None

Get an OpenEye design unit based on the given input.

Parameters:
  • structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to prepare.

  • chain_id (str or None) – The chain ID of interest.

  • alternate_location (str or None) – The alternate location of interest.

  • has_ligand (bool) – If design unit generation should consider ligands. If True, design units will be only generated for protein ligand complexes. If False, design units will not consider co-crystallized ligands.

  • ligand_name (str or None) – The ligand expo ID bound to the protein of interest. Design units will be filtered to contain the respective ligand.

  • model_loops_and_caps (bool) – If loops and caps should be modeled.

Returns:

design_unit – The design unit or None if no design unit was found.

Return type:

oechem.OEDesignUnit or None

static _get_components(design_unit: openeye.oechem.OEDesignUnit, chain_id: str | None) Tuple[oechem.OEGraphMol(), oechem.OEGraphMol(), oechem.OEGraphMol()]

Get protein, solvent and ligand components from an OpenEye design unit.

Parameters:
  • design_unit (oechem.OEDesignUnit) – The OpenEye design unit to extract components from.

  • chain_id (str or None) – The chain ID of interest.

Returns:

components – OpenEye molecules holding protein, solvent and ligand.

Return type:

tuple of oechem.OEGraphMol, oechem.OEGraphMol and oechem.OEGraphMol

_process_protein(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1, ligand: oechem.OEMolBase | None = None) oechem.OEMolBase

Process a protein structure according to the given amino acid sequence.

Parameters:
  • protein_structure (oechem.OEMolBase) – An OpenEye molecule holding the protein structure to process.

  • amino_acid_sequence (str) – The amino acid sequence with associated metadata.

  • first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.

  • ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule that should be checked for heavy atom clashes with built insertions.

Returns:

An OpenEye molecule holding the processed protein structure.

Return type:

oechem.OEMolBase

static _get_protein_residue_numbers(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1) List[int]

Get the residue numbers of a protein structure according to given amino acid sequence.

Parameters:
  • protein_structure (oechem.OEMolBase) – The kinase domain structure.

  • amino_acid_sequence (core.sequences.AminoAcidSequence) – The template amino acid sequence.

  • first_id (int, default=1) – The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment.

Returns:

residue_number – A list of residue numbers according to the given amino acid sequence in the same order as the residues in the given protein structure.

Return type:

list of int

_assemble_components(protein: openeye.oechem.OEMolBase, solvent: openeye.oechem.OEMolBase, ligand: openeye.oechem.OEMolBase | None = None) openeye.oechem.OEMolBase

Assemble components of a solvated protein-ligand complex into a single OpenEye molecule.

Parameters:
  • protein (oechem.OEMolBase) – An OpenEye molecule holding the protein of interest.

  • solvent (oechem.OEMolBase) – An OpenEye molecule holding the solvent of interest.

  • ligand (oechem.OEMolBase or None, default=None) – An OpenEye molecule holding the ligand of interest if given.

Returns:

assembled_components – An OpenEye molecule holding protein, solvent and ligand if given.

Return type:

oechem.OEMolBase

static _remove_clashing_water(solvent: openeye.oechem.OEMolBase, ligand: openeye.oechem.OEMolBase | None, protein: openeye.oechem.OEMolBase) openeye.oechem.OEGraphMol

Remove water molecules clashing with a ligand or newly modeled protein residues.

Parameters:
  • solvent (oechem.OEGraphMol) – An OpenEye molecule holding the water molecules.

  • ligand (oechem.OEGraphMol or None) – An OpenEye molecule holding the ligand or None.

  • protein (oechem.OEGraphMol) – An OpenEye molecule holding the protein.

Returns:

An OpenEye molecule holding water molecules not clashing with the ligand or newly modeled protein residues.

Return type:

oechem.OEGraphMol

_update_pdb_header(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: [str, None] = None, other_pdb_header_info: None | Iterable[Tuple[str, str]] = None) openeye.oechem.OEMolBase

Stores information about Featurizer, protein and ligand in the PDB header COMPND section in the given OpenEye molecule.

Parameters:
  • structure (oechem.OEMolBase) – An OpenEye molecule.

  • protein_name (str) – The name of the protein.

  • ligand_name (str or None, default=None) – The name of the ligand if present.

  • other_pdb_header_info (None or iterable of tuple of str) – Tuples with information that should be saved in the PDB header. Each tuple consists of two strings, i.e., the PDB header section (e.g. COMPND) and the respective information.

Returns:

The OpenEye molecule containing the updated PDB header.

Return type:

oechem.OEMolBase

_write_results(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: str | None = None) pathlib.Path

Write the results from the Featurizer and retrieve the paths to protein or complex if a ligand is present.

Parameters:
  • structure (oechem.OEMolBase) – The OpenEye molecule holding the featurized system.

  • protein_name (str) – The name of the protein.

  • ligand_name (str or None, default=None) – The name of the ligand if present.

Returns:

Path to prepared protein or complex if ligand is present.

Return type:

Path