kinoml.features.core ==================== .. py:module:: kinoml.features.core .. autoapi-nested-parse:: Featurizers can transform a ``kinoml.core.system.System`` object and produce new representations of the molecular entities and their associated measurements. Module Contents --------------- .. py:data:: logger .. py:class:: BaseFeaturizer Abstract Featurizer class. .. py:attribute:: _SUPPORTED_TYPES .. py:method:: featurize(systems: List[kinoml.core.systems.System], keep=True) -> List[kinoml.core.systems.System] Given some systems (compatible with ``_SUPPORTED_TYPES``), apply the featurization scheme implemented in this class. First, ``self.supports()`` will check whether the systems are compatible with the featurization scheme. We assume all of them are equal, so only the first one will be checked. Then, the Systems are passed to ``self._featurize`` to handle the actual leg-work. :param systems: This is the collection of System objects that will be transformed. :type systems: list of System :param keep: Whether to store the current featurizer in the ``system.featurizations`` dictionary with its own key (``self.name``), in addition to ``last``. :type keep: bool, optional=True :returns: **systems** -- The same systems that were passed in. The returned Systems will have an extra entry in the ``.featurizations`` dictionary, containing the featurized object (either a new System or an array-like object) under a key named after ``.name``. :rtype: list of System .. py:method:: __call__(*args, **kwargs) You can also call the instance directly. This forwards to ``.featurize()``. .. py:method:: _pre_featurize(systems: List[kinoml.core.systems.System]) -> None Run before featurizing all systems. Redefine this method if needed. :param systems: This is the collection of System objects that will be transformed. :type systems: list of System .. py:method:: _featurize(systems: List[kinoml.core.systems.System]) -> List[object] Featurize all system objects in a serial fashion as defined in ``._featurize_one()``. :param systems: This is the collection of System objects that will be transformed. :type systems: list of System :returns: **features** :rtype: list of System or array-like .. py:method:: _featurize_one(system: kinoml.core.systems.System) -> object :abstractmethod: Implement this method to do the actual leg-work for `self.featurize()`. It takes a single System object and returns either a new System object or an array-like object. :param system: The System to be featurized. :type system: System :rtype: System or array-like .. py:method:: _post_featurize(systems: List[kinoml.core.systems.System], features: List, keep: bool = True) -> List[kinoml.core.systems.System] Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn't need to redefine this method. :param systems: The systems being featurized :type systems: list of System :param features: The features returned by ``self._featurize`` :type features: list :param keep: Whether to store the current featurizer in the ``system.featurizations`` dictionary with its own key (``self.name``), in addition to ``last``. :type keep: bool, optional=True :returns: **filtered_systems** -- The same systems as passed, but with ``.featurizations`` extended with the calculated features in two entries: the featurizer name and ``last``. Systems with a feature of None will be removed. :rtype: systems .. py:method:: supports(*systems: kinoml.core.systems.System, raise_errors: bool = True) -> bool Check if these systems are supported by this featurizer. Do NOT reimplement in subclass. Check ``._supports()`` instead. :param systems: Systems to be checked (by type, contained attributes, etc) :type systems: list of System :param raise_errors: if True, raise `ValueError` if errors were found :type raise_errors: bool, optional=True :returns: True if all systems are compatible, False otherwise :rtype: bool :raises `ValueError`` if ``._supports()`` fails and ``raise_errors`` is `True`: .. py:method:: _supports(system: kinoml.core.systems.System) -> bool This is the private method that actually tests for compatibility between a single system and the current featurizer. This is the method you should reimplement in your subclass. :param system: The system that will be checked :type system: System :rtype: True if compatible, False otherwise .. py:property:: name .. py:method:: __repr__() .. py:class:: ParallelBaseFeaturizer(use_multiprocessing: bool = True, n_processes: Union[int, None] = None, chunksize: Union[int, None] = None, dask_client=None, **kwargs) Bases: :py:obj:`BaseFeaturizer` Abstract Featurizer class with support for multiprocessing. :param use_multiprocessing: If multiprocessing to use. :type use_multiprocessing: bool, default=True :param n_processes: How many processes to use in case of multiprocessing. Defaults to number of available CPUs. :type n_processes: int or None, default=None :param chunksize: See https://stackoverflow.com/a/54032744/3407590. :type chunksize: int, optional=None :param dask_client: A dask client to manage multiprocessing. Will ignore `use_multiprocessing` `chunksize` and `n_processes` attributes. :type dask_client: dask.distributed.Client or None, default=None .. py:attribute:: _SUPPORTED_TYPES .. py:attribute:: use_multiprocessing :value: True .. py:attribute:: n_processes :value: None .. py:attribute:: chunksize :value: None .. py:attribute:: dask_client :value: None .. py:method:: __getstate__() Only preserve object fields that are serializable .. py:method:: __setstate__(state) Only preserve object fields that are serializable. .. py:method:: _featurize(systems: List[kinoml.core.systems.System]) -> List[object] Featurize all system objects in a parallel fashion as defined in ``._featurize_one()``. :param systems: This is the collection of System objects that will be transformed. :type systems: list of System :returns: **features** :rtype: list of System or array-like .. py:class:: Pipeline(featurizers: List[BaseFeaturizer], shortname=None, **kwargs) Bases: :py:obj:`BaseFeaturizer` Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y). :param featurizers: Featurizers to stack. They must be compatible with each other! :type featurizers: iterable of BaseFeaturizer .. note:: While ``Pipeline`` is a subclass of ``BaseFeaturizer``, it should be considered a special case of such. It indeed shares the same API but the implementation details of ``._featurize()`` are slightly different. It acts as a wrapper around individual ``Featurizer`` objects. .. py:attribute:: featurizers .. py:attribute:: _shortname :value: None .. py:method:: _featurize(systems: List[kinoml.core.systems.System], keep: bool = True) -> List[object] Given a list of featurizers, apply them sequentially on the systems (e.g. featurizer A returns X, and X is taken by featurizer B, which returns Y) and store the features in the systems. :param systems: This is the collection of System objects that will be transformed :type systems: list of System :param keep: Whether to store the current featurizer in the ``system.featurizations`` dictionary with its own key (``self.name``), in addition to ``last``. :type keep: bool, optional=True :returns: **features** :rtype: list of System or array-like .. py:method:: supports(*systems: kinoml.core.systems.System, raise_errors: bool = False) -> bool Check if these systems are supported by all featurizers. :param systems: systems to be checked (by type, contained attributes, etc) :type systems: list of System :param raise_errors: If True, raise ``ValueError`` :type raise_errors: bool, optional=False :returns: True if all systems are compatible with all featurizers, False otherwise :rtype: bool :raises `ValueError`` if ``f.supports()`` fails and ``raise_errors`` is ``True``: .. py:property:: name .. py:property:: shortname .. py:class:: Concatenated(featurizers: List[BaseFeaturizer], axis: int = 1, **kwargs) Bases: :py:obj:`Pipeline` Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY). :param featurizers: These should take a System or array, but return only arrays so they can be concatenated. Note that the arrays must have the same number of dimensions. If that is not the case, you will need to reshape one of them using ``CallableFeaturizer`` and a lambda function that relies on ``np.reshape`` or similar. :type featurizers: list of BaseFeaturizer :param axis: On which axis to concatenate. By default, it will concatenate on axis ``1``, which means that the features in each pipeline will be concatenated. :type axis: int, optional=1 .. admonition:: Notes This Featurizer maybe removed in the future, since it can be replaced by `TupleOfArrays`. .. py:attribute:: axis :value: 1 .. py:method:: _featurize(systems: List[kinoml.core.systems.System], keep=True) -> numpy.ndarray Given a list of featurizers, apply them serially and concatenate the result (e.g. featurizer A returns X, and featurizer B returns Y; the output is XY). :param systems: The Systems (or arrays) to be featurized. :type systems: list of System or array-like :param keep: Whether to store the current featurizer in the ``system.featurizations`` dictionary with its own key (``self.name``), in addition to ``last``. :type keep: bool, optional=True :returns: Concatenated arrays along specified ``axis``. :rtype: np.ndarray .. py:class:: TupleOfArrays(*args, **kwargs) Bases: :py:obj:`Pipeline` Given a list of featurizers, apply them serially and return the result directly as a flattened tuple of the arrays, for each system. E.g; given one system, featurizer A returns X, and featurizer B returns Y, Z; the output is a tuple of X, Y, Z). The final result will be tuple of tuples. .. py:method:: _featurize(systems: List[kinoml.core.systems.System], keep: bool = True) -> List Given a list of featurizers, apply them serially and build a flat tuple out of the results. :param systems: The Systems (or arrays) to be featurized. :type systems: list of System or array-like :param keep: Whether to store the current featurizer in the ``system.featurizations`` dictionary with its own key (``self.name``), in addition to ``last``. :type keep: bool, optional=True :returns: If the last featurizer is returning a single array, the shape of the object will be (N_systems,). If the last featurizer returns more than one array, it will be (N_systems, M_returned_objects). :rtype: tuple of (of tuples) arraylike .. py:class:: BaseOneHotEncodingFeaturizer(dictionary: dict = None, **kwargs) Bases: :py:obj:`ParallelBaseFeaturizer` Base class for Featurizers concerning one hot encoding. .. py:attribute:: ALPHABET :value: None .. py:attribute:: dictionary :value: None .. py:method:: _featurize_one(system: Union[kinoml.core.systems.LigandSystem, kinoml.core.systems.ProteinLigandComplex]) -> Union[numpy.ndarray, None] One hot encode one system. :param system: The System to be featurized. :type system: LigandSystem or ProteinLigandComplex :rtype: array or None .. py:method:: _retrieve_sequence(system: kinoml.core.systems.System) :abstractmethod: Implement in your component-specific subclass! .. py:method:: one_hot_encode(sequence: Iterable, dictionary: dict | Sequence) -> numpy.ndarray :staticmethod: One-hot encode a sequence of characters, given a dictionary. :param sequence: :type sequence: Iterable :param dictionary: Mapping of each character to their position in the alphabet. If a sequence-like is given, it will be enumerated into a dict. :type dictionary: dict or sequuence-like :returns: One-hot encoded matrix with shape ``(len(dictionary), len(sequence))`` :rtype: array-like .. py:class:: PadFeaturizer(shape: Iterable[int] = 'auto', key: Hashable = 'last', pad_with: int = 0, **kwargs) Bases: :py:obj:`ParallelBaseFeaturizer` Pads features of a given system to a desired size or length. This class wraps ``numpy.pad`` with ``mode=constant``, auto-calculating the needed additions to match the requested shape. :param shape: The desired size of the transformed features. If "auto", shape will be estimated from the Dataset passed at runtime so it matches the largest observed. :type shape: tuple of int, or "auto" :param key: element to retrieve from ``System.featurizations`` :type key: hashable :param pad_with: value to fill the array-like features with :type pad_with: int .. py:attribute:: shape :value: 'auto' .. py:attribute:: key :value: 'last' .. py:attribute:: pad_with :value: 0 .. py:method:: _get_array(system_or_array: kinoml.core.systems.System | numpy.ndarray) -> numpy.ndarray .. py:method:: _pre_featurize(systems) -> None Compute the largest shape in the input arrays and store in shape attribute. :param systems: :type systems: list of System .. py:method:: _featurize_one(system: kinoml.core.systems.System) -> numpy.ndarray :param system: The System (or array) to be featurized. :type system: System or array-like :param options: Must contain a key ``shape`` with the expected final shape of the systems. :type options: dict :rtype: array .. py:class:: HashFeaturizer(getter: Callable[[kinoml.core.systems.System], str] = None, normalize=True, **kwargs) Bases: :py:obj:`BaseFeaturizer` Hash an attribute of the protein, such as the name or id. :param getter: A function or lambda that takes a System and returns a string to be hashed. Default value will return whatever ``system.featurizations["last"]`` contains, as a string :type getter: callable, optional :param normalize: Normalizes the hash to obtain a value in the unit interval :type normalize: bool, default=True .. py:attribute:: getter .. py:attribute:: normalize :value: True .. py:attribute:: denominator :value: 115792089237316195423570985008687907853269984665640564039457584007913129639936 .. py:method:: _getter(system) :staticmethod: .. py:method:: _featurize_one(system: kinoml.core.systems.System) -> numpy.ndarray Featurizes a component using the hash of the chosen attribute. :param system: The System to be featurized. :type system: System :returns: Sha256'd attribute :rtype: array .. py:class:: NullFeaturizer(**kwargs) Bases: :py:obj:`ParallelBaseFeaturizer` Abstract Featurizer class with support for multiprocessing. :param use_multiprocessing: If multiprocessing to use. :type use_multiprocessing: bool, default=True :param n_processes: How many processes to use in case of multiprocessing. Defaults to number of available CPUs. :type n_processes: int or None, default=None :param chunksize: See https://stackoverflow.com/a/54032744/3407590. :type chunksize: int, optional=None :param dask_client: A dask client to manage multiprocessing. Will ignore `use_multiprocessing` `chunksize` and `n_processes` attributes. :type dask_client: dask.distributed.Client or None, default=None .. py:method:: _featurize(systems: Iterable[kinoml.core.systems.System], keep: bool = None) -> object Featurize all system objects in a parallel fashion as defined in ``._featurize_one()``. :param systems: This is the collection of System objects that will be transformed. :type systems: list of System :returns: **features** :rtype: list of System or array-like .. py:class:: CallableFeaturizer(func: Callable[[kinoml.core.systems.System], kinoml.core.systems.System | numpy.array] | str = None, **kwargs) Bases: :py:obj:`BaseFeaturizer` Apply an arbitrary callable to a System. :param func: Must take a System and return a System or array. If ``str`` it will be ``eval``'d into a callable. If None, the default callable will return ``system.featurizations["last"]`` for each system. :type func: callable or str or None .. py:attribute:: callable :value: None .. py:method:: _default_func(system) :staticmethod: .. py:method:: _featurize_one(system: kinoml.core.systems.System | numpy.ndarray) -> numpy.ndarray :param system: The System (or array) to be featurized. :type system: System or array-like :param options: Unused :type options: dict :rtype: array-like .. py:class:: ClearFeaturizations(keys=('last', ), style='keep', **kwargs) Bases: :py:obj:`BaseFeaturizer` Remove keys from the ``.featurizations`` dictionary in each ``System`` object. By default, it will remove all keys that are not ``last``. :param keys: Which keys to keep or remove, depending on ``style``. :type keys: tuple of str, optional=("last",) :param style: Whether to ``keep`` or ``remove`` the entries passed as ``keys``. :type style: str, optional="keep" .. py:attribute:: keys :value: ('last',) .. py:attribute:: style :value: 'keep' .. py:method:: _featurize_one(system: kinoml.core.systems.System) -> kinoml.core.systems.System Implement this method to do the actual leg-work for `self.featurize()`. It takes a single System object and returns either a new System object or an array-like object. :param system: The System to be featurized. :type system: System :rtype: System or array-like .. py:method:: _post_featurize(systems: Iterable[kinoml.core.systems.System], features: Iterable[kinoml.core.systems.System | numpy.array], keep: bool = True) -> Iterable[kinoml.core.systems.System] Bypass the automated population of the ``.featurizations`` dict in each System .. py:class:: OEBaseModelingFeaturizer(loop_db: Union[str, None] = None, cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, **kwargs) Bases: :py:obj:`ParallelBaseFeaturizer` This abstract class defines several methods that use functionality from the OpenEye toolkit for molecular modeling. Featurizers that subclass `OEBaseModelingFeaturizer` need to implement at least the `_featurize_one` method. :param loop_db: The path to the loop database used by OESpruce to model missing loops. :type loop_db: str :param cache_dir: Path to directory used for saving intermediate files. If None, default location provided by `appdirs.user_cache_dir()` will be used. :type cache_dir: str, Path or None, default=None :param output_dir: Path to directory used for saving output files. If None, output structures will not be saved. :type output_dir: str, Path or None, default=None .. py:attribute:: loop_db :value: None .. py:attribute:: cache_dir .. py:attribute:: output_dir :value: None .. py:method:: _read_protein_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) -> Union[oechem.OEGraphMol, None] Returns the protein structure of the given protein object as OpenEye molecule. :param protein: The protein object. :type protein: Protein or KLIFSKinase :returns: The protein structure as OpenEye molecule or None. :rtype: oechem.OEGraphMol or None :raises ValueError: If wrong toolkit was used during initialization of the protein object. .. py:method:: _get_design_unit(structure: openeye.oechem.OEMolBase, chain_id: Union[str, None], alternate_location: Union[str, None], has_ligand: bool, ligand_name: Union[str, None], model_loops_and_caps: bool) -> Union[openeye.oechem.OEDesignUnit, None] Get an OpenEye design unit based on the given input. :param structure: An OpenEye molecule holding the protein structure to prepare. :type structure: oechem.OEMolBase :param chain_id: The chain ID of interest. :type chain_id: str or None :param alternate_location: The alternate location of interest. :type alternate_location: str or None :param has_ligand: If design unit generation should consider ligands. If True, design units will be only generated for protein ligand complexes. If False, design units will not consider co-crystallized ligands. :type has_ligand: bool :param ligand_name: The ligand expo ID bound to the protein of interest. Design units will be filtered to contain the respective ligand. :type ligand_name: str or None :param model_loops_and_caps: If loops and caps should be modeled. :type model_loops_and_caps: bool :returns: **design_unit** -- The design unit or None if no design unit was found. :rtype: oechem.OEDesignUnit or None .. py:method:: _get_components(design_unit: openeye.oechem.OEDesignUnit, chain_id: Union[str, None]) -> Tuple[oechem.OEGraphMol(), oechem.OEGraphMol(), oechem.OEGraphMol()] :staticmethod: Get protein, solvent and ligand components from an OpenEye design unit. :param design_unit: The OpenEye design unit to extract components from. :type design_unit: oechem.OEDesignUnit :param chain_id: The chain ID of interest. :type chain_id: str or None :returns: **components** -- OpenEye molecules holding protein, solvent and ligand. :rtype: tuple of oechem.OEGraphMol, oechem.OEGraphMol and oechem.OEGraphMol .. py:method:: _process_protein(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1, ligand: Union[oechem.OEMolBase, None] = None) -> oechem.OEMolBase Process a protein structure according to the given amino acid sequence. :param protein_structure: An OpenEye molecule holding the protein structure to process. :type protein_structure: oechem.OEMolBase :param amino_acid_sequence: The amino acid sequence with associated metadata. :type amino_acid_sequence: str :param first_id: The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment. :type first_id: int, default=1 :param ligand: An OpenEye molecule that should be checked for heavy atom clashes with built insertions. :type ligand: oechem.OEMolBase or None, default=None :returns: An OpenEye molecule holding the processed protein structure. :rtype: oechem.OEMolBase .. py:method:: _get_protein_residue_numbers(protein_structure: oechem.OEMolBase, amino_acid_sequence: str, first_id: int = 1) -> List[int] :staticmethod: Get the residue numbers of a protein structure according to given amino acid sequence. :param protein_structure: The kinase domain structure. :type protein_structure: oechem.OEMolBase :param amino_acid_sequence: The template amino acid sequence. :type amino_acid_sequence: core.sequences.AminoAcidSequence :param first_id: The ID of the first amino acid in the given sequence, e.g. if only a part of a protein was expressed and used in experiment. :type first_id: int, default=1 :returns: **residue_number** -- A list of residue numbers according to the given amino acid sequence in the same order as the residues in the given protein structure. :rtype: list of int .. py:method:: _assemble_components(protein: openeye.oechem.OEMolBase, solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None] = None) -> openeye.oechem.OEMolBase Assemble components of a solvated protein-ligand complex into a single OpenEye molecule. :param protein: An OpenEye molecule holding the protein of interest. :type protein: oechem.OEMolBase :param solvent: An OpenEye molecule holding the solvent of interest. :type solvent: oechem.OEMolBase :param ligand: An OpenEye molecule holding the ligand of interest if given. :type ligand: oechem.OEMolBase or None, default=None :returns: **assembled_components** -- An OpenEye molecule holding protein, solvent and ligand if given. :rtype: oechem.OEMolBase .. py:method:: _remove_clashing_water(solvent: openeye.oechem.OEMolBase, ligand: Union[openeye.oechem.OEMolBase, None], protein: openeye.oechem.OEMolBase) -> openeye.oechem.OEGraphMol :staticmethod: Remove water molecules clashing with a ligand or newly modeled protein residues. :param solvent: An OpenEye molecule holding the water molecules. :type solvent: oechem.OEGraphMol :param ligand: An OpenEye molecule holding the ligand or None. :type ligand: oechem.OEGraphMol or None :param protein: An OpenEye molecule holding the protein. :type protein: oechem.OEGraphMol :returns: An OpenEye molecule holding water molecules not clashing with the ligand or newly modeled protein residues. :rtype: oechem.OEGraphMol .. py:method:: _update_pdb_header(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: [str, None] = None, other_pdb_header_info: Union[None, Iterable[Tuple[str, str]]] = None) -> openeye.oechem.OEMolBase Stores information about Featurizer, protein and ligand in the PDB header COMPND section in the given OpenEye molecule. :param structure: An OpenEye molecule. :type structure: oechem.OEMolBase :param protein_name: The name of the protein. :type protein_name: str :param ligand_name: The name of the ligand if present. :type ligand_name: str or None, default=None :param other_pdb_header_info: Tuples with information that should be saved in the PDB header. Each tuple consists of two strings, i.e., the PDB header section (e.g. COMPND) and the respective information. :type other_pdb_header_info: None or iterable of tuple of str :returns: The OpenEye molecule containing the updated PDB header. :rtype: oechem.OEMolBase .. py:method:: _write_results(structure: openeye.oechem.OEMolBase, protein_name: str, ligand_name: Union[str, None] = None) -> pathlib.Path Write the results from the Featurizer and retrieve the paths to protein or complex if a ligand is present. :param structure: The OpenEye molecule holding the featurized system. :type structure: oechem.OEMolBase :param protein_name: The name of the protein. :type protein_name: str :param ligand_name: The name of the ligand if present. :type ligand_name: str or None, default=None :returns: Path to prepared protein or complex if ligand is present. :rtype: Path