kinoml.features.complexes¶

Featurizers that can only get applied to ProteinLigandComplexes or subclasses thereof

Module Contents¶

kinoml.features.complexes.logger¶

class kinoml.features.complexes.SingleLigandProteinComplexFeaturizer(**kwargs)¶

Bases: kinoml.features.core.ParallelBaseFeaturizer

Provides a minimally useful ._supports() method for all ProteinLigandComplex-like featurizers.

_COMPATIBLE_PROTEIN_TYPES¶

_COMPATIBLE_LIGAND_TYPES¶

_supports(system: kinoml.core.systems.ProteinLigandComplex) → bool¶: Check that exactly one protein and one ligand is present in the System

class kinoml.features.complexes.MostSimilarPDBLigandFeaturizer(similarity_metric: str = 'fingerprint', cache_dir: str | pathlib.Path | None = None, **kwargs)¶

Bases: SingleLigandProteinComplexFeaturizer

Find the most similar co-crystallized ligand in the PDB according to a given SMILES and UniProt ID.

The protein component of each system must be a core.proteins.Protein or a subclass thereof, and must be initialized with a uniprot_id parameter.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES.

Parameters:

similarity_metric (str, default="fingerprint") – The similarity metric to use to detect the structure with the most similar ligand [“fingerprint”, “mcs”, “openeye_shape”, “schrodinger_shape”].
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

Note

The toolkit [‘MDAnalysis’ or ‘OpenEye’] specified in the protein object initialization should fit the required toolkit when subsequently applying the OEDockingFeaturizer or SCHRODINGERDockingFeaturizer.

_SUPPORTED_TYPES¶

_SUPPORTED_SIMILARITY_METRICS = ('fingerprint', 'mcs', 'openeye_shape', 'schrodinger_shape')¶

similarity_metric = 'fingerprint'¶

cache_dir¶

_pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) → None¶: Check that SCHRODINGER variable exists.

_check_schrodinger()¶: Check that SCHRODINGER variable exists.

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) → kinoml.core.systems.ProteinLigandComplex | None¶

Find a PDB entry with a protein of the given UniProt ID and with the most similar co-crystallized ligand.

Parameters:: system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
Returns:: The same system, but with additional protein attributes, i.e. pdb_id, chain_id and expo_id. None if no suitable PDB entry was found.
Return type:: ProteinLigandComplex or None

_post_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex], features: List[kinoml.core.systems.ProteinLigandComplex], keep: bool = True) → List[kinoml.core.systems.ProteinLigandComplex]¶

Run after featurizing all systems. Original systems will be replaced with systems returned by the featurizer. Systems that were not successfully featurized will be removed and listed in a log file in the current working directory.

Parameters:

systems (list of ProteinLigandComplex) – The systems being featurized.
features (list of ProteinLigandComplex) – The features returned by self._featurize, i.e. new systems.
keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns:

The new systems with .featurizations extended with the calculated features in two entries: the featurizer name and last.

Return type:

list of ProteinLigandComplex

_get_pdb_ligand_entities(uniprot_id: str) → pandas.DataFrame | None¶

Get PDB ligand entities bound to protein structures of the given UniProt ID. Only X-ray structures will be considered. If a ligand is co-crystallized with multiple PDB structures the ligand entity with the lowest resolution will be returned.

Parameters:: uniprot_id (str) – The UniProt ID of the protein of interest.
Returns:: A DataFrame with columns ligand_entity, pdb_id, non_polymer_id, chain_id, expo_id and resolution. None if no suitable ligand entities were found.
Return type:: pd.DataFrame or None

static _add_ligand_entity_info(pdb_ligand_entities: pd.DataFrame) → pd.DataFrame¶

Add chain and expo ID information to the PDB ligand entities dataframe.

Parameters:: pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with a column named ligand_entity. This column must contain strings in the format ‘4YNE_3’, i.e. the third non polymer entity of PDB entry 4YNE.
Returns:: The same PDB ligand entities dataframe but with additional columns named chain_id and expo_id. PDB ligand entities without such information are removed.
Return type:: pd.DataFrame

static _add_pdb_resolution(pdb_ligand_entities: pd.DataFrame) → pd.DataFrame¶

Add resolution information to the PDB ligand entities dataframe.

Parameters:: pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with a column named pdb_id. This column must contain strings in the format ‘4YNE’, i.e. PDB entry 4YNE.
Returns:: The same PDB ligand entities dataframe but with an additional column named resolution. PDB ligand entities without such information will get a dummy resolution of 99.9.
Return type:: pd.DataFrame

_get_most_similar_pdb_ligand_entity(pdb_ligand_entities: pd.DataFrame, smiles: str) → Tuple[str, str, str]¶

Get the PDB ligand that is most similar to the given SMILES.

Parameters:: pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id and expo_id.
Returns:: The PDB, chain and expo ID of the most similar ligand.
Return type:: tuple of str

static _by_fingerprint(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) → pandas.DataFrame¶

Get the PDB ligands that are most similar to the given SMILES according to Morgan Fingerprints.

Parameters:

pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 0.87 and the max_similarity_cutoff is set to 0.1, all ligands will be returned with a similarity of 0.77 or higher.

Returns:

The most similar ligands.

Return type:

pd.DataFrame

static _by_mcs(pdb_ligand_entities: pd.DataFrame, smiles: str, max_bonds_cutoff: float = 0.0) → pd.DataFrame¶

Get the PDB ligands that are most similar to the given SMILES according to the number of bonds in the maximum common substructures.

Parameters:

pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_bonds_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected number of MCS bonds and the possible maximum of MCS bonds. The possible maximum number is calculated from the number of bonds in the given smiles. If the possible maximum number is 35, the highest number of detected mcs bonds is 20 and the max_bonds_cutoff is 0.1, all ligands will be returned with a number of MCS bonds of 16.5 (20 - (35 * 0.1)) or higher.

Returns:

The most similar ligands.

Return type:

pd.DataFrame

_by_schrodinger_shape(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) → pandas.DataFrame¶

Get the PDB ligands that are most similar to the given SMILES according to SCHRODINGER shape_screen.

Parameters:

pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 0.87 and the max_similarity_cutoff is set to 0.1, all ligands will be returned with a similarity of 0.77 or higher.

Returns:

The most similar ligands.

Return type:

pd.DataFrame

_by_openeye_shape(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) → pandas.DataFrame¶

Get the PDB ligands that are most similar to the given SMILES according to OpenEye’s TanimotoCombo score.

Parameters:

pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 1.31 and the max_similarity_cutoff is set to 0.2, all ligands will be returned with a similarity of 1.11 or higher.

Returns:

The most similar ligands.

Return type:

pd.DataFrame

class kinoml.features.complexes.KLIFSConformationTemplatesFeaturizer(**kwargs)¶

Bases: MostSimilarPDBLigandFeaturizer

Find suitable kinase templates for modeling a kinase:inhibitor complex in different KLIFS conformations.

The protein component of each system must be a core.proteins.KLIFSKinase, and must be initialized with a uniprot_id or kinase_klifs_id parameter.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES.

Parameters:

similarity_metric (str, default="fingerprint") – The similarity metric to use to detect the structures with similar ligands [“fingerprint”, “mcs”, “openeye_shape”, “schrodinger_shape”].
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

_COMPATIBLE_PROTEIN_TYPES¶

_pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) → None¶: Check SCHRODINGER variable and fetch KLIFS data.

_create_klifs_structure_db()¶: Fetch structure data from KLIFS.

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) → pandas.DataFrame¶

Find PDB entries for different KLIFS conformations with a similar co-crystallized ligand and a similar pocket sequence.

Parameters:: system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
Returns:: A dataframe with columns for dfg, ac_helix, pdb_id, chain_id, expo_id, ligand_similarity and sequence_similarity.
Return type:: DataFrame

static _filter_structures(structures: pd.DataFrame) → pd.DataFrame¶

Filter KLIFS entries for the presence of exactly one orthosteric ligand and determined KLIFS conformation, and remove duplicates.

Parameters:: structures (DataFrame) – The KLIFS entries to filter, need to contain the columns ligand.expo_id, structure.pdb_id, structure.dfg, structure.ac_helix, structure.qualityscore, structure.resolution, structure.chain and structure.alternate_model.
Returns:: The filtered KLIFS entries.
Return type:: DataFrame

_get_most_similar_klifs_ligand_entity(structures: pd.DataFrame, smiles: str, klifs_sequence: str) → Tuple[str, str, str, str, str]¶

Get the KLIFS entry that is most similar to the given SMILES and KLIFS pocket sequence.

Parameters:: structures (pd.DataFrame) – The KLIFS entries dataframe with columns named structure.pdb_id, structure.chain, structure.expo_id, smiles and structure.pocket.
Returns:: The PDB ID, chain ID, expo ID, ligand similarity and pocket similarity of the KLIFS entry with the most similar ligand and KLIFS pocket sequence.
Return type:: tuple of str

static _by_klifs_sequence(klifs_structures: pd.DataFrame, reference_klifs_sequence: str, max_similarity_cutoff: float = 0.0) → pd.DataFrame¶

Get the KLIFS entries that are most similar to the given pocket sequence.

Parameters:

reference_klifs_sequence (str) – The PDB ligand entities dataframe with a column named structure.pocket.
reference_klifs_sequence – The sequence for calculating the similarity.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar sequences based on the highest detected sequence similarity and the possible maximum of sequence similarity. The possible maximum sequence similarity is aligning the reference sequence to itself. If the possible maximum sequence similarity is 450, the highest detected sequence similarity is 320 and the max_similarity_cutoff is 0.1, all entries will be returned with a sequence similarity of 275 (320 - (450 * 0.1)) or higher.

Returns:

The KLIFS entries with the most similar pocket sequences.

Return type:

pd.DataFrame

_post_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex], features: List, keep: bool = True) → List[kinoml.core.systems.ProteinLigandComplex]¶

Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn’t need to redefine this method.

Parameters:

systems (list of System) – The systems being featurized
features (list) – The features returned by self._featurize
keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns:

filtered_systems – The same systems as passed, but with .featurizations extended with the calculated features in two entries: the featurizer name and last. Systems with a feature of None will be removed.

Return type:

systems

class kinoml.features.complexes.OEComplexFeaturizer(**kwargs)¶

Bases: kinoml.features.core.OEBaseModelingFeaturizer, SingleLigandProteinComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the complex structure by:

modeling missing loops with OESpruce according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

building missing side chains

substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be modeled with OESpruce, if an alteration could not be modeled, the corresponding mismatch in the structure will be deleted

removing everything but protein, water and ligand of interest

protonation at pH 7.4

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’OpenEye’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

name: A string specifying the name of the protein, will be used for generating the output file name.

chain_id: A string specifying which chain should be used.

alternate_location: A string specifying which alternate location should be used.

expo_id: A string specifying the ligand of interest. This is especially useful if multiple ligands are present in a PDB structure.

uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

The ligand component of each system must be a core.components.BaseLigand or a subclass thereof. The ligand component can have the following optional attributes:

name: A string specifying the name of the ligand, will be used for generating the output file name.

Parameters:

loop_db (str) – The path to the loop database used by OESpruce to model missing loops.
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

Note

If the ligand of interest is covalently bonded to the protein, the covalent bond will be broken. This may lead to the transformation of the ligand into a radical.

_SUPPORTED_TYPES¶

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) → Universe | None¶

Prepare a protein structure.

Parameters:: system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
Returns:: An MDAnalysis universe of the featurized system. None if no design unit was found.
Return type:: Universe or None

class kinoml.features.complexes.OEDockingFeaturizer(method: str = 'Posit', pKa_norm: bool = True, **kwargs)¶

Bases: kinoml.features.core.OEBaseModelingFeaturizer, SingleLigandProteinComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the structure and dock the ligand into the prepared protein structure with one of OpenEye’s docking algorithms:

modeling missing loops with OESpruce according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

building missing side chains

substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be modeled with OESpruce, if an alteration could not be modeled, the corresponding mismatch in the structure will be deleted

removing everything but protein, water and ligand of interest

protonation at pH 7.4

perform docking

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’OpenEye’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

name: A string specifying the name of the protein, will be used for generating the output file name.

chain_id: A string specifying which chain should be used.

alternate_location: A string specifying which alternate location should be used.

expo_id: A string specifying a ligand bound to the protein of interest. This is especially useful if multiple proteins are found in one PDB structure.

uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

pocket_resids: List of integers specifying the residues in the binding pocket of interest. This attribute is required if docking with Fred into an apo structure.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES. Additionally, the ligand component can have the following optional .. attribute:: - name

A string specifying the name of the ligand, will be used for generating the output file name.

Parameters:

method (str, default="Posit") – The docking method to use [“Fred”, “Hybrid”, “Posit”].
loop_db (str) – The path to the loop database used by OESpruce to model missing loops.
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
pKa_norm (bool, default=True) – Assign the predominant ionization state of the molecules to dock at pH ~7.4. If False, the ionization state of the input molecules will be conserved.

method = 'Posit'¶

pKa_norm = True¶

_SUPPORTED_TYPES¶

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) → Universe | None¶

Prepare a protein structure and dock a ligand using OpenEye’s Fred method.

Parameters:: system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
Returns:: An MDAnalysis universe of the featurized system. None if no design unit or docking pose was found.
Return type:: Universe or None

static _store_docking_score(structure: Universe, docking_pose: openeye.oechem.OEGraphMol)¶

Store the docking score from OpenEye docking in the MDAnalysis universe._topology. If the Posit probability is available it will be stored as well. They cannot be stored in the universe object directly, because they will be lost during multiprocessing/pickling.

Parameters:

structure (Universe) – The docked structure as MDAnalysis universe.
docking_pose (oechem.OEGraphMol) – The docking pose.

class kinoml.features.complexes.SCHRODINGERComplexFeaturizer(cache_dir: str | pathlib.Path | None = None, output_dir: str | pathlib.Path | None = None, max_retry: int = 3, build_loops: bool = True, **kwargs)¶

Bases: SingleLigandProteinComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the complex structure by:

modeling missing loops with Prime according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

building missing side chains

substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be first deleted and subsequently the intended sequence modeled with Prime, if an alteration could not be modeled, a corresponding deletion will remain

removing everything but protein, water and ligand of interest

protonation at pH 7.4

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’MDAnalysis’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

name: A string specifying the name of the protein, will be used for generating the output file name.

chain_id: A string specifying which chain should be used.

alternate_location: A string specifying which alternate location should be used.

expo_id: A string specifying the ligand of interest. This is especially useful if multiple ligands are present in a PDB structure.

uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

The ligand component of each system must be a core.components.BaseLigand or a subclass thereof. The ligand component can have the following optional attributes:

name: A string specifying the name of the ligand, will be used for generating the output file name.

Parameters:

cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
max_retry (int, default=3) – The maximal number of attempts to try running the prepwizard step.
build_loops (bool, default=True) – If missing loops shell be built. Is also needed to model mutations.

cache_dir¶

output_dir = None¶

max_retry = 3¶

build_loops = True¶

_SUPPORTED_TYPES¶

_pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) → None¶: Check that SCHRODINGER variable exists.

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) → Universe | None¶

Prepare a protein structure.

Parameters:: system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
Returns:: An MDAnalysis universe of the featurized system or None if not successful.
Return type:: Universe or None

static _system_to_name(system: kinoml.core.systems.ProteinLigandComplex) → str¶

Get a name of the system based on attributes of the protein and ligand component.

Parameters:: system (ProteinLigandComplex) – The system with protein and ligand component.
Returns:: A descriptive name of the system
Return type:: str

_prepare_structure(protein: kinoml.core.proteins.Protein | kinoml.core.proteins.KLIFSKinase) → pathlib.Path | None¶

Prepare the structure with SCHRODINGER’s prepwizard.

Parameters:: protein (Path) – The path to the input structure file in PDB format.
Returns:: The path to the prepared structure if successful.
Return type:: Path or None

_read_protein_structure(protein: kinoml.core.proteins.Protein | kinoml.core.proteins.KLIFSKinase) → Universe | None¶

Returns the protein structure of the given protein object as MDAnalysis universe.

Parameters:: protein (Protein or KLIFSKinase) – The protein object.
Returns:: The protein structure as MDAnalysis universe or None.
Return type:: Universe or None
Raises:: ValueError – If wrong toolkit was used during initialization of the protein object.

_preprocess_structure(pdb_path: str | pathlib.Path, chain_id: str | None, alternate_location: str | None, expo_id: str | None, sequence: str) → pathlib.Path¶

Pre-process a structure for SCHRODINGER’s prepwizard with the following steps:

select chain of interest
select alternate location of interest
remove all ligands but ligand of interest
remove expression tags
delete protein alterations differing from given sequence
renumber protein residues according to the given sequence

Parameters:

pdb_path (str or Path) – Path to the structure file in PDB format.
chain_id (str or None) – The chain ID of interest.
alternate_location (str or None) – The alternate location of interest.
expo_id (str or None) – The resname of the ligand of interest.
sequence (str) – The amino acid sequence of the protein.

Returns:

The path to the cleaned structure.

Return type:

Path

static _postprocess_structure(prepared_structure: Universe, protein: [kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) → Universe¶

Post-process a structure prepared with SCHRODINGER’s prepwizard with the following steps:

select the chain of interest
select the alternate location of interest
remove all ligands but the ligands of interest
update residue identifiers, e.g. atom indices, chain ID, residue IDs of non-protein

Parameters:

prepared_structure (Universe) – The structure prepared by SCHRODINGER’s prepwizard.
protein (Protein or KLIFSKinase) – The protein component of the system.

Returns:

The post-processed structure.

Return type:

Universe

class kinoml.features.complexes.SCHRODINGERDockingFeaturizer(cache_dir: str | pathlib.Path | None = None, output_dir: str | pathlib.Path | None = None, max_retry: int = 3, build_loops: bool = True, shape_restrain: bool = True, **kwargs)¶

Bases: SCHRODINGERComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the structure dock the ligand into its binding site identified by a co-crystallized ligand. The following steps will be performed:

modeling missing loops with Prime according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

building missing side chains

substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be first deleted and subsequently the intended sequence modeled with Prime, if an alteration could not be modeled, a corresponding deletion will remain

removing everything but protein, water and ligand of interest

protonation at pH 7.4

docking a ligand

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’MDAnalysis’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

name: A string specifying the name of the protein, will be used for generating the output file name.

chain_id: A string specifying which chain should be used.

alternate_location: A string specifying which alternate location should be used.

expo_id: A string specifying a ligand bound to the protein of interest. This is especially useful if multiple proteins are found in one PDB structure.

uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES. Additionally, the ligand component can have the following optional .. attribute:: - name

A string specifying the name of the ligand, will be used for generating the output file name and as molecule title in the docking pose SDF file.

- `macrocycle`: A bool specifying if the ligand shell be sampled as a macrocycle during docking. Docking will fail, if SCHRDODINGER does not consider the ligand a macrocycle.

Parameters:

cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
max_retry (int, default=3) – The maximal number of attempts to try running the prepwizard and docking steps.
build_loops (bool, default=True) – If missing loops shell be built. Is also needed to model mutations.
shape_restrain (bool, default=True) – If the docking shell be performed with shape restrain based on the co-crystallized ligand.

shape_restrain = True¶

_SUPPORTED_TYPES¶

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) → Universe | None¶

Prepare a protein structure and dock a ligand.

Parameters:: system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
Returns:: An MDAnalysis universe of the featurized system or None if not successful.
Return type:: Universe or None

_dock_molecule(prepared_structure_path: pathlib.Path, system: kinoml.core.systems.ProteinLigandComplex, system_name: str) → pathlib.Path | None¶

Dock the molecule into the protein with SCHRODINGER’s Glide.

Parameters:

prepared_structure_path (Path) – A prepared protein structure, ready for docking.
system (ProteinLigandComplex) – The system that is being featurized.
system_name (str) – A descriptive name of the system.

Returns:

The path to the generated docking pose, None if not successful.

Return type:

Path or None

static _replace_ligand(pdb_path: pathlib.Path, docking_pose_sdf_path: pathlib.Path) → Universe¶

Replace the ligand in a PDB file with a ligand in an SDF file.

Parameters:

pdb_path (Path) – Path to the PDB file of the protein ligand complex.
docking_pose_sdf_path (Path) – Path to the molecule in SDF format that shell be added to the structure.

Returns:

The structure with replaced ligand.

Return type:

Universe

static _store_docking_score(structure: Universe, docking_pose_path: pathlib.Path)¶

Store the docking score from OpenEye docking in the MDAnalysis universe._topology. They cannot be stored in the universe object directly, because they will be lost during multiprocessing/pickling.

Parameters:

structure (Universe) – The docked structure as MDAnalysis universe.
docking_pose_path (Path) – The path to the docking pose.

_write_complex_mae(prepared_structure: Universe, docking_pose_path: pathlib.Path, complex_path_mae: pathlib.Path)¶

Write the new docked structure in MAE format.

Parameters:

prepared_structure (Universe) – The prepared structure containing the docked ligand with resname LIG.
docking_pose_path (Path) – The prepared docking pose including correct bonding information.
complex_path_mae (Path) – The path for the output file in MAE format.