kinoml.features.complexes
¶
Featurizers that can only get applied to ProteinLigandComplexes or subclasses thereof
Module Contents¶
- kinoml.features.complexes.logger¶
- class kinoml.features.complexes.SingleLigandProteinComplexFeaturizer(**kwargs)¶
Bases:
kinoml.features.core.ParallelBaseFeaturizer
Provides a minimally useful
._supports()
method for all ProteinLigandComplex-like featurizers.- _COMPATIBLE_PROTEIN_TYPES = ()¶
- _COMPATIBLE_LIGAND_TYPES = ()¶
- _supports(system: Union[kinoml.core.systems.ProteinLigandComplex]) bool ¶
Check that exactly one protein and one ligand is present in the System
- class kinoml.features.complexes.MostSimilarPDBLigandFeaturizer(similarity_metric: str = 'fingerprint', cache_dir: Union[str, pathlib.Path, None] = None, **kwargs)¶
Bases:
SingleLigandProteinComplexFeaturizer
Find the most similar co-crystallized ligand in the PDB according to a given SMILES and UniProt ID.
The protein component of each system must be a core.proteins.Protein or a subclass thereof, and must be initialized with a uniprot_id parameter.
The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES.
- Parameters
similarity_metric (str, default="fingerprint") – The similarity metric to use to detect the structure with the most similar ligand [“fingerprint”, “mcs”, “openeye_shape”, “schrodinger_shape”].
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
Note
The toolkit [‘MDAnalysis’ or ‘OpenEye’] specified in the protein object initialization should fit the required toolkit when subsequently applying the OEDockingFeaturizer or SCHRODINGERDockingFeaturizer.
- _SUPPORTED_TYPES = ()¶
- _SUPPORTED_SIMILARITY_METRICS = ('fingerprint', 'mcs', 'openeye_shape', 'schrodinger_shape')¶
- _pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) None ¶
Check that SCHRODINGER variable exists.
- _check_schrodinger()¶
Check that SCHRODINGER variable exists.
- _featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[kinoml.core.systems.ProteinLigandComplex, None] ¶
Find a PDB entry with a protein of the given UniProt ID and with the most similar co-crystallized ligand.
- Parameters
system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
- Returns
The same system, but with additional protein attributes, i.e. pdb_id, chain_id and expo_id. None if no suitable PDB entry was found.
- Return type
ProteinLigandComplex or None
- _post_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex], features: List[kinoml.core.systems.ProteinLigandComplex], keep: bool = True) List[kinoml.core.systems.ProteinLigandComplex] ¶
Run after featurizing all systems. Original systems will be replaced with systems returned by the featurizer. Systems that were not successfully featurized will be removed and listed in a log file in the current working directory.
- Parameters
systems (list of ProteinLigandComplex) – The systems being featurized.
features (list of ProteinLigandComplex) – The features returned by
self._featurize
, i.e. new systems.keep (bool, optional=True) – Whether to store the current featurizer in the
system.featurizations
dictionary with its own key (self.name
), in addition tolast
.
- Returns
The new systems with
.featurizations
extended with the calculated features in two entries: the featurizer name andlast
.- Return type
list of ProteinLigandComplex
- _get_pdb_ligand_entities(uniprot_id: str) Union[pandas.DataFrame, None] ¶
Get PDB ligand entities bound to protein structures of the given UniProt ID. Only X-ray structures will be considered. If a ligand is co-crystallized with multiple PDB structures the ligand entity with the lowest resolution will be returned.
- Parameters
uniprot_id (str) – The UniProt ID of the protein of interest.
- Returns
A DataFrame with columns ligand_entity, pdb_id, non_polymer_id, chain_id, expo_id and resolution. None if no suitable ligand entities were found.
- Return type
pd.DataFrame or None
- static _add_ligand_entity_info(pdb_ligand_entities: pd.DataFrame) pd.DataFrame ¶
Add chain and expo ID information to the PDB ligand entities dataframe.
- Parameters
pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with a column named ligand_entity. This column must contain strings in the format ‘4YNE_3’, i.e. the third non polymer entity of PDB entry 4YNE.
- Returns
The same PDB ligand entities dataframe but with additional columns named chain_id and expo_id. PDB ligand entities without such information are removed.
- Return type
pd.DataFrame
- static _add_pdb_resolution(pdb_ligand_entities: pd.DataFrame) pd.DataFrame ¶
Add resolution information to the PDB ligand entities dataframe.
- Parameters
pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with a column named pdb_id. This column must contain strings in the format ‘4YNE’, i.e. PDB entry 4YNE.
- Returns
The same PDB ligand entities dataframe but with an additional column named resolution. PDB ligand entities without such information will get a dummy resolution of 99.9.
- Return type
pd.DataFrame
- _get_most_similar_pdb_ligand_entity(pdb_ligand_entities: pd.DataFrame, smiles: str) Tuple[str, str, str] ¶
Get the PDB ligand that is most similar to the given SMILES.
- Parameters
pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id and expo_id.
- Returns
The PDB, chain and expo ID of the most similar ligand.
- Return type
tuple of str
- static _by_fingerprint(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) pandas.DataFrame ¶
Get the PDB ligands that are most similar to the given SMILES according to Morgan Fingerprints.
- Parameters
pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 0.87 and the max_similarity_cutoff is set to 0.1, all ligands will be returned with a similarity of 0.77 or higher.
- Returns
The most similar ligands.
- Return type
pd.DataFrame
- static _by_mcs(pdb_ligand_entities: pd.DataFrame, smiles: str, max_bonds_cutoff: float = 0.0) pd.DataFrame ¶
Get the PDB ligands that are most similar to the given SMILES according to the number of bonds in the maximum common substructures.
- Parameters
pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_bonds_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected number of MCS bonds and the possible maximum of MCS bonds. The possible maximum number is calculated from the number of bonds in the given smiles. If the possible maximum number is 35, the highest number of detected mcs bonds is 20 and the max_bonds_cutoff is 0.1, all ligands will be returned with a number of MCS bonds of 16.5 (20 - (35 * 0.1)) or higher.
- Returns
The most similar ligands.
- Return type
pd.DataFrame
- _by_schrodinger_shape(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) pandas.DataFrame ¶
Get the PDB ligands that are most similar to the given SMILES according to SCHRODINGER shape_screen.
- Parameters
pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 0.87 and the max_similarity_cutoff is set to 0.1, all ligands will be returned with a similarity of 0.77 or higher.
- Returns
The most similar ligands.
- Return type
pd.DataFrame
- _by_openeye_shape(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) pandas.DataFrame ¶
Get the PDB ligands that are most similar to the given SMILES according to OpenEye’s TanimotoCombo score.
- Parameters
pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.
smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 1.31 and the max_similarity_cutoff is set to 0.2, all ligands will be returned with a similarity of 1.11 or higher.
- Returns
The most similar ligands.
- Return type
pd.DataFrame
- class kinoml.features.complexes.KLIFSConformationTemplatesFeaturizer(**kwargs)¶
Bases:
MostSimilarPDBLigandFeaturizer
Find suitable kinase templates for modeling a kinase:inhibitor complex in different KLIFS conformations.
The protein component of each system must be a core.proteins.KLIFSKinase, and must be initialized with a uniprot_id or kinase_klifs_id parameter.
The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES.
- Parameters
similarity_metric (str, default="fingerprint") – The similarity metric to use to detect the structures with similar ligands [“fingerprint”, “mcs”, “openeye_shape”, “schrodinger_shape”].
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
- _COMPATIBLE_PROTEIN_TYPES = ()¶
- _pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) None ¶
Check SCHRODINGER variable and fetch KLIFS data.
- _create_klifs_structure_db()¶
Fetch structure data from KLIFS.
- _featurize_one(system: kinoml.core.systems.ProteinLigandComplex) pandas.DataFrame ¶
Find PDB entries for different KLIFS conformations with a similar co-crystallized ligand and a similar pocket sequence.
- Parameters
system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
- Returns
A dataframe with columns for dfg, ac_helix, pdb_id, chain_id, expo_id, ligand_similarity and sequence_similarity.
- Return type
DataFrame
- static _filter_structures(structures: pd.DataFrame) pd.DataFrame ¶
Filter KLIFS entries for the presence of exactly one orthosteric ligand and determined KLIFS conformation, and remove duplicates.
- Parameters
structures (DataFrame) – The KLIFS entries to filter, need to contain the columns ligand.expo_id, structure.pdb_id, structure.dfg, structure.ac_helix, structure.qualityscore, structure.resolution, structure.chain and structure.alternate_model.
- Returns
The filtered KLIFS entries.
- Return type
DataFrame
- _get_most_similar_klifs_ligand_entity(structures: pd.DataFrame, smiles: str, klifs_sequence: str) Tuple[str, str, str, str, str] ¶
Get the KLIFS entry that is most similar to the given SMILES and KLIFS pocket sequence.
- Parameters
structures (pd.DataFrame) – The KLIFS entries dataframe with columns named structure.pdb_id, structure.chain, structure.expo_id, smiles and structure.pocket.
- Returns
The PDB ID, chain ID, expo ID, ligand similarity and pocket similarity of the KLIFS entry with the most similar ligand and KLIFS pocket sequence.
- Return type
tuple of str
- static _by_klifs_sequence(klifs_structures: pd.DataFrame, reference_klifs_sequence: str, max_similarity_cutoff: float = 0.0) pd.DataFrame ¶
Get the KLIFS entries that are most similar to the given pocket sequence.
- Parameters
reference_klifs_sequence (str) – The PDB ligand entities dataframe with a column named structure.pocket.
reference_klifs_sequence – The sequence for calculating the similarity.
max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar sequences based on the highest detected sequence similarity and the possible maximum of sequence similarity. The possible maximum sequence similarity is aligning the reference sequence to itself. If the possible maximum sequence similarity is 450, the highest detected sequence similarity is 320 and the max_similarity_cutoff is 0.1, all entries will be returned with a sequence similarity of 275 (320 - (450 * 0.1)) or higher.
- Returns
The KLIFS entries with the most similar pocket sequences.
- Return type
pd.DataFrame
- _post_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex], features: List, keep: bool = True) List[kinoml.core.systems.ProteinLigandComplex] ¶
Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn’t need to redefine this method.
- Parameters
systems (list of System) – The systems being featurized
features (list) – The features returned by
self._featurize
keep (bool, optional=True) – Whether to store the current featurizer in the
system.featurizations
dictionary with its own key (self.name
), in addition tolast
.
- Returns
filtered_systems – The same systems as passed, but with
.featurizations
extended with the calculated features in two entries: the featurizer name andlast
. Systems with a feature of None will be removed.- Return type
systems
- class kinoml.features.complexes.OEComplexFeaturizer(**kwargs)¶
Bases:
kinoml.features.core.OEBaseModelingFeaturizer
,SingleLigandProteinComplexFeaturizer
Given systems with exactly one protein and one ligand, prepare the complex structure by:
modeling missing loops with OESpruce according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled
building missing side chains
substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be modeled with OESpruce, if an alteration could not be modeled, the corresponding mismatch in the structure will be deleted
removing everything but protein, water and ligand of interest
protonation at pH 7.4
The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’OpenEye’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:
name: A string specifying the name of the protein, will be used for generating the output file name.
chain_id: A string specifying which chain should be used.
alternate_location: A string specifying which alternate location should be used.
expo_id: A string specifying the ligand of interest. This is especially useful if multiple ligands are present in a PDB structure.
uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.
sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.
The ligand component of each system must be a core.components.BaseLigand or a subclass thereof. The ligand component can have the following optional attributes:
name: A string specifying the name of the ligand, will be used for generating the output file name.
- Parameters
loop_db (str) – The path to the loop database used by OESpruce to model missing loops.
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
Note
If the ligand of interest is covalently bonded to the protein, the covalent bond will be broken. This may lead to the transformation of the ligand into a radical.
- _SUPPORTED_TYPES = ()¶
- _featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None] ¶
Prepare a protein structure.
- Parameters
system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
- Returns
An MDAnalysis universe of the featurized system. None if no design unit was found.
- Return type
Universe or None
- class kinoml.features.complexes.OEDockingFeaturizer(method: str = 'Posit', pKa_norm: bool = True, **kwargs)¶
Bases:
kinoml.features.core.OEBaseModelingFeaturizer
,SingleLigandProteinComplexFeaturizer
Given systems with exactly one protein and one ligand, prepare the structure and dock the ligand into the prepared protein structure with one of OpenEye’s docking algorithms:
modeling missing loops with OESpruce according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled
building missing side chains
substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be modeled with OESpruce, if an alteration could not be modeled, the corresponding mismatch in the structure will be deleted
removing everything but protein, water and ligand of interest
protonation at pH 7.4
perform docking
The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’OpenEye’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:
name: A string specifying the name of the protein, will be used for generating the output file name.
chain_id: A string specifying which chain should be used.
alternate_location: A string specifying which alternate location should be used.
expo_id: A string specifying a ligand bound to the protein of interest. This is especially useful if multiple proteins are found in one PDB structure.
uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.
sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.
pocket_resids: List of integers specifying the residues in the binding pocket of interest. This attribute is required if docking with Fred into an apo structure.
The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES. Additionally, the ligand component can have the following optional .. attribute:: - name
A string specifying the name of the ligand, will be used for generating the output file name.
- Parameters
method (str, default="Posit") – The docking method to use [“Fred”, “Hybrid”, “Posit”].
loop_db (str) – The path to the loop database used by OESpruce to model missing loops.
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
pKa_norm (bool, default=True) – Assign the predominant ionization state of the molecules to dock at pH ~7.4. If False, the ionization state of the input molecules will be conserved.
- _SUPPORTED_TYPES = ()¶
- _featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None] ¶
Prepare a protein structure and dock a ligand using OpenEye’s Fred method.
- Parameters
system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
- Returns
An MDAnalysis universe of the featurized system. None if no design unit or docking pose was found.
- Return type
Universe or None
- static _store_docking_score(structure: Universe, docking_pose: openeye.oechem.OEGraphMol)¶
Store the docking score from OpenEye docking in the MDAnalysis universe._topology. If the Posit probability is available it will be stored as well. They cannot be stored in the universe object directly, because they will be lost during multiprocessing/pickling.
- Parameters
structure (Universe) – The docked structure as MDAnalysis universe.
docking_pose (oechem.OEGraphMol) – The docking pose.
- class kinoml.features.complexes.SCHRODINGERComplexFeaturizer(cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, max_retry: int = 3, build_loops: bool = True, **kwargs)¶
Bases:
SingleLigandProteinComplexFeaturizer
Given systems with exactly one protein and one ligand, prepare the complex structure by:
modeling missing loops with Prime according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled
building missing side chains
substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be first deleted and subsequently the intended sequence modeled with Prime, if an alteration could not be modeled, a corresponding deletion will remain
removing everything but protein, water and ligand of interest
protonation at pH 7.4
The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’MDAnalysis’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:
name: A string specifying the name of the protein, will be used for generating the output file name.
chain_id: A string specifying which chain should be used.
alternate_location: A string specifying which alternate location should be used.
expo_id: A string specifying the ligand of interest. This is especially useful if multiple ligands are present in a PDB structure.
uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.
sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.
The ligand component of each system must be a core.components.BaseLigand or a subclass thereof. The ligand component can have the following optional attributes:
name: A string specifying the name of the ligand, will be used for generating the output file name.
- Parameters
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
max_retry (int, default=3) – The maximal number of attempts to try running the prepwizard step.
build_loops (bool, default=True) – If missing loops shell be built. Is also needed to model mutations.
- _SUPPORTED_TYPES = ()¶
- _pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) None ¶
Check that SCHRODINGER variable exists.
- _featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None] ¶
Prepare a protein structure.
- Parameters
system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
- Returns
An MDAnalysis universe of the featurized system or None if not successful.
- Return type
Universe or None
- static _system_to_name(system: kinoml.core.systems.ProteinLigandComplex) str ¶
Get a name of the system based on attributes of the protein and ligand component.
- Parameters
system (ProteinLigandComplex) – The system with protein and ligand component.
- Returns
A descriptive name of the system
- Return type
str
- _prepare_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Union[pathlib.Path, None] ¶
Prepare the structure with SCHRODINGER’s prepwizard.
- Parameters
protein (Path) – The path to the input structure file in PDB format.
- Returns
The path to the prepared structure if successful.
- Return type
Path or None
- _read_protein_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Union[Universe, None] ¶
Returns the protein structure of the given protein object as MDAnalysis universe.
- Parameters
protein (Protein or KLIFSKinase) – The protein object.
- Returns
The protein structure as MDAnalysis universe or None.
- Return type
Universe or None
- Raises
ValueError – If wrong toolkit was used during initialization of the protein object.
- _preprocess_structure(pdb_path: Union[str, pathlib.Path], chain_id: Union[str, None], alternate_location: Union[str, None], expo_id: Union[str, None], sequence: str) pathlib.Path ¶
- Pre-process a structure for SCHRODINGER’s prepwizard with the following steps:
select chain of interest
select alternate location of interest
remove all ligands but ligand of interest
remove expression tags
delete protein alterations differing from given sequence
renumber protein residues according to the given sequence
- Parameters
pdb_path (str or Path) – Path to the structure file in PDB format.
chain_id (str or None) – The chain ID of interest.
alternate_location (str or None) – The alternate location of interest.
expo_id (str or None) – The resname of the ligand of interest.
sequence (str) – The amino acid sequence of the protein.
- Returns
The path to the cleaned structure.
- Return type
Path
- static _postprocess_structure(prepared_structure: Universe, protein: [kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Universe ¶
- Post-process a structure prepared with SCHRODINGER’s prepwizard with the following steps:
select the chain of interest
select the alternate location of interest
remove all ligands but the ligands of interest
update residue identifiers, e.g. atom indices, chain ID, residue IDs of non-protein
- Parameters
prepared_structure (Universe) – The structure prepared by SCHRODINGER’s prepwizard.
protein (Protein or KLIFSKinase) – The protein component of the system.
- Returns
The post-processed structure.
- Return type
Universe
- class kinoml.features.complexes.SCHRODINGERDockingFeaturizer(cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, max_retry: int = 3, build_loops: bool = True, shape_restrain: bool = True, **kwargs)¶
Bases:
SCHRODINGERComplexFeaturizer
Given systems with exactly one protein and one ligand, prepare the structure dock the ligand into its binding site identified by a co-crystallized ligand. The following steps will be performed:
modeling missing loops with Prime according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled
building missing side chains
substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be first deleted and subsequently the intended sequence modeled with Prime, if an alteration could not be modeled, a corresponding deletion will remain
removing everything but protein, water and ligand of interest
protonation at pH 7.4
docking a ligand
The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’MDAnalysis’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:
name: A string specifying the name of the protein, will be used for generating the output file name.
chain_id: A string specifying which chain should be used.
alternate_location: A string specifying which alternate location should be used.
expo_id: A string specifying a ligand bound to the protein of interest. This is especially useful if multiple proteins are found in one PDB structure.
uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.
sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.
The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES. Additionally, the ligand component can have the following optional .. attribute:: - name
A string specifying the name of the ligand, will be used for generating the output file name and as molecule title in the docking pose SDF file.
- - `macrocycle`
A bool specifying if the ligand shell be sampled as a macrocycle during docking. Docking will fail, if SCHRDODINGER does not consider the ligand a macrocycle.
- Parameters
cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.
output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.
use_multiprocessing (bool, default=True) – If multiprocessing to use.
n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.
max_retry (int, default=3) – The maximal number of attempts to try running the prepwizard and docking steps.
build_loops (bool, default=True) – If missing loops shell be built. Is also needed to model mutations.
shape_restrain (bool, default=True) – If the docking shell be performed with shape restrain based on the co-crystallized ligand.
- _SUPPORTED_TYPES = ()¶
- _featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None] ¶
Prepare a protein structure and dock a ligand.
- Parameters
system (ProteinLigandComplex) – A system object holding a protein and a ligand component.
- Returns
An MDAnalysis universe of the featurized system or None if not successful.
- Return type
Universe or None
- _dock_molecule(prepared_structure_path: pathlib.Path, system: kinoml.core.systems.ProteinLigandComplex, system_name: str) Union[pathlib.Path, None] ¶
Dock the molecule into the protein with SCHRODINGER’s Glide.
- Parameters
prepared_structure_path (Path) – A prepared protein structure, ready for docking.
system (ProteinLigandComplex) – The system that is being featurized.
system_name (str) – A descriptive name of the system.
- Returns
The path to the generated docking pose, None if not successful.
- Return type
Path or None
- static _replace_ligand(pdb_path: pathlib.Path, docking_pose_sdf_path: pathlib.Path) Universe ¶
Replace the ligand in a PDB file with a ligand in an SDF file.
- Parameters
pdb_path (Path) – Path to the PDB file of the protein ligand complex.
docking_pose_sdf_path (Path) – Path to the molecule in SDF format that shell be added to the structure.
- Returns
The structure with replaced ligand.
- Return type
Universe
- static _store_docking_score(structure: Universe, docking_pose_path: pathlib.Path)¶
Store the docking score from OpenEye docking in the MDAnalysis universe._topology. They cannot be stored in the universe object directly, because they will be lost during multiprocessing/pickling.
- Parameters
structure (Universe) – The docked structure as MDAnalysis universe.
docking_pose_path (Path) – The path to the docking pose.
- _write_complex_mae(prepared_structure: Universe, docking_pose_path: pathlib.Path, complex_path_mae: pathlib.Path)¶
Write the new docked structure in MAE format.
- Parameters
prepared_structure (Universe) – The prepared structure containing the docked ligand with resname LIG.
docking_pose_path (Path) – The prepared docking pose including correct bonding information.
complex_path_mae (Path) – The path for the output file in MAE format.