kinoml.features.complexes

Featurizers that can only get applied to ProteinLigandComplexes or subclasses thereof

Module Contents

kinoml.features.complexes.logger
class kinoml.features.complexes.SingleLigandProteinComplexFeaturizer(**kwargs)

Bases: kinoml.features.core.ParallelBaseFeaturizer

Provides a minimally useful ._supports() method for all ProteinLigandComplex-like featurizers.

_COMPATIBLE_PROTEIN_TYPES = ()
_COMPATIBLE_LIGAND_TYPES = ()
_supports(system: Union[kinoml.core.systems.ProteinLigandComplex]) bool

Check that exactly one protein and one ligand is present in the System

class kinoml.features.complexes.MostSimilarPDBLigandFeaturizer(similarity_metric: str = 'fingerprint', cache_dir: Union[str, pathlib.Path, None] = None, **kwargs)

Bases: SingleLigandProteinComplexFeaturizer

Find the most similar co-crystallized ligand in the PDB according to a given SMILES and UniProt ID.

The protein component of each system must be a core.proteins.Protein or a subclass thereof, and must be initialized with a uniprot_id parameter.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES.

Parameters
  • similarity_metric (str, default="fingerprint") – The similarity metric to use to detect the structure with the most similar ligand [“fingerprint”, “mcs”, “openeye_shape”, “schrodinger_shape”].

  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

Note

The toolkit [‘MDAnalysis’ or ‘OpenEye’] specified in the protein object initialization should fit the required toolkit when subsequently applying the OEDockingFeaturizer or SCHRODINGERDockingFeaturizer.

_SUPPORTED_TYPES = ()
_SUPPORTED_SIMILARITY_METRICS = ('fingerprint', 'mcs', 'openeye_shape', 'schrodinger_shape')
_pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) None

Check that SCHRODINGER variable exists.

_check_schrodinger()

Check that SCHRODINGER variable exists.

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[kinoml.core.systems.ProteinLigandComplex, None]

Find a PDB entry with a protein of the given UniProt ID and with the most similar co-crystallized ligand.

Parameters

system (ProteinLigandComplex) – A system object holding a protein and a ligand component.

Returns

The same system, but with additional protein attributes, i.e. pdb_id, chain_id and expo_id. None if no suitable PDB entry was found.

Return type

ProteinLigandComplex or None

_post_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex], features: List[kinoml.core.systems.ProteinLigandComplex], keep: bool = True) List[kinoml.core.systems.ProteinLigandComplex]

Run after featurizing all systems. Original systems will be replaced with systems returned by the featurizer. Systems that were not successfully featurized will be removed and listed in a log file in the current working directory.

Parameters
  • systems (list of ProteinLigandComplex) – The systems being featurized.

  • features (list of ProteinLigandComplex) – The features returned by self._featurize, i.e. new systems.

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

The new systems with .featurizations extended with the calculated features in two entries: the featurizer name and last.

Return type

list of ProteinLigandComplex

_get_pdb_ligand_entities(uniprot_id: str) Union[pandas.DataFrame, None]

Get PDB ligand entities bound to protein structures of the given UniProt ID. Only X-ray structures will be considered. If a ligand is co-crystallized with multiple PDB structures the ligand entity with the lowest resolution will be returned.

Parameters

uniprot_id (str) – The UniProt ID of the protein of interest.

Returns

A DataFrame with columns ligand_entity, pdb_id, non_polymer_id, chain_id, expo_id and resolution. None if no suitable ligand entities were found.

Return type

pd.DataFrame or None

static _add_ligand_entity_info(pdb_ligand_entities: pd.DataFrame) pd.DataFrame

Add chain and expo ID information to the PDB ligand entities dataframe.

Parameters

pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with a column named ligand_entity. This column must contain strings in the format ‘4YNE_3’, i.e. the third non polymer entity of PDB entry 4YNE.

Returns

The same PDB ligand entities dataframe but with additional columns named chain_id and expo_id. PDB ligand entities without such information are removed.

Return type

pd.DataFrame

static _add_pdb_resolution(pdb_ligand_entities: pd.DataFrame) pd.DataFrame

Add resolution information to the PDB ligand entities dataframe.

Parameters

pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with a column named pdb_id. This column must contain strings in the format ‘4YNE’, i.e. PDB entry 4YNE.

Returns

The same PDB ligand entities dataframe but with an additional column named resolution. PDB ligand entities without such information will get a dummy resolution of 99.9.

Return type

pd.DataFrame

_get_most_similar_pdb_ligand_entity(pdb_ligand_entities: pd.DataFrame, smiles: str) Tuple[str, str, str]

Get the PDB ligand that is most similar to the given SMILES.

Parameters

pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id and expo_id.

Returns

The PDB, chain and expo ID of the most similar ligand.

Return type

tuple of str

static _by_fingerprint(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) pandas.DataFrame

Get the PDB ligands that are most similar to the given SMILES according to Morgan Fingerprints.

Parameters
  • pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.

  • smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.

  • max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 0.87 and the max_similarity_cutoff is set to 0.1, all ligands will be returned with a similarity of 0.77 or higher.

Returns

The most similar ligands.

Return type

pd.DataFrame

static _by_mcs(pdb_ligand_entities: pd.DataFrame, smiles: str, max_bonds_cutoff: float = 0.0) pd.DataFrame

Get the PDB ligands that are most similar to the given SMILES according to the number of bonds in the maximum common substructures.

Parameters
  • pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.

  • smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.

  • max_bonds_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected number of MCS bonds and the possible maximum of MCS bonds. The possible maximum number is calculated from the number of bonds in the given smiles. If the possible maximum number is 35, the highest number of detected mcs bonds is 20 and the max_bonds_cutoff is 0.1, all ligands will be returned with a number of MCS bonds of 16.5 (20 - (35 * 0.1)) or higher.

Returns

The most similar ligands.

Return type

pd.DataFrame

_by_schrodinger_shape(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) pandas.DataFrame

Get the PDB ligands that are most similar to the given SMILES according to SCHRODINGER shape_screen.

Parameters
  • pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.

  • smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.

  • max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 0.87 and the max_similarity_cutoff is set to 0.1, all ligands will be returned with a similarity of 0.77 or higher.

Returns

The most similar ligands.

Return type

pd.DataFrame

_by_openeye_shape(pdb_ligand_entities: pandas.DataFrame, smiles: str, max_similarity_cutoff: float = 0.0) pandas.DataFrame

Get the PDB ligands that are most similar to the given SMILES according to OpenEye’s TanimotoCombo score.

Parameters
  • pdb_ligand_entities (pd.DataFrame) – The PDB ligand entities dataframe with columns named pdb_id, chain_id, expo_id and smiles.

  • smiles (str) – The SMILES representation of the molecule to search for similar PDB ligands.

  • max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar ligands based on the highest detected similarity. If the highest detected similarity is 1.31 and the max_similarity_cutoff is set to 0.2, all ligands will be returned with a similarity of 1.11 or higher.

Returns

The most similar ligands.

Return type

pd.DataFrame

class kinoml.features.complexes.KLIFSConformationTemplatesFeaturizer(**kwargs)

Bases: MostSimilarPDBLigandFeaturizer

Find suitable kinase templates for modeling a kinase:inhibitor complex in different KLIFS conformations.

The protein component of each system must be a core.proteins.KLIFSKinase, and must be initialized with a uniprot_id or kinase_klifs_id parameter.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES.

Parameters
  • similarity_metric (str, default="fingerprint") – The similarity metric to use to detect the structures with similar ligands [“fingerprint”, “mcs”, “openeye_shape”, “schrodinger_shape”].

  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

_COMPATIBLE_PROTEIN_TYPES = ()
_pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) None

Check SCHRODINGER variable and fetch KLIFS data.

_create_klifs_structure_db()

Fetch structure data from KLIFS.

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) pandas.DataFrame

Find PDB entries for different KLIFS conformations with a similar co-crystallized ligand and a similar pocket sequence.

Parameters

system (ProteinLigandComplex) – A system object holding a protein and a ligand component.

Returns

A dataframe with columns for dfg, ac_helix, pdb_id, chain_id, expo_id, ligand_similarity and sequence_similarity.

Return type

DataFrame

static _filter_structures(structures: pd.DataFrame) pd.DataFrame

Filter KLIFS entries for the presence of exactly one orthosteric ligand and determined KLIFS conformation, and remove duplicates.

Parameters

structures (DataFrame) – The KLIFS entries to filter, need to contain the columns ligand.expo_id, structure.pdb_id, structure.dfg, structure.ac_helix, structure.qualityscore, structure.resolution, structure.chain and structure.alternate_model.

Returns

The filtered KLIFS entries.

Return type

DataFrame

_get_most_similar_klifs_ligand_entity(structures: pd.DataFrame, smiles: str, klifs_sequence: str) Tuple[str, str, str, str, str]

Get the KLIFS entry that is most similar to the given SMILES and KLIFS pocket sequence.

Parameters

structures (pd.DataFrame) – The KLIFS entries dataframe with columns named structure.pdb_id, structure.chain, structure.expo_id, smiles and structure.pocket.

Returns

The PDB ID, chain ID, expo ID, ligand similarity and pocket similarity of the KLIFS entry with the most similar ligand and KLIFS pocket sequence.

Return type

tuple of str

static _by_klifs_sequence(klifs_structures: pd.DataFrame, reference_klifs_sequence: str, max_similarity_cutoff: float = 0.0) pd.DataFrame

Get the KLIFS entries that are most similar to the given pocket sequence.

Parameters
  • reference_klifs_sequence (str) – The PDB ligand entities dataframe with a column named structure.pocket.

  • reference_klifs_sequence – The sequence for calculating the similarity.

  • max_similarity_cutoff (float, default=0.0) – The cutoff to use for selecting similar sequences based on the highest detected sequence similarity and the possible maximum of sequence similarity. The possible maximum sequence similarity is aligning the reference sequence to itself. If the possible maximum sequence similarity is 450, the highest detected sequence similarity is 320 and the max_similarity_cutoff is 0.1, all entries will be returned with a sequence similarity of 275 (320 - (450 * 0.1)) or higher.

Returns

The KLIFS entries with the most similar pocket sequences.

Return type

pd.DataFrame

_post_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex], features: List, keep: bool = True) List[kinoml.core.systems.ProteinLigandComplex]

Run after featurizing all systems. Systems with a feature of None will be removed and listed in a log file in the current working directory. You shouldn’t need to redefine this method.

Parameters
  • systems (list of System) – The systems being featurized

  • features (list) – The features returned by self._featurize

  • keep (bool, optional=True) – Whether to store the current featurizer in the system.featurizations dictionary with its own key (self.name), in addition to last.

Returns

filtered_systems – The same systems as passed, but with .featurizations extended with the calculated features in two entries: the featurizer name and last. Systems with a feature of None will be removed.

Return type

systems

class kinoml.features.complexes.OEComplexFeaturizer(**kwargs)

Bases: kinoml.features.core.OEBaseModelingFeaturizer, SingleLigandProteinComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the complex structure by:

  • modeling missing loops with OESpruce according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

  • building missing side chains

  • substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be modeled with OESpruce, if an alteration could not be modeled, the corresponding mismatch in the structure will be deleted

  • removing everything but protein, water and ligand of interest

  • protonation at pH 7.4

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’OpenEye’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

  • name: A string specifying the name of the protein, will be used for generating the output file name.

  • chain_id: A string specifying which chain should be used.

  • alternate_location: A string specifying which alternate location should be used.

  • expo_id: A string specifying the ligand of interest. This is especially useful if multiple ligands are present in a PDB structure.

  • uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

  • sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

The ligand component of each system must be a core.components.BaseLigand or a subclass thereof. The ligand component can have the following optional attributes:

  • name: A string specifying the name of the ligand, will be used for generating the output file name.

Parameters
  • loop_db (str) – The path to the loop database used by OESpruce to model missing loops.

  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

Note

If the ligand of interest is covalently bonded to the protein, the covalent bond will be broken. This may lead to the transformation of the ligand into a radical.

_SUPPORTED_TYPES = ()
_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None]

Prepare a protein structure.

Parameters

system (ProteinLigandComplex) – A system object holding a protein and a ligand component.

Returns

An MDAnalysis universe of the featurized system. None if no design unit was found.

Return type

Universe or None

class kinoml.features.complexes.OEDockingFeaturizer(method: str = 'Posit', pKa_norm: bool = True, **kwargs)

Bases: kinoml.features.core.OEBaseModelingFeaturizer, SingleLigandProteinComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the structure and dock the ligand into the prepared protein structure with one of OpenEye’s docking algorithms:

  • modeling missing loops with OESpruce according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

  • building missing side chains

  • substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be modeled with OESpruce, if an alteration could not be modeled, the corresponding mismatch in the structure will be deleted

  • removing everything but protein, water and ligand of interest

  • protonation at pH 7.4

  • perform docking

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’OpenEye’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

  • name: A string specifying the name of the protein, will be used for generating the output file name.

  • chain_id: A string specifying which chain should be used.

  • alternate_location: A string specifying which alternate location should be used.

  • expo_id: A string specifying a ligand bound to the protein of interest. This is especially useful if multiple proteins are found in one PDB structure.

  • uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

  • sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

  • pocket_resids: List of integers specifying the residues in the binding pocket of interest. This attribute is required if docking with Fred into an apo structure.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES. Additionally, the ligand component can have the following optional .. attribute:: - name

A string specifying the name of the ligand, will be used for generating the output file name.

Parameters
  • method (str, default="Posit") – The docking method to use [“Fred”, “Hybrid”, “Posit”].

  • loop_db (str) – The path to the loop database used by OESpruce to model missing loops.

  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

  • pKa_norm (bool, default=True) – Assign the predominant ionization state of the molecules to dock at pH ~7.4. If False, the ionization state of the input molecules will be conserved.

_SUPPORTED_TYPES = ()
_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None]

Prepare a protein structure and dock a ligand using OpenEye’s Fred method.

Parameters

system (ProteinLigandComplex) – A system object holding a protein and a ligand component.

Returns

An MDAnalysis universe of the featurized system. None if no design unit or docking pose was found.

Return type

Universe or None

static _store_docking_score(structure: Universe, docking_pose: openeye.oechem.OEGraphMol)

Store the docking score from OpenEye docking in the MDAnalysis universe._topology. If the Posit probability is available it will be stored as well. They cannot be stored in the universe object directly, because they will be lost during multiprocessing/pickling.

Parameters
  • structure (Universe) – The docked structure as MDAnalysis universe.

  • docking_pose (oechem.OEGraphMol) – The docking pose.

class kinoml.features.complexes.SCHRODINGERComplexFeaturizer(cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, max_retry: int = 3, build_loops: bool = True, **kwargs)

Bases: SingleLigandProteinComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the complex structure by:

  • modeling missing loops with Prime according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

  • building missing side chains

  • substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be first deleted and subsequently the intended sequence modeled with Prime, if an alteration could not be modeled, a corresponding deletion will remain

  • removing everything but protein, water and ligand of interest

  • protonation at pH 7.4

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’MDAnalysis’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

  • name: A string specifying the name of the protein, will be used for generating the output file name.

  • chain_id: A string specifying which chain should be used.

  • alternate_location: A string specifying which alternate location should be used.

  • expo_id: A string specifying the ligand of interest. This is especially useful if multiple ligands are present in a PDB structure.

  • uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

  • sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

The ligand component of each system must be a core.components.BaseLigand or a subclass thereof. The ligand component can have the following optional attributes:

  • name: A string specifying the name of the ligand, will be used for generating the output file name.

Parameters
  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

  • max_retry (int, default=3) – The maximal number of attempts to try running the prepwizard step.

  • build_loops (bool, default=True) – If missing loops shell be built. Is also needed to model mutations.

_SUPPORTED_TYPES = ()
_pre_featurize(systems: List[kinoml.core.systems.ProteinLigandComplex]) None

Check that SCHRODINGER variable exists.

_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None]

Prepare a protein structure.

Parameters

system (ProteinLigandComplex) – A system object holding a protein and a ligand component.

Returns

An MDAnalysis universe of the featurized system or None if not successful.

Return type

Universe or None

static _system_to_name(system: kinoml.core.systems.ProteinLigandComplex) str

Get a name of the system based on attributes of the protein and ligand component.

Parameters

system (ProteinLigandComplex) – The system with protein and ligand component.

Returns

A descriptive name of the system

Return type

str

_prepare_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Union[pathlib.Path, None]

Prepare the structure with SCHRODINGER’s prepwizard.

Parameters

protein (Path) – The path to the input structure file in PDB format.

Returns

The path to the prepared structure if successful.

Return type

Path or None

_read_protein_structure(protein: Union[kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Union[Universe, None]

Returns the protein structure of the given protein object as MDAnalysis universe.

Parameters

protein (Protein or KLIFSKinase) – The protein object.

Returns

The protein structure as MDAnalysis universe or None.

Return type

Universe or None

Raises

ValueError – If wrong toolkit was used during initialization of the protein object.

_preprocess_structure(pdb_path: Union[str, pathlib.Path], chain_id: Union[str, None], alternate_location: Union[str, None], expo_id: Union[str, None], sequence: str) pathlib.Path
Pre-process a structure for SCHRODINGER’s prepwizard with the following steps:
  • select chain of interest

  • select alternate location of interest

  • remove all ligands but ligand of interest

  • remove expression tags

  • delete protein alterations differing from given sequence

  • renumber protein residues according to the given sequence

Parameters
  • pdb_path (str or Path) – Path to the structure file in PDB format.

  • chain_id (str or None) – The chain ID of interest.

  • alternate_location (str or None) – The alternate location of interest.

  • expo_id (str or None) – The resname of the ligand of interest.

  • sequence (str) – The amino acid sequence of the protein.

Returns

The path to the cleaned structure.

Return type

Path

static _postprocess_structure(prepared_structure: Universe, protein: [kinoml.core.proteins.Protein, kinoml.core.proteins.KLIFSKinase]) Universe
Post-process a structure prepared with SCHRODINGER’s prepwizard with the following steps:
  • select the chain of interest

  • select the alternate location of interest

  • remove all ligands but the ligands of interest

  • update residue identifiers, e.g. atom indices, chain ID, residue IDs of non-protein

Parameters
  • prepared_structure (Universe) – The structure prepared by SCHRODINGER’s prepwizard.

  • protein (Protein or KLIFSKinase) – The protein component of the system.

Returns

The post-processed structure.

Return type

Universe

class kinoml.features.complexes.SCHRODINGERDockingFeaturizer(cache_dir: Union[str, pathlib.Path, None] = None, output_dir: Union[str, pathlib.Path, None] = None, max_retry: int = 3, build_loops: bool = True, shape_restrain: bool = True, **kwargs)

Bases: SCHRODINGERComplexFeaturizer

Given systems with exactly one protein and one ligand, prepare the structure dock the ligand into its binding site identified by a co-crystallized ligand. The following steps will be performed:

  • modeling missing loops with Prime according to the PDB header unless a custom sequence is specified via the uniprot_id or sequence attribute in the protein component (see below), missing sequences at N- and C-termini are not modeled

  • building missing side chains

  • substitutions, deletions and insertions, if a uniprot_id or sequence attribute is provided for the protein component alteration will be first deleted and subsequently the intended sequence modeled with Prime, if an alteration could not be modeled, a corresponding deletion will remain

  • removing everything but protein, water and ligand of interest

  • protonation at pH 7.4

  • docking a ligand

The protein component of each system must be a core.proteins.Protein or a subclass thereof, must be initialized with toolkit=’MDAnalysis’ and give access to the molecular structure, e.g. via a pdb_id. Additionally, the protein component can have the following optional attributes to customize the protein modeling:

  • name: A string specifying the name of the protein, will be used for generating the output file name.

  • chain_id: A string specifying which chain should be used.

  • alternate_location: A string specifying which alternate location should be used.

  • expo_id: A string specifying a ligand bound to the protein of interest. This is especially useful if multiple proteins are found in one PDB structure.

  • uniprot_id: A string specifying the UniProt ID that will be used to fetch the amino acid sequence from UniProt, which will be used for modeling the protein. This will supersede the sequence information given in the PDB header.

  • sequence: A string specifying the amino acid sequence in one-letter-codes that should be used during modeling the protein. This will supersede a given uniprot_id and the sequence information given in the PDB header.

The ligand component of each system must be a core.ligands.Ligand or a subclass thereof and give access to the molecular structure, e.g. via a SMILES. Additionally, the ligand component can have the following optional .. attribute:: - name

A string specifying the name of the ligand, will be used for generating the output file name and as molecule title in the docking pose SDF file.

- `macrocycle`

A bool specifying if the ligand shell be sampled as a macrocycle during docking. Docking will fail, if SCHRDODINGER does not consider the ligand a macrocycle.

Parameters
  • cache_dir (str, Path or None, default=None) – Path to directory used for saving intermediate files. If None, default location provided by appdirs.user_cache_dir() will be used.

  • output_dir (str, Path or None, default=None) – Path to directory used for saving output files. If None, output structures will not be saved.

  • use_multiprocessing (bool, default=True) – If multiprocessing to use.

  • n_processes (int or None, default=None) – How many processes to use in case of multiprocessing. Defaults to number of available CPUs.

  • max_retry (int, default=3) – The maximal number of attempts to try running the prepwizard and docking steps.

  • build_loops (bool, default=True) – If missing loops shell be built. Is also needed to model mutations.

  • shape_restrain (bool, default=True) – If the docking shell be performed with shape restrain based on the co-crystallized ligand.

_SUPPORTED_TYPES = ()
_featurize_one(system: kinoml.core.systems.ProteinLigandComplex) Union[Universe, None]

Prepare a protein structure and dock a ligand.

Parameters

system (ProteinLigandComplex) – A system object holding a protein and a ligand component.

Returns

An MDAnalysis universe of the featurized system or None if not successful.

Return type

Universe or None

_dock_molecule(prepared_structure_path: pathlib.Path, system: kinoml.core.systems.ProteinLigandComplex, system_name: str) Union[pathlib.Path, None]

Dock the molecule into the protein with SCHRODINGER’s Glide.

Parameters
  • prepared_structure_path (Path) – A prepared protein structure, ready for docking.

  • system (ProteinLigandComplex) – The system that is being featurized.

  • system_name (str) – A descriptive name of the system.

Returns

The path to the generated docking pose, None if not successful.

Return type

Path or None

static _replace_ligand(pdb_path: pathlib.Path, docking_pose_sdf_path: pathlib.Path) Universe

Replace the ligand in a PDB file with a ligand in an SDF file.

Parameters
  • pdb_path (Path) – Path to the PDB file of the protein ligand complex.

  • docking_pose_sdf_path (Path) – Path to the molecule in SDF format that shell be added to the structure.

Returns

The structure with replaced ligand.

Return type

Universe

static _store_docking_score(structure: Universe, docking_pose_path: pathlib.Path)

Store the docking score from OpenEye docking in the MDAnalysis universe._topology. They cannot be stored in the universe object directly, because they will be lost during multiprocessing/pickling.

Parameters
  • structure (Universe) – The docked structure as MDAnalysis universe.

  • docking_pose_path (Path) – The path to the docking pose.

_write_complex_mae(prepared_structure: Universe, docking_pose_path: pathlib.Path, complex_path_mae: pathlib.Path)

Write the new docked structure in MAE format.

Parameters
  • prepared_structure (Universe) – The prepared structure containing the docked ligand with resname LIG.

  • docking_pose_path (Path) – The prepared docking pose including correct bonding information.

  • complex_path_mae (Path) – The path for the output file in MAE format.