OpenEye Structural Featurizer

This notebook introduces structural modeling featurizers using the OpenEye toolkits to prepare protein structures and to dock small molecules into their binding sites.

Note: All structural featurizers fetch data and/or do expensive computations. Hence, fetched data (e.g. PDB structures) and intermediate results (e.g. a prepared protein structure) are stored in a cache directory to speed up calculations when featurizing the same or similar systems multiple times. The cache directory can be specified via the cache_dir parameter, but also has a default (user_cache_dir from appdirs). In case you update your KinoML version you should consider deleting the cache directory. Otherwise, you may get results from the former KinoML version, since the intermediate results will be taken from cache.

[1]:
%%capture --no-display
from importlib import resources
import inspect
from pathlib import Path

from appdirs import user_cache_dir

from kinoml.core.ligands import Ligand
from kinoml.core.proteins import Protein, KLIFSKinase
from kinoml.core.systems import ProteinSystem, ProteinLigandComplex
from kinoml.features.core import Pipeline
from kinoml.features.protein import OEProteinStructureFeaturizer
from kinoml.features.complexes import (
    OEComplexFeaturizer,
    OEDockingFeaturizer,
    MostSimilarPDBLigandFeaturizer,
    KLIFSConformationTemplatesFeaturizer,
)

OEProteinStructureFeaturizer

All OpenEye Featurizers come with an extensive doc string explaining the capabilities and requirements.

[2]:
print(inspect.getdoc(OEProteinStructureFeaturizer))
Given systems with exactly one protein, prepare the protein structure by:

 - modeling missing loops with OESpruce according to the PDB header unless
   a custom sequence is specified via the `uniprot_id` or `sequence`
   attribute in the protein component (see below), missing sequences at
   N- and C-termini are not modeled
 - building missing side chains
 - substitutions, deletions and insertions, if a `uniprot_id` or `sequence`
   attribute is provided for the protein component alteration will be
   modeled with OESpruce, if an alteration could not be modeled, the
   corresponding mismatch in the structure will be deleted
 - removing everything but protein and water
 - protonation at pH 7.4

The protein component of each system must be a `core.proteins.Protein`
or a subclass thereof, must be initialized with toolkit='OpenEye' and
give access to a molecular structure, e.g. via a pdb_id. Additionally,
the protein component can have the following optional attributes to
customize the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
    generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
    should be used.
 - `expo_id`: A string specifying a ligand bound to the protein of
   interest. This is especially useful if multiple proteins are found in
   one PDB structure.
 - `uniprot_id`: A string specifying the UniProt ID that will be used to
   fetch the amino acid sequence from UniProt, which will be used for
   modeling the protein. This will supersede the sequence information
   given in the PDB header.
 - `sequence`: A  string specifying the amino acid sequence in
   one-letter-codes that should be used during modeling the protein. This
   will supersede a given `uniprot_id` and the sequence information given
   in the PDB header.

Parameters
----------
loop_db: str
    The path to the loop database used by OESpruce to model missing loops.
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
output_dir: str, Path or None, default=None
    Path to directory used for saving output files. If None, output
    structures will not be saved.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.

In general these featurizers will work with a minimal amount of information, e.g. just a PDB ID. However, it is recommended to be explicit as possible when defining the systems to featurize. For example, if a given PDB entry has multiple chains and ligands, the featurizer will have to guess which chain and ligand is of interest if not explicitly stated.

[3]:
# collect systems to featurize, i.e. prepare the protein structure
systems = []
[4]:
# unspecifc definition of the system, only via PDB ID
# modeling will be performed according to the sequence stored in the PDB Header
protein = Protein(pdb_id="4f8o", name="PsaA")
system = ProteinSystem(components=[protein])
systems.append(system)
[5]:
# more specific definition of the system, protein of chain A co-crystallized with ligand AES and
# alternate location B, modeling will be performed according to the sequence of the given
# UniProt ID
protein = Protein.from_pdb(pdb_id="4f8o", name="PsaA")
protein.uniprot_id = "P31522"
protein.chain_id = "A"
protein.alternate_location = "B"
protein.expo_id = "AES"
system = ProteinSystem(components=[protein])
systems.append(system)
[6]:
# use a protein structure form file
with resources.path("kinoml.data.proteins", "4f8o_edit.pdb") as structure_path:
    pass
protein = Protein.from_file(file_path=structure_path, name="PsaA")
protein.uniprot_id = "P31522"
system = ProteinSystem(components=[protein])
systems.append(system)
[7]:
with resources.path("kinoml.data.proteins", "kinoml_tests_4f8o_spruce.loop_db") as loop_db:
    pass
featurizer = OEProteinStructureFeaturizer(
    loop_db=loop_db,
    output_dir=user_cache_dir() + "/protein",
    use_multiprocessing=False,
)
[8]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[8]:
[<ProteinSystem with 1 components (<Protein name=PsaA>)>,
 <ProteinSystem with 1 components (<Protein name=PsaA>)>,
 <ProteinSystem with 1 components (<Protein name=PsaA>)>]

The featurizers will return the featurized systems as an MDAnalysis universe. Systems that failed will be filtered out. In case one is interested in failures, one can enable logging messages via:

import logging
logging.basicConfig(level=logging.DEBUG)
[9]:
systems[0]
[9]:
<ProteinSystem with 1 components (<Protein name=PsaA>)>
[10]:
systems[0].featurizations["last"]
[10]:
<Universe with 2381 atoms>

If an output_dir was provided, the prepared structure is saved in PDB and OEB format.

[11]:
for path in sorted(Path(user_cache_dir() + "/protein").glob("*")):
    print(path.name)
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_chainA_altlocB_protein.oeb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_chainA_altlocB_protein.pdb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_edit_protein.oeb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_edit_protein.pdb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_protein.oeb
kinoml_OEProteinStructureFeaturizer_PsaA_4f8o_protein.pdb

OEComplexFeaturizer

[12]:
print(inspect.getdoc(OEComplexFeaturizer))
Given systems with exactly one protein and one ligand, prepare the complex
structure by:

 - modeling missing loops with OESpruce according to the PDB header unless
   a custom sequence is specified via the `uniprot_id` or `sequence`
   attribute in the protein component (see below), missing sequences at
   N- and C-termini are not modeled
 - building missing side chains
 - substitutions, deletions and insertions, if a `uniprot_id` or `sequence`
   attribute is provided for the protein component alteration will be
   modeled with OESpruce, if an alteration could not be modeled, the
   corresponding mismatch in the structure will be deleted
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='OpenEye' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
   should be used.
 - `expo_id`: A string specifying the ligand of interest. This is
   especially useful if multiple ligands are present in a PDB structure.
 - `uniprot_id`: A string specifying the UniProt ID that will be used to
   fetch the amino acid sequence from UniProt, which will be used for
   modeling the protein. This will supersede the sequence information
   given in the PDB header.
 - `sequence`: A string specifying the amino acid sequence in
   one-letter-codes that should be used during modeling the protein. This
   will supersede a given `uniprot_id` and the sequence information given
   in the PDB header.

The ligand component of each system must be a `core.components.BaseLigand`
or a subclass thereof. The ligand component can have the following
optional attributes:

 - `name`: A string specifying the name of the ligand, will be used for
   generating the output file name.

Parameters
----------
loop_db: str
    The path to the loop database used by OESpruce to model missing loops.
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
output_dir: str, Path or None, default=None
    Path to directory used for saving output files. If None, output
    structures will not be saved.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.

Note
----
If the ligand of interest is covalently bonded to the protein, the
covalent bond will be broken. This may lead to the transformation of the
ligand into a radical.
[13]:
systems = []
[14]:
protein = Protein(pdb_id="4f8o", name="PsaA")
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[15]:
protein = Protein.from_pdb(pdb_id="4f8o", name="PsaA")
protein.uniprot_id = "P31522"
protein.chain_id = "A"
protein.alternate_location = "B"
protein.expo_id = "AES"
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[16]:
featurizer = OEComplexFeaturizer(
    output_dir=user_cache_dir() + "/complex",
    use_multiprocessing=False,
)
[17]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[17]:
[<ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>,
 <ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>]

If an output_dir was provided, the prepared structure is saved in PDB and OEB format, the prepared ligand is additionally saved in SDF format.

[18]:
for path in sorted(Path(user_cache_dir() + "/complex").glob("*")):
    print(path.name)
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_complex.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_complex.pdb
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_ligand.sdf
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_protein.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_AEBSF_protein.pdb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_complex.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_complex.pdb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_ligand.sdf
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_protein.oeb
kinoml_OEComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_protein.pdb

OEDockingFeaturizer

The OEDockingFeaturizer supports 3 docking methods, i.e.: - Fred - standard docking protocol - Hybrid - biased by co-crystallized ligand - Posit - bias depends on the similarity to the co-crystallized ligand

[19]:
print(inspect.getdoc(OEDockingFeaturizer))
Given systems with exactly one protein and one ligand, prepare the
structure and dock the ligand into the prepared protein structure with
one of OpenEye's docking algorithms:

 - modeling missing loops with OESpruce according to the PDB header unless
   a custom sequence is specified via the `uniprot_id` or `sequence`
   attribute in the protein component (see below), missing sequences at
   N- and C-termini are not modeled
 - building missing side chains
 - substitutions, deletions and insertions, if a `uniprot_id` or `sequence`
   attribute is provided for the protein component alteration will be
   modeled with OESpruce, if an alteration could not be modeled, the
   corresponding mismatch in the structure will be deleted
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4
 - perform docking

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='OpenEye' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
   should be used.
 - `expo_id`: A string specifying a ligand bound to the protein of
   interest. This is especially useful if multiple proteins are found in
   one PDB structure.
 - `uniprot_id`: A string specifying the UniProt ID that will be used to
   fetch the amino acid sequence from UniProt, which will be used for
   modeling the protein. This will supersede the sequence information
   given in the PDB header.
 - `sequence`: A string specifying the amino acid sequence in
   one-letter-codes that should be used during modeling the protein. This
   will supersede a given `uniprot_id` and the sequence information given
   in the PDB header.
 - `pocket_resids`: List of integers specifying the residues in the
   binding pocket of interest. This attribute is required if docking with
   Fred into an apo structure.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES. Additionally, the ligand component can have the following optional
attributes:

 - `name`: A string specifying the name of the ligand, will be used for
   generating the output file name.

Parameters
----------
method: str, default="Posit"
    The docking method to use ["Fred", "Hybrid", "Posit"].
loop_db: str
    The path to the loop database used by OESpruce to model missing loops.
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
output_dir: str, Path or None, default=None
    Path to directory used for saving output files. If None, output
    structures will not be saved.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.
pKa_norm: bool, default=True
    Assign the predominant ionization state of the molecules to dock at pH
    ~7.4. If False, the ionization state of the input molecules will be
    conserved.

Fred

[20]:
systems = []
[21]:
# define the binding site for docking via co-crystallized ligand
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[22]:
# define the binding site for docking via residue IDs
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.pocket_resids = [
    516, 517, 521, 524, 542, 544, 573, 589, 590, 591, 592, 595, 596, 654, 655, 656, 657, 667, 668
]
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib_2")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[23]:
featurizer = OEDockingFeaturizer(
    output_dir=user_cache_dir() + "/Fred",
    method="Fred",
    use_multiprocessing=False,
)
[24]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Warning: Failed to find residue GLY-7.
Warning: Failed to find residue ALA-6.
Warning: Failed to find residue MET-5.
DPI: 0.12, RFree: 0.22, Resolution: 2.02
Processing BU # 1 with title: HIGH AFFINITY NERVE GROWTH FACTOR RECEPTOR, chains A, alt: A
Warning: For residue ARG 702   A 1   removing clashing solvent molecule HOH 908   A 2
Warning: There was a problem building some missing pieces, built as much as was possible
Processing BU # 2 with title: HIGH AFFINITY NERVE GROWTH FACTOR RECEPTOR, chains A, alt: B
Warning: For residue ARG 702   A 1   removing clashing solvent molecule HOH 908   A 2
Warning: There was a problem building some missing pieces, built as much as was possible
[24]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>,
 <ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib_2>)>]

Docking scores are stored in the returned MDAnalysis universe.

[25]:
[system.featurizations["last"]._topology.docking_score for system in systems]
[25]:
[-17.801493, -3.960361]

Hybrid

[26]:
systems = []
[27]:
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[28]:
featurizer = OEDockingFeaturizer(
    output_dir=user_cache_dir() + "/Hybrid",
    method="Hybrid"
)
[29]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[29]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

Posit

[30]:
systems = []
[31]:
protein = Protein(pdb_id="4yne", name="NTRK1")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[32]:
featurizer = OEDockingFeaturizer(
    output_dir=user_cache_dir() + "/Posit",
    method="Posit"
)
[33]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[33]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

Beside the docking score, the Posit probability is also stored in the returned MDAnaylsis universe.

[34]:
systems[0].featurizations["last"]._topology.posit_probability
[34]:
0.5

MostSimilarPDBLigandFeaturizer

Manually specifying the most suitable PDB structure to dock into is not practical for a larger set of ligands. Hence, the MostSimilarPDBLigandFeaturizer was implemented, wich can find the most suitable structure for docking in the PDB based on ligand similarity. The user can choose from one the following similarity metrics:

  • Fingerprint

  • Most common substructure

  • OpenEye’s shape

  • Schrodinger’s shape

[35]:
print(inspect.getdoc(MostSimilarPDBLigandFeaturizer))
Find the most similar co-crystallized ligand in the PDB according to a
given SMILES and UniProt ID.

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, and must be initialized with a `uniprot_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structure with the most
    similar ligand ["fingerprint", "mcs", "openeye_shape",
    "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.

Note
----
The toolkit ['MDAnalysis' or 'OpenEye'] specified in the protein object
initialization should fit the required toolkit when subsequently applying
the OEDockingFeaturizer or SCHRODINGERDockingFeaturizer.

Most common substructure

[36]:
systems = []
[37]:
protein = Protein(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[38]:
featurizer = MostSimilarPDBLigandFeaturizer(
    similarity_metric="mcs"
)
[39]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
[39]:
('4YNE', 'A', '4EK')
5.51 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Fingerprint

[40]:
featurizer = MostSimilarPDBLigandFeaturizer(
    similarity_metric="fingerprint"
)
[41]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
[41]:
('4YNE', 'A', '4EK')
5.04 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

OpenEye’s shape

[42]:
featurizer = MostSimilarPDBLigandFeaturizer(
    similarity_metric="openeye_shape"
)
[43]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
[43]:
('4YNE', 'A', '4EK')
4min 25s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Using shape is clearly the slowest option, but in many cases the most accurate one.

Pipeline of MostSimilarPDBLigandFeaturizer and OEDockingFeaturizer

The MostSimilarPDBLigandFeaturizer can be joined with the OEDockingFeaturizer into a Pipeline featurizer.

[44]:
systems = []
[45]:
protein = Protein(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[46]:
featurizer = Pipeline([
    MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint"),
    OEDockingFeaturizer(output_dir=user_cache_dir() + "/docking_pipeline", method="Posit"),
])
[47]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Warning: Failed to find residue GLY-7.
Warning: Failed to find residue ALA-6.
Warning: Failed to find residue MET-5.
DPI: 0.12, RFree: 0.22, Resolution: 2.02
Processing BU # 1 with title: HIGH AFFINITY NERVE GROWTH FACTOR RECEPTOR, chains A, alt: A
Warning: For residue ARG 702   A 1   removing clashing solvent molecule HOH 908   A 2
Warning: There was a problem building some missing pieces, built as much as was possible
   Falling back to charging protein with OEMMFF94Charges
Processing BU # 2 with title: HIGH AFFINITY NERVE GROWTH FACTOR RECEPTOR, chains A, alt: B
Warning: For residue ARG 702   A 1   removing clashing solvent molecule HOH 908   A 2
Warning: There was a problem building some missing pieces, built as much as was possible
   Falling back to charging protein with OEMMFF94Charges
Warning: No BioAssembly transforms found, using input molecule as biounit: HIGH AFFINITY NERVE GROWTH FACTOR RECEPTOR(A)altA
Warning: Iridium - Structure: HIGH AFFINITY NERVE GROWTH FACTOR RECEPTOR(A)altA has no REMARK data
Processing BU # 1 with title: HIGH AFFINITY NERVE GROWTH FACTOR RECEPTOR(A)altA, chains A
Warning: There was a problem building some missing pieces, built as much as was possible
[47]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]
[48]:
systems[0].featurizations
[48]:
{'last': <Universe with 4783 atoms>,
 'Pipeline([MostSimilarPDBLigandFeaturizer, OEDockingFeaturizer])': <Universe with 4783 atoms>}

KLIFSConformationTemplatesFeaturizer

The KLIFSConformationTemplatesFeaturizer searches for suitable templates to model a kinase:ligand complex in different conformations. The templates are selected based on ligand and sequence similarity.

[49]:
print(inspect.getdoc(KLIFSConformationTemplatesFeaturizer))
Find suitable kinase templates for modeling a kinase:inhibitor complex in
different KLIFS conformations.

The protein component of each system must be a `core.proteins.KLIFSKinase`,
and must be initialized with a `uniprot_id` or `kinase_klifs_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structures with similar
    ligands ["fingerprint", "mcs", "openeye_shape", "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.
[50]:
systems = []
[51]:
protein = KLIFSKinase(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[52]:
featurizer = KLIFSConformationTemplatesFeaturizer(
    similarity_metric="fingerprint"
)
[53]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[53]:
[<ProteinLigandComplex with 2 components (<KLIFSKinase name=NTRK1>, <Ligand name=larotrectinib>)>]
[54]:
systems[0].featurizations["last"]
[54]:
dfg ac_helix pdb_id chain_id expo_id ligand_similarity pocket_similarity
0 in in 4yne A 4EK 0.568047 443.0
1 in out 6tfp A N6Z 0.534031 215.0
2 out in 4pmp A 31W 0.482759 443.0
3 out-like in 6brj A VX6 0.521739 279.0
4 out-like out 3aqv A TAK 0.435754 171.0
5 out out 5jfv A 6K1 0.491620 422.0