Schrodinger Structural Featurizer¶
This notebook introduces structural modeling featurizers using molecular modeling capabilities from the Schrodinger Suite to prepare protein structures and to dock small molecules into their binding sites.
Note: All structural featurizers fetch data and/or do expensive computations. Hence, fetched data (e.g. PDB structures) and intermediate results (e.g. a prepared protein structure) are stored in a cache directory to speed up calculations when featurizing the same or similar systems multiple times. The cache directory can be specified via the cache_dir
parameter, but also has a default (user_cache_dir
from appdirs
). In case you update your KinoML version you should consider
deleting the cache directory. Otherwise, you may get results from the former KinoML version, since the intermediate results will be taken from cache.
[1]:
%%capture --no-display
from importlib import resources
import inspect
from pathlib import Path
from appdirs import user_cache_dir
from kinoml.core.ligands import Ligand
from kinoml.core.proteins import Protein, KLIFSKinase
from kinoml.core.systems import ProteinLigandComplex
from kinoml.features.core import Pipeline
from kinoml.features.complexes import (
SCHRODINGERComplexFeaturizer,
SCHRODINGERDockingFeaturizer,
MostSimilarPDBLigandFeaturizer,
KLIFSConformationTemplatesFeaturizer,
)
SCHRODINGERComplexFeaturizer¶
All Schrodinger Featurizers come with an extensive doc string explaining the capabilities and requirements.
[2]:
print(inspect.getdoc(SCHRODINGERComplexFeaturizer))
Given systems with exactly one protein and one ligand, prepare the complex
structure by:
- modeling missing loops with Prime according to the PDB header unless
a custom sequence is specified via the `uniprot_id` or `sequence`
attribute in the protein component (see below), missing sequences at
N- and C-termini are not modeled
- building missing side chains
- substitutions, deletions and insertions, if a `uniprot_id` or `sequence`
attribute is provided for the protein component alteration will be first
deleted and subsequently the intended sequence modeled with Prime, if
an alteration could not be modeled, a corresponding deletion will remain
- removing everything but protein, water and ligand of interest
- protonation at pH 7.4
The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='MDAnalysis' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:
- `name`: A string specifying the name of the protein, will be used for
generating the output file name.
- `chain_id`: A string specifying which chain should be used.
- `alternate_location`: A string specifying which alternate location
should be used.
- `expo_id`: A string specifying the ligand of interest. This is
especially useful if multiple ligands are present in a PDB structure.
- `uniprot_id`: A string specifying the UniProt ID that will be used to
fetch the amino acid sequence from UniProt, which will be used for
modeling the protein. This will supersede the sequence information
given in the PDB header.
- `sequence`: A string specifying the amino acid sequence in
one-letter-codes that should be used during modeling the protein. This
will supersede a given `uniprot_id` and the sequence information given
in the PDB header.
The ligand component of each system must be a `core.components.BaseLigand`
or a subclass thereof. The ligand component can have the following
optional attributes:
- `name`: A string specifying the name of the ligand, will be used for
generating the output file name.
Parameters
----------
cache_dir: str, Path or None, default=None
Path to directory used for saving intermediate files. If None, default
location provided by `appdirs.user_cache_dir()` will be used.
output_dir: str, Path or None, default=None
Path to directory used for saving output files. If None, output
structures will not be saved.
use_multiprocessing : bool, default=True
If multiprocessing to use.
n_processes : int or None, default=None
How many processes to use in case of multiprocessing. Defaults to
number of available CPUs.
max_retry: int, default=3
The maximal number of attempts to try running the prepwizard step.
build_loops: bool, default=True
If missing loops shell be built. Is also needed to model mutations.
In general these featurizers will work with a minimal amount of information, e.g. just a PDB ID. However, it is recommended to be explicit as possible when defining the systems to featurize. For example, if a given PDB entry has multiple chains and ligands, the featurizer will have to guess which chain and ligand is of interest if not explicitly stated.
[3]:
# collect systems to featurize, i.e. prepare the protein structure
systems = []
[4]:
# unspecifc definition of the system, only via PDB ID
# modeling will be performed according to the sequence stored in the PDB Header
protein = Protein(pdb_id="4f8o", name="PsaA", toolkit="MDAnalysis")
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[5]:
# more specific definition of the system, protein of chain A co-crystallized with ligand AES and
# alternate location B, modeling will be performed according to the sequence of the given
# UniProt ID
protein = Protein.from_pdb(pdb_id="4f8o", name="PsaA", toolkit="MDAnalysis")
protein.uniprot_id = "P31522"
protein.chain_id = "A"
protein.alternate_location = "B"
protein.expo_id = "AES"
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
[6]:
featurizer = SCHRODINGERComplexFeaturizer(output_dir="output/complex")
The featurizers will return the featurized systems as an MDAnalysis universe. Systems that failed will be filtered out. In case one is interested in failures, one can enable logging messages via:
import logging
logging.basicConfig(level=logging.DEBUG)
[7]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[7]:
[<ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>,
<ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>]
[8]:
systems[0].featurizations["last"]
[8]:
<Universe with 2506 atoms>
If an output_dir
was provided, the prepared structure is saved in PDB format.
[9]:
for path in sorted(Path("output/complex").glob("*")):
print(path.name)
kinoml_SCHRODINGERComplexFeaturizer_PsaA_4f8o_AEBSF_complex.pdb
kinoml_SCHRODINGERComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_complex.pdb
SCHRODINGERDockingFeaturizer¶
Docking can be performed with and without shape restrain to the co-crystallized ligand. Moreover, the protein structure for docking must contain a co-crystallized ligand, which is required for the pocket definition.
[10]:
print(inspect.getdoc(SCHRODINGERDockingFeaturizer))
Given systems with exactly one protein and one ligand, prepare the
structure dock the ligand into its binding site identified by a
co-crystallized ligand. The following steps will be performed:
- modeling missing loops with Prime according to the PDB header unless
a custom sequence is specified via the `uniprot_id` or `sequence`
attribute in the protein component (see below), missing sequences at
N- and C-termini are not modeled
- building missing side chains
- substitutions, deletions and insertions, if a `uniprot_id` or `sequence`
attribute is provided for the protein component alteration will be first
deleted and subsequently the intended sequence modeled with Prime, if
an alteration could not be modeled, a corresponding deletion will remain
- removing everything but protein, water and ligand of interest
- protonation at pH 7.4
- docking a ligand
The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='MDAnalysis' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:
- `name`: A string specifying the name of the protein, will be used for
generating the output file name.
- `chain_id`: A string specifying which chain should be used.
- `alternate_location`: A string specifying which alternate location
should be used.
- `expo_id`: A string specifying a ligand bound to the protein of
interest. This is especially useful if multiple proteins are found in
one PDB structure.
- `uniprot_id`: A string specifying the UniProt ID that will be used to
fetch the amino acid sequence from UniProt, which will be used for
modeling the protein. This will supersede the sequence information
given in the PDB header.
- `sequence`: A string specifying the amino acid sequence in
one-letter-codes that should be used during modeling the protein. This
will supersede a given `uniprot_id` and the sequence information given
in the PDB header.
The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES. Additionally, the ligand component can have the following optional
attributes:
- `name`: A string specifying the name of the ligand, will be used for
generating the output file name and as molecule title in the docking
pose SDF file.
- `macrocycle`: A bool specifying if the ligand shell be sampled as a
macrocycle during docking. Docking will fail, if SCHRDODINGER does not
consider the ligand a macrocycle.
Parameters
----------
cache_dir: str, Path or None, default=None
Path to directory used for saving intermediate files. If None, default
location provided by `appdirs.user_cache_dir()` will be used.
output_dir: str, Path or None, default=None
Path to directory used for saving output files. If None, output
structures will not be saved.
use_multiprocessing : bool, default=True
If multiprocessing to use.
n_processes : int or None, default=None
How many processes to use in case of multiprocessing. Defaults to
number of available CPUs.
max_retry: int, default=3
The maximal number of attempts to try running the prepwizard and
docking steps.
build_loops: bool, default=True
If missing loops shell be built. Is also needed to model mutations.
shape_restrain: bool, default=True
If the docking shell be performed with shape restrain based on the
co-crystallized ligand.
Without shape restrain¶
[11]:
systems = []
[12]:
protein = Protein(pdb_id="4yne", uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[13]:
featurizer = SCHRODINGERDockingFeaturizer(
output_dir="output/docking_without_shape_restrain",
shape_restrain=False
)
[14]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Converted file: /scratch/lsftmp/6525607.tmpdir/tmpq7th966w.mae
Removing previous job files...
JobId: lu04-0-6262a247
ExitStatus: finished
Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_without_shape_restrain/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae
[14]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]
Docking scores are stored in the returned MDAnalysis universe.
[15]:
systems[0].featurizations["last"]._topology.docking_score
[15]:
-10.1766
If an output_dir
was provided, the prepared structure is saved in PDB and MAE format, the prepared ligand is additionally saved in SDF format.
[16]:
for path in sorted(Path("output/docking_without_shape_restrain").glob("*")):
print(path.name)
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.pdb
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_ligand.sdf
With shape restrain¶
[17]:
systems = []
[18]:
protein = Protein(pdb_id="4yne", uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[19]:
featurizer = SCHRODINGERDockingFeaturizer(
output_dir="output/docking_with_shape_restrain",
shape_restrain=True,
)
[20]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Converted file: /scratch/lsftmp/6525607.tmpdir/tmp195pmphc.mae
Removing previous job files...
JobId: lu04-0-6262a33d
ExitStatus: finished
Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_with_shape_restrain/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae
[20]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]
MostSimilarPDBLigandFeaturizer¶
Manually specifying the most suitable PDB structure to dock into is not practical for a larger set of ligands. Hence, the MostSimilarPDBLigandFeaturizer
was implemented, wich can find the most suitable structure for docking in the PDB based on ligand similarity. The user can choose from one the following similarity metrics:
Fingerprint
Most common substructure
OpenEye’s shape
Schrodinger’s shape
[21]:
print(inspect.getdoc(MostSimilarPDBLigandFeaturizer))
Find the most similar co-crystallized ligand in the PDB according to a
given SMILES and UniProt ID.
The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, and must be initialized with a `uniprot_id` parameter.
The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.
Parameters
----------
similarity_metric: str, default="fingerprint"
The similarity metric to use to detect the structure with the most
similar ligand ["fingerprint", "mcs", "openeye_shape",
"schrodinger_shape"].
cache_dir: str, Path or None, default=None
Path to directory used for saving intermediate files. If None, default
location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
If multiprocessing to use.
n_processes : int or None, default=None
How many processes to use in case of multiprocessing. Defaults to
number of available CPUs.
Note
----
The toolkit ['MDAnalysis' or 'OpenEye'] specified in the protein object
initialization should fit the required toolkit when subsequently applying
the OEDockingFeaturizer or SCHRODINGERDockingFeaturizer.
Most common substructure¶
[22]:
systems = []
[23]:
protein = Protein(uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[24]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="mcs")
[25]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
[25]:
('4YNE', 'A', '4EK')
12.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Fingerprint¶
[26]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint")
[27]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
[27]:
('4YNE', 'A', '4EK')
6.91 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Schrodinger’s shape¶
[28]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="schrodinger_shape")
[29]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_NUMERIC = "C",
LC_TIME = "C",
LANG = "C.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
JobId: lu04-0-6262a42c
[29]:
('4YPS', 'A', '4F6')
36.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Using shape is the slowest option, but in many cases the most accurate one.
Pipeline of MostSimilarPDBLigandFeaturizer and SCHRODINGERDockingFeaturizer¶
The MostSimilarPDBLigandFeaturizer
can be joined with the SCHRODINGERDockingFeaturizer
into a Pipeline
featurizer.
[30]:
systems = []
[31]:
protein = Protein(uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[32]:
featurizer = Pipeline([
MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint"),
SCHRODINGERDockingFeaturizer(output_dir="output/docking_pipeline"),
])
[33]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Converted file: /scratch/lsftmp/6525607.tmpdir/tmp5xoq3zd9.mae
Removing previous job files...
JobId: lu04-0-6262a474
ExitStatus: finished
Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_pipeline/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4YNE_chainA_larotrectinib_complex.mae
[33]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]
[34]:
systems[0].featurizations
[34]:
{'last': <Universe with 4964 atoms>,
'Pipeline([MostSimilarPDBLigandFeaturizer, SCHRODINGERDockingFeaturizer])': <Universe with 4964 atoms>}
KLIFSConformationTemplatesFeaturizer¶
The KLIFSConformationTemplatesFeaturizer
searches for suitable templates to model a kinase:ligand complex in different conformations. The templates are selected based on ligand and sequence similarity.
[35]:
print(inspect.getdoc(KLIFSConformationTemplatesFeaturizer))
Find suitable kinase templates for modeling a kinase:inhibitor complex in
different KLIFS conformations.
The protein component of each system must be a `core.proteins.KLIFSKinase`,
and must be initialized with a `uniprot_id` or `kinase_klifs_id` parameter.
The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.
Parameters
----------
similarity_metric: str, default="fingerprint"
The similarity metric to use to detect the structures with similar
ligands ["fingerprint", "mcs", "openeye_shape", "schrodinger_shape"].
cache_dir: str, Path or None, default=None
Path to directory used for saving intermediate files. If None, default
location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
If multiprocessing to use.
n_processes : int or None, default=None
How many processes to use in case of multiprocessing. Defaults to
number of available CPUs.
[36]:
systems = []
[37]:
protein = KLIFSKinase(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[38]:
featurizer = KLIFSConformationTemplatesFeaturizer(
similarity_metric="fingerprint"
)
[39]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[39]:
[<ProteinLigandComplex with 2 components (<KLIFSKinase name=NTRK1>, <Ligand name=larotrectinib>)>]
[40]:
systems[0].featurizations["last"]
[40]:
dfg | ac_helix | pdb_id | chain_id | expo_id | ligand_similarity | pocket_similarity | |
---|---|---|---|---|---|---|---|
0 | in | in | 4yne | A | 4EK | 0.568047 | 443.0 |
1 | in | out | 6tfp | A | N6Z | 0.534031 | 215.0 |
2 | out | in | 4pmp | A | 31W | 0.482759 | 443.0 |
3 | out-like | in | 6brj | A | VX6 | 0.521739 | 279.0 |
4 | out-like | out | 3aqv | A | TAK | 0.435754 | 171.0 |
5 | out | out | 5jfv | A | 6K1 | 0.491620 | 422.0 |