Schrodinger Structural Featurizer

This notebook introduces structural modeling featurizers using molecular modeling capabilities from the Schrodinger Suite to prepare protein structures and to dock small molecules into their binding sites.

Note: All structural featurizers fetch data and/or do expensive computations. Hence, fetched data (e.g. PDB structures) and intermediate results (e.g. a prepared protein structure) are stored in a cache directory to speed up calculations when featurizing the same or similar systems multiple times. The cache directory can be specified via the cache_dir parameter, but also has a default (user_cache_dir from appdirs). In case you update your KinoML version you should consider deleting the cache directory. Otherwise, you may get results from the former KinoML version, since the intermediate results will be taken from cache.

[1]:
%%capture --no-display
from importlib import resources
import inspect
from pathlib import Path

from appdirs import user_cache_dir

from kinoml.core.ligands import Ligand
from kinoml.core.proteins import Protein, KLIFSKinase
from kinoml.core.systems import ProteinLigandComplex
from kinoml.features.core import Pipeline
from kinoml.features.complexes import (
    SCHRODINGERComplexFeaturizer,
    SCHRODINGERDockingFeaturizer,
    MostSimilarPDBLigandFeaturizer,
    KLIFSConformationTemplatesFeaturizer,
)

SCHRODINGERComplexFeaturizer

All Schrodinger Featurizers come with an extensive doc string explaining the capabilities and requirements.

[2]:
print(inspect.getdoc(SCHRODINGERComplexFeaturizer))
Given systems with exactly one protein and one ligand, prepare the complex
structure by:

 - modeling missing loops with Prime according to the PDB header unless
   a custom sequence is specified via the `uniprot_id` or `sequence`
   attribute in the protein component (see below), missing sequences at
   N- and C-termini are not modeled
 - building missing side chains
 - substitutions, deletions and insertions, if a `uniprot_id` or `sequence`
   attribute is provided for the protein component alteration will be first
   deleted and subsequently the intended sequence modeled with Prime, if
   an alteration could not be modeled, a corresponding deletion will remain
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='MDAnalysis' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
   should be used.
 - `expo_id`: A string specifying the ligand of interest. This is
   especially useful if multiple ligands are present in a PDB structure.
 - `uniprot_id`: A string specifying the UniProt ID that will be used to
   fetch the amino acid sequence from UniProt, which will be used for
   modeling the protein. This will supersede the sequence information
   given in the PDB header.
 - `sequence`: A string specifying the amino acid sequence in
   one-letter-codes that should be used during modeling the protein. This
   will supersede a given `uniprot_id` and the sequence information given
   in the PDB header.

The ligand component of each system must be a `core.components.BaseLigand`
or a subclass thereof. The ligand component can have the following
optional attributes:

 - `name`: A string specifying the name of the ligand, will be used for
   generating the output file name.

Parameters
----------
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
output_dir: str, Path or None, default=None
    Path to directory used for saving output files. If None, output
    structures will not be saved.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.
max_retry: int, default=3
    The maximal number of attempts to try running the prepwizard step.
build_loops: bool, default=True
    If missing loops shell be built. Is also needed to model mutations.

In general these featurizers will work with a minimal amount of information, e.g. just a PDB ID. However, it is recommended to be explicit as possible when defining the systems to featurize. For example, if a given PDB entry has multiple chains and ligands, the featurizer will have to guess which chain and ligand is of interest if not explicitly stated.

[3]:
# collect systems to featurize, i.e. prepare the protein structure
systems = []
[4]:
# unspecifc definition of the system, only via PDB ID
# modeling will be performed according to the sequence stored in the PDB Header
protein = Protein(pdb_id="4f8o", name="PsaA", toolkit="MDAnalysis")
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[5]:
# more specific definition of the system, protein of chain A co-crystallized with ligand AES and
# alternate location B, modeling will be performed according to the sequence of the given
# UniProt ID
protein = Protein.from_pdb(pdb_id="4f8o", name="PsaA", toolkit="MDAnalysis")
protein.uniprot_id = "P31522"
protein.chain_id = "A"
protein.alternate_location = "B"
protein.expo_id = "AES"
ligand = Ligand(name="AEBSF")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
[6]:
featurizer = SCHRODINGERComplexFeaturizer(output_dir="output/complex")

The featurizers will return the featurized systems as an MDAnalysis universe. Systems that failed will be filtered out. In case one is interested in failures, one can enable logging messages via:

import logging
logging.basicConfig(level=logging.DEBUG)
[7]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[7]:
[<ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>,
 <ProteinLigandComplex with 2 components (<Protein name=PsaA>, <Ligand name=AEBSF>)>]
[8]:
systems[0].featurizations["last"]
[8]:
<Universe with 2506 atoms>

If an output_dir was provided, the prepared structure is saved in PDB format.

[9]:
for path in sorted(Path("output/complex").glob("*")):
    print(path.name)
kinoml_SCHRODINGERComplexFeaturizer_PsaA_4f8o_AEBSF_complex.pdb
kinoml_SCHRODINGERComplexFeaturizer_PsaA_4f8o_chainA_altlocB_AEBSF_complex.pdb

SCHRODINGERDockingFeaturizer

Docking can be performed with and without shape restrain to the co-crystallized ligand. Moreover, the protein structure for docking must contain a co-crystallized ligand, which is required for the pocket definition.

[10]:
print(inspect.getdoc(SCHRODINGERDockingFeaturizer))
Given systems with exactly one protein and one ligand, prepare the
structure dock the ligand into its binding site identified by a
co-crystallized ligand. The following steps will be performed:

 - modeling missing loops with Prime according to the PDB header unless
   a custom sequence is specified via the `uniprot_id` or `sequence`
   attribute in the protein component (see below), missing sequences at
   N- and C-termini are not modeled
 - building missing side chains
 - substitutions, deletions and insertions, if a `uniprot_id` or `sequence`
   attribute is provided for the protein component alteration will be first
   deleted and subsequently the intended sequence modeled with Prime, if
   an alteration could not be modeled, a corresponding deletion will remain
 - removing everything but protein, water and ligand of interest
 - protonation at pH 7.4
 - docking a ligand

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, must be initialized with toolkit='MDAnalysis' and give
access to the molecular structure, e.g. via a pdb_id. Additionally, the
protein component can have the following optional attributes to customize
the protein modeling:

 - `name`: A string specifying the name of the protein, will be used for
   generating the output file name.
 - `chain_id`: A string specifying which chain should be used.
 - `alternate_location`: A string specifying which alternate location
   should be used.
 - `expo_id`: A string specifying a ligand bound to the protein of
   interest. This is especially useful if multiple proteins are found in
   one PDB structure.
 - `uniprot_id`: A string specifying the UniProt ID that will be used to
   fetch the amino acid sequence from UniProt, which will be used for
   modeling the protein. This will supersede the sequence information
   given in the PDB header.
 - `sequence`: A string specifying the amino acid sequence in
   one-letter-codes that should be used during modeling the protein. This
   will supersede a given `uniprot_id` and the sequence information given
   in the PDB header.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES. Additionally, the ligand component can have the following optional
attributes:

 - `name`: A string specifying the name of the ligand, will be used for
   generating the output file name and as molecule title in the docking
   pose SDF file.
 - `macrocycle`: A bool specifying if the ligand shell be sampled as a
   macrocycle during docking. Docking will fail, if SCHRDODINGER does not
   consider the ligand a macrocycle.

Parameters
----------
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
output_dir: str, Path or None, default=None
    Path to directory used for saving output files. If None, output
    structures will not be saved.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.
max_retry: int, default=3
    The maximal number of attempts to try running the prepwizard and
    docking steps.
build_loops: bool, default=True
    If missing loops shell be built. Is also needed to model mutations.
shape_restrain: bool, default=True
    If the docking shell be performed with shape restrain based on the
    co-crystallized ligand.

Without shape restrain

[11]:
systems = []
[12]:
protein = Protein(pdb_id="4yne", uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[13]:
featurizer = SCHRODINGERDockingFeaturizer(
    output_dir="output/docking_without_shape_restrain",
    shape_restrain=False
)
[14]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Converted file: /scratch/lsftmp/6525607.tmpdir/tmpq7th966w.mae
Removing previous job files...
JobId: lu04-0-6262a247
ExitStatus: finished
Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_without_shape_restrain/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae
[14]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

Docking scores are stored in the returned MDAnalysis universe.

[15]:
systems[0].featurizations["last"]._topology.docking_score
[15]:
-10.1766

If an output_dir was provided, the prepared structure is saved in PDB and MAE format, the prepared ligand is additionally saved in SDF format.

[16]:
for path in sorted(Path("output/docking_without_shape_restrain").glob("*")):
    print(path.name)
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.pdb
kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_ligand.sdf

With shape restrain

[17]:
systems = []
[18]:
protein = Protein(pdb_id="4yne", uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
protein.expo_id = "4EK"
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[19]:
featurizer = SCHRODINGERDockingFeaturizer(
    output_dir="output/docking_with_shape_restrain",
    shape_restrain=True,
)
[20]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Converted file: /scratch/lsftmp/6525607.tmpdir/tmp195pmphc.mae
Removing previous job files...
JobId: lu04-0-6262a33d
ExitStatus: finished
Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_with_shape_restrain/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4yne_larotrectinib_complex.mae
[20]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]

MostSimilarPDBLigandFeaturizer

Manually specifying the most suitable PDB structure to dock into is not practical for a larger set of ligands. Hence, the MostSimilarPDBLigandFeaturizer was implemented, wich can find the most suitable structure for docking in the PDB based on ligand similarity. The user can choose from one the following similarity metrics:

  • Fingerprint

  • Most common substructure

  • OpenEye’s shape

  • Schrodinger’s shape

[21]:
print(inspect.getdoc(MostSimilarPDBLigandFeaturizer))
Find the most similar co-crystallized ligand in the PDB according to a
given SMILES and UniProt ID.

The protein component of each system must be a `core.proteins.Protein` or
a subclass thereof, and must be initialized with a `uniprot_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structure with the most
    similar ligand ["fingerprint", "mcs", "openeye_shape",
    "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.

Note
----
The toolkit ['MDAnalysis' or 'OpenEye'] specified in the protein object
initialization should fit the required toolkit when subsequently applying
the OEDockingFeaturizer or SCHRODINGERDockingFeaturizer.

Most common substructure

[22]:
systems = []
[23]:
protein = Protein(uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[24]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="mcs")
[25]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
[25]:
('4YNE', 'A', '4EK')
12.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Fingerprint

[26]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint")
[27]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
[27]:
('4YNE', 'A', '4EK')
6.91 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Schrodinger’s shape

[28]:
featurizer = MostSimilarPDBLigandFeaturizer(similarity_metric="schrodinger_shape")
[29]:
%%timeit -n 1 -r 1
%%capture --no-display
systems = featurizer.featurize(systems)
systems[0].protein.pdb_id, systems[0].protein.chain_id, systems[0].protein.expo_id
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LC_NUMERIC = "C",
        LC_TIME = "C",
        LANG = "C.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
JobId: lu04-0-6262a42c
[29]:
('4YPS', 'A', '4F6')
36.7 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Using shape is the slowest option, but in many cases the most accurate one.

Pipeline of MostSimilarPDBLigandFeaturizer and SCHRODINGERDockingFeaturizer

The MostSimilarPDBLigandFeaturizer can be joined with the SCHRODINGERDockingFeaturizer into a Pipeline featurizer.

[30]:
systems = []
[31]:
protein = Protein(uniprot_id="P04629", name="NTRK1", toolkit="MDAnalysis")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[32]:
featurizer = Pipeline([
    MostSimilarPDBLigandFeaturizer(similarity_metric="fingerprint"),
    SCHRODINGERDockingFeaturizer(output_dir="output/docking_pipeline"),
])
[33]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
Converted file: /scratch/lsftmp/6525607.tmpdir/tmp5xoq3zd9.mae
Removing previous job files...
JobId: lu04-0-6262a474
ExitStatus: finished
Converted file: /lila/data/chodera/shallerd/projects/schrodinger/kinoml/examples/output/docking_pipeline/kinoml_SCHRODINGERDockingFeaturizer_NTRK1_4YNE_chainA_larotrectinib_complex.mae
[33]:
[<ProteinLigandComplex with 2 components (<Protein name=NTRK1>, <Ligand name=larotrectinib>)>]
[34]:
systems[0].featurizations
[34]:
{'last': <Universe with 4964 atoms>,
 'Pipeline([MostSimilarPDBLigandFeaturizer, SCHRODINGERDockingFeaturizer])': <Universe with 4964 atoms>}

KLIFSConformationTemplatesFeaturizer

The KLIFSConformationTemplatesFeaturizer searches for suitable templates to model a kinase:ligand complex in different conformations. The templates are selected based on ligand and sequence similarity.

[35]:
print(inspect.getdoc(KLIFSConformationTemplatesFeaturizer))
Find suitable kinase templates for modeling a kinase:inhibitor complex in
different KLIFS conformations.

The protein component of each system must be a `core.proteins.KLIFSKinase`,
and must be initialized with a `uniprot_id` or `kinase_klifs_id` parameter.

The ligand component of each system must be a `core.ligands.Ligand` or a
subclass thereof and give access to the molecular structure, e.g. via a
SMILES.

Parameters
----------
similarity_metric: str, default="fingerprint"
    The similarity metric to use to detect the structures with similar
    ligands ["fingerprint", "mcs", "openeye_shape", "schrodinger_shape"].
cache_dir: str, Path or None, default=None
    Path to directory used for saving intermediate files. If None, default
    location provided by `appdirs.user_cache_dir()` will be used.
use_multiprocessing : bool, default=True
    If multiprocessing to use.
n_processes : int or None, default=None
    How many processes to use in case of multiprocessing. Defaults to
    number of available CPUs.
[36]:
systems = []
[37]:
protein = KLIFSKinase(uniprot_id="P04629", name="NTRK1")
ligand = Ligand(smiles="C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F", name="larotrectinib")
system = ProteinLigandComplex(components=[protein, ligand])
systems.append(system)
[38]:
featurizer = KLIFSConformationTemplatesFeaturizer(
    similarity_metric="fingerprint"
)
[39]:
%%capture --no-display
systems = featurizer.featurize(systems)
systems
[39]:
[<ProteinLigandComplex with 2 components (<KLIFSKinase name=NTRK1>, <Ligand name=larotrectinib>)>]
[40]:
systems[0].featurizations["last"]
[40]:
dfg ac_helix pdb_id chain_id expo_id ligand_similarity pocket_similarity
0 in in 4yne A 4EK 0.568047 443.0
1 in out 6tfp A N6Z 0.534031 215.0
2 out in 4pmp A 31W 0.482759 443.0
3 out-like in 6brj A VX6 0.521739 279.0
4 out-like out 3aqv A TAK 0.435754 171.0
5 out out 5jfv A 6K1 0.491620 422.0