KinoML object model¶
The KinoML object model provides access to binding affinity data in the context of machine learning for small molecule drug discovery (Fig. 1). The DatasetProvider
is the central object for storing all relevant information of a dataset. It is essentially a list of Measurement
objects, which contain the measured values
(singlicate or replicates), associated to a System
plus experimental AssayCondition
s. A System
is a list of MolecularComponent
objects; usually a
Protein
and a Ligand
. Featurizer
s will use the input of the MolecularComponent
s to represent the System
in different formats for machine learning tasks, e.g. a Ligand
as molecular fingerprint.
KinoML has a focus on protein kinases but the architecture is applicable to protein targets in general. When writing your own KinoML objects it is recommended to move computational expensive tasks to the Featurizer
level, which is capable of multi-processing. For example Protein
objects can be initialized with nothing else but a UniProt ID. The amino acid sequence will be fetched when the Protein
’s sequence
attribute is called for the first time. Thus, one can quickly generate
many Protein
objects and the more time-consuming sequence fetching is done with multi-processing during featurization.
In the following section, different KinoML objects will be introduced including code examples.
Molecular components¶
Molecular components like ligands and proteins store molecular representations, a name
and additional metadata
, that may be important for working with the data and provenance.
Ligands¶
Ligand objects store information about the molecular structure of a ligand, usually a small molecule with certain activity for a target.
The Ligand
object is based on the OpenFF-Toolkit Molecule
object, which can be accessed via the molecule
attribute. This also allows usage of methods of the OpenFF-Toolkit Molecule
including conversion to other toolkits, e.g. RDKit and OpenEye. The Ligand
object can be directly
initialized via SMILES or file including interpretation of the given input, or lazely initialized via SMILES without any interpretation of the given input.
[1]:
from openff.toolkit.utils.exceptions import SMILESParseError
from kinoml.core.ligands import Ligand
[2]:
# initialize a Ligand from SMILES, the molecule will be directly interpreted
ligand = Ligand.from_smiles("CCC", name="propane")
print(type(ligand))
print(type(ligand.molecule))
print(type(ligand.molecule.to_rdkit()))
print(ligand.molecule.to_smiles(explicit_hydrogens=False))
<class 'kinoml.core.ligands.Ligand'>
<class 'openff.toolkit.topology.molecule.Molecule'>
<class 'rdkit.Chem.rdchem.Mol'>
CCC
[3]:
# erroneous input will raise errors during initialization
try:
ligand = Ligand.from_smiles("XXX", name="wrong_smiles")
print("Success!")
except SMILESParseError:
print("Failed!")
Failed!
Warning: Problem parsing SMILES:
Warning: XXX
Warning: ^
[4]:
# Ligands can also be lazely initialized via SMILES
# here the interpretation is done when calling the molecule attribute for the first time
ligand = Ligand(smiles="CCC", name="propane")
print(type(ligand.molecule))
<class 'openff.toolkit.topology.molecule.Molecule'>
[5]:
# this makes the object generation faster
# but will result in interpretation errors later, e.g. during a featurization step
# hence featurizers need to detect and remove those systems
ligand = Ligand(smiles="XXX", name="wrong_smiles")
print("Ligand lazely initialized!")
try:
print(type(ligand.molecule))
print("Success!")
except SMILESParseError:
print("Failed!")
Ligand lazely initialized!
Failed!
Warning: Problem parsing SMILES:
Warning: XXX
Warning: ^
Proteins¶
Protein objects store information about the molecular structure of a protein, e.g. the target of a small molecule inhibitor.
KinoML provides two different Protein objects, i.e. Protein
(applicable to all proteins) and KLIFSKinase
(allows access to information from the protein kinase-specific KLIFS database). Similar to Ligand
, protein objects can be directly or lazily initialized.
Again, the molecular structure is accessable via the molecule
attribute. However, both protein objects support two toolkits, i.e. MDAnalysis and OpenEye, which can be specified via the toolkit argument. A conversion from one toolkit to the other after initialization is currently not possible, but likely not needed anyway.
Another important attribute of proteins is their sequence
. Depending on the used featurizer, a molecular structure may actually not be required, for example in case of OneHotEncoding of the sequence. Hence, you can also initialize Protein
and KLIFSKinase
using sequence identifiers only, e.g. UniProt ID or NCBI ID. This is always done lazily, so the sequences will be fetched from the respective resource on the first call of the sequence
attribute. Protein
and KLIFSKinase
inherit their sequence-related functionality from the AminoAcidSequence
object in kinoml.core.sequences
, which allows for further a customization of sequences, e.g. mutations. For more details have a look at the AminoAcidSequence
class in the respective section of the KinoML API documentation.
[6]:
from kinoml.core.proteins import Protein, KLIFSKinase
/home/david/miniconda3/envs/kinoml/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/david/miniconda3/envs/kinoml/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
/home/david/miniconda3/envs/kinoml/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/david/miniconda3/envs/kinoml/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
[7]:
# initialize from PDB ID with different toolkits
protein = Protein.from_pdb("4yne", name="NTRK1")
protein2 = Protein.from_pdb("4yne", name="NTRK1", toolkit="MDAnalysis")
print(type(protein.molecule))
print(type(protein2.molecule))
protein2
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
<class 'openeye.oechem.OEGraphMol'>
<class 'MDAnalysis.core.universe.Universe'>
[7]:
<Protein name=NTRK1>
[8]:
# initialize lazily via PDB ID
protein = Protein(pdb_id="4nye", name="NTRK1")
print(type(protein.molecule))
<class 'openeye.oechem.OEGraphMol'>
[9]:
# note there is no sequence yet, since no UniProt ID was given
print(len(protein.sequence))
# but one could get it from the protein structure if needed
0
[10]:
# initialize with sequence from UniProt
protein = Protein(uniprot_id="P04629", name="NTRK1")
print(protein.sequence[:10])
# initialize with sequence from UniProt and custom mutations
protein = Protein(uniprot_id="P04629", name="NTRK1", metadata={"mutations": "R3A"})
print(protein.sequence[:10])
print(type(protein.molecule)) # a molecule is not available
MLRGGRRGQL
MLAGGRRGQL
<class 'NoneType'>
[11]:
# get the kinase KLIFS pocket sequence via different identifiers (lazy)
kinase = KLIFSKinase(uniprot_id="P04629", name="NTRK1")
print(kinase.kinase_klifs_sequence)
kinase = KLIFSKinase(ncbi_id="NP_001007793", name="NTRK1")
print(kinase.kinase_klifs_sequence)
kinase = KLIFSKinase(kinase_klifs_id=480, name="NTRK1")
print(kinase.kinase_klifs_sequence)
WELGEGAFGKVFLVAVKALDFQREAELLTMLQQHIVRFFGVLMVFEYMRHGDLNRFLRSYLAGLHFVHRDLATRNCLVIGDFGMS
WELGEGAFGKVFLVAVKALDFQREAELLTMLQQHIVRFFGVLMVFEYMRHGDLNRFLRSYLAGLHFVHRDLATRNCLVIGDFGMS
WELGEGAFGKVFLVAVKALDFQREAELLTMLQQHIVRFFGVLMVFEYMRHGDLNRFLRSYLAGLHFVHRDLATRNCLVIGDFGMS
Systems¶
Systems store all molecular components for a given activity data point. They may only contain a Ligand
in case of purely ligand-based featurization but can also contain a Protein
, i.e. LigandSystem
, ProteinSystem
, ProteinLigandComplex
.
[12]:
from kinoml.core.systems import LigandSystem, ProteinSystem, ProteinLigandComplex
[13]:
ligand = Ligand(smiles="CCC", name="propane")
protein = Protein(uniprot_id="P04629", name="NTRK1")
[14]:
system = LigandSystem(components=[ligand])
system
[14]:
<LigandSystem with 1 components (<Ligand name=propane>)>
[15]:
system = ProteinSystem(components=[protein])
system
[15]:
<ProteinSystem with 1 components (<Protein name=NTRK1>)>
[16]:
system = ProteinLigandComplex(components=[ligand, protein])
system
[16]:
<ProteinLigandComplex with 2 components (<Ligand name=propane>, <Protein name=NTRK1>)>
Featurizers¶
Featurizer
s ingest System
s to compute features for e.g. machine learning tasks. Systems failing during featurization will be removed, e.g. erroneous SMILES. Featurizations are stored in each system for later usage.
[17]:
from kinoml.features.ligand import MorganFingerprintFeaturizer
[18]:
# generate systems with lazily initialized ligands
systems = [
LigandSystem(components=[Ligand(smiles=smiles, name=str(i))])
for i, smiles in enumerate(["C", "?", "CC", "CCC"])
]
systems
[18]:
[<LigandSystem with 1 components (<Ligand name=0>)>,
<LigandSystem with 1 components (<Ligand name=1>)>,
<LigandSystem with 1 components (<Ligand name=2>)>,
<LigandSystem with 1 components (<Ligand name=3>)>]
[19]:
# the featurization will lead to interpretation of the given SMILES for the first time
# failing systems will not be returned
featurizer = MorganFingerprintFeaturizer()
systems = featurizer.featurize(systems)
systems
Warning: Problem parsing SMILES:
Warning: ?
Warning: ^
[19]:
[<LigandSystem with 1 components (<Ligand name=0>)>,
<LigandSystem with 1 components (<Ligand name=2>)>,
<LigandSystem with 1 components (<Ligand name=3>)>]
[20]:
# featurizations are stored in each system as a dict
# the lastly performed featurization is additionally stored with the "last" key
systems[0].featurizations
[20]:
{'last': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]),
'MorganFingerprintFeaturizer': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0])}
Measurements¶
Measurement
s combine information for a given activity data point, i.e. System
, AssayCondition
and activity values
. Currently available Measurement
objects are PercentageDisplacementMeasurement
, pIC50Measurement
, pKiMeasurement
, pKdMeasurement
.
[21]:
from kinoml.core.conditions import AssayConditions
from kinoml.core.measurements import PercentageDisplacementMeasurement
[22]:
ligand = Ligand(smiles="CCC", name="propane")
protein = Protein(uniprot_id="P04629", name="NTRK1")
measurement = PercentageDisplacementMeasurement(
10,
conditions=AssayConditions(pH=7.0),
system=ProteinLigandComplex(components=[ligand, protein]),
)
measurement
[22]:
<PercentageDisplacementMeasurement values=[10] conditions=<AssayConditions pH=7.0> system=<ProteinLigandComplex with 2 components (<Ligand name=propane>, <Protein name=NTRK1>)>>
DatasetProviders¶
DatasetProviders
are essentially a list of Measurement
s, which can be used for machine learning experiments. Featurizer
s can be passed to allow a featurization of all available System
s. Currently, KinoML is shipped with DatasetProvider
s for PKIS2 and ChEMBL datasets allowing quick experiment design.
[23]:
from kinoml.datasets.chembl import ChEMBLDatasetProvider
from kinoml.datasets.pkis2 import PKIS2DatasetProvider
[24]:
# load data points given by the PKIS2 publication (https://doi.org/10.1371/journal.pone.0181585)
pkis2 = PKIS2DatasetProvider.from_source()
print(pkis2)
<PKIS2DatasetProvider with 261870 PercentageDisplacementMeasurement measurements and 261870 systems (Ligand=640, KLIFSKinase=406)>
[25]:
# load curated ChEMBL data points available at https://github.com/openkinome/kinodata
# here the more general "Protein" object will be used instead of the default "KLIFSKinase"
# also protein objects will be initialized with the MDAnalysis toolkit
chembl = ChEMBLDatasetProvider.from_source(
path_or_url="https://github.com/openkinome/datascripts/releases/download/v0.3/activities-chembl29_v0.3.zip",
measurement_types=("pIC50", "pKi", "pKd"),
protein_type="Protein",
toolkit="MDAnalysis",
)
chembl
[25]:
<ChEMBLDatasetProvider with 190469 measurements (pIC50Measurement=160703, pKiMeasurement=15653, pKdMeasurement=14113), and 188032 systems (Protein=462, Ligand=115207)>
[26]:
# loading a smaller sample allows rapid testing
# loading now with default "KLIFSKinase" protein object
chembl = ChEMBLDatasetProvider.from_source(
path_or_url="https://github.com/openkinome/datascripts/releases/download/v0.3/activities-chembl29_v0.3.zip",
measurement_types=["pKi"],
sample=100,
)
chembl
[26]:
<ChEMBLDatasetProvider with 100 measurements (pKiMeasurement=100), and 100 systems (KLIFSKinase=38, Ligand=100)>
[27]:
%%capture --no-display
# upper statement to hide warnings
# all systems will be successfully featurized
chembl.featurize(MorganFingerprintFeaturizer())
chembl
[27]:
<ChEMBLDatasetProvider with 100 measurements (pKiMeasurement=100), and 100 systems (KLIFSKinase=38, Ligand=100)>
[28]:
from kinoml.features.protein import OneHotEncodedSequenceFeaturizer
[29]:
# not all systems maybe featurizable and will be removed, e.g. erroneous SMILES
# here certain ChEMBL data points are for kinases that are not available in KLIFS
chembl.featurize(OneHotEncodedSequenceFeaturizer(sequence_type="klifs_kinase"))
chembl
There were 3 systems that could not be featurized!
[29]:
<ChEMBLDatasetProvider with 97 measurements (pKiMeasurement=97), and 97 systems (KLIFSKinase=37, Ligand=97)>