kinoml.datasets.torch_datasets¶
Helper classes to convert between DatasetProvider objects and Dataset-like objects native to the PyTorch ecosystem
Module Contents¶
- class kinoml.datasets.torch_datasets.PrefeaturizedTorchDataset(systems, measurements, observation_model: callable = _null_observation_model)¶
Bases:
torch.utils.data.DatasetExposes the
X,y(systems and measurements, respectively) arrays exported byDatasetProviderusing the API expected by Torch DataLoaders.- Parameters:
systems (array-like) – X vectors, as exported from featurized systems in DatasetProvider
measurements (array-like) – y vectors, as exported from the measurement values contained in a DatasetProvider
observation_model (callable, optional) – A function that adapts the predicted
yto the observedyvalues. Useful to combine measurement types in the same model, if they are mathematically related. Normally provided by theMeasurementtype class.
- device = 'cuda'¶
- systems¶
- measurements¶
- observation_model¶
- __getitem__(index)¶
- __len__()¶
- as_dataloader(**kwargs)¶
Build a PyTorch DataLoader view of this Dataset
- estimate_input_size() int¶
Estimate the input size for a model, using the first dimension of the
Xvector shape.
- class kinoml.datasets.torch_datasets.TorchDataset(systems, measurements, featurizer, observation_model: callable = _null_observation_model)¶
Bases:
PrefeaturizedTorchDatasetSame purpose as
PrefeaturizedTorchDataset, but instead of taking arrays in, it takes the non-featurizedSystemandMeasurement``objects, and applies a ``featurizeron the fly upon access (e.g. during training).- Parameters:
systems (list of kinoml.core.systems.System)
measurements (list of kinoml.core.measurements.BaseMeasurement)
featurizer (callable) – A function that takes a
Systemand returns an array-like object.observation_model (callable, optional) – A function that adapts the predicted
yto the observedyvalues. Useful to combine measurement types in the same model, if they are mathematically related. Normally provided by theMeasurementtype class.
- featurizer¶
- estimate_input_size()¶
Estimate the input size for a model, using the first dimension of the
Xvector shape.
- __getitem__(index)¶
In this case, the DatasetProvider is passing System objects that will be featurized (and memoized) upon access only.
- class kinoml.datasets.torch_datasets.XyTorchDataset(X, y, indices=None)¶
Bases:
torch.utils.data.DatasetSimple Torch Dataset adaptor where X and y are homogeneous tensors. All systems have the shape.
- Parameters:
X (arraylike) – Featurized systems and their measurements
y (arraylike) – Featurized systems and their measurements
indices (dict of array selectors) – It will only accept train, train/test or train/test/val keys.
- data_X¶
- data_y¶
- indices¶
- classmethod from_npz(path)¶
Load
Xandyarrays from a NPZ file present in disk. These files must expose at least two keys:Xandy. It can also contain three more:idx_train,idx_testandidx_val, which correspond to the indices of the training, test and validation subsets.- Parameters:
npz (str) – Path to a NPZ file with the keys exposed above.
- __getitem__(index)¶
- __len__()¶
- input_size()¶
- class kinoml.datasets.torch_datasets.MultiXTorchDataset(dict_of_arrays, indices=None)¶
Bases:
torch.utils.data.DatasetThis class is able to load NPZ files into a
torch.Datasetcompliant object.It assumes the following things. If each system is characterized with a single tensor:
The X tensors can be of the same shape. In that case, the NPZ file only has a single
Xkey, preloaded and accessible via.data_X`. When queried, it returns a view to the ``torch.tensorobject.The X tensors have different shape. In that case, the keys of the NPZ follow the
X_s{int}syntax. When queried, it returns a list oftorch.tensorobjects.
If each system is characterized with more than one tensor:
The NPZ keys follow the
X_s{int}_a{int}syntax. When queried, it returns a list of tuples oftorch.tensorobjects.
No matter the structure of
X,yis assumed to be a homogeneous tensor, and it will always be returned as a view to the underlyingtorch.tensorobject.Additionally, the NPZ file might contain
idx_train,idx_test(andidx_val) arrays, specifying indices for the train / test / validation split. If provided, they will be stored under an.indicesdict.- Parameters:
dict_of_arrays (dict of np.ndarray) – See above.
indices (dict of np.ndarray)
Notes
This object is better paired with the output of
DatasetProvider.to_dict_of_arrays.
- _data¶
- data_y¶
- shape_X¶
- shape_y¶
- indices¶
- _fast_key_access¶
- _is_npz = None¶
- classmethod from_npz(path, lazy=True, close_filehandle=False)¶
Load from a single NPZ file. If lazy=True, this can be very slow for large amounts of arrays.
- Parameters:
path (str) – Path to the NPZ file
lazy (bool, optional=True) – Whether to let Numpy load arrays on demand, upon access (True) or preload everything in memory (False)
close_filehandle (bool, optional=False) – Whether to close the NPZ filehandle after reading some metadata. This will enable parallelism without preloading everything, but each access will suffer the overhead of opening the NPZ file again!
Note
NPZ files cannot be read in parallel (you’ll see CRC32 errors and others). If you want to use
DataLoader(..., num_workers=2)or above, you’ll need to:A) preload everything with
lazy=False. This will use more RAM and incur an initial waiting time.B) use
close_filehandle=True. This will incur a penalty upon each access, because the NPZ file needs to be reloaded each time.
- _getitem_multi_X(accessor)¶
Note: This method might scale poorly and can end up being a bottleneck! Most of the time is spent accessing the NPZ file on disk, though.
Some timings:
>>> ds = MultiXTorchDataset.from_npz("ChEMBLDatasetProvider.npz") >>> %timeit _ = ds[0:2] 2.91 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit _ = ds[0:4] 5.59 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit _ = ds[0:8] 11.4 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit _ = ds[0:16] 22.7 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit _ = ds[0:32] 44.7 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit _ = ds[0:64] 87 ms ± 2.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit _ = ds[0:128] 171 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
- _getitem_single_X(index)¶
- __getitem__(index)¶
- _shape_X()¶
- is_single_X()¶
- _str_keys_to_nested_dict(keys)¶
- static _key_to_ints(key: str) List[int]¶
NPZ keys are formatted with this syntax:
{X|y}_{1-character str}{int}_{1-character str}{int}_We split by underscores and extract the ints into a list
- __len__()¶
- class kinoml.datasets.torch_datasets.AwkwardArrayDataset(data)¶
Bases:
torch.utils.data.DatasetLoads an Awkward array of Records.
The structure of the array dimensions needs to be:
List of systems
—- X1 —- X2 —- … —- Xn —- y
However, X1…Xn, y are accessed by positional index, as a string.
So, to get all the X1 vectors for all systems, you’d do:
X1 = data[“0”] X2 = data[“1”]
Since
yis always the last one you can use thedata.fieldslist:y = data[data.fields[-1]]
This is essentially what
__getitem__is doing for you.It will try to consolidate tensors whenever possible, as long as they have the same shape. If they do not, then you’ll get a list of tensors instead.
If this is the case, make sure to provide a suitable
collate_fnfunction for the corresponding Dataloader! More info:https://pytorch.org/docs/stable/data.html#dataloader-collate-fn
Notes
With several tensors per system, but all of the same shape, it is faster:
>>> awk = AwkwardArrayDataset.from_parquet("same_shape.parquet") >>> %timeit _ = awk[:50] 2.38 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> awk = AwkwardArrayDataset.from_parquet("different_shape.parquet") >>> %timeit _ = awk[:50] 9.32 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is probably due to the Awkward->Numpy->Torch conversions that need to happen for each different-shape sub-tensor. Look in
__getitem__for bottlenecks.- data¶
- __len__()¶
- __getitem__(index)¶
- __repr__()¶
- __str__()¶
- classmethod from_parquet(path, **kwargs)¶
- kinoml.datasets.torch_datasets._accessor_to_indices(accessor, full_size)¶