kinoml.datasets.core
¶
Base classes for DatasetProvider
-like objects
Module Contents¶
- kinoml.datasets.core.logger¶
- class kinoml.datasets.core.BaseDatasetProvider¶
Bases:
object
API specification for dataset providers
- abstract property systems¶
- abstract property measurement_type¶
- abstract property conditions¶
- abstract classmethod from_source(path_or_url=None, **kwargs)¶
Parse CSV/raw files to object model.
- abstract observation_model(backend='pytorch')¶
- abstract measurements_as_array(reduce=np.mean)¶
- abstract measurements_by_group()¶
- abstract featurize(*featurizers: Iterable[kinoml.features.core.BaseFeaturizer])¶
- abstract featurized_systems(key='last')¶
- abstract to_dataframe(*args, **kwargs)¶
- abstract to_pytorch(**kwargs)¶
- abstract to_tensorflow(*args, **kwargs)¶
- abstract to_numpy(*args, **kwargs)¶
- class kinoml.datasets.core.DatasetProvider(measurements: Iterable[kinoml.core.measurements.BaseMeasurement], metadata: dict = None)¶
Bases:
BaseDatasetProvider
Base object for all DatasetProvider classes.
- Parameters
measurements (list of BaseMeasurement) – A DatasetProvider holds a list of
kinoml.core.measurements.BaseMeasurement
objects (or any of its subclasses). They must be of the same type!metadata (dict) – Extra information for provenance.
Note
All measurements must be of the same type! If they are not, consider using
MultiDatasetProvider
instead.- property systems¶
- property measurement_type¶
- property conditions: set¶
- _raw_data¶
- __len__()¶
- __getitem__(subscript)¶
- __repr__() str ¶
Return repr(self).
- abstract classmethod from_source(path_or_url=None, **kwargs)¶
Parse CSV/raw file to object model. This method is responsible of generating the objects for
self.measurements
, if relevant. Additional kwargs will be passed to__init__
.You must define this in your subclass.
- featurize(featurizer: kinoml.features.core.BaseFeaturizer)¶
Given a collection of
kinoml.features.core.BaseFeaturizers
, apply them to the systems present in theself.measurements
.- Parameters
featurizer (BaseFeaturizer) – Featurization scheme that will be applied to the systems, in a stacked way.
Note
- TODO:
Will the systems be properly featurized with Dask?
- _post_featurize(featurizer: kinoml.features.core.BaseFeaturizer)¶
Remove measurements with systems, that were not successfully featurized.
- Parameters
featurizer (BaseFeaturizer) – The used featurizer.
- featurized_systems(key='last', clear_after=False)¶
Return the
key
featurized objects from all systems.
- abstract _to_dataset(style='pytorch')¶
Generate a clean <style>.data.Dataset object for further steps in the pipeline (model building, etc).
Warning
This step is lossy because the resulting objects will no longer hold chemical data. Operations depending on such information, must be performed first.
Examples
>>> provider = DatasetProvider() >>> provider.featurize() # optional >>> splitter = TimeSplitter() >>> split_indices = splitter.split(provider.data) >>> dataset = provider.to_dataset("pytorch") # .featurize() under the hood >>> X_train, X_test, y_train, y_test = train_test_split(dataset, split_indices)
- to_dataframe(*args, **kwargs)¶
Generates a
pandas.DataFrame
containing information on the systems and their measurements- Return type
pandas.DataFrame
- to_pytorch(featurizer=None, **kwargs)¶
Export dataset to a PyTorch-compatible object, via adapters found in
kinoml.torch_datasets
.
- to_xgboost(**kwargs)¶
Export dataset to a
DMatrix
object, native to the XGBoost framework
- abstract to_tensorflow(*args, **kwargs)¶
- to_numpy(featurization_key='last', y_dtype='float32', **kwargs)¶
Export dataset to a tuple of two Numpy arrays of same shape:
X
: the featurized systemsy
: the measurements values (must be the same measurement type)
- Parameters
featurization_key (str, optional="last") – Which featurization present in the systems will be taken to build the
X
array. Usually,last
as provided by aPipeline
object.y_dtype – Coerce Y array to this dtype
str (np.dtype or) – Coerce Y array to this dtype
optional="float32" – Coerce Y array to this dtype
kwargs (optional,) – Dict that will be forwarded to
.measurements_as_array
, which will build they
array.
- Returns
X, y
- Return type
2-tuple of np.array
Note
This exporter assumes that each System is featurized as a single tensor with homogeneous shape throughout the system collection. If this does not hold true for your current featurization scheme, consider using
.to_dict_of_arrays
instead.
- to_dict_of_arrays(featurization_key='last', y_dtype='float32', _initial_system_index=0) dict ¶
Export dataset to a dict-like object, compatible with
DictOfArrays
and NPZ files.The idea is to provide unique keys for each system and their features, following the syntax
X_s{int}_v{int}
.This object is useful when the features for each system have different shapes and/or dimensionality and cannot be concatenated in a single homogeneous array
- Parameters
featurization_key (Hashable, optional="last") – Which key to access in each
System.featurizations
dicty_dtype (np.dtype or str, optional="float32") – Which kind of dtype to use for the
y
array_initial_system_index (int, optional=0) – PRIVATE. Start counting systems in
X_s{int}
with this value.
- Returns
A dictionary that maps
str
keys to array-like objects. Depending on the featurization scheme, keys can be:All systems are featurized as an array and they share the same shape ->
X, y
All N systems are featurized as an array but they do NOT share the same shape ->
X_s0_, X_s1_, ..., X_sN_
All N systems are featurized as a M-tuple of arrays (shape irrelevant) ->
X_s0_a0_, X_s0_a1_, X_s1_a0_, X_s1_a1_, ..., X_sN_aM_
- Return type
dict[str, array]
Note
The X keys have a trailing underscore on purpose. Otherwise, filtering keys out of the dictionary by index can be deceivingly slow. For example, filtering for the first system (s1) with
key.startswith("X_s1")
will also select for X_s1, X_s10, X_s11… Hence, we filter withX_s{int}_
.
- to_awkward(featurization_key='last', y_dtype='float32', clear_after=False)¶
Creates an awkward array out of the featurized systems and the associated measurements.
- Return type
awkward array
Notes
Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.
Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they’re not.
- observation_model(**kwargs)¶
Draft implementation of a modular observation model, based on individual contributions from different measurement types.
- loss_adapter(**kwargs)¶
Observation model plus loss function, wrapped in a single callable. Return types are backend-dependent.
- measurements_as_array(reduce=np.mean, dtype='float32')¶
- split_by_groups() dict ¶
If a
kinoml.datasets.groups
class has been applied to this instance, this method will create more DatasetProvider instances, one per group.- Returns
Maps group key to sub-datasets
- Return type
dict
- classmethod _download_to_cache_or_retrieve(path_or_url) str ¶
Helper function to either download files to the usercache, or retrieve an already cached copy.
- Parameters
path_or_url (str or Path-like) – File path or URL pointing to the required file
- Returns
If provided argument is a file, the same path, right away If it was a URL, it will be the (downloaded) cached file path
- Return type
str
- class kinoml.datasets.core.MultiDatasetProvider(measurements: Iterable[kinoml.core.measurements.BaseMeasurement], metadata: dict = None)¶
Bases:
DatasetProvider
Adapter class that is able to expose a DatasetProvider-like interface to a collection of Measurements of different types.
The different types are split into individual DatasetProvider objects, stored under
.providers
.The rest of the API works around that list to provide similar functionality as the original, single-type DatasetProvider, but in plural.
- Parameters
measurements (list of BaseMeasurement) – A MultiDatasetProvider holds a list of
kinoml.core.measurements.BaseMeasurement
objects (or any of its subclasses). UnlikeDatasetProvider
, the measurements here can be of different types, but they will be grouped together in different sub-datasets.
- property measurements¶
Flattened list of all measurements present across all providers.
Use
.indices_by_provider()
to obtain the corresponding slices to each provider.
- _post_featurize(featurizer: kinoml.features.core.BaseFeaturizer)¶
Remove measurements with systems, that were not successfully featurized.
- Parameters
featurizer (BaseFeaturizer) – The used featurizer.
- observation_models(**kwargs)¶
List of observation models present in this dataset, one per provider (measurement type)
- loss_adapters(**kwargs)¶
List of observation models present in this dataset, one per provider (measurement type)
- abstract observation_model(**kwargs)¶
Draft implementation of a modular observation model, based on individual contributions from different measurement types.
- abstract loss_adapter(**kwargs)¶
Observation model plus loss function, wrapped in a single callable. Return types are backend-dependent.
- indices_by_provider() dict ¶
Return a dict mapping each
provider
type to their correlative indices in a hypothetically concatenated dataset.For example, if a
MultiDatasetProvider
contains 50 measurements of type A, and 25 measurements of type B, this would return{"A": slice(0, 50), "B": slice(50, 75)}
.Note
slice
objects can be passed directly to item access syntax, likelist[slice(a, b)]
.
- to_dataframe(*args, **kwargs)¶
Concatenate all the providers into a single DataFrame for easier visualization.
Check
DatasetProvider.to_dataframe()
for more details.
- to_numpy(**kwargs)¶
List of Numpy-native arrays, as generated by each
provider.to_numpy(...)
method. CheckDatasetProvider.to_numpy
docstring for more details.
- to_pytorch(**kwargs)¶
List of Numpy-native arrays, as generated by each
provider.to_pytorch(...)
method. CheckDatasetProvider.to_pytorch
docstring for more details.
- to_xgboost(**kwargs)¶
List of Numpy-native arrays, as generated by each
provider.to_xgboost(...)
method. CheckDatasetProvider.to_xgboost
docstring for more details.
- to_dict_of_arrays(**kwargs) dict ¶
Will generate a dictionary of str: np.ndarray. System indices will be accumulated.
- to_awkward(**kwargs)¶
See
DatasetProvider.to_awkward()
.X
andy
will be concatenated along axis=0 (one provider after another)
- __repr__() str ¶
Return repr(self).