`kinoml.datasets.core`¶

Base classes for DatasetProvider-like objects

Module Contents¶

kinoml.datasets.core.logger¶

class kinoml.datasets.core.BaseDatasetProvider¶

Bases: object

API specification for dataset providers

abstract property systems¶

abstract property measurement_type¶

abstract property conditions¶

abstract classmethod from_source(path_or_url=None, **kwargs)¶: Parse CSV/raw files to object model.

abstract observation_model(backend='pytorch')¶

abstract measurements_as_array(reduce=np.mean)¶

abstract measurements_by_group()¶

abstract featurize(*featurizers: Iterable[kinoml.features.core.BaseFeaturizer])¶

abstract featurized_systems(key='last')¶

abstract to_dataframe(*args, **kwargs)¶

abstract to_pytorch(**kwargs)¶

abstract to_tensorflow(*args, **kwargs)¶

abstract to_numpy(*args, **kwargs)¶

class kinoml.datasets.core.DatasetProvider(measurements: Iterable[kinoml.core.measurements.BaseMeasurement], metadata: dict = None)¶

Bases: BaseDatasetProvider

Base object for all DatasetProvider classes.

Parameters

measurements (list of BaseMeasurement) – A DatasetProvider holds a list of kinoml.core.measurements.BaseMeasurement objects (or any of its subclasses). They must be of the same type!
metadata (dict) – Extra information for provenance.

Note

All measurements must be of the same type! If they are not, consider using MultiDatasetProvider instead.

property systems¶

property measurement_type¶

property conditions: set¶

_raw_data¶

__len__()¶

__getitem__(subscript)¶

__repr__() → str¶: Return repr(self).

abstract classmethod from_source(path_or_url=None, **kwargs)¶

Parse CSV/raw file to object model. This method is responsible of generating the objects for self.measurements, if relevant. Additional kwargs will be passed to __init__.

You must define this in your subclass.

featurize(featurizer: kinoml.features.core.BaseFeaturizer)¶

Given a collection of kinoml.features.core.BaseFeaturizers, apply them to the systems present in the self.measurements.

Parameters: featurizer (BaseFeaturizer) – Featurization scheme that will be applied to the systems, in a stacked way.

Note

TODO:

Will the systems be properly featurized with Dask?

_post_featurize(featurizer: kinoml.features.core.BaseFeaturizer)¶

Remove measurements with systems, that were not successfully featurized.

Parameters: featurizer (BaseFeaturizer) – The used featurizer.

featurized_systems(key='last', clear_after=False)¶: Return the key featurized objects from all systems.

abstract _to_dataset(style='pytorch')¶

Generate a clean <style>.data.Dataset object for further steps in the pipeline (model building, etc).

Warning

This step is lossy because the resulting objects will no longer hold chemical data. Operations depending on such information, must be performed first.

Examples

>>> provider = DatasetProvider()
>>> provider.featurize()  # optional
>>> splitter = TimeSplitter()
>>> split_indices = splitter.split(provider.data)
>>> dataset = provider.to_dataset("pytorch")  # .featurize() under the hood
>>> X_train, X_test, y_train, y_test = train_test_split(dataset, split_indices)

to_dataframe(*args, **kwargs)¶

Generates a pandas.DataFrame containing information on the systems and their measurements

Return type: pandas.DataFrame

to_pytorch(featurizer=None, **kwargs)¶: Export dataset to a PyTorch-compatible object, via adapters found in kinoml.torch_datasets.

to_xgboost(**kwargs)¶: Export dataset to a DMatrix object, native to the XGBoost framework

abstract to_tensorflow(*args, **kwargs)¶

to_numpy(featurization_key='last', y_dtype='float32', **kwargs)¶

Export dataset to a tuple of two Numpy arrays of same shape:

X: the featurized systems
y: the measurements values (must be the same measurement type)

Parameters

featurization_key (str, optional="last") – Which featurization present in the systems will be taken to build the X array. Usually, last as provided by a Pipeline object.
y_dtype – Coerce Y array to this dtype
str (np.dtype or) – Coerce Y array to this dtype
optional="float32" – Coerce Y array to this dtype
kwargs (optional,) – Dict that will be forwarded to .measurements_as_array, which will build the y array.

Returns

X, y

Return type

2-tuple of np.array

Note

This exporter assumes that each System is featurized as a single tensor with homogeneous shape throughout the system collection. If this does not hold true for your current featurization scheme, consider using .to_dict_of_arrays instead.

to_dict_of_arrays(featurization_key='last', y_dtype='float32', _initial_system_index=0) → dict¶

Export dataset to a dict-like object, compatible with DictOfArrays and NPZ files.

The idea is to provide unique keys for each system and their features, following the syntax X_s{int}_v{int}.

This object is useful when the features for each system have different shapes and/or dimensionality and cannot be concatenated in a single homogeneous array

Parameters

featurization_key (Hashable, optional="last") – Which key to access in each System.featurizations dict
y_dtype (np.dtype or str, optional="float32") – Which kind of dtype to use for the y array
_initial_system_index (int, optional=0) – PRIVATE. Start counting systems in X_s{int} with this value.

Returns

A dictionary that maps str keys to array-like objects. Depending on the featurization scheme, keys can be:

All systems are featurized as an array and they share the same shape -> X, y
All N systems are featurized as an array but they do NOT share the same shape -> X_s0_, X_s1_, ..., X_sN_
All N systems are featurized as a M-tuple of arrays (shape irrelevant) -> X_s0_a0_, X_s0_a1_, X_s1_a0_, X_s1_a1_, ..., X_sN_aM_

Return type

dict[str, array]

Note

The X keys have a trailing underscore on purpose. Otherwise, filtering keys out of the dictionary by index can be deceivingly slow. For example, filtering for the first system (s1) with key.startswith("X_s1") will also select for X_s1, X_s10, X_s11… Hence, we filter with X_s{int}_.

to_awkward(featurization_key='last', y_dtype='float32', clear_after=False)¶

Creates an awkward array out of the featurized systems and the associated measurements.

Return type: awkward array

Notes

Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.

Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they’re not.

observation_model(**kwargs)¶: Draft implementation of a modular observation model, based on individual contributions from different measurement types.

loss_adapter(**kwargs)¶: Observation model plus loss function, wrapped in a single callable. Return types are backend-dependent.

measurements_as_array(reduce=np.mean, dtype='float32')¶

split_by_groups() → dict¶

If a kinoml.datasets.groups class has been applied to this instance, this method will create more DatasetProvider instances, one per group.

Returns: Maps group key to sub-datasets
Return type: dict

classmethod _download_to_cache_or_retrieve(path_or_url) → str¶

Helper function to either download files to the usercache, or retrieve an already cached copy.

Parameters: path_or_url (str or Path-like) – File path or URL pointing to the required file
Returns: If provided argument is a file, the same path, right away If it was a URL, it will be the (downloaded) cached file path
Return type: str

class kinoml.datasets.core.MultiDatasetProvider(measurements: Iterable[kinoml.core.measurements.BaseMeasurement], metadata: dict = None)¶

Bases: DatasetProvider

Adapter class that is able to expose a DatasetProvider-like interface to a collection of Measurements of different types.

The different types are split into individual DatasetProvider objects, stored under .providers.

The rest of the API works around that list to provide similar functionality as the original, single-type DatasetProvider, but in plural.

Parameters: measurements (list of BaseMeasurement) – A MultiDatasetProvider holds a list of kinoml.core.measurements.BaseMeasurement objects (or any of its subclasses). Unlike DatasetProvider, the measurements here can be of different types, but they will be grouped together in different sub-datasets.

property measurements¶

Flattened list of all measurements present across all providers.

Use .indices_by_provider() to obtain the corresponding slices to each provider.

_post_featurize(featurizer: kinoml.features.core.BaseFeaturizer)¶

Remove measurements with systems, that were not successfully featurized.

Parameters: featurizer (BaseFeaturizer) – The used featurizer.

observation_models(**kwargs)¶: List of observation models present in this dataset, one per provider (measurement type)

loss_adapters(**kwargs)¶: List of observation models present in this dataset, one per provider (measurement type)

abstract observation_model(**kwargs)¶: Draft implementation of a modular observation model, based on individual contributions from different measurement types.

abstract loss_adapter(**kwargs)¶: Observation model plus loss function, wrapped in a single callable. Return types are backend-dependent.

indices_by_provider() → dict¶

Return a dict mapping each provider type to their correlative indices in a hypothetically concatenated dataset.

For example, if a MultiDatasetProvider contains 50 measurements of type A, and 25 measurements of type B, this would return {"A": slice(0, 50), "B": slice(50, 75)}.

Note

slice objects can be passed directly to item access syntax, like list[slice(a, b)].

to_dataframe(*args, **kwargs)¶

Concatenate all the providers into a single DataFrame for easier visualization.

Check DatasetProvider.to_dataframe() for more details.

to_numpy(**kwargs)¶: List of Numpy-native arrays, as generated by each provider.to_numpy(...) method. Check DatasetProvider.to_numpy docstring for more details.

to_pytorch(**kwargs)¶: List of Numpy-native arrays, as generated by each provider.to_pytorch(...) method. Check DatasetProvider.to_pytorch docstring for more details.

to_xgboost(**kwargs)¶: List of Numpy-native arrays, as generated by each provider.to_xgboost(...) method. Check DatasetProvider.to_xgboost docstring for more details.

to_dict_of_arrays(**kwargs) → dict¶: Will generate a dictionary of str: np.ndarray. System indices will be accumulated.

to_awkward(**kwargs)¶: See DatasetProvider.to_awkward(). X and y will be concatenated along axis=0 (one provider after another)

__repr__() → str¶: Return repr(self).

kinoml.datasets.core¶

Module Contents¶

`kinoml.datasets.core`¶