kinoml.datasets.core

Base classes for DatasetProvider-like objects

Module Contents

kinoml.datasets.core.logger
class kinoml.datasets.core.BaseDatasetProvider

Bases: object

API specification for dataset providers

abstract property systems
abstract property measurement_type
abstract property conditions
abstract classmethod from_source(path_or_url=None, **kwargs)

Parse CSV/raw files to object model.

abstract observation_model(backend='pytorch')
abstract measurements_as_array(reduce=np.mean)
abstract measurements_by_group()
abstract featurize(*featurizers: Iterable[kinoml.features.core.BaseFeaturizer])
abstract featurized_systems(key='last')
abstract to_dataframe(*args, **kwargs)
abstract to_pytorch(**kwargs)
abstract to_tensorflow(*args, **kwargs)
abstract to_numpy(*args, **kwargs)
class kinoml.datasets.core.DatasetProvider(measurements: Iterable[kinoml.core.measurements.BaseMeasurement], metadata: dict = None)

Bases: BaseDatasetProvider

Base object for all DatasetProvider classes.

Parameters
  • measurements (list of BaseMeasurement) – A DatasetProvider holds a list of kinoml.core.measurements.BaseMeasurement objects (or any of its subclasses). They must be of the same type!

  • metadata (dict) – Extra information for provenance.

Note

All measurements must be of the same type! If they are not, consider using MultiDatasetProvider instead.

property systems
property measurement_type
property conditions: set
_raw_data
__len__()
__getitem__(subscript)
__repr__() str

Return repr(self).

abstract classmethod from_source(path_or_url=None, **kwargs)

Parse CSV/raw file to object model. This method is responsible of generating the objects for self.measurements, if relevant. Additional kwargs will be passed to __init__.

You must define this in your subclass.

featurize(featurizer: kinoml.features.core.BaseFeaturizer)

Given a collection of kinoml.features.core.BaseFeaturizers, apply them to the systems present in the self.measurements.

Parameters

featurizer (BaseFeaturizer) – Featurization scheme that will be applied to the systems, in a stacked way.

Note

TODO:
  • Will the systems be properly featurized with Dask?

_post_featurize(featurizer: kinoml.features.core.BaseFeaturizer)

Remove measurements with systems, that were not successfully featurized.

Parameters

featurizer (BaseFeaturizer) – The used featurizer.

featurized_systems(key='last', clear_after=False)

Return the key featurized objects from all systems.

abstract _to_dataset(style='pytorch')

Generate a clean <style>.data.Dataset object for further steps in the pipeline (model building, etc).

Warning

This step is lossy because the resulting objects will no longer hold chemical data. Operations depending on such information, must be performed first.

Examples

>>> provider = DatasetProvider()
>>> provider.featurize()  # optional
>>> splitter = TimeSplitter()
>>> split_indices = splitter.split(provider.data)
>>> dataset = provider.to_dataset("pytorch")  # .featurize() under the hood
>>> X_train, X_test, y_train, y_test = train_test_split(dataset, split_indices)
to_dataframe(*args, **kwargs)

Generates a pandas.DataFrame containing information on the systems and their measurements

Return type

pandas.DataFrame

to_pytorch(featurizer=None, **kwargs)

Export dataset to a PyTorch-compatible object, via adapters found in kinoml.torch_datasets.

to_xgboost(**kwargs)

Export dataset to a DMatrix object, native to the XGBoost framework

abstract to_tensorflow(*args, **kwargs)
to_numpy(featurization_key='last', y_dtype='float32', **kwargs)

Export dataset to a tuple of two Numpy arrays of same shape:

  • X: the featurized systems

  • y: the measurements values (must be the same measurement type)

Parameters
  • featurization_key (str, optional="last") – Which featurization present in the systems will be taken to build the X array. Usually, last as provided by a Pipeline object.

  • y_dtype – Coerce Y array to this dtype

  • str (np.dtype or) – Coerce Y array to this dtype

  • optional="float32" – Coerce Y array to this dtype

  • kwargs (optional,) – Dict that will be forwarded to .measurements_as_array, which will build the y array.

Returns

X, y

Return type

2-tuple of np.array

Note

This exporter assumes that each System is featurized as a single tensor with homogeneous shape throughout the system collection. If this does not hold true for your current featurization scheme, consider using .to_dict_of_arrays instead.

to_dict_of_arrays(featurization_key='last', y_dtype='float32', _initial_system_index=0) dict

Export dataset to a dict-like object, compatible with DictOfArrays and NPZ files.

The idea is to provide unique keys for each system and their features, following the syntax X_s{int}_v{int}.

This object is useful when the features for each system have different shapes and/or dimensionality and cannot be concatenated in a single homogeneous array

Parameters
  • featurization_key (Hashable, optional="last") – Which key to access in each System.featurizations dict

  • y_dtype (np.dtype or str, optional="float32") – Which kind of dtype to use for the y array

  • _initial_system_index (int, optional=0) – PRIVATE. Start counting systems in X_s{int} with this value.

Returns

A dictionary that maps str keys to array-like objects. Depending on the featurization scheme, keys can be:

  1. All systems are featurized as an array and they share the same shape -> X, y

  2. All N systems are featurized as an array but they do NOT share the same shape -> X_s0_, X_s1_, ..., X_sN_

  3. All N systems are featurized as a M-tuple of arrays (shape irrelevant) -> X_s0_a0_, X_s0_a1_, X_s1_a0_, X_s1_a1_, ..., X_sN_aM_

Return type

dict[str, array]

Note

The X keys have a trailing underscore on purpose. Otherwise, filtering keys out of the dictionary by index can be deceivingly slow. For example, filtering for the first system (s1) with key.startswith("X_s1") will also select for X_s1, X_s10, X_s11… Hence, we filter with X_s{int}_.

to_awkward(featurization_key='last', y_dtype='float32', clear_after=False)

Creates an awkward array out of the featurized systems and the associated measurements.

Return type

awkward array

Notes

Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.

Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they’re not.

observation_model(**kwargs)

Draft implementation of a modular observation model, based on individual contributions from different measurement types.

loss_adapter(**kwargs)

Observation model plus loss function, wrapped in a single callable. Return types are backend-dependent.

measurements_as_array(reduce=np.mean, dtype='float32')
split_by_groups() dict

If a kinoml.datasets.groups class has been applied to this instance, this method will create more DatasetProvider instances, one per group.

Returns

Maps group key to sub-datasets

Return type

dict

classmethod _download_to_cache_or_retrieve(path_or_url) str

Helper function to either download files to the usercache, or retrieve an already cached copy.

Parameters

path_or_url (str or Path-like) – File path or URL pointing to the required file

Returns

If provided argument is a file, the same path, right away If it was a URL, it will be the (downloaded) cached file path

Return type

str

class kinoml.datasets.core.MultiDatasetProvider(measurements: Iterable[kinoml.core.measurements.BaseMeasurement], metadata: dict = None)

Bases: DatasetProvider

Adapter class that is able to expose a DatasetProvider-like interface to a collection of Measurements of different types.

The different types are split into individual DatasetProvider objects, stored under .providers.

The rest of the API works around that list to provide similar functionality as the original, single-type DatasetProvider, but in plural.

Parameters

measurements (list of BaseMeasurement) – A MultiDatasetProvider holds a list of kinoml.core.measurements.BaseMeasurement objects (or any of its subclasses). Unlike DatasetProvider, the measurements here can be of different types, but they will be grouped together in different sub-datasets.

property measurements

Flattened list of all measurements present across all providers.

Use .indices_by_provider() to obtain the corresponding slices to each provider.

_post_featurize(featurizer: kinoml.features.core.BaseFeaturizer)

Remove measurements with systems, that were not successfully featurized.

Parameters

featurizer (BaseFeaturizer) – The used featurizer.

observation_models(**kwargs)

List of observation models present in this dataset, one per provider (measurement type)

loss_adapters(**kwargs)

List of observation models present in this dataset, one per provider (measurement type)

abstract observation_model(**kwargs)

Draft implementation of a modular observation model, based on individual contributions from different measurement types.

abstract loss_adapter(**kwargs)

Observation model plus loss function, wrapped in a single callable. Return types are backend-dependent.

indices_by_provider() dict

Return a dict mapping each provider type to their correlative indices in a hypothetically concatenated dataset.

For example, if a MultiDatasetProvider contains 50 measurements of type A, and 25 measurements of type B, this would return {"A": slice(0, 50), "B": slice(50, 75)}.

Note

slice objects can be passed directly to item access syntax, like list[slice(a, b)].

to_dataframe(*args, **kwargs)

Concatenate all the providers into a single DataFrame for easier visualization.

Check DatasetProvider.to_dataframe() for more details.

to_numpy(**kwargs)

List of Numpy-native arrays, as generated by each provider.to_numpy(...) method. Check DatasetProvider.to_numpy docstring for more details.

to_pytorch(**kwargs)

List of Numpy-native arrays, as generated by each provider.to_pytorch(...) method. Check DatasetProvider.to_pytorch docstring for more details.

to_xgboost(**kwargs)

List of Numpy-native arrays, as generated by each provider.to_xgboost(...) method. Check DatasetProvider.to_xgboost docstring for more details.

to_dict_of_arrays(**kwargs) dict

Will generate a dictionary of str: np.ndarray. System indices will be accumulated.

to_awkward(**kwargs)

See DatasetProvider.to_awkward(). X and y will be concatenated along axis=0 (one provider after another)

__repr__() str

Return repr(self).