DAta RepresenTativeness

DART uses hyperdimensional computing to quantitatively assess the distribution match of sample metadata (attributes) between different populations (or subpopulations).

Getting Started

Required input

The only input required to use DART is a pandas DataFrame containing a dataset’s sample information. The DataFrame should be organized such that every row represents a distinct sample and each column represents an attribute.

Important

The current version of DART does not have a way to resolve samples with missing information. All samples missing information for one or more attributes will be removed from the dataset prior to encoding.

Note

The measures of dataset similarity used in DART are based on attribute distribution and do not depend on the number of samples. Therefore, the two datasets shown in the table below would be considered equivalent. Reducing the number of samples in datasets to the minimum needed to accurately represent the dataset distribution can improve processing time for extremely large datasets.

Full dataset

Population

Disease Status

A

Positive

A

Positive

B

Negative

B

Negative

B

Negative

B

Negative

Reduced dataset

Population

Disease Status

A

Positive

B

Negative

B

Negative

Constructing a dataset

Simply provide the DataFrame containing sample attribute information to the DART.Dataset class to create a Dataset with default attribute configurations. Specifying custom configurations is not currently supported, but is planned for a future version of DART.

Running comparisons

Comparisons can be run with a constructed Dataset using Dataset.compare(). The first two arguments (criteria1 and criteria2) are dictionaries used to define the two groups that will be compared. For example, the attribute distributionsof B and A samples could be compared using Dataset.compare({"Population":"B"}, {"Population":"A"}). See compare() for optional arguments.

Contents