Add new modalities to the package¶

How to Add a New Dataset¶

This tutorial guides you through the steps to add a new dataset to our framework in a Python project. Each dataset is loaded by the domain label, i.e., a separate dataset object is loaded for each domain. The corresponding domain labels are specified in the command line/experiment definition (--tr_d).

Within each such dataset (one per domain), individual observations have the form (X, Y), where X would be an image (or features, etc.) and Y would be its associated label. In general, the label Y is a separate concept from the domain label.

Follow the steps below to set up and configure your own dataset for the experiments.

Step 1: Create Dataset File¶

Navigate to the dset folder in your project directory.
Create a new Python file named dset_<name>.py, replacing <name> with the name of your dataset.

Step 2: Define the Dataset Class¶

Inside your new file, you will need to define a dataset class with necessary methods, along the lines of:

class DsetYourClass:
    def __init__(self, domain_label):
        # Initialization code here


    def __getitem__(self, index):
        # Code to load an image and its associated labels
        # Returns:
        # - image: the loaded image
        # - image_label: the label associated with the image
        # - conditional_label: a condition associated with the image
        # - image_id: the identifier for the image
        return image, image_label, conditional_label, image_id

Step 3: Implementing `getitem`¶

The __getitem__ method is a key part of the dataset class, which should be implemented to load and return the necessary data:

image: This should load the actual image from the dataset.
image_label: Load the label (not the domain label) that is associated with the image.
conditional_label: If your dataset includes conditions (for example, any additionally annotated label), this can be included using a CSV file.
image_id: Useful for tracking which image is being processed, especially in debugging or complex data handling scenarios.

Step 4: Create a task class that would utilize your Dset class¶

Navigate to the tasks folder in your project directory.
Create a task file specific to your dataset. This file will manage loading one domain at a time and will be passed to the experiment.
The task_.py should contain the following functions:

class NodeTaskTemplate(NodeTaskDictCluster):
    """Basic template for tasks where categories are considered 'domains'
    """

    @property
    def list_str_y(self):
        """
        This task has no conventional labels; categories are treated as domains.
        """
        return mk_dummy_label_list_str("label_prefix", number_of_domains)

    @property
    def isize(self):
        """
        :return: Image size object storing image channels, height, width.
        """
        return ImSize(channels, height, width)

    def get_list_domains(self):
        """
        Get list of domain names.
        :return: List of domain names.
        """
        return mk_dummy_label_list_str("domain_prefix", number_of_domains)

    def get_dset_by_domain(self, args, na_domain, split=False):
        """
        Get a dataset by domain.
        :param args: Command line arguments.
        :param na_domain: Domain name.
        :param split: Whether to perform a train/validation split.
        :return: Training dataset, Validation dataset.
        """
        ratio_split = float(args.split) if split else False
        trans = [transforms.Resize((desired_height, desired_width)), transforms.ToTensor()]
        ind_global = self.get_list_domains().index(na_domain)
        dset = DsetYourClass(domain=ind_global, args=args, list_transforms=trans)

        if ratio_split > 0:
            train_len = int(len(dset) * ratio_split)
            val_len = len(dset) - train_len
            train_set, val_set = random_split(dset, [train_len, val_len])
        else:
            train_set = dset
            val_set = dset
        return train_set, val_set

Step 5: Add new Task Chain in `TaskChainNodeGetter` Class¶

After defining your task class, you will need to integrate it into the processing chain. This is typically done in the zoo_task.py file where multiple tasks are chained together for sequential processing.

Here’s how to add your new task to the chain:

Navigate to zoo_task.py to the TaskChainNodeGetter class.
Add your NodeTaskTemplate to the existing chain as shown below:

chain = NodeTaskTemplate(succ=chain)

Conclusion¶

With the dataset class set up in your dset_<name>.py file that is imported to task_<name>.py, your new dataset is ready to be integrated into your project.

Adding a New Model to the Domid Python Package¶

This tutorial will guide you through the steps to add a new model file to the models submodule in the domid Python package.

Step 1: Create the Model File and Define the Model Class¶

Navigate to the models directory in the domid codebase and create a file named model_<name>.py. In this file, you will construct the new model, define loss optimization functions, and configure any necessary clustering layers. The layers of the model are defined in the compos submodule. Here, you can find already implemented fully-connected and convolutional VAEs (Variational AutoEncoders) and AEs (AutoEncoders). These components can be used as building blocks for your model. Create a class for your model by extending a base model class from domid. Typically, models extend from a common base class such as a_model_cluster.py, which provides some of the default functionalities, and are wrapped within a mk_model method:

def mk_model(parent_class=AModelCluster):
    class CustomModel(parent_class):
        def __init__(self, arg1, arg2, ...):
            super(CustomModel, self).__init__()
            # Model initialization and layer definitions
            self.model = model

        def _inference(self, x):
            # ...

        def infer_d_v_2(self, x, inject_domain):
            # ...

        def _cal_reconstruction_loss_helper(self, x,y):
            # ...

        # Implement any additional methods necessary for your model
        def _cal_loss_(self, x, y):
            # ...

    return CustomModel

Step 2: Implement a trainer function if needed¶

When integrating your model into the domid package, you have the option to utilize an existing trainer from the package or define a new trainer that caters to the specific needs of your model. Below are details on both approaches.

Using an Existing Trainer¶

domid includes several generic trainers that are designed to work with a variety of models. For example, trainer_cluster.py, which is compatible with VaDE and DEC models.

Defining a New Trainer¶

If the existing trainers do not meet the specific requirements of your model, you may need to define a new trainer. This involves:

Creating a Trainer Class: Define a class in Python that encapsulates all aspects of training your model. This includes initializing the model, running the training loops, handling validation, and potentially testing.

class CustomTrainer:
    def __init__(self, model, optimizer, loss_fn, device):
        self.model = model.to(device)
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.device = device

    def tr_epoch(self, epoch_number):
        # runs one epoch of experiemnt for more details look at any other the existing trainers