Using Tasks API

An rnaglib.Task object packages everything you need to train a model for a particular biological problem.

The key components of a Task you will use are:

  • A dedicated dataset for the task

  • Train/validation/test dataloaders

  • Model evaluator

When you instantiate the task, the task either calls the process() and split() methods to compute the necessary data or you have already run this before and the result was stored in the root directory and loading should be instantaneous.

Once loading is complete you only need to select a tensor representation (e.g. graph, voxel, point cloud) and encode the underlying dataset with it and then iterate through the train loader. Note that whenever you update the task’s dataset you should call set_loaders() so that changes in the dataset are reflected in the data served by the loaders:

from rnaglib.tasks import ChemicalModification
from rnaglib.transforms import GraphRepresentation

ta = ChemicalModification(root='cm')
ta.dataset.add_representation(GraphRepresentation(framework='pyg'))
ta.set_loaders()

for batch in ta.train_loader:
    pred = ta.dummy_model(batch['graph'])
    ...


 metrics = ta.evaluate(ta.dummy_model)

Once you have completed training you can pass your model to the task’s evaluate() method which will return a dictionary of metrics and performance values.

Note

Each task provides a dummy_model variable which you can use for testing out the task. It simply returns a random prediction of the appropriate shape.

Building Custom Tasks

If you would like to propose a new prediction task for the machine learning community. You just have to implement a few methos in a subclass of the``Task`` class.

An instance of the Task class packages the following attributes:

  • dataset: full collection of RNAs to use in the task.

  • splitter: method for partitioning the dataset into train, validation, and test subsets.

  • target_vars: method for setting and encoding input and target variables.

  • evaluate: method which accepts a model and returns performance metrics.

  • metadata: this is a simple (optional) dictionary that holds useful info about the task (e.g. task type, number of classes, etc.)

Once the task processing is complete, all task data is dumped into root which is a path passed to the task init method.

Here is a minimal template for a custom task:

from rnaglib.tasks import Task
from rnaglib.data_loading import RNADataset
from rnaglib.splitters import Splitter

class MyTask(Task):

    def __init__(self, root):
        super().__init__(root)

    def process(self) -> RNADataset:
        # build the task's dataset
        # ...
        pass

    @property
    def default_splitter() -> Splitter:
        # return a splitter object to build train/val/test splits
        # ...
        pass

    def get_task_vars() -> FeaturesComputer:
        # computes the task's default input and target variables
        # managed by creating a FeaturesComputer object
        # ...
        pass

    def init_metadata(self):
        return {'task_name': 'my task'}

In this tutorial we will walk through the steps to create a task with the aim of predicting for each residue, whether or not it will be chemically modified, and a more advanced example we will build the task of predicting the Rfam classification of an RNA.

Types of tasks

Tasks can operate at the residue, edge, and whole RNA level. Biolerplate for evaluation and loading would be affected depending on the choice of level. For that reason we create sub-classes of the Task clas which you can use to avoid re-coding such things.

Since chemical modifications are applied to residues, Let’s build a residue-level binary classification task.:

from rnaglib.tasks import ResidueClassificationTask

class ChemicalModification(ResidueClasificationTask):
    ....

1. Create the task’s dataset

Each task needs to define which RNAs to use. Typically this involves filtering a whole dataset of available RNAs by certain attributes to retain only the ones that contain certain annotations or pass certain criteria (e.g. size, origin, resolution, etc.).

You are free to do this in any way you like as long as after Task.process() is called, a list of .json graphs storing the RNA annotations the task needs is dumped into {root}/dataset.

To make things easier you can take advantage of the rnaglib.Tranforms library which provides funcionality for manipulating datasets of RNAs.

Let’s define a Task.process() method which builds a dataset with a single criterion:

  • Only keep RNAs that contain at least one chemically modified residue

The Transforms library provides a filter which checks that an RNA’s residues are of a desired value.

from rnaglib.data_loading import RNADataset
from rnaglib.tasks import ResidueClassificationTask
from rnaglib.transforms import ResidueAttributeFilter
from rnaglib.transforms import PDBIDNameTransform

class ChemicalModification(ResidueClasificationTask):
    def process(self) -> RNADataset:
        # grab a full set of available RNAs
        rnas = RNADataset()

        filter = ResidueAttributeFilter(attribute='is_modified',
                                        val_checker=lambda val: val == True
                                        )

        rnas = filter(rnas)

        rnas = PDBIDNameTransform()(rnas)
        dataset = RNADataset(rnas=[r["rna"] for r in rnas])
        return dataset

        pass

Applying the filter gives us a new list containing only the RNAs that passed the filter. The last thing we need to do is assign a name value to each RNA so that they can be properly managed by the RNADataset. We assign the PDBID as the name of each item in our dataset using the PDBIDNameTransform.

Now we just create a new RNADataset object using the reduced list. The dataset object requires a list and not a generator so we just unroll before passing it.

That’s it now you just return the new RNADataset object.

2. Set the task’s variables

Apart from the RNAs themselves, the task needs to know which variables are relevant. In particular we need to set the prediction target. Additionally we can set some default input features, which are always provided. The user can always add more input features if he/she desires by manipulating task.dataset.features_computer but at the minimum we need to define target variables.:

from rnaglib.data_loading import RNADataset
from rnaglib.tasks import ResidueClassificationTask
from rnaglib.transforms import ResidueAttributeFilter
from rnaglib.transforms import PDBIDNameTransform
from rnaglib.transforms import FeaturesComputer

class ChemicalModification(ResidueClasificationTask):
    def process(self) -> RNADataset:
        ...
        pass

    def get_task_vars(self) -> FeaturesComputer:
        return FeaturesComputer(nt_features=['nt_code'], nt_targets=['is_modified'])

Here we simply have a nucleotide level target so we pass the 'is_modified' attribute to the FeaturesComputer object. This will take care of selecting the residue when encoding the RNA into tensor form. In addition we provide the nucleotide identity ('nt_code') as a default input feature.

3. Train/val/test splits

The last necessary step is to define the train, validation and test subsets of the whole dataset. Once these are set, the task’s boilerplate will take care of generating the appropriate loaders.

To set the splits, you implement the default_splitter() method which returns a Splitter object. A Splitter object is simply a callable which accepts a dataset and returns three lists of indices representing the train, validation and test subsets.

You can select from the library of implemented splitters of implement your own.

For this example, we will split the RNAs by structural similarity using RNA-align.:

from rnaglib.data_loading import RNADataset
from rnaglib.tasks import ResidueClassificationTask

from rnaglib.transforms import ResidueAttributeFilter
from rnaglib.transforms import PDBIDNameTransform
from rnaglib.transforms import FeaturesComputer

from rnaglib.splitters import Splitter, RNAalignSplitter

class ChemicalModification(ResidueClasificationTask):
    def process(self) -> RNADataset:
        ...
        pass

    def get_task_vars(self) -> FeaturesComputer:
        return FeaturesComputer(nt_features=['nt_code'], nt_targets=['is_modified'])

    @property
    def default_splitter(self) -> Splitter
        return RNAalignSplitter(similarity_threshold=0.6)

Now our splits will guarantee a maximum structural similarity of 0.6 between them.

Check out the Splitter class for a quick guide on how to create your own splitters.

Note that this is only setting the default method to use for splitting the dataset. If a user wants to try a different splitter it can be pased to the task’s init.

That’s it! Your task is now fully defined and can be used in model training and evaluation.

Here is the ful task implementation:

from rnaglib.data_loading import RNADataset
from rnaglib.tasks import ResidueClassificationTask
from rnaglib.transforms import FeaturesComputer
from rnaglib.transforms import ResidueAttributeFilter
from rnaglib.transforms import PDBIDNameTransform
from rnaglib.splitters import Splitter, RNAalignSplitter


class ChemicalModification(ResidueClassificationTask):
    """Residue-level binary classification task to predict whether or not a given
    residue is chemically modified.
    """

    target_var = "is_modified"

    def __init__(self, root, splitter=None, **kwargs):
        super().__init__(root=root, splitter=splitter, **kwargs)

    def get_task_vars(self):
        return FeaturesComputer(nt_targets=self.target_var)

    def process(self):
        rnas = ResidueAttributeFilter(
            attribute=self.target_var, value_checker=lambda val: val == True
        )(RNADataset(debug=self.debug))
        rnas = PDBIDNameTransform()(rnas)
        dataset = RNADataset(rnas=[r["rna"] for r in rnas])
        return dataset

    def default_splitter(self) -> Splitter:
        return RNAalignSplitter(similarity_threshold=0.6)

Metadata

Each task holds a metadata attribute which is a simple dictionary holding useful information about the task (e.g. number of classes, task type, name, description). You can modify this during task setup and it is saved to disk once the task is built.

Task saving and loading

Once the task is completely built (dataset and splits), the task class automatically calls its write() method which dumps to the root directory all the information necessary to skip processing if the task is re-loaded.

Your root directory will look something like:

my_root/
    train_idx.txt
    val_idx.txt
    test_idx.txt
    task_id.txt
    metadata.json
    dataset/
        1abc.json
        2xzy.json
        ...

The task folder contains 3 .txt files with the indices for each split. The metadata.json file stores any additional information relevant to the task, the task_id.txt file holds a unique identifier for the task which is built by hashing all the RNAs and splits so that if anything about the task changes the ID will be different, and bfinally the dataset/ folder holds .json files which can be loaded into RNA dicts and used to re-instantiate the task.

Customize Splitting

We provide some pre-defined splitters for sequence and structure-based splitting. If you have other criteria for splitting you can subclass the Splitter class. All you have to do is implement the __call__() method which takes a dataset and returns three lists of indices:

class Splitter:
    def __init__(self, split_train=0.7, split_valid=0.15, split_test=0.15):
        assert sum([split_train, split_valid, split_test]) == 1, "Splits don't sum to 1."
        self.split_train = split_train
        self.split_valid = split_valid
        self.split_test = split_test
        pass

    def __call__(self, dataset):
        return None, None, None

The __call__(self, dataset) method returns three lists of indices from the given dataset object.

The splitter can be initiated with the desired proportions of the dataset for each subset.