rnaglib.dataset.RNADataset¶

class rnaglib.dataset.RNADataset(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=None, features_computer=None, representations=None, debug=False, get_pdbs=True, multigraph=False, transforms=None)[source]¶

This class is the main object to hold RNA data, and is compatible with Pytorch Dataset.

A key feature is a bidict that holds a mapping between RNA names and their index in the dataset. This allows for constant time access to an RNA with a given name.

The RNAs contained in an RNADataset can either live in the memory, or be specified as files. In the latter case, an RNADataset can be seen as an ordered list of file in a given directory. One can put a dataset in memory by calling to_memory()

Once a dataset is built, you can save it, subset it and access its elements by name or by index.

Parameters:

rnas (Optional[list[Graph]]) – For use in memory, list of RNA objects represented as networkx graphs.
dataset_path (Union[str, PathLike, None]) – If using filenames, this is the path to the folder containing the graphs to load.
version – If using filenames, and no dataset_path is provided, this is the version of the RNA dataset download that will be used, and set as dataset_path
redundancy – Same as version, sets the redundancy mode to use if neither rnas nor dataset_path is provided.
rna_id_subset (Optional[list[str]]) – List of graphs filenames to grab in the dataset_path to keep instead of using all available.
recompute_mapping (bool) – When loading a dataset, you can choose to use an existing bidict_mapping (for instance if some graphs are irrelevant).
in_memory (Optional[bool]) – When loading a dataset from files, you can choose to load the data in memory by setting in_memory to true.
debug (bool) – if True, will only report 50 items
get_pdbs (bool) – if True, will also fetch the corresponding structures.
multigraph (bool) – Whether to load RNAs as multi-graphs or simple graphs. Multigraphs can have backbone and base pairs between the same two residues.
transforms (Union[list[Transform], Transform, None]) – An optional list of transforms to apply to rnas before calling the features computer and

the representations in get_item :type features_computer: Optional[FeaturesComputer] :param features_computer: A FeaturesComputer object, useful to transform raw RNA data into tensors. :type representations: Union[list[Representation], Representation, None] :param representations: List of Representation objects to

apply to each item.

Examples:¶

Create a default dataset:: >>> from rnaglib.dataset import RNADataset >>> dataset = RNADataset()

Access the first item in the dataset:: >>> dataset[0]

Each item is a dictionary with the key ‘rna’ holding annotations as a networkx Graph. >>> dataset[‘rna’].nodes() >>> dataset[‘rna’].edges()

Access an RNA by its PDBID:: >>> dataset.get_pdbid(‘4nlf’)

Hint

Pass debug=True to RNADataset to quickly load a small dataset for testing.

__init__(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=None, features_computer=None, representations=None, debug=False, get_pdbs=True, multigraph=False, transforms=None)[source]¶

Methods

`__init__`([rnas, dataset_path, version, ...])
`add_distance`(name, distance_mat)	Adds a distance matrix to the dataset.
`add_feature`(feature[, feature_level, is_input])	Add a feature to the dataset for model training.
`add_representation`(representations)	Add a representation object to dataset.
`check_consistency`()	Make sure all RNAs actually present when in_memory is true.
`get_by_name`(rna_name)	Grab an RNA by its name.
`get_pdbid`(pdbid)	Grab an RNA by its pdbid.
`remove_distance`(name)	Removes a distance from the dataset.
`remove_representation`(names)	Removes specified representation.
`save`(dump_path, *[, recompute, verbose])	Save a local copy of the dataset.
`save_distances`([dump_path])	Saves distances to distance path.
`subset`([list_of_ids, list_of_names])	Create another dataset with only the specified graphs.
`to_memory`()	Make in_memory=True from a dataset not in memory.

Attributes

distances

Using a cached property is useful for loading precomputed data.