rnaglib.data_loading.RNADataset

class rnaglib.data_loading.RNADataset(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=True, features_computer=None, representations=None, debug=False, get_pdbs=True, overwrite=False, multigraph=False, pre_transforms=None, transforms=None, **kwargs)[source]

This class is the main object to hold the core RNA data annotations.

The RNAglibDataset.all_rnas object is a list of networkx objects that holds all the annotations for each RNA in the dataset.

Parameters:
  • rnas (Optional[list[Graph]]) – One can instantiate directly from a list of RNA files

  • dataset_path (Union[str, PathLike, None]) – The path to the folder containing the graphs.

  • rna_id_subset (Optional[list[str]]) – In the given directory, 'dataset_path', one can choose to provide a list of graphs filenames to keep instead of using all available.

  • multigraph (bool) – Whether to load RNAs as multi-graphs or simple graphs. Multigraphs can have backbone and base pairs between the same two residues.

  • in_memory (bool) – Whether to load all RNA graphs in memory or to load them on the fly

  • features_computer (Optional[FeaturesComputer]) – A FeaturesComputer object, useful to transform raw RNA data into tensors.

  • representations (Union[list[Representation], Representation, None]) – List of Representation objects to apply to each item.

The dataset holds an attribute self.all_rnas = bidict({rna_name: i for i, rna_name in enumerate(all_rna_names)}) Where rna_name is expected to match the file name the rna should be saved in.

Examples:

Create a default dataset:: >>> from rnaglib.data_loading import RNADataset >>> dataset = RNADataset()

Access the first item in the dataset:: >>> dataset[0]

Each item is a dictionary with the key ‘rna’ holding annotations as a networkx Graph. >>> dataset[‘rna’].nodes() >>> dataset[‘rna’].edges()

Access an RNA by its PDBID:: >>> dataset.get_pdbid(‘4nlf’)

Hint

Pass debug=True to RNADataset to quickly load a small dataset for testing.

__init__(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=True, features_computer=None, representations=None, debug=False, get_pdbs=True, overwrite=False, multigraph=False, pre_transforms=None, transforms=None, **kwargs)[source]

Methods

__init__([rnas, dataset_path, version, ...])

add_distance(name, distance_mat)

Adds a distance matrix to the dataset.

add_feature(feature[, feature_level, is_input])

Add a feature to the dataset for model training.

add_representation(representations)

Add a representation object to dataset.

check_consistency()

Make sure all RNAs actually present when in_memory is true.

from_database([representations, ...])

Run the steps to build a dataset from scratch.

get_by_name(rna_name)

Grab an RNA by its pdbid.

get_pdbid(pdbid)

Grab an RNA by its pdbid.

remove_distance(name)

Removes a distance from the dataset.

remove_representation(names)

Removes specified representation.

save(dump_path, *[, recompute, verbose])

Save a local copy of the dataset.

save_distances()

Saves distances to distance path.

subset([list_of_ids, list_of_names])

Create another dataset with only the specified graphs.

to_memory()

Make in_memory=True from a dataset not in memory.

Attributes

distances

Using a cached property is useful for loading precomputed data.