rnaglib.data_loading.RNADataset¶
- class rnaglib.data_loading.RNADataset(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=True, features_computer=None, representations=None, debug=False, get_pdbs=True, overwrite=False, multigraph=False, pre_transforms=None, transforms=None, **kwargs)[source]¶
This class is the main object to hold the core RNA data annotations.
The
RNAglibDataset.all_rnas
object is a list of networkx objects that holds all the annotations for each RNA in the dataset.- Parameters:
rnas (
Optional
[list
[Graph
]]) – One can instantiate directly from a list of RNA filesdataset_path (
Union
[str
,PathLike
,None
]) – The path to the folder containing the graphs.rna_id_subset (
Optional
[list
[str
]]) – In the given directory,'dataset_path'
, one can choose to provide a list of graphs filenames to keep instead of using all available.multigraph (
bool
) – Whether to load RNAs as multi-graphs or simple graphs. Multigraphs can have backbone and base pairs between the same two residues.in_memory (
bool
) – Whether to load all RNA graphs in memory or to load them on the flyfeatures_computer (
Optional
[FeaturesComputer
]) – A FeaturesComputer object, useful to transform raw RNA data into tensors.representations (
Union
[list
[Representation
],Representation
,None
]) – List ofRepresentation
objects to apply to each item.
The dataset holds an attribute self.all_rnas = bidict({rna_name: i for i, rna_name in enumerate(all_rna_names)}) Where rna_name is expected to match the file name the rna should be saved in.
Examples:¶
Create a default dataset:: >>> from rnaglib.data_loading import RNADataset >>> dataset = RNADataset()
Access the first item in the dataset:: >>> dataset[0]
Each item is a dictionary with the key ‘rna’ holding annotations as a networkx Graph. >>> dataset[‘rna’].nodes() >>> dataset[‘rna’].edges()
Access an RNA by its PDBID:: >>> dataset.get_pdbid(‘4nlf’)
Hint
Pass
debug=True
toRNADataset
to quickly load a small dataset for testing.- __init__(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=True, features_computer=None, representations=None, debug=False, get_pdbs=True, overwrite=False, multigraph=False, pre_transforms=None, transforms=None, **kwargs)[source]¶
Methods
__init__
([rnas, dataset_path, version, ...])add_distance
(name, distance_mat)Adds a distance matrix to the dataset.
add_feature
(feature[, feature_level, is_input])Add a feature to the dataset for model training.
add_representation
(representations)Add a representation object to dataset.
check_consistency
()Make sure all RNAs actually present when in_memory is true.
from_database
([representations, ...])Run the steps to build a dataset from scratch.
get_by_name
(rna_name)Grab an RNA by its pdbid.
get_pdbid
(pdbid)Grab an RNA by its pdbid.
remove_distance
(name)Removes a distance from the dataset.
remove_representation
(names)Removes specified representation.
save
(dump_path, *[, recompute, verbose])Save a local copy of the dataset.
save_distances
()Saves distances to distance path.
subset
([list_of_ids, list_of_names])Create another dataset with only the specified graphs.
to_memory
()Make in_memory=True from a dataset not in memory.
Attributes
distances
Using a cached property is useful for loading precomputed data.