rnaglib.dataset.RNADataset¶
- class rnaglib.dataset.RNADataset(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=None, features_computer=None, representations=None, debug=False, get_pdbs=True, multigraph=False, transforms=None)[source]¶
This class is the main object to hold RNA data, and is compatible with Pytorch Dataset.
A key feature is a bidict that holds a mapping between RNA names and their index in the dataset. This allows for constant time access to an RNA with a given name.
The RNAs contained in an RNADataset can either live in the memory, or be specified as files. In the latter case, an RNADataset can be seen as an ordered list of file in a given directory. One can put a dataset in memory by calling to_memory()
Once a dataset is built, you can save it, subset it and access its elements by name or by index.
- Parameters:
rnas (
Optional
[list
[Graph
]]) – For use in memory, list of RNA objects represented as networkx graphs.dataset_path (
Union
[str
,PathLike
,None
]) – If using filenames, this is the path to the folder containing the graphs to load.version – If using filenames, and no dataset_path is provided, this is the version of the RNA dataset download that will be used, and set as dataset_path
redundancy – same as version, sets the redundancy mode to use if neither rnas nor dataset_path is provided.
rna_id_subset (
Optional
[list
[str
]]) – List of graphs filenames to grab in the dataset_path to keep instead of using all available.recompute_mapping (
bool
) – When loading a dataset, you can choose to use an existing bidict_mapping
(for instance if some graphs are irrelevant) :type in_memory:
Optional
[bool
] :param in_memory: When loading a dataset from files, you can choose to load the data in memory by setting in memory to true :type debug:bool
:param debug: if True, will only report 50 items :type get_pdbs:bool
:param get_pdbs: if True, will also fetch the corresponding structures. :type multigraph:bool
:param multigraph: Whether to load RNAs as multi-graphs or simple graphs. Multigraphs can havebackbone and base pairs between the same two residues.
- Parameters:
transforms (
Union
[list
[Transform
],Transform
,None
]) – An optional list of transforms to apply to rnas before calling the features computer and
the representations in get_item :type features_computer:
Optional
[FeaturesComputer
] :param features_computer: A FeaturesComputer object, useful to transform raw RNA data into tensors. :type representations:Union
[list
[Representation
],Representation
,None
] :param representations: List ofRepresentation
objects toapply to each item.
Examples:¶
Create a default dataset:: >>> from rnaglib.dataset import RNADataset >>> dataset = RNADataset()
Access the first item in the dataset:: >>> dataset[0]
Each item is a dictionary with the key ‘rna’ holding annotations as a networkx Graph. >>> dataset[‘rna’].nodes() >>> dataset[‘rna’].edges()
Access an RNA by its PDBID:: >>> dataset.get_pdbid(‘4nlf’)
Hint
Pass
debug=True
toRNADataset
to quickly load a small dataset for testing.- __init__(rnas=None, dataset_path=None, version='2.0.2', redundancy='nr', rna_id_subset=None, recompute_mapping=True, in_memory=None, features_computer=None, representations=None, debug=False, get_pdbs=True, multigraph=False, transforms=None)[source]¶
Methods
__init__
([rnas, dataset_path, version, ...])add_distance
(name, distance_mat)Adds a distance matrix to the dataset.
add_feature
(feature[, feature_level, is_input])Add a feature to the dataset for model training.
add_representation
(representations)Add a representation object to dataset.
check_consistency
()Make sure all RNAs actually present when in_memory is true.
get_by_name
(rna_name)Grab an RNA by its name.
get_pdbid
(pdbid)Grab an RNA by its pdbid.
remove_distance
(name)Removes a distance from the dataset.
remove_representation
(names)Removes specified representation.
save
(dump_path, *[, recompute, verbose])Save a local copy of the dataset.
save_distances
([dump_path])Saves distances to distance path.
subset
([list_of_ids, list_of_names])Create another dataset with only the specified graphs.
to_memory
()Make in_memory=True from a dataset not in memory.
Attributes
distances
Using a cached property is useful for loading precomputed data.