Building RNA Databases with RNAglib¶
This module (prepare_data) contains all the necessary code to build databases of annotated RNA 3D structures, and the user interfaces with it through the rnaglib_prepare_data command line script. Dataset creation follows the following steps:
- Fetching the raw RNA structures from either:
RCSB PDB Databank (accepts the –nr flag to only use structures in the [BGSU Representative Set](https://www.bgsu.edu/research/rna/databases/non-redundant-list.html)
A local user-defined folder
For each structure, run x3dna-dssr
Store x3dna-dssr output in a networkx Graph object
- If the –annotate flag is passed for pre-training:
Chop the whole RNAs into smaller chunks
Pre-compute local neighbourhoods
Extract all graphlets
Quickstart¶
Print the help message:
$ rnaglib_prepare_data -h
To run a quick debug build with default values:
$ rnaglib_prepare_data -s structures/ --tag first_build -o builds/ -d
Data versioning¶
The optional argument –tag is used to name the folder containing the final output. For our distributions we use rnaglib-<’all’ or ‘nr’><’-annotated’ or ‘’><’-chopped’ or ‘’>-<version> depending on the build options. We distribute data builds with all available RNAs and assign all to the tag, and non-redundant structures according to the [BGSU Representative Set](https://www.bgsu.edu/research/rna/databases/non-redundant-list.html). For each of these two choices, we also provide versions pre-processed for [graphlet kernel](https://rnaglib.cs.mcgill.ca/static/docs/html/rnaglib.kernels.html) computations used to compute node similarity and assign the annot value to the tag.
Output¶
After running the –debug test run above, your ./builds/ folder will contain a single sub-folder called ./builds/graphs with 10 .json files and a file ./builds/graphs/errors.csv. Each of these JSONs contains the annotated RNAs and the CSV contains a list of RNAs that failed to build and the failure reason.
Data building options¶
–nr only outputs RNAs from the non-redundant set from BGSU
–chop creates a sub-folder in the build called chops which contains chunked RNAs for even batch sizes
–annot builds necessary annotations for computing node similarities