Data Reference ================= Graph Format --------------- Each graph contains structure information for one model of a PDB entry containing at least one RNA chain. Graphs are stored in `JSON` node-link format which can be loaded by [networkx](https://networkx.org/documentation/stable/reference/readwrite/generated/networkx.readwrite.json_graph.node_link_data.html#networkx.readwrite.json_graph.node_link_data). All data comes from the output of `x3dna-dssr` which can be downloaded [here](https://x3dna.org/) and our custom interface extraction tools. Graphs are dumped as JSONs in the node-link format. .. code-block:: python import json from networkx.readwrite.json_graph import node_link_graph G = node_link_graph(open('path/to/graph', 'r')) Nodes ------- Node IDs ~~~~~~~~~~ Node IDs are strings in the form `[pdb id].[chain name].[residue number]`. Node Data ~~~~~~~~~~ To access node data dictionary: .. code-block:: python G.nodes[] These are the keys in the node data dictionary: * `'index'`: (int), relative index along chain starting at 1 (e.g. `1`) * `'index_chain'`: (int) 26 (e.g. `26`) * `'chain_name'`: (str), name of chain. (e.g. `A`) * `'nt_resnum'`: (int), residue number according to PDB. (e.g. `101`) * `'nt_name'`: (str), nucleotide name (e.g. `G`) * `'nt_code'`: (str), (e.g. `'U'`) * `'nt_id'`: (str) unique nucleotide ID generated by DSSR. `.` (e.g. `'A.U42'`), * `'nt_type'`: (str) molecule type of residue (e.g. `'RNA'`) * `'dbn':` (str) dot-bracket notation for the residue (e.g. `')'`) * `'summary'`: (str) additional residue info (e.g. `"anti,~C3'-endo,BI,canonical,non-pair-contact,helix,stem,coaxial-stack"`) * `'alpha'`: (float) base angle in degrees `[-180, 180]`. * `'beta'`: (float) base angle * `'gamma'`: (float) base angle * `'delta'`: (float) base angle * `'epsilon'`: (float) * `'zeta'`: (float) * `'epsilon_zeta'`: (float) * `'bb_type':` (str) Backbone type 'BI', * `'chi'`: (float) * `'glyco_bond'`: str (e.g. `'anti'` * `'C5prime_xyz': (list), 5' Carbon xyz coordinates (e.g. `[-1.343, 8.453, 1.288]`) * `'P_xyz'`: (list) Phosphate coordinates. * `'form'`: (str) (e.g. `'A'`) classification of a dinucleotide step comprising the bp above the given designation and the bp that follows it. Types include 'A', 'B' or 'Z' for the common A-, B- and Z- form helices, '.' for an unclassified step , and 'x' for a step without a continuous backbone. * `'ssZp'`: (float) (e.g. `4.41`), * `'Dp'`: (float) (e.g. `4.404`) * `'splay_angle'`: (float) (e.g. `21.6`), * `'splay_distance'`: (float) (e.g. `3.612`) * `'splay_ratio':` (float) (e.g. `0.199`) * `'eta'`: (float) (e.g. `169.652`), * `'theta':` -167.457, * `'eta_prime'`: (float) (e.g. `-176.189`) * `'theta_prime':` (float) (e.g. `-167.27`) * `'eta_base'`: (float) (e.g. `-135.681`) * `'theta_base'`: (float) (e.g. `-141.003`) * `'v0'`: (float) (e.g `8.194`) * `'v1'`: (float) (e.g. `-28.393`), * `'v2'`: (float) * `'v3'`: (float) * `'v4'`: (float) * `'amplitude'`: (float) * `'phase_angle'`: (float) * `'puckering'`: (str) (e.g. `"C3'-endo"`) * `'sugar_class':` (str) (e.g. `"~C3'-endo"`) * `'bin'`: (str) (e.g. `'33t'`) ( name of the 12 bins based on [ delta (i -1) , delta , gamma ], where delta (i -1) and delta can be either 3 ( for C3 '- endo sugar ) or 2 ( for C2 '- endo ) and gamma can be p/t/ m ( for gauche +/ trans / gauche - conformations , respectively ) (2 x2x3 =12 combinations : `33p` , `33t` , ... `22m`); `'inc'` refers to incomplete cases (i .e., with missing torsions ) , and `'trig'` to triages ( i.e., with torsion angle outliers ),\[1\] * `'cluster'`: (str) (e.g. `'1c'`) (2-char suite name, for one of 53 reported clusters (46 certain and 7 wannabes ) , `'__'` for incomplete cases , and `'!!'` for outliers),\[1\] * `'suiteness'`: (float) (measure of conformer - match quality ( low to high in range 0 to 1) ) \[1\] * `'filter_rmsd'`: (float) * `'frame':` (dict) e.g. (`{'rmsd': 0.006, 'origin': [-4.856, 8.564, -1.171], 'x_axis': [0.922, 0.386, -0.006], 'y_axis': [0.098, -0.25, -0.963], 'z_axis': [-0.374, 0.888, -0.269], 'quaternion': [0.592, -0.781, -0.155, 0.122]}` * `'sse'`: (dict) Secondary structure info (e.g. residue inside third hairpin `{'sse': 'hairpin_3'}`) * `'binding_protein'`: (dict) RNA-Protein interface. If no interface found, `None`. Else, dictionary (e.g. `{'nt-aa': 'C-arg', 'nt': 'A.C37', 'aa': 'A.ARG47', 'Tdst': '6.62', 'Rdst': '-114.00', 'Tx': '-1.15', 'Ty': '1.89', 'Tz': '6.23', 'Rx': '-53.57', 'Ry': '19.41', 'Rz': '-103.42', 'sse': 'a-helix'}`) * `'binding_ion'`: (string) molecule ID of ion if residue is at a binding site (otherwise `None`) (e.g. `'Ca'`) * `'binding_small-molecule'`: (string) molecule ID of small molecule if residue is at a binding site (otherwise `None`) (e.g. `'SAM'`) Edge data -------------- Each edge also has an attribute dictionary: .. code-block:: python G.edges[(, )] * `'index'`: (int) Index of edge in DSSR ordering. * `'nt1'`: (str) DSSR nucleotide ID of first base (e.g. `'A.G17'`) * `'nt2'`: (str) DSSR nucleotide ID of second base (e.g. `'A.G29'`) * `'bp'` (str): Nucleotide identity of paired residues (e.g. `'G-C'`) * `'name'`: (str) (e.g. `'WC'`) * `'Saenger'`: (str) Saenger base pairing category (e.g. `'19-XIX'`), * `'LW'`: (str) Leontis-Westhof base pair geometry category (e.g. `'cWW'`) * `'DSSR'`: (str) Custom DSSR base pair geometry category (e.g. `'cW-W'`) Graph-level data ------------------ Each graph also has an attribute dictionary: .. code-block:: python G.graph * `'dbn'`: a dict containing information on the chains contained in the graph, such as the sequences or their length * `'resolution_{low,high}'`: bounds on the resolution, present in most (~80%) of the graphs * `'proteins'`: A list of the residues in contact with a protein * `'ligands'`: A list of the ligands interacting with the graph nucleotides. Each ligand is a dict with the ligand Biopython id, its name, and the rna residues it is bound to * `'ions'`: same thing with ions ## Graph creation pipeline * `dssr_to_graphs.py` : runs dssr on the cif file to get the first networkx graph object. It moreover computes the RNA/protein interfaces (since it uses dssr) and annotates at the level of the node. * `annotations.py'`: Completes the graph using the mmcif file to include resolution and interaction with small molecules and ions * `main.py` : The script to call to build or update the data releases References -------------- \[1] [Richardson et al. (2008): "RNA backbone: consensus all-angle conformers and modular string nomenclature (an RNA Ontology Consortium contribution). RNA, 14(3):465-481](https://rnajournal.cshlp.org/content/14/3/465.short)