rnaglib.splitters.CDHitSplitter

class rnaglib.splitters.CDHitSplitter(similarity_threshold=0.3, n_jobs=-1, seed=0, *args, **kwargs)[source]

Splits based on sequence similarity using CDHit. NOTE: Make sure cd-hit is in your PATH.

__init__(similarity_threshold=0.3, n_jobs=-1, seed=0, *args, **kwargs)

Methods

__init__([similarity_threshold, n_jobs, seed])

cluster_split(dataset, frac[, n])

Fast cluster-based splitting adapted from ProteinShake (https://github.com/BorgwardtLab/proteinshake_release/blob/main/structure_split.py).

compute_similarity_matrix(dataset)

Computes sequence similarity between all pairs of RNAs.