rnaglib.dataset_transforms.ClusterSplitter¶

class rnaglib.dataset_transforms.ClusterSplitter(similarity_threshold=0.5, n_jobs=-1, seed=0, balanced=True, distance_name='USalign', verbose=False, *args, **kwargs)[source]¶

Abstract class for splitting by clustering with a similarity function.

Parameters:

similarity_threshold (float) – similarity threshold (using similarity defined as 1-distance) above which two RNAs will be clustered in the same cluster (default 0.5)
n_jobs (int) – number of jobs (for parallelization) (if set to -1, use the maximum number of cores)(default -1)
seed (int) – seed for shuffling (default 0)
balanced (bool) – whether to used balanced clusters (default True)
distance_name (str) – name of the distance metric to use to perform clustering (must have been computed for this dataset, see DistanceComputer if it hasn’t) (default “USalign”)
verbose (bool) – whether to display messages (default False)

__init__(similarity_threshold=0.5, n_jobs=-1, seed=0, balanced=True, distance_name='USalign', verbose=False, *args, **kwargs)[source]¶

Methods

`__init__`([similarity_threshold, n_jobs, ...])
`balancer`(clusters, label_counts, dataset, fracs)	Split clusters into train, val, test keeping into account label balance.
`cluster_split`(dataset, frac[, n, split])	Fast cluster-based splitting adapted from ProteinShake.
`forward`(dataset)	Split dataset into train, validation, and test sets using clustering.