rnaglib.dataset_transforms.ClusterSplitter

class rnaglib.dataset_transforms.ClusterSplitter(similarity_threshold=0.5, n_jobs=-1, seed=0, balanced=True, distance_name='USalign', verbose=False, *args, **kwargs)[source]

Abstract class for splitting by clustering with a similarity function.

Parameters:
  • similarity_threshold (float) – similarity threshold (using similarity defined as 1-distance) above which two RNAs will be clustered in the same cluster (default 0.5)

  • n_jobs (int) – number of jobs (for parallelization) (if set to -1, use the maximum number of cores)(default -1)

  • seed (int) – seed for shuffling (default 0)

  • balanced (bool) – whether to used balanced clusters (default True)

  • distance_name (str) – name of the distance metric to use to perform clustering (must have been computed for this dataset, see DistanceComputer if it hasn’t) (default “USalign”)

  • verbose (bool) – whether to display messages (default False)

__init__(similarity_threshold=0.5, n_jobs=-1, seed=0, balanced=True, distance_name='USalign', verbose=False, *args, **kwargs)[source]

Methods

__init__([similarity_threshold, n_jobs, ...])

balancer(clusters, label_counts, dataset, fracs)

Splits clusters into train, val, test keeping into account label balance.

cluster_split(dataset, frac[, n, split])

Fast cluster-based splitting adapted from ProteinShake.

forward(dataset)