rnaglib.dataset_transforms.ClusterSplitter¶
- class rnaglib.dataset_transforms.ClusterSplitter(similarity_threshold=0.5, n_jobs=-1, seed=0, balanced=True, distance_name='USalign', verbose=False, *args, **kwargs)[source]¶
Abstract class for splitting by clustering with a similarity function.
- Parameters:
similarity_threshold (float) – similarity threshold (using similarity defined as 1-distance) above which two RNAs will be clustered in the same cluster (default 0.5)
n_jobs (int) – number of jobs (for parallelization) (if set to -1, use the maximum number of cores)(default -1)
seed (int) – seed for shuffling (default 0)
balanced (bool) – whether to used balanced clusters (default True)
distance_name (str) – name of the distance metric to use to perform clustering (must have been computed for this dataset, see DistanceComputer if it hasn’t) (default “USalign”)
verbose (bool) – whether to display messages (default False)
- __init__(similarity_threshold=0.5, n_jobs=-1, seed=0, balanced=True, distance_name='USalign', verbose=False, *args, **kwargs)[source]¶
Methods
__init__
([similarity_threshold, n_jobs, ...])balancer
(clusters, label_counts, dataset, fracs)Splits clusters into train, val, test keeping into account label balance.
cluster_split
(dataset, frac[, n, split])Fast cluster-based splitting adapted from ProteinShake.
forward
(dataset)