rnaglib.utils.cdhit_wrapper¶
- rnaglib.utils.cdhit_wrapper(ids, sequences, sim_thresh=0.6, n_jobs=1)[source]¶
Cluster sequences using CD-hit. Adapted from ProteinShake.
Choose of word size: -n 5 for thresholds 0.7 ~ 1.0 -n 4 for thresholds 0.6 ~ 0.7 -n 3 for thresholds 0.5 ~ 0.6 -n 2 for thresholds 0.4 ~ 0.5
- Parameters:
sequences (list) – List of protein sequences to cluster.
Returns
--------
representatives (list) – List of sequence indices to preserve as representatives.