rnaglib.utils.cdhit_wrapper

rnaglib.utils.cdhit_wrapper(ids, sequences, sim_thresh=0.6, n_jobs=1)[source]

Cluster sequences using CD-hit. Adapted from ProteinShake.

Choose of word size: -n 5 for thresholds 0.7 ~ 1.0 -n 4 for thresholds 0.6 ~ 0.7 -n 3 for thresholds 0.5 ~ 0.6 -n 2 for thresholds 0.4 ~ 0.5

Parameters:
  • sequences (list) – List of protein sequences to cluster.

  • Returns

  • --------

  • representatives (list) – List of sequence indices to preserve as representatives.