sRNAtoolbox

Remove Duplicates from a Fasta File and manipulate names :

This tool can:

Detect and remove duplicated IDs
Detect and remove duplicated sequences
Detect and remove duplicated sequences & generate a new ID by pasting the sequence IDs that have the same sequence
Manipulate the sequences names (eliminate a certain string)

Consider the following input

>seqA
ATCACTA
>seqA
CACTG
>seqB
ATCACTA

1)Detect and rem

ove duplicated IDs:

Only the first ID is used, all others are ignored (not written to the output)
After removing duplicated IDs, the output for the above example would be:
>seqA
ATCACTA
>seqB
CACTG

2) Detect and remove duplicated sequences If this checkbox is activated, all entries with the same sequence are eliminated The result for the above example would be:

>seqA
ATCACTA

3) Detect and remove duplicated sequences & generate a new ID by pasting the sequence IDs that have the same sequence (if this checkbox is activated → launch with mode RDG) If the "paste names" checkbox is activated, the output would be:

>seqA=seqB
ATCACTA

4) Manipulate the sequences names (eliminate a certain string) The UCSC table browser allows to obtain 3' UTR sequences which are needed when searching for microRNA target genes. However the output files have the following format:

>hg19_refGene_NM_001184906 range=chr17:37408897-37417712 5'pad=0 3'pad=0 strand=- repeatMasking=none
CAATGGAGGTGGTCAACCTTGGCGAACTGAGTATTTAATGACACTTCTAG
AGCTACCGTGGAGTCTCTCCAGTGGAAGCAACCCCAGTGTTCTGAGCAAG

Input data

Help

Link to Web Manual