Type to search…

Data

We will work with the most common textual formats for DNA, RNA and protein sequencing in bioinformatics, FASTA and Genbank; and we will also review other well-known ones.

Introduction

Clone the project we have prepared, which has everything needed to try out the proposed examples and activities.

shell
git clone https://gitlab.com/xtec/bio/genfiles.git

FASTA Format

It is probably the most widely used file format for sequences and one of the most common types of file formats in bioinformatics.

The FASTA file format has its origins in the FAST program, used for sequence alignment.

The file format is simply defined as a plain text file with one or more entries consisting of a line with a > symbol followed by a unique identifying definition line, or defline, and one or more sequence data lines.

Creating a fasta text file is very easy, both in a plain text editor like notepad and in VSCode.

We can create a unifasta file (a single sequence) called uniseq.fasta with the VSCode editor, just copy the text and save it.

txt
>Seqüència aminoàcids de prova
MTHCP*MTI*

Or create a multifasta file (more than one sequence) called sequences.fa from the Linux terminal:

shell
echo ">a
ACGCGTACGTGACGACGATCG
>b
ATTTCGCGACTCTGCCTACGCTAC
>c
GGGAAACCTTTTTTT" > sequences.fa

Bingo! We now have a multifasta file :) with sequences a, b and c.

The fundamental requirement is that the file be plain text so that it can be handled with any text processing application or programming language.

Therefore, these files are best handled in text editors like nano, sublime or VSCode.

To view a FASTA file from the command line without editing it, you can use the cat application.

shell
cat uniseq.fasta
>Seqüència aminoàcids de prova
MTHCP*MTI*

You're reading a preview.

Sign in to read the full article. Any account opens 4 free articles a month; students and teachers read their course pages without limit.

Sign in