Header Ads Widget

⚡ Premium Tools Hub • EXE Apps + Full Python Source Code
Lite • Pro • Bundle Packs • Instant Download

Biopython Sequence I/O Operations Tutorial: Read and Write FASTA, GenBank Files in Python

Biopython - Sequence I/O Operations

Introduction

In bioinformatics, biological data is stored in specialized file formats such as FASTA, GenBank, EMBL, and others. Efficient reading and writing of these files is essential for sequence analysis, genome research, and computational biology.

Biopython provides the SeqIO module, which makes sequence input/output (I/O) simple and powerful. With SeqIO, you can read, parse, write, and manipulate biological sequence files using just a few lines of Python code.

In this tutorial, you will learn how to perform sequence I/O operations using Biopython.


What is Sequence I/O in Biopython?

Sequence I/O refers to:

  • Reading biological sequence files
  • Writing sequences to files
  • Parsing multiple sequences
  • Converting between file formats
  • Handling sequence metadata

Biopython supports many formats:

  • FASTA
  • GenBank
  • EMBL
  • Swiss-Prot
  • Clustal
  • Phylip

Importing SeqIO Module

To start working with sequence files, import SeqIO:

from Bio import SeqIO

This module is the core tool for sequence file handling.


Understanding FASTA Format

FASTA is the most commonly used biological file format.

Example FASTA file:

>Seq1
ATGCGATACGTT
>Seq2
ATGCCGTAGCTA

Each sequence has:

  • Header line starting with >
  • Sequence data below it

Reading a FASTA File

To read a FASTA file:

from Bio import SeqIO

for record in SeqIO.parse("sample.fasta", "fasta"):
    print(record.id)
    print(record.seq)

Output Example

Seq1
ATGCGATACGTT
Seq2
ATGCCGTAGCTA

Each record contains:

  • record.id → sequence ID
  • record.seq → biological sequence

Reading All Sequences as a List

records = list(SeqIO.parse("sample.fasta", "fasta"))

print(len(records))

Output:

2

This is useful for batch processing.


Reading a Single Sequence

If your file contains only one sequence:

record = SeqIO.read("single.fasta", "fasta")

print(record.id)
print(record.seq)

Accessing Sequence Metadata

Biological files contain additional information.

print(record.description)
print(record.name)
print(record.id)

These fields provide important biological context.


Writing FASTA Files

You can also write sequences to a file.

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

record = SeqRecord(
    Seq("ATGCGATACGTT"),
    id="Seq1",
    description="Sample DNA sequence"
)

SeqIO.write(record, "output.fasta", "fasta")

Writing Multiple Sequences

records = [
    SeqRecord(Seq("ATGC"), id="S1"),
    SeqRecord(Seq("GCTA"), id="S2")
]

SeqIO.write(records, "multi.fasta", "fasta")

Converting File Formats

Biopython can convert between formats easily.

records = SeqIO.parse("input.fasta", "fasta")

SeqIO.write(records, "output.gb", "genbank")

This converts FASTA → GenBank format.


Working with GenBank Files

GenBank files contain rich annotations.

Example:

for record in SeqIO.parse("sample.gb", "genbank"):
    print(record.id)
    print(record.description)

Extracting Features from GenBank

record = SeqIO.read("sample.gb", "genbank")

for feature in record.features:
    print(feature.type)

Output Example

source
gene
CDS

Filtering Sequences

You can filter sequences based on conditions.

for record in SeqIO.parse("sample.fasta", "fasta"):
    if len(record.seq) > 10:
        print(record.id)

Counting Sequences in File

count = 0

for record in SeqIO.parse("sample.fasta", "fasta"):
    count += 1

print(count)

Calculating GC Content in Files

from Bio.Seq import Seq
from Bio import SeqIO

for record in SeqIO.parse("sample.fasta", "fasta"):
    seq = record.seq

    gc = (
        (seq.count("G") + seq.count("C"))
        / len(seq)
    ) * 100

    print(record.id, gc)

Renaming Sequences

for record in SeqIO.parse("sample.fasta", "fasta"):
    record.id = "NEW_" + record.id
    print(record.id)

Reverse Complement in Files

from Bio.Seq import Seq
from Bio import SeqIO

for record in SeqIO.parse("sample.fasta", "fasta"):
    print(record.seq.reverse_complement())

Common File Formats Supported

FormatDescription
FASTASimple sequence format
GenBankAnnotated sequences
EMBLEuropean sequence format
PDBProtein structure
Swiss-ProtProtein database
PhylipPhylogenetic data

Real-World Applications

Sequence I/O is used in:

Genomics

  • Genome sequencing projects
  • Large-scale DNA analysis

Medical Research

  • Genetic disorder studies
  • Mutation detection

Drug Discovery

  • Protein and gene analysis

Evolutionary Biology

  • Species comparison
  • Phylogenetic studies

Best Practices

Always Validate Files

Ensure correct format before parsing.

Use Iterators for Large Files

Avoid loading entire datasets into memory.

Keep Data Organized

Separate input and output files.

Use Meaningful IDs

Rename sequences for clarity.

Combine with Analysis Tools

Use SeqIO with Seq, Align, and Entrez modules.


Performance Tips

  • Use SeqIO.parse() for large datasets
  • Avoid list() for huge files unless necessary
  • Process sequences in chunks when possible

Conclusion

Biopython’s SeqIO module is a powerful tool for handling biological sequence files. It allows you to easily read, write, convert, and process FASTA, GenBank, and other bioinformatics file formats.

Mastering sequence I/O operations is essential for working with real genomic data, building bioinformatics pipelines, and performing large-scale biological analysis. In the next tutorial, we will explore sequence alignment and how to compare biological sequences using Biopython.




Post a Comment

0 Comments