Biopython - Sequence I/O Operations

Introduction

In bioinformatics, biological data is stored in specialized file formats such as FASTA, GenBank, EMBL, and others. Efficient reading and writing of these files is essential for sequence analysis, genome research, and computational biology.

Biopython provides the SeqIO module, which makes sequence input/output (I/O) simple and powerful. With SeqIO, you can read, parse, write, and manipulate biological sequence files using just a few lines of Python code.

In this tutorial, you will learn how to perform sequence I/O operations using Biopython.

What is Sequence I/O in Biopython?

Sequence I/O refers to:

Reading biological sequence files
Writing sequences to files
Parsing multiple sequences
Converting between file formats
Handling sequence metadata

Biopython supports many formats:

FASTA
GenBank
EMBL
Swiss-Prot
Clustal
Phylip

Importing SeqIO Module

To start working with sequence files, import SeqIO:

from Bio import SeqIO

This module is the core tool for sequence file handling.

Understanding FASTA Format

FASTA is the most commonly used biological file format.

Example FASTA file:

>Seq1
ATGCGATACGTT
>Seq2
ATGCCGTAGCTA

Each sequence has:

Header line starting with >
Sequence data below it

Reading a FASTA File

To read a FASTA file:

from Bio import SeqIO

for record in SeqIO.parse("sample.fasta", "fasta"):
    print(record.id)
    print(record.seq)

Output Example

Seq1
ATGCGATACGTT
Seq2
ATGCCGTAGCTA

Each record contains:

record.id → sequence ID
record.seq → biological sequence

Reading All Sequences as a List

records = list(SeqIO.parse("sample.fasta", "fasta"))

print(len(records))

Output:

This is useful for batch processing.

Reading a Single Sequence

If your file contains only one sequence:

record = SeqIO.read("single.fasta", "fasta")

print(record.id)
print(record.seq)

Accessing Sequence Metadata

Biological files contain additional information.

print(record.description)
print(record.name)
print(record.id)

These fields provide important biological context.

Writing FASTA Files

You can also write sequences to a file.

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO

record = SeqRecord(
    Seq("ATGCGATACGTT"),
    id="Seq1",
    description="Sample DNA sequence"
)

SeqIO.write(record, "output.fasta", "fasta")

Writing Multiple Sequences

records = [
    SeqRecord(Seq("ATGC"), id="S1"),
    SeqRecord(Seq("GCTA"), id="S2")
]

SeqIO.write(records, "multi.fasta", "fasta")

Converting File Formats

Biopython can convert between formats easily.

records = SeqIO.parse("input.fasta", "fasta")

SeqIO.write(records, "output.gb", "genbank")

This converts FASTA → GenBank format.

Working with GenBank Files

GenBank files contain rich annotations.

Example:

for record in SeqIO.parse("sample.gb", "genbank"):
    print(record.id)
    print(record.description)

Extracting Features from GenBank

record = SeqIO.read("sample.gb", "genbank")

for feature in record.features:
    print(feature.type)

Output Example

source
gene
CDS

Filtering Sequences

You can filter sequences based on conditions.

for record in SeqIO.parse("sample.fasta", "fasta"):
    if len(record.seq) > 10:
        print(record.id)

Counting Sequences in File

count = 0

for record in SeqIO.parse("sample.fasta", "fasta"):
    count += 1

print(count)

Calculating GC Content in Files

from Bio.Seq import Seq
from Bio import SeqIO

for record in SeqIO.parse("sample.fasta", "fasta"):
    seq = record.seq

    gc = (
        (seq.count("G") + seq.count("C"))
        / len(seq)
    ) * 100

    print(record.id, gc)

Renaming Sequences

for record in SeqIO.parse("sample.fasta", "fasta"):
    record.id = "NEW_" + record.id
    print(record.id)

Reverse Complement in Files

from Bio.Seq import Seq
from Bio import SeqIO

for record in SeqIO.parse("sample.fasta", "fasta"):
    print(record.seq.reverse_complement())

Common File Formats Supported

Format	Description
FASTA	Simple sequence format
GenBank	Annotated sequences
EMBL	European sequence format
PDB	Protein structure
Swiss-Prot	Protein database
Phylip	Phylogenetic data

Real-World Applications

Sequence I/O is used in:

Genomics

Genome sequencing projects
Large-scale DNA analysis

Medical Research

Genetic disorder studies
Mutation detection

Drug Discovery

Protein and gene analysis

Evolutionary Biology

Species comparison
Phylogenetic studies

Best Practices

Always Validate Files

Ensure correct format before parsing.

Use Iterators for Large Files

Avoid loading entire datasets into memory.

Keep Data Organized

Separate input and output files.

Use Meaningful IDs

Rename sequences for clarity.

Combine with Analysis Tools

Use SeqIO with Seq, Align, and Entrez modules.

Performance Tips

Use SeqIO.parse() for large datasets
Avoid list() for huge files unless necessary
Process sequences in chunks when possible

Conclusion

Biopython’s SeqIO module is a powerful tool for handling biological sequence files. It allows you to easily read, write, convert, and process FASTA, GenBank, and other bioinformatics file formats.

Mastering sequence I/O operations is essential for working with real genomic data, building bioinformatics pipelines, and performing large-scale biological analysis. In the next tutorial, we will explore sequence alignment and how to compare biological sequences using Biopython.

Header Ads Widget

Biopython Sequence I/O Operations Tutorial: Read and Write FASTA, GenBank Files in Python

Biopython - Sequence I/O Operations

Introduction

What is Sequence I/O in Biopython?

Importing SeqIO Module

Understanding FASTA Format

Reading a FASTA File

Output Example

Reading All Sequences as a List

Reading a Single Sequence

Accessing Sequence Metadata

Writing FASTA Files

Writing Multiple Sequences

Converting File Formats

Working with GenBank Files

Extracting Features from GenBank

Output Example

Filtering Sequences

Counting Sequences in File

Calculating GC Content in Files

Renaming Sequences

Reverse Complement in Files

Common File Formats Supported

Real-World Applications

Genomics

Medical Research

Drug Discovery

Evolutionary Biology

Best Practices

Always Validate Files

Use Iterators for Large Files

Keep Data Organized

Use Meaningful IDs

Combine with Analysis Tools

Performance Tips

Conclusion

Posted by: Roger John Williams

You may like these posts

Post a Comment

0 Comments

Search This Blog

Report Abuse

Labels

Subscribe Us

Ad Space

Popular Posts

NumPy Inverse Fourier Transform Explained – Python IFFT with Examples

Python - Join Tuples (Complete Guide for Beginners)

Python - Tuple Methods (Complete Guide for Beginners)

Tags

Popular Posts

NumPy Inverse Fourier Transform Explained – Python IFFT with Examples

Python - Join Tuples (Complete Guide for Beginners)

Python - Tuple Methods (Complete Guide for Beginners)

Labels

Menu Footer Widget