Biopython - Sequence I/O Operations
Introduction
In bioinformatics, biological data is stored in specialized file formats such as FASTA, GenBank, EMBL, and others. Efficient reading and writing of these files is essential for sequence analysis, genome research, and computational biology.
Biopython provides the SeqIO module, which makes sequence input/output (I/O) simple and powerful. With SeqIO, you can read, parse, write, and manipulate biological sequence files using just a few lines of Python code.
In this tutorial, you will learn how to perform sequence I/O operations using Biopython.
What is Sequence I/O in Biopython?
Sequence I/O refers to:
- Reading biological sequence files
- Writing sequences to files
- Parsing multiple sequences
- Converting between file formats
- Handling sequence metadata
Biopython supports many formats:
- FASTA
- GenBank
- EMBL
- Swiss-Prot
- Clustal
- Phylip
Importing SeqIO Module
To start working with sequence files, import SeqIO:
from Bio import SeqIOThis module is the core tool for sequence file handling.
Understanding FASTA Format
FASTA is the most commonly used biological file format.
Example FASTA file:
>Seq1
ATGCGATACGTT
>Seq2
ATGCCGTAGCTAEach sequence has:
- Header line starting with
> - Sequence data below it
Reading a FASTA File
To read a FASTA file:
from Bio import SeqIO
for record in SeqIO.parse("sample.fasta", "fasta"):
print(record.id)
print(record.seq)Output Example
Seq1
ATGCGATACGTT
Seq2
ATGCCGTAGCTAEach record contains:
record.id→ sequence IDrecord.seq→ biological sequence
Reading All Sequences as a List
records = list(SeqIO.parse("sample.fasta", "fasta"))
print(len(records))Output:
2This is useful for batch processing.
Reading a Single Sequence
If your file contains only one sequence:
record = SeqIO.read("single.fasta", "fasta")
print(record.id)
print(record.seq)Accessing Sequence Metadata
Biological files contain additional information.
print(record.description)
print(record.name)
print(record.id)These fields provide important biological context.
Writing FASTA Files
You can also write sequences to a file.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
record = SeqRecord(
Seq("ATGCGATACGTT"),
id="Seq1",
description="Sample DNA sequence"
)
SeqIO.write(record, "output.fasta", "fasta")Writing Multiple Sequences
records = [
SeqRecord(Seq("ATGC"), id="S1"),
SeqRecord(Seq("GCTA"), id="S2")
]
SeqIO.write(records, "multi.fasta", "fasta")Converting File Formats
Biopython can convert between formats easily.
records = SeqIO.parse("input.fasta", "fasta")
SeqIO.write(records, "output.gb", "genbank")This converts FASTA → GenBank format.
Working with GenBank Files
GenBank files contain rich annotations.
Example:
for record in SeqIO.parse("sample.gb", "genbank"):
print(record.id)
print(record.description)Extracting Features from GenBank
record = SeqIO.read("sample.gb", "genbank")
for feature in record.features:
print(feature.type)Output Example
source
gene
CDSFiltering Sequences
You can filter sequences based on conditions.
for record in SeqIO.parse("sample.fasta", "fasta"):
if len(record.seq) > 10:
print(record.id)Counting Sequences in File
count = 0
for record in SeqIO.parse("sample.fasta", "fasta"):
count += 1
print(count)Calculating GC Content in Files
from Bio.Seq import Seq
from Bio import SeqIO
for record in SeqIO.parse("sample.fasta", "fasta"):
seq = record.seq
gc = (
(seq.count("G") + seq.count("C"))
/ len(seq)
) * 100
print(record.id, gc)Renaming Sequences
for record in SeqIO.parse("sample.fasta", "fasta"):
record.id = "NEW_" + record.id
print(record.id)Reverse Complement in Files
from Bio.Seq import Seq
from Bio import SeqIO
for record in SeqIO.parse("sample.fasta", "fasta"):
print(record.seq.reverse_complement())Common File Formats Supported
| Format | Description |
|---|---|
| FASTA | Simple sequence format |
| GenBank | Annotated sequences |
| EMBL | European sequence format |
| PDB | Protein structure |
| Swiss-Prot | Protein database |
| Phylip | Phylogenetic data |
Real-World Applications
Sequence I/O is used in:
Genomics
- Genome sequencing projects
- Large-scale DNA analysis
Medical Research
- Genetic disorder studies
- Mutation detection
Drug Discovery
- Protein and gene analysis
Evolutionary Biology
- Species comparison
- Phylogenetic studies
Best Practices
Always Validate Files
Ensure correct format before parsing.
Use Iterators for Large Files
Avoid loading entire datasets into memory.
Keep Data Organized
Separate input and output files.
Use Meaningful IDs
Rename sequences for clarity.
Combine with Analysis Tools
Use SeqIO with Seq, Align, and Entrez modules.
Performance Tips
- Use
SeqIO.parse()for large datasets - Avoid
list()for huge files unless necessary - Process sequences in chunks when possible
Conclusion
Biopython’s SeqIO module is a powerful tool for handling biological sequence files. It allows you to easily read, write, convert, and process FASTA, GenBank, and other bioinformatics file formats.
Mastering sequence I/O operations is essential for working with real genomic data, building bioinformatics pipelines, and performing large-scale biological analysis. In the next tutorial, we will explore sequence alignment and how to compare biological sequences using Biopython.


0 Comments