Biopython - Entrez Database
The Entrez system is a powerful search and retrieval system developed by NCBI (National Center for Biotechnology Information). It provides access to a vast collection of biological databases including DNA sequences, protein sequences, genome data, and scientific literature.
Biopython provides a module called Bio.Entrez that allows you to interact with the NCBI databases directly from Python. This makes it easy to search, retrieve, and analyze biological data programmatically.
In this tutorial, you will learn how to use the Entrez module in Biopython step by step.
What is Entrez?
Entrez is a unified database search system that connects multiple biological databases such as:
- Nucleotide sequences
- Protein sequences
- Genome data
- PubMed articles
- Taxonomy data
- Structure databases
It allows researchers to retrieve biological information using keywords, accession numbers, or identifiers.
Why Use Entrez in Biopython?
Biopython’s Entrez module helps you to:
- Search biological databases
- Retrieve DNA and protein sequences
- Access scientific publications
- Automate data collection
- Integrate NCBI data into Python workflows
Important Requirement
Before using Entrez, NCBI requires you to provide an email address.
from Bio import Entrez
Entrez.email = "your_email@example.com"This helps NCBI contact users if needed.
Available NCBI Databases
Some commonly used databases include:
| Database | Description |
|---|---|
| nucleotide | DNA sequences |
| protein | Protein sequences |
| pubmed | Scientific literature |
| genome | Genome data |
| taxonomy | Organism classification |
Searching NCBI Database
You can search for biological data using keywords.
from Bio import Entrez
Entrez.email = "your_email@example.com"
handle = Entrez.esearch(
db="nucleotide",
term="BRCA1"
)
record = Entrez.read(handle)
print(record)Understanding Search Results
The search returns:
- List of IDs
- Total number of results
- Query information
Example output:
{'IdList': ['12345', '67890'], 'Count': '250'}Fetching Data from NCBI
After searching, you can retrieve detailed data using IDs.
from Bio import Entrez
Entrez.email = "your_email@example.com"
handle = Entrez.efetch(
db="nucleotide",
id="12345",
rettype="fasta",
retmode="text"
)
print(handle.read())Fetching FASTA Sequences
handle = Entrez.efetch(
db="nucleotide",
id="NM_007294",
rettype="fasta",
retmode="text"
)
print(handle.read())Fetching GenBank Records
handle = Entrez.efetch(
db="nucleotide",
id="NM_007294",
rettype="gb",
retmode="text"
)
print(handle.read())Reading Structured Data
To parse structured results:
from Bio import Entrez
handle = Entrez.esearch(
db="pubmed",
term="cancer genomics"
)
record = Entrez.read(handle)
print(record["IdList"])Searching PubMed Articles
handle = Entrez.esearch(
db="pubmed",
term="COVID-19 vaccine"
)
record = Entrez.read(handle)
print(record["IdList"])Fetching PubMed Details
handle = Entrez.efetch(
db="pubmed",
id="12345678",
rettype="medline",
retmode="text"
)
print(handle.read())Retrieving Multiple Records
ids = ",".join(["12345", "67890"])
handle = Entrez.efetch(
db="nucleotide",
id=ids,
rettype="fasta",
retmode="text"
)
print(handle.read())Working with Taxonomy Database
handle = Entrez.esearch(
db="taxonomy",
term="Homo sapiens"
)
record = Entrez.read(handle)
print(record["IdList"])Parsing Entrez XML Data
handle = Entrez.esearch(
db="nucleotide",
term="insulin"
)
record = Entrez.read(handle)
for id in record["IdList"]:
print(id)EFetch vs ESearch
| Function | Purpose |
| esearch | Search database |
| efetch | Retrieve data |
Entrez Workflow
Search (esearch)
↓
Get IDs
↓
Fetch Data (efetch)
↓
Parse Results
↓
Analyze DataImportant Rules for Using Entrez
1. Always set email
Entrez.email = "your_email@example.com"2. Avoid too many requests
NCBI may block excessive queries.
3. Use proper delays for large queries
import time
time.sleep(1)Common Databases in Entrez
| Database | Use Case |
| nucleotide | DNA analysis |
| protein | protein research |
| pubmed | scientific articles |
| genome | genome mapping |
| structure | protein structures |
Real-World Applications
Genomics Research
- DNA sequence retrieval
- Gene annotation
Medical Research
- Disease gene discovery
- Cancer research
Drug Discovery
- Protein analysis
- Target identification
Academic Research
- Literature review
- Scientific data mining
Advantages of Entrez in Biopython
- Direct access to NCBI databases
- Automation of data retrieval
- Integration with Python workflows
- Supports multiple biological datasets
- Useful for large-scale research
Limitations
- Requires internet connection
- API rate limits apply
- Dependent on NCBI server availability
Best Practices
Always use email identification
NCBI requires user identification.
Cache results locally
Avoid repeated requests.
Use structured queries
Improve search accuracy.
Respect API limits
Add delays between requests.
Example Workflow
from Bio import Entrez
Entrez.email = "your_email@example.com"
handle = Entrez.esearch(
db="nucleotide",
term="BRCA1 human"
)
record = Entrez.read(handle)
id_list = record["IdList"]
handle = Entrez.efetch(
db="nucleotide",
id=id_list[0],
rettype="fasta",
retmode="text"
)
print(handle.read())Conclusion
The Biopython Entrez module provides a powerful interface for accessing NCBI biological databases. It allows researchers to search, retrieve, and analyze vast amounts of genomic and biomedical data directly from Python.
Mastering Entrez is essential for bioinformatics workflows involving gene analysis, protein research, and scientific literature mining. It bridges the gap between online biological databases and Python-based data analysis.
In the next tutorial, we will explore protein structure analysis using Biopython’s PDB module.


0 Comments