Biopython - Entrez Database

The Entrez system is a powerful search and retrieval system developed by NCBI (National Center for Biotechnology Information). It provides access to a vast collection of biological databases including DNA sequences, protein sequences, genome data, and scientific literature.

Biopython provides a module called Bio.Entrez that allows you to interact with the NCBI databases directly from Python. This makes it easy to search, retrieve, and analyze biological data programmatically.

In this tutorial, you will learn how to use the Entrez module in Biopython step by step.

What is Entrez?

Entrez is a unified database search system that connects multiple biological databases such as:

Nucleotide sequences
Protein sequences
Genome data
PubMed articles
Taxonomy data
Structure databases

It allows researchers to retrieve biological information using keywords, accession numbers, or identifiers.

Why Use Entrez in Biopython?

Biopython’s Entrez module helps you to:

Search biological databases
Retrieve DNA and protein sequences
Access scientific publications
Automate data collection
Integrate NCBI data into Python workflows

Important Requirement

Before using Entrez, NCBI requires you to provide an email address.

from Bio import Entrez

Entrez.email = "your_email@example.com"

This helps NCBI contact users if needed.

Available NCBI Databases

Some commonly used databases include:

Database	Description
nucleotide	DNA sequences
protein	Protein sequences
pubmed	Scientific literature
genome	Genome data
taxonomy	Organism classification

Searching NCBI Database

You can search for biological data using keywords.

from Bio import Entrez

Entrez.email = "your_email@example.com"

handle = Entrez.esearch(
    db="nucleotide",
    term="BRCA1"
)

record = Entrez.read(handle)

print(record)

Understanding Search Results

The search returns:

List of IDs
Total number of results
Query information

Example output:

{'IdList': ['12345', '67890'], 'Count': '250'}

Fetching Data from NCBI

After searching, you can retrieve detailed data using IDs.

from Bio import Entrez

Entrez.email = "your_email@example.com"

handle = Entrez.efetch(
    db="nucleotide",
    id="12345",
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Fetching FASTA Sequences

handle = Entrez.efetch(
    db="nucleotide",
    id="NM_007294",
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Fetching GenBank Records

handle = Entrez.efetch(
    db="nucleotide",
    id="NM_007294",
    rettype="gb",
    retmode="text"
)

print(handle.read())

Reading Structured Data

To parse structured results:

from Bio import Entrez

handle = Entrez.esearch(
    db="pubmed",
    term="cancer genomics"
)

record = Entrez.read(handle)

print(record["IdList"])

Searching PubMed Articles

handle = Entrez.esearch(
    db="pubmed",
    term="COVID-19 vaccine"
)

record = Entrez.read(handle)

print(record["IdList"])

Fetching PubMed Details

handle = Entrez.efetch(
    db="pubmed",
    id="12345678",
    rettype="medline",
    retmode="text"
)

print(handle.read())

Retrieving Multiple Records

ids = ",".join(["12345", "67890"])

handle = Entrez.efetch(
    db="nucleotide",
    id=ids,
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Working with Taxonomy Database

handle = Entrez.esearch(
    db="taxonomy",
    term="Homo sapiens"
)

record = Entrez.read(handle)

print(record["IdList"])

Parsing Entrez XML Data

handle = Entrez.esearch(
    db="nucleotide",
    term="insulin"
)

record = Entrez.read(handle)

for id in record["IdList"]:
    print(id)

EFetch vs ESearch

Function	Purpose
esearch	Search database
efetch	Retrieve data

Entrez Workflow

Search (esearch)
     ↓
Get IDs
     ↓
Fetch Data (efetch)
     ↓
Parse Results
     ↓
Analyze Data

Important Rules for Using Entrez

1. Always set email

Entrez.email = "your_email@example.com"

2. Avoid too many requests

NCBI may block excessive queries.

3. Use proper delays for large queries

import time
time.sleep(1)

Common Databases in Entrez

Database	Use Case
nucleotide	DNA analysis
protein	protein research
pubmed	scientific articles
genome	genome mapping
structure	protein structures

Real-World Applications

Genomics Research

DNA sequence retrieval
Gene annotation

Medical Research

Disease gene discovery
Cancer research

Drug Discovery

Protein analysis
Target identification

Academic Research

Literature review
Scientific data mining

Advantages of Entrez in Biopython

Direct access to NCBI databases
Automation of data retrieval
Integration with Python workflows
Supports multiple biological datasets
Useful for large-scale research

Limitations

Requires internet connection
API rate limits apply
Dependent on NCBI server availability

Best Practices

Always use email identification

NCBI requires user identification.

Cache results locally

Avoid repeated requests.

Use structured queries

Improve search accuracy.

Respect API limits

Add delays between requests.

Example Workflow

from Bio import Entrez

Entrez.email = "your_email@example.com"

handle = Entrez.esearch(
    db="nucleotide",
    term="BRCA1 human"
)

record = Entrez.read(handle)

id_list = record["IdList"]

handle = Entrez.efetch(
    db="nucleotide",
    id=id_list[0],
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Conclusion

The Biopython Entrez module provides a powerful interface for accessing NCBI biological databases. It allows researchers to search, retrieve, and analyze vast amounts of genomic and biomedical data directly from Python.

Mastering Entrez is essential for bioinformatics workflows involving gene analysis, protein research, and scientific literature mining. It bridges the gap between online biological databases and Python-based data analysis.

In the next tutorial, we will explore protein structure analysis using Biopython’s PDB module.

Header Ads Widget

Biopython Entrez Database Tutorial: Access NCBI Data Using Python