Header Ads Widget

⚡ Premium Tools Hub • EXE Apps + Full Python Source Code
Lite • Pro • Bundle Packs • Instant Download

Biopython Entrez Database Tutorial: Access NCBI Data Using Python

Biopython - Entrez Database

The Entrez system is a powerful search and retrieval system developed by NCBI (National Center for Biotechnology Information). It provides access to a vast collection of biological databases including DNA sequences, protein sequences, genome data, and scientific literature.

Biopython provides a module called Bio.Entrez that allows you to interact with the NCBI databases directly from Python. This makes it easy to search, retrieve, and analyze biological data programmatically.

In this tutorial, you will learn how to use the Entrez module in Biopython step by step.


What is Entrez?

Entrez is a unified database search system that connects multiple biological databases such as:

  • Nucleotide sequences
  • Protein sequences
  • Genome data
  • PubMed articles
  • Taxonomy data
  • Structure databases

It allows researchers to retrieve biological information using keywords, accession numbers, or identifiers.


Why Use Entrez in Biopython?

Biopython’s Entrez module helps you to:

  • Search biological databases
  • Retrieve DNA and protein sequences
  • Access scientific publications
  • Automate data collection
  • Integrate NCBI data into Python workflows

Important Requirement

Before using Entrez, NCBI requires you to provide an email address.

from Bio import Entrez

Entrez.email = "your_email@example.com"

This helps NCBI contact users if needed.


Available NCBI Databases

Some commonly used databases include:

DatabaseDescription
nucleotideDNA sequences
proteinProtein sequences
pubmedScientific literature
genomeGenome data
taxonomyOrganism classification

Searching NCBI Database

You can search for biological data using keywords.

from Bio import Entrez

Entrez.email = "your_email@example.com"

handle = Entrez.esearch(
    db="nucleotide",
    term="BRCA1"
)

record = Entrez.read(handle)

print(record)

Understanding Search Results

The search returns:

  • List of IDs
  • Total number of results
  • Query information

Example output:

{'IdList': ['12345', '67890'], 'Count': '250'}

Fetching Data from NCBI

After searching, you can retrieve detailed data using IDs.

from Bio import Entrez

Entrez.email = "your_email@example.com"

handle = Entrez.efetch(
    db="nucleotide",
    id="12345",
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Fetching FASTA Sequences

handle = Entrez.efetch(
    db="nucleotide",
    id="NM_007294",
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Fetching GenBank Records

handle = Entrez.efetch(
    db="nucleotide",
    id="NM_007294",
    rettype="gb",
    retmode="text"
)

print(handle.read())

Reading Structured Data

To parse structured results:

from Bio import Entrez

handle = Entrez.esearch(
    db="pubmed",
    term="cancer genomics"
)

record = Entrez.read(handle)

print(record["IdList"])

Searching PubMed Articles

handle = Entrez.esearch(
    db="pubmed",
    term="COVID-19 vaccine"
)

record = Entrez.read(handle)

print(record["IdList"])

Fetching PubMed Details

handle = Entrez.efetch(
    db="pubmed",
    id="12345678",
    rettype="medline",
    retmode="text"
)

print(handle.read())

Retrieving Multiple Records

ids = ",".join(["12345", "67890"])

handle = Entrez.efetch(
    db="nucleotide",
    id=ids,
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Working with Taxonomy Database

handle = Entrez.esearch(
    db="taxonomy",
    term="Homo sapiens"
)

record = Entrez.read(handle)

print(record["IdList"])

Parsing Entrez XML Data

handle = Entrez.esearch(
    db="nucleotide",
    term="insulin"
)

record = Entrez.read(handle)

for id in record["IdList"]:
    print(id)

EFetch vs ESearch

FunctionPurpose
esearchSearch database
efetchRetrieve data

Entrez Workflow

Search (esearch)
     ↓
Get IDs
     ↓
Fetch Data (efetch)
     ↓
Parse Results
     ↓
Analyze Data

Important Rules for Using Entrez

1. Always set email

Entrez.email = "your_email@example.com"

2. Avoid too many requests

NCBI may block excessive queries.

3. Use proper delays for large queries

import time
time.sleep(1)

Common Databases in Entrez

DatabaseUse Case
nucleotideDNA analysis
proteinprotein research
pubmedscientific articles
genomegenome mapping
structureprotein structures

Real-World Applications

Genomics Research

  • DNA sequence retrieval
  • Gene annotation

Medical Research

  • Disease gene discovery
  • Cancer research

Drug Discovery

  • Protein analysis
  • Target identification

Academic Research

  • Literature review
  • Scientific data mining

Advantages of Entrez in Biopython

  • Direct access to NCBI databases
  • Automation of data retrieval
  • Integration with Python workflows
  • Supports multiple biological datasets
  • Useful for large-scale research

Limitations

  • Requires internet connection
  • API rate limits apply
  • Dependent on NCBI server availability

Best Practices

Always use email identification

NCBI requires user identification.

Cache results locally

Avoid repeated requests.

Use structured queries

Improve search accuracy.

Respect API limits

Add delays between requests.


Example Workflow

from Bio import Entrez

Entrez.email = "your_email@example.com"

handle = Entrez.esearch(
    db="nucleotide",
    term="BRCA1 human"
)

record = Entrez.read(handle)

id_list = record["IdList"]

handle = Entrez.efetch(
    db="nucleotide",
    id=id_list[0],
    rettype="fasta",
    retmode="text"
)

print(handle.read())

Conclusion

The Biopython Entrez module provides a powerful interface for accessing NCBI biological databases. It allows researchers to search, retrieve, and analyze vast amounts of genomic and biomedical data directly from Python.

Mastering Entrez is essential for bioinformatics workflows involving gene analysis, protein research, and scientific literature mining. It bridges the gap between online biological databases and Python-based data analysis.

In the next tutorial, we will explore protein structure analysis using Biopython’s PDB module.




Post a Comment

0 Comments