Key Biological Databases Every Researcher Should Know

Biological databases are essential tools in life sciences research, providing extensive collections of data on genes, proteins, and other biological molecules. This blog outlines some of the most important biological databases that researchers frequently use, focusing on their main features and practical applications.

1. GenBank

  • Overview: GenBank, maintained by the National Center for Biotechnology Information (NCBI), is a large nucleotide sequence database. It includes DNA sequences from a wide range of organisms, including viruses, bacteria, plants, and animals.

New wizard for submitting mRNA sequences

  • Key Features: GenBank offers a comprehensive collection of annotated sequences, including coding regions and regulatory elements. It also provides links to related literature and resources.
  • Practical Application: Researchers use GenBank to retrieve specific DNA sequences, compare them with sequences from other organisms, and analyze evolutionary relationships. For example, BLAST (Basic Local Alignment Search Tool) allows researchers to find similar sequences within the GenBank database.

2. UniProt

  • Overview: The Universal Protein Resource (UniProt) is a major resource for protein sequence and functional information. It is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR).

Sequence data | UniProt

  • Key Features: UniProt consists of three main components: UniProtKB (Knowledgebase), UniRef (Reference Clusters), and UniParc (Archive). UniProtKB contains manually reviewed (Swiss-Prot) and computationally analyzed (TrEMBL) protein sequences with detailed annotations.
  • Practical Application: Researchers studying protein function and structure use UniProt to find information about protein sequences, domains, interactions, and post-translational modifications.

3. Protein Data Bank (PDB)

  • Overview: The Protein Data Bank (PDB) is a global repository for 3D structural data of biological molecules, such as proteins and nucleic acids. It is managed by the Worldwide Protein Data Bank (wwPDB) consortium.

RCSB PDB: About RCSB PDB

  • Key Features: PDB provides 3D structures determined using methods like X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy. Each entry includes atomic coordinates, metadata, and experimental data.
  • Practical Application: Researchers use PDB to visualize the 3D structure of proteins and nucleic acids, which is important for understanding their function and interactions.

4. Ensembl

  • Overview: Ensembl is a genome browser and database that provides detailed information on the genomes of vertebrates and other eukaryotic species. It is maintained by the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute.

Ensembl genome database

  • Key Features: Ensembl offers tools and data, including gene annotations, comparative genomics, variation data, and regulatory features. It integrates data from various sources and provides a user-friendly interface for analyzing genomic information.
  • Practical Application: Ensembl is useful for researchers involved in comparative genomics or studying genetic variation. For example, researchers can explore genetic variants associated with diseases and compare them with variants in other species.

5. Gene Expression Omnibus (GEO)

  • Overview: The Gene Expression Omnibus (GEO) is a public repository for high-throughput gene expression data, including microarray and RNA-seq data. It is maintained by NCBI and is widely used in transcriptomics research.

The workflow of our study. GEO, Gene Expression Omnibus

  • Key Features: GEO provides access to a variety of gene expression datasets, including raw and processed data, experimental details, and metadata. It also offers tools for data visualization and analysis, such as GEO2R, which allows researchers to compare gene expression across different conditions.
  • Practical Application: GEO is commonly used by researchers studying gene expression patterns in various biological contexts. For example, it can be used to access datasets and perform differential expression analysis.

6. KEGG (Kyoto Encyclopedia of Genes and Genomes)

  • Overview: KEGG is a resource for understanding biological systems, such as the cell, the organism, and the ecosystem, based on molecular-level information.

Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of genes participating in fusions in T-cell acute lymphoblastic leukemia identified using ClueGO

  • Key Features: KEGG provides databases for genes, proteins, and small molecules, with a focus on metabolic and signaling pathways. It includes graphical representations of these pathways and other cellular processes.
  • Practical Application: Researchers use KEGG to study metabolic pathways and model biological networks. For example, researchers studying a metabolic disorder might use KEGG to map the affected pathway and identify key enzymes involved.

Database

Focus

Key Features

Data Types

Use Cases

Ensembl

Genomic data for vertebrates and model organisms

Genome sequences, gene annotations, variation data, and comparative genomics

Genomes, genes, variants

Gene function, evolutionary studies, comparative genomics

Protein Data Bank (PDB)

3D structures of proteins, nucleic acids, and complex assemblies

3D structural data, molecular visualization, and detailed structural information

Protein structures, nucleic acids

Structural biology, drug design, protein function analysis

UniProt

Protein sequence and functional information

Comprehensive protein sequences, functional annotations, and protein family classifications

Protein sequences, functional data

Protein function, annotation, and classification

GenBank

Nucleotide sequences from various organisms

DNA and RNA sequences, annotations, and links to other databases

DNA sequences, RNA sequences

Gene discovery, sequence alignment, functional genomics

Gene Expression Omnibus (GEO)

Gene expression data from high-throughput experiments

Gene expression profiles, experimental metadata, and normalization methods

Gene expression data

Transcriptomics, gene expression studies, biomarker discovery

KEGG

Biological pathways and molecular interactions

Pathway maps, functional annotations, and integration with gene, protein, and compound data

Pathways, gene interactions

Pathway analysis, systems biology, drug development


Biological databases like GenBank, UniProt, PDB, Ensembl, GEO, and KEGG are critical resources for researchers in life sciences. These databases provide access to extensive data that support various aspects of research, from sequence analysis to protein structure and gene expression studies. Familiarity with these databases can greatly enhance research efficiency and lead to more informed scientific discoveries.

NGS library construction