Table I

Summary of the protein sequence databases that are commonly used in shotgun proteomic analysis

Database sizes and the number of sequences are given for the human subset of each database only. EBI, European Bioinformatics Institute; SIB, Swiss Institute of Bioinformatics.

Database, date (version)Number of sequences; size of file (human)Description; source databasesOrganismsRelease; update frequency; maintained by
Uni-Prot/Swiss-Prot, 02/15/200511,898; 7.8 MbExpertly curated; high level of annotation; minimum level of redundancy; high level of integration with other databases.ManyRelease every 4 months; updates every 2 weeks; EBI, SIB, Georgetown University
Uni-Prot/TrEMBL, 02/15/200552,052; 23.3 MbComputer-annotated supplement to Uni-Prot/Swiss-Prot. Contains translated coding sequences from GenBankTM nucleotide database, protein sequences extracted from the literature or submitted to Uni-Prot/Swiss-Prot but not yet manually curated.ManyRelease every 4 months; updates every 2 weeks; EBI, SIB, Georgetown University
RefSeq, 08/26/2004 (R 9)27,960; 17.7 MbOngoing curation by NCBI staff; non-redundant; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases.ManyRelease every ∼3 months; NCBI
Ensembl, 02/2005 (version 28-35a)33,860; 21.1 MbCreated using automated genome annotation pipeline; eukaryotic genomes only; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. Peptides identified by MS/MS can be mapped to the genome via Ensembl Protein database and visualized using Ensembl Genome Browser.16 organismsEvery 1–2 months; EBI and Wellcome Trust Sanger Institute
IPI, 02/2005 (version 3.03)48,953; 28.9 MbGood balance between degree of redundancy and completeness; references to the primary data sources; attempts to maintain stable identifiers (with incremental versioning), but still in flux. Assembled from Uni-Prot (Swiss-Prot + TrEMBL), RefSeq, Emsembl, H-Invitational database.5 organismsMonthly; EBI
Entrez Protein (NCBInr), 02/17/2005115,926; 58.5 MbMore complete with regard to sequence polymorphisms and splice forms; annotations extracted from curated databases; high degree of sequence redundancy makes interpretation difficult. Assembled from GenBankTM and RefSeq coding sequence translations, Protein Information Resource (PIR), Protein Data Bank (PDB), Uni-Prot/Swiss-Prot, Protein Research Foundation (PRF).ManyFrequent updates; NCBI