Getting More from Less

Algorithms for Rapid Protein Identification with Multiple Short Peptide Sequences*

  1. Aaron J. Mackey§,
  2. Timothy A. J. Haystead and
  3. William R. Pearson**‡‡
  1. Department of Microbiology, University of Virginia, Charlottesville, Virginia 22908
  2. **Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, Virginia 22908
  3. Department of Pharmacology, Duke University, Durham, North Carolina 27710
  1. ‡‡Supported in part by Grant LM04969 from the National Library of Medicine, with additional support from the Compaq Computer Corporation. To whom correspondence should be addressed. Tel.: 434-924-2818; Fax: 434-924-5069; E-mail: wrp{at}virginia.edu.

Abstract

We describe two novel sequence similarity search algorithms, FASTS and FASTF, that use multiple short peptide sequences to identify homologous sequences in protein or DNA databases. FASTS searches with peptide sequences of unknown order, as obtained by mass spectrometry-based sequencing, evaluating all possible arrangements of the peptides. FASTF searches with mixed peptide sequences, as generated by Edman sequencing of unseparated mixtures of peptides. FASTF deconvolutes the mixture, using a greedy heuristic that allows rapid identification of high scoring alignments while reducing the total number of explored alternatives. Both algorithms use the heuristic FASTA comparison strategy to accelerate the search but use alignment probability, rather than similarity score, as the criterion for alignment optimality. Statistical estimates are calculated using an empirical correction to a theoretical probability. These calculated estimates were accurate within a factor of 10 for FASTS and 1000 for FASTF on our test dataset. FASTS requires only 15–20 total residues in three or four peptides to robustly identify homologues sharing 50% or greater protein sequence identity. FASTF requires about 25% more sequence data than FASTS for equivalent sensitivity, but additional sequence data are usually available from mixed Edman experiments. Thus, both algorithms can identify homologues that diverged 100 to 500 million years ago, allowing proteomic identification from organisms whose genomes have not been sequenced.

Footnotes

  • Published, MCP Papers in Press, December 12, 2001, DOI 10.1074/mcp.M100004-MCP200

  • 1 The abbreviations used are: MS, mass spectrometry; MS/MS, tandem mass spectrometry.

  • 2 M.-Q. Huang and W. R. Pearson, manuscript in preparation.

  • * The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

    The on-line version of this article (available at http://www.mcponline.org) contains Supplemental Material.

  • § Supported by Grant T32AI07046 from the National Institutes of Health.

  • Supported by Grants HL19242-24 and DK52378-04 from the National Institutes of Health.

    • Received August 7, 2001.
    • Revision received November 13, 2001.
« Previous | Next Article »Table of Contents
  • Advertisement
  • Advertisement
Advertisement