If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
pepDESC: A method for the detection of differentially expressed proteins for mass spectrometry-based single-cell proteomics using peptide-level information
Collage of Chemistry and Molecular Engineering, Peking University, 100871 Beijing, ChinaBeijing Advanced Innovation Center for Genomics, Peking University,100871 Beijing, ChinaBiomedical Pioneering Innovation Center, Peking University, 100871 Beijing, China
pepDESC performs differential expression analysis for single-cell proteomics data.
•
pepDESC uses peptide-level quantification signals to improve analysis result.
•
pepDESC performs well on low-input and regular-size benchmark datasets.
•
pepDESC is compatible with different search engine and various workflow.
•
pepDESC is available as an R package.
Abstract
Single-cell proteomics as an emerging field has exhibited potential in revealing cellular heterogeneity at the functional level. However, accurate interpretation of single-cell proteomics data suffers from challenges such as measurement noise, internal heterogeneity, and the limited sample size of label-free quantitative mass spectrometry. Herein, the author describes pepDESC, a method for detecting differentially expressed proteins using peptide-level information designed for label-free quantitative mass spectrometry-based single-cell proteomics. While in this study, the author focuses on the heterogeneity among limited number of samples, pepDESC is also applicable to regular-size proteomics data. By balancing proteome coverage and quantification accuracy using peptide quantification, pepDESC is demonstrated to be effective in real-world single-cell and spike -in benchmark datasets. By applying pepDESC to a published single-mouse macrophages data, the author found a large fraction of differentially expressed proteins among three types of cells, which remarkably revealed distinct dynamics of different cellular functions responding to lipopolysaccharide stimulation.
Single-cell proteomic measurements, which are capable of accessing cell identities at the functional level, have demonstrated potential in providing insights into cellular activities and pathological mechanisms [
] designed for mass spectrometry (MS)-based single-cell proteomics have witnessed tremendous progress in recent years. However, attention on quantification accuracy in this domain remains insufficient. In fact, analyzing single-cell MS-based proteomics data, for example, to detect the differential expression of proteins between two cell types, is quite challenging because of low signal-to-noise ratio. Besides the relatively high measurement noise, the existence of a large fraction of missing values hinders the accurate interpretation of data [
]. While label-free quantification has, to some extent, avoided the limitation due to the carrier proteome effect of isobaric-labeling quantification [
], which is technically a major challenge of concomitant to label-free quantitative MS and an inevitable difficulty when it comes to scarce sample.
Methods designed for analyzing bulk proteomics data are not always suitable for single-cell proteomics data. For instance, in most cases, proteins with only one reliable peptide would be removed to avoid false identification [
]. However, applying such a stringent criterion to single-cell data would sacrifice proteome coverage when the quantified protein number is already very small. At the same time, although single-cell transcriptomics techniques have provided well-established measurement tools for differential expression analysis, they may not be the most optimal choices for proteomics analysis. This is because of the hierarchical structure of bottom-up MS-based proteomics data, where protein abundances rely on aggregating peptide-level or PSM(peptide spectrum matches)-level information [
]. However, whether existing methods are applicable to single-cell proteomics data is still in question and suitable statistical tools for single-cell proteomics data are still required.
Herein, the author introduces a method, pepDESC (PEPtide-level Differential Expression analysis for Single-Cell proteomic), for analyzing differential expression score at the peptide level according to the nature of single-cell proteomics data. This tool directly uses peptide-level results and is compatible with various search engines and available as an R package. Several commonly used statistical methods and peptide-based methods were used for comparison to state the strengths of pepDESC in low-input MS data, as well as regular MS data. Moreover, the author applied pepDESC to a published single-cell experiment to demonstrate its performance in real-world data.
Experimental procedures
Experimental Design and Statistical Rationale
Three datasets were used to validate the performance of pepDESC, namely, D1 (a mixed single-cell data), D2 (a low-input spike-in data) and D3 (a regular-size spike-in data). The dataset D1 contains 10 biological replicates in each sample group. The spike-in datasets D2 and D3 contains seven and four technical replicates in each sample group respectively. The design of group size considered the current sample size of label-free quantitative MS-based proteomics experiment [
]. The sample preparation of dataset D2, the database searching of the three datasets and the design of the dataset D1 are explicated herein with detailed information presented below.
Sample preparation of D2 low-input spike-in dataset
The Escherichia coli DH5α cell pellet was lysed by sonication using a Covaris M220 in a buffer containing 8M urea and 100mM TEAB. The protein amount was measured using a micro BCA protein assay kit (23235, ThermoFisher). The sample was reduced with 0.25M DTT and alkylated with 8mM IAA. The trypsin (V5280, Promega) was then added to 1:50 enzyme-to-protein ratio in 100mM TEAB. The sample was acidified with 1% TFA before being loaded into the SPE micro-column. The peptides were eluted after two times of washing and were dried in a SpeedVac. Finally, E. coli proteins were reconstituted with 0.1% TFA and 1% ACN buffer. The HeLa digest (88329, ThermoFisher) was reconstituted with 0.1% TFA and 1% ACN. The E. coli digest and HeLa digest were mixed at the total mass ratio of 94:6 or 97:3.
LC-MS/MS analysis of Dataset D2 low-input spike-in dataset
For each sample group, seven technical replicates of measurements were conducted to mimic the sample size of current label free single-cell proteomics data. First, 120 pg of digests were injected and separated using a commercial chromatography column (Aurora, IonOpticks) by a nanoflow liquid chromatography (Ultimate 3000 RSLCnano, ThermoFisher) with a flowrate of 100nL/min. Mobile phase A was 0.1% FA in 2% acetonitrile, while mobile phase B was 0.1% FA in 80%acetonitrile. The 70-min LC gradient was as follows: 5%–6.2% B for 2 min, 6.2%–31.2% B for 40 min, 31.2%–42.5% B for 16 min, 42.5%–99% B for 5 min, and then isocratic at 99% B for 10 min. Label-free quantification was performed using a tribrid mass spectrometer (Orbitrap Eclipse, ThermoFisher) with an ion mobility interface (FAIMS Pro, ThermoFisher). The FAMIS compensation voltages of −55 V and −70 V were used with a cycle time of 1 s. The MS spectra were collected by an Orbitrap analyzer and the MS2 spectra were collected by a linear ion trap analyzer with the max ion injection time of 200 ms.
Database searching of the datasets D1, D2, and D3
The database searching was conducted by Proteome Discoverer 2.4 (ThermoFisher) using Sequest. For the mixed single-cell dataset, the 293T cell sample and the mouse oocyte sample was searched separately against UniProt human protein database (20,286 entries, downloaded on April 14, 2020) and UniProt mouse database (17,015 entries, downloaded on downloaded on July 31, 2020) respectively. Datasets D2 and D3 were searched against the UniProt human protein database (20,286 entries, downloaded on April 14, 2020) with the UniProt E. coli database (4,349 entries, downloaded on July 15, 2019) concatenated. An in-house curated contamination database by Proteome Discoverer 2.4 (ThermoFisher) containing 284 entries was also included in each analysis. The carbamidomethylation on cysteine residues was set as a static modification. Dynamic modifications included the acetylation on N-terminals, the methylation-loss on N-terminals and the oxidation on methionine residues. The mass tolerance for precursor ions was 10 ppm while the mass tolerance for fragment ions was 0.6 Da. At most, two tryptic miss-cleavages were allowed. PSMs and proteins were both filtered at 1% FDR.
For D1, the human dataset of 20 293T cells contains 1409 high-confidence master proteins and 7694 high-confidence peptides, while the mouse dataset of 20 mouse oocytes contains 3800 high-confidence master proteins and 32098 high-confidence peptides.
For D2, the result contains 1409 high-confidence master proteins and 5726 high-confidence peptides. For D3, the result contains 4967 high-confidence master proteins and 29117 high-confidence peptides.
Design of D1 mixed single-cell benchmark dataset
To design a benchmark dataset with certain different proteins while retaining the characteristic of a single-cell proteomics data, a few modifications were made to the dataset D1. First, among the 1,409 human proteins and the 3,800 mouse proteins identified by the search engine, 200 proteins in 20 293T cells as well as 1,000 proteins in 20 mouse oocytes were randomly selected while the remaining proteins were removed. Next, the human cells with 200 proteins and the mouse cells with 1,000 proteins were combined to 20 samples, and each has the expression of 1,200 proteins from a particular 293T cell and a particular mouse oocyte. To make a certain differentially expressed protein set, the abundances of 200 human proteins in 10 samples were numerically halved. That is, the dataset D2 contains two sample groups of 10 biological replicates with 200 differentially expressed proteins (marked as human proteins) and 1,000 stable proteins (marked as mouse proteins)
Applying different statistical methods to datasets D1, D2 and D3
In this paragraph, the abundance of peptides or proteins is denoted with letter X, where XN is the abundances of a peptide or protein in the sample group N.
Statistical methods based on protein abundances
High-confidence master proteins were normalized by the median value of each sample group, with missing values filled with 0s. These data were used for Student’s t-test, Wilcoxon test and the Limma method. The statistics for ordinary Student’s t-test are as follows:
(1)
(2)
n1 and n2 are the sample sizes of the two groups of samples.
The two-sided Wilcoxon test (also known as Mann–Whitney–Wilcoxon test) was based on the ranks of each data in two sample groups. The statistics denoted as U for two sample groups are:
(3)
(4)
n1 and n2 are the sample sizes of the two groups of samples. R1 and R2 are the sums of ranks of the two groups of samples.
Limma was performed based on the empirical-Bayesian prior variance:
(5)
sposterior was derived by the eBayes() function from the Limma package.
Student’s t-test was accomplished by function t.test() from “stats” package. Wilcoxon test was accomplished by wilcox.test() from “stats” package. Limma was accomplished by the “Limma” package.
Statistical methods based on peptide abundances
For high-confidence unique peptides, the missing values were filled with 0s. These data were used for the PECA, DeqMS and pepDESC.
The DeqMS only works for peptides whose minimum is greater than zero. Normalization of each group samples by the median value was done after applying log-transformation of the data. The DeqMS uses the core method of Limma to the peptide abundances and adjust the statistics with peptide counts, which is accomplished by the spectraCounteBayes() function in the “DEqMS” package [
PECA was performed by a one-line function PECA() from the “PECA” package. The PECA method simply calculates the statistics of Student’s t-test for all the peptides from a protein to discover differentially expressed proteins. Normalization in PECA was included for analysis of dataset D2 and D3. No normalization was used for dataset D1 since group-wise normalization is not compatible with this method [
The first step of pepDESC was to filter out the contamination peptides from contamination protein database, peptides with missing values over 60%, or a customized threshold, of cells and peptides with peaks identical to contamination peptides (retention time difference < 0.05 and m/z difference < 2). Normalization was done after removing outlier samples. pepDESC adopt a median normalization by default, unless with user-defined settings or any sample has over 50% missing values, where a mean normalization would be adopted. The mathematical details of DE-scores, which mark the confidence of a changing protein or a peptide, are illustrated with the following equations.
When the first sample group has M cells while the second sample group has N cells, denote the abundance of peptide i in different cells to be and .
The adjusted DE-score of a peptide is derived based on the pairwise ratio of a peptide abundance between two sample groups. Peptide DE-score whose absolute value was found to be higher than 1.5 was set as 1.5 at Equation (6).
(6)
(7)
Where the second term denotes the fidelity of a peptide by its expression level and the correlation between other peptides of this protein (Equation (8)). The third term describes if the peptide is significantly different between sample groups.
(8)
rij describes the Pearson correlation coefficient of two peptides if they were positively correlated:
(9)
The weight coefficients were then normalized among peptides:
(10)
Finally, the DE-score of a protein is derived based on all the adjusted DE-scores of the belonging peptides:
(11)
(12)
When a protein is quantified based on a single peptide, the DE-score goes with Equation (13)
(13)
In the Equations ((12), (13)), the pi denotes the possibility of taking the null hypothesis in the Wilcoxon test. The maximum allowed p-value was set as 0.05 by default, yet customized setting is allowed with pepDESC. All the adjustable parameters mentioned above were set as default value for analysis of D1, D2, and D3.
Evaluating the performance of different statistical methods
The precision–recall curves were plotted using the R package “ROCR”. The number of true positive discoveries refers to the number of E. coli proteins (for D2 and D3) or the human proteins (for D1) identified as varying in each statistical method. The precision of the results refers the ratio of true positive discoveries to the total positive discoveries.
Applying pepDESC to single-cell proteomics result of mouse macrophages
The single-cell proteomics data of 164 mouse macrophages was downloaded via ProteomeXchange on March 22, 2022 [
]. The peptide result containing 31391 unique peptides was used as the input data for pepDESC. The protein result containing 1979 proteins was used as a reference. The application of pepDESC contains two separate analyses, the analysis between the control group (abbreviated as CON) and 24 h stimulation group (abbreviated as LPS24) and the analysis between LPS24 and the 48 h stimulation group (abbreviated as LPS48). Contamination peptides denoted by the search result and the peptides with more than 80% missing values were removed in each analysis, and outlier samples were also removed in each analysis. Mean normalizations in each group were applied since high fraction of missing values existed in several samples. Peptides whose accession start with “CON” or “REV” were set as contamination peptides.
Pathway enrichment was conducted using the “Analyse gene list” function in the Reactome knowledgebase resource [
Building the peptide-level differential expression analysis method based on the nature of single-cell proteomics data
An optimal method for single-cell proteomics analysis needs to balance the quantification accuracy and proteome coverage. While relaxed data filtration negatively impacts the reliability of quantification result, strict data filtration challenges the depth of data. Therefore, processing the data at the peptide level would theoretically improve the overall performance (Fig. 1a). Based on the quantification results of single-cell MS measurements generated by search engine, the author built pepDESC, by which a DE-score quantifies the result of differential expression analysis. pepDESC includes three major steps: data filtration, peptide DE-score calculation and protein DE-score calculation (Fig. 1b).
Figure 1Design of pepDESC, a method to discover differential expression of proteome at peptide-level for single-cell proteomicsa. a. One reason to use peptide-level information for single-cell proteomics data. The lines show the quantification results of peptides while the circles show the quantification results of proteins. The red color indicates that a peptide or a protein is incorrectly quantified while the yellow color indicates that a peptide or a protein is correctly quantified. In bulk proteomics (top), a protein is usually quantified by a number of peptides, where accuracy of quantification result is merely affected by one falsely quantified peptide. While in single-cell proteomics (bottom), many proteins are quantified by a limited number of peptides, one falsely quantified peptide will bring non-negligible effect to the final result, resulting in a sacrifice in coverage or in accuracy. In this case, processing data at peptide-level has the potential to remove falsely quantified peptides to assure the accuracy of the statistical result. b. The workflow of pepDESC illustrated by protein Z, an example protein with four quantified peptides. pepDESC is composed of four steps, data initialization, data filtration, peptide DE-score calculation, and protein DE-score calculation. In the first step, information regarding the four peptides is extracted from the input search result. In the second step, “untrusted” data, which in this case refers to the peptide D, which had missing values in five samples, is identified and removed. In the third step, peptide DE-scores are calculated based on the pairwise ratio between peptide abundances in two types of samples. However, since the expression of peptide A is not significantly different between the two groups, the peptide DE-score of peptide A was 0. Finally, protein DE-scores were derived based on the adjusted peptide DE-scores, considering the expression level and correlations of the three peptides.
At the beginning, two types of “untrusted” data need to be removed. The first type of “untrusted” data refers to contamination peptides assigned by the search engine and peptides that are highly suspected to be contaminates [
The difference in each peptide was calculated based on the pairwise ratio between each sample pair from two different sample groups, which was then defined as the peptide DE-score. A higher peptide DE-score means a higher magnitude of differences. However, owing to the stochasticity in gene expression, the single-cell proteomics data is naturally noisier than bulk proteomics, thereby leading to internal heterogeneity among identical cells [
]. Therefore, the author further measured the statistical significance of the abundance differences to identify DE-scores that merely caused by internal heterogeneity.
When being aggregated to protein DE-scores, the credibility of the peptide DE-scores should be evaluated again. Falsely identified peptides could occur during database searching [
]. Traditionally, when using the summation of the peptide abundances as the abundance of a protein, the weight coefficient of each peptide is solely dependent on the expression level. However, in single-cell proteomics, where peptide number is limited for each protein, the impact of falsely quantified high-abundant peptides needs to be further reduced. As a protein is usually quantified by several different peptides, whether a peptide truthfully indicates the abundance of the protein or not can be testified by each other. Thus, the reliability of peptide DE-scores could be informed by the correlation of the peptides belonging to the same protein. Therefore, the author calculated the adjusted peptide DE-scores considering the abundances as well as the pair-wise abundance correlations, and determined the protein DE-scores considering all adjusted DE-scores of belonging peptides, as described in the experimental procedures section.
Performance on mixed single-cell data reveals the strengths of pepDESC
The author first tested pepDESC using a mixed single-cell data, addressed as dataset D1, which was adapted from a label-free single-cell proteomics experiment on homogenous 293T human cell and mouse MII oocytes [
]. This dataset comprises 20 samples, each containing 200 proteins from a human 293T cell and 1000 proteins from a mouse oocyte cell. Ten of them contain 1,000 proteins from 10 different mouse oocytes and 200 proteins from 10 different human 293T cells, while the other 10 samples contain 1,000 proteins from 10 different mouse oocytes and 200 proteins from 10 different half-293T cells, as described in the experimental procedures section. This benchmark dataset represented a single-cell proteomics dataset with internal heterogeneity and external differences. In this scenario, an ideal method should identify all 200 human proteins and no mouse protein as differentially expressed proteins.
To state the performance of pepDESC, the author roughly compared the protein DE-scores with the result of commonly used Student’s t-test. pepDESC found 140 differentially expressed human proteins with 33 falsely identified mouse proteins (absolute value of protein DE-score >0.3) while Student’s t-test found 118 differentially expressed human proteins and 48 falsely identified mouse proteins (p-value <0.05). By studying the proteins that had different results with the two methods, it could be found that some error-prone steps in traditional protein quantification method could be circumvented using pepDESC.
For real-world single-cell proteomics data, the most common challenge during differential expression analysis is to deal with the large fraction of missing values. In 140 falsely identified proteins in Student’s t-test, 54 proteins were affected by peptides with missing values over 60%. Although generally speaking, summing up the abundances over several peptides could alleviate this problem, for proteins with limited quantified peptides, existence of these inaccurate quantified peptides would affect the quantitative results of the proteins. For example, one peptide (peptide C in Fig. 2a) of Rpap1 had a large fraction of missing values, which would mislead the protein-level measurement if not removed in advance (Fig. 2a). High-abundance peptide might also need to be removed. For mouse protein Puf60, its abundance was highly dominated by a peptide (peptide C in Fig. 2b) that was highly likely to be a contamination signal. pepDESC found this mis-identified peptide as it had similar feature and similar expression level with a contaminant peptide (Fig. 2b). This mis-identification might be ascribed to a mis-assignment during MBR (match between runs), which merely affects bulk experiments where the true signals of samples are much higher than contamination signals.
Figure2Comparison between pepDESC and Student’s t-test with dataset D1. Student’s t-test identified a changing protein when p-value is lower than 0.05, while pepDESC identified a changing protein when the absolute value of DE-score is over 0.3. Human proteins were theoretically changing while mouse proteins should be stable proteins. a. Bubble plot for mouse Rpap1 protein and peptides. The size of circles stands for the abundance of corresponding protein or peptide. A red circle indicates a missing value. The result of Rpap1 for Student’s t-test was incorrect (changed, p-value = 0.01). pepDESC removed the peptide C during filtration and the result of Rpap1 for pepDESC was correct (unchanged, DE-score = 0.26). b. Peak feature chart and line plot for mouse Puf60 protein and peptides. The chart (top) shows the similarity of the peak features of peptide C and a KRT10 peptide (contaminant peptide). At the same time, peptide C had similar expression level with KRT10 (down). The result of Puf60 for Student’s t-test was incorrect (changed, p-value = 0.03). pepDESC removed the peptide C during filtration and the result of Puf60 for pepDESC was correct (unchanged, DE-score = 0.11). c. Boxplot for mouse Asf1b protein and peptides in two sample groups, with p-values for Student’s t-test of peptide and protein abundances. The result of Asf1b for Student’s t-test was incorrect (changed, p-value = 0.03). The result of Asf1b for pepDESC was correct (unchanged, DE-score = 0). d. Line plot for human MRI1 protein and peptides abundances (left) with the Pearson correlation of peptide abundances (right). The result of MRI1 for Student’s t-test was incorrect (unchanged, p-value = 0.16). The result of MRI1 for pepDESC was correct (changed, DE-score = 1.13).
A common mistake reported when using Student’s t-test to analyze single-cell data was caused by the accumulated noises from “unchanged peptides”. In 13 out of 48 falsely positive proteins, no peptide was significantly different between the two sample groups. For example, neither peptide of Asf1b was considered changing (Student’s t test p-value > 0.1). Yet, the numerical summation of the peptides, which the abundance of Asf1b protein, between two groups of samples was significantly different (Student’s t-test p-value =0.03; Fig. 2c).
Furthermore, the credibility of peptide DE-score was further tested in pepDESC. In the case of human protein MRI1, the most abundant peptide (peptide C in Fig. 2c) correlated poorly with the other two peptides. If a traditional method was adopted and the sum of the peptide abundances was used as the protein abundance, this highly abundant peptide would affect the final result and lead to wrong conclusion (Fig. 2d).
As evidenced by the result, processing data at the peptide level does improve the quantification performance for this benchmark dataset, which is based on real-world single-cell proteomics measurement.
Comparison of different differential expression detection methods
To assess the performance of pepDESC compared with other statistical tools, the author evaluated several approaches including Student’s t-test, Wilcoxon test, and widely used “Limma” [
] using three label-free proteomics benchmark datasets. The author also compared pepDESC with peptide-level quantification methods DEqMS and PECA.
First, the performance of all these methods on the dataset D1 was determined (Fig. 3a). As evident from the precision–recall curves, pepDESC attained the highest F-score and outperformed other methods. Meanwhile, it had the largest correct identification with precision over 0.9.
Figure 3Performance of different methods on datasets D1, D2, and D3. The precision–recall curves in a, c, and e, where different colors stand for the performance of different methods. The bar plots in b, d, and f show the F-scores corresponding to each method (top) and the maximum number of true positive identification when the overall precision was found to be bigger than 0.9. Student’s t-test was abbreviated as “T test”. The color of the precision–recall curves corresponds to the color in the bar plots. a and b. Performance of different methods on dataset D1. c and d. Performance of different methods on dataset D2. e and f. Performance of different methods on dataset D3.
Next, a series of spike-in samples with different compositions of HeLa digest and E. coli digest were collected. Each sample group of this benchmark dataset, addressed as dataset D2, contains seven technical replicates. The two groups of samples contained 3% or 6% of E. coli protein and 97% or 93% of human protein, respectively. The amount of peptides loaded into the MS was around 120pg to imitate protein contents of a single cell. In this case, ideal quantification results should show that the E. coli proteins change around two folds, and the human proteins remain constant. Still, the precision–recall curves and the F-scores confirmed the superiority of pepDESC, and it can also identify more changing proteins when the precision is higher than 0.9 (Fig. 3b).
Furthermore, all the methods were applied to a published benchmark dataset for regular proteomics [
], addressed as dataset D3. This spike-in dataset contains four samples with 3% E. coli proteins in human proteins and four samples with 6% E. coli proteins in human proteins. Although all the tested methods obviously showed better performance compared to results for low-input data, as shown by Fig. 3c, pepDESC outperformed other methods as well for the regular-size proteomic measurements.
Besides the high performance of pepDESC, it could also be noticed that through the analysis of three independent benchmark datasets, peptide-level quantification methods pepDESC, PECA and DEqMS all showed satisfactory performance. DEqMS yielded the second-best result in the two spike-in datasets but was not a very ideal choice for dataset D1, as it was unfriendly with missing values. PECA outperformed other methods expect for pepDESC in the dataset D1. This result illustrates the superiority of peptide-based methods in studying proteomics data.
Applying pepDESC to a single-mouse macrophages proteome data reveals distinct dynamics of different cellular functions replying to LPS stimulation
To demonstrate the practicability of pepDESC in real-world single-cell proteomics data, it was applied to a published MaxQuant search result of single-mouse macrophage proteomics measurements [
]. This dataset describes the single-cell proteome of 56 controlled cells, 56 24h LPS-treated cells (abbreviated as LPS24), and 52 48h LPS-treated cells (abbreviated as LPS48). The search result contains 1,727 proteins with a large fraction of missing values. More specifically, there are 1,264 missing values on an average for one cell, and only 383 proteins have expression over half of the cells. To tackle the problem caused by the missing values, the original work applied data imputation, despite a distortion of data [
], before differential expression analysis using ANOVA. The author wondered whether the use of pepDESC, which is more tolerant of missing values, could boost the performance without losing the fidelity of original data. With the peptide quantitative information of the published search result, pepDESC applied two independent rounds of comparison between consecutive periods of time. A total of 452 changing proteins, 323 in the first 24 h and 271 in the second 24 h, were found to be changing with LPS stimulation among the 630 proteins in the pepDESC result.
With protein DE-scores indicating the dynamics of proteome, it could be found that various biological functions responded differently to LPS stimulation. The 452 changing proteins were grouped into five clusters, and each showed distinct dynamics and was involved in different biological pathways [
]. As depicted in the Fig 4a, proteins related to gene expression and protein translation in clusters 1 and 5 were upregulated in the first 24 h, whereas only the abundance of mRNA splicing proteins in cluster 1 dropped on the second day. A rise in stimulant responding proteins and immunologically responding proteins primarily happens in the second day, as shown in cluster 4. Clusters 2 and the cluster 3 confirmed a change in the metabolism of LPS-stimulated macrophages [
Figure 4Proteome dynamics for single-macrophages responding to LPS stimulation discovered by pepDESC. The ANOVA result and the protein abundances were adapted from [
]. a. Heatmap of protein DE-scores for 452 changing proteins (protein DE-score > 0.3) during the first and the second 24 h after LPS stimulation(left). A positive DE-score means the protein has been up-regulated during the period and the vice versa. 452 proteins were grouped into five clusters according to the protein DE-scores. Typical Reactome pathway enrichment terms (FDR < 0.01) of each cluster is depicted in the middle. The FDR of each term is depicted in the bar plot (right). b. Dynamics of marker proteins found in the ANOVA result could also be found by pepDESC. The violin plot shows the log-transformed protein abundances from the protein level search result (with no imputation). The protein DE-scores of the two consecutive periods of time are shown on the top, where a changing protein is denoted with the red color. c. Venn plot of identified changing protein using pepDESC (pink) or using ANOVA (purple). d and e. Dynamics of newly identified marker protein using pepDESC. The violin plot shows the log-transformed protein abundances from the protein level search result (with no imputation). The protein DE-scores of the two consecutive periods of time are shown on the top, where a changing protein is denoted with the red color.
Key regulators mentioned in the original work were also found by pepDESC (Fig 4b). At the same time, most changing proteins found by ANOVA analysis could also be found with the new method (Fig 4c), indicating a good overlap between the two methods. pepDESC additionally identified more marker proteins including the inflammatory signaling protein Tap1 (Fig 4d), which is involved in antigen presentation via MHC class I [
Differential regulation of the expression of transporters associated with antigen processing, TAP1 and TAP2, by cytokines and lipopolysaccharide in primary human macrophages.
In summary, pepDESC uncovered a system-wide picture of proteome responding to LPS stimulation according to a specific time order. Usage of the new method successfully discovered a large number of differentially expressed proteins and offered an insightful interpretation of functional dynamics. pepDESC has been demonstrated to be a practical tool for real-world single-cell proteomics data that does not require imputation, which generally affects the data fidelity.
Discussion
Single-cell proteomics data based on label-free quantitative MS has three main characteristics, namely, high measurement noise, internal heterogeneity, and the limited sample size. To gain insights with such complicated data, there are two main concerns for statistical methods, which are proteome coverage and quantification accuracy. Carefully weighing the depth and the accuracy is crucial when dealing with the single-cell proteomics data. Although various tools could be used to boost the performance, such as data imputation, improper methods may reduce data fidelity and mask the intrinsic nature of single cells [
]. Based on these considerations, the author developed a method, pepDESC, for single-cell proteomics discovery.
To demonstrate the performance of pepDESC, which uses the peptide-level information to discover differentially expressed proteins between two cell populations, three datasets were used for evaluation, including a mixed single-cell dataset (dataset D1), a low-input spike-in dataset (dataset D2) and a published regular spike-in dataset (dataset D3). Although it was clear that the performance of tested methods varied among different datasets, the advantage of pepDESC was evident despite limited sample size (D1, D2 and D3), internal heterogeneity (D1) and low quantification signal (D1 and D2). Compared with widely used protein-level methods, pepDESC evaluates the differences of data with appreciation of the nature of single-cell data at a higher resolution. Although peptide-level methods PECA and DEqMS yielded good results as well, the stringent filtering condition dampened the performance in low-input data. Therefore, the choice of analysis tool should take both the feature of the data and the purpose of the analysis into consideration. As for the mouse macrophages data described in this work, although some key proteins like cytokines could not be detected as a result of their low copy numbers and the limited detectability of current single-cell MS, marker proteins could be discovered with effective analysis tool, like Tap1 and Phb1, which play critical roles during immunological response. To be specific, TAP1 would import the endogenous peptides into the endoplasmic reticulum after LPS stimulation, in order to present exogenous antigens via MHC class I molecules and activate CD8+ T cells, while the increased expression of Phb1 during the early stage of LPS stimulation regulates the orchestrating of the cytokine production. The discovery from the noisy data with a noneligible fraction of missing values greatly depended on the choice of the statistical method.
The design of pepDESC was to make it applicable to different quantification results, bulk or single-cell, using different search engines. In the current implementation, the slowest step to measure differentially expressed protein among around 100 cells only took a minute or less on personal computers, which makes pepDESC an effective tool even with larger cohorts. At the same time, to make this method compatible with various data, all the steps were sectioned allowing for customized workflows. Moreover, several parameters could also be adjusted based on the nature of the data.
The author hope that this well-designed novel statistical tool would be widely used to improve the performance of MS-based single-cell proteomics technique. There is no doubt that more significant discoveries could be found with functional-level measurement at single-cell resolution.
Data availability
The raw data of dataset D3 is available on ProteomeXchange Consortinum via the PRIDE partner repository (identifier PXD003881). The raw data of single-mouse macrophages is available on ProteomeXchange Consortium via the MassIVE partner repository (identifier MSV000085937)
The raw data of D1 and D2 as well as the search engine result of D1, D2, and D3 are now private and have been deposited to MassIVE. For reviewer access of data, please visit https://doi.org/doi:10.25345/C5Q814X3C, user: MSV000090606_reviewer; password: DioneZhang.
The source data of pepDESC was available from Github (https://github.com/dionezhang/pepDESC).
Supplemental data
This article contains supplemental identification results of D2 and D3, as well as the two single-cell datasets for Dataset D1 (Supplementary file S1 and S2). The randomly selected mixed single-cell data for the Dataset D1 was provided as Supplementary file S3. pepDESC result for single-macrophage data was provided as Supplementary file S4
Author contributions section
The project design, programming, and writing of this work was conducted by Y.T. Zhang.
Acknowledgement
I thank Dr. Mo Hu for his help in operating mass spectrometry and his guide in preparing this paper. I thank Yuan Yuan for his inspiring discussion. I thank Dr. Xiaoliang Sunney Xie for the opportunity to do this work and for his instruction during the whole project. I thank Home for Researchers editorial team (www.home-for-researchers.com) for language editing service.
Differential regulation of the expression of transporters associated with antigen processing, TAP1 and TAP2, by cytokines and lipopolysaccharide in primary human macrophages.
pepDESC is a statistical method built for differential expression analysis for single-cell mass spectrometry-based data. To overcome the difficulties caused by the small amount of input sample, pepDESC uses peptide-level quantification results to balance the proteome coverage and quantification accuracy. The application of pepDESC shows its superiority in benchmark datasets and real single-cell measurements. The use of pepDESC would improve the current implementation of single-cell proteomics measurements and boost our understanding of single-cell data at the functional level.