Proteogenomic Characterization of the Pathogenic Fungus Aspergillus flavus Reveals Novel Genes Involved in Aflatoxin Production

In Brief A. flavus is a pathogenic fungus capable of producing aflatoxin, which is harmful and carcinogenic to animals and human. In this study, we presented the first comprehensive draft map of the A. flavus proteome and a wide range of posttranslational modifications. This is the first study we are aware of to have employed such a proteomic approach to improve Aspergillus genomic annotation, which will serve as a valuable resource for future efforts to explore the synthesis and excretion of aflatoxins.

genome_assembly_id=28730&gi=-1) and Uniprot (https://www.uniprot.org/uniprot /?query=aspergillus+flavus+NRRL3357&sort=score) databases. It is important that the A. flavus genome be fully and accurately annotated in order to ensure a complete understanding of the molecular mechanisms governing aflatoxin production and associated pathogenicity. As such, there is a pressing need for a high quality annotation of protein coding genes for this fungal pathogen.
Proteogenomics is referred as the use of mass spectrometry (MS)-derived proteomic data to annotate the protein coding genes and improve genome annotation quality (13,14). In contrast to conventional MS data analysis, proteogenomic approach does not rely on an existing reference protein database, but use the database translated from six-frame translation of the whole genome and three-frame translation of RNA-seq sequences. Thus, such complementary annotation analyses can not only provide valuable insights that offer direct experimental support for predicted protein-coding genes, thereby improving genomic annotation (15), but also have the potential to provide supporting evidence for the novel transcripts, alternative predictions or single amino acid variants (SAAVs) at protein level (16)(17)(18)(19). Over the last few years, with the rapid advances in MS technology and computational tools, proteogenomic reseach has become a dramatically growing field (18)(19)(20). It has been successfully applied to improve the quality of genome annotation in many model organisms, including Homo sapiens (21,22), Arabidopsis (23)(24)(25), Zea mays (26), Oryza sativa (27,28), diatom (29), cyanobacteria (30) and bacteria (31)(32)(33)(34). These studies have provided invaluable information for genome annotation and the by guest on January 11, 2021 https://www.mcponline.org Downloaded from physiology of these model organisms. However, a high-quality annotation of the A. flavus genome is still not available, presenting a major obstacle for the understanding of the entire networks governing pathogenicity in this major fungal pathogen.
To address this gap, we performed high-quality comprehensive annotation of the A. flavus genome using an integrated proteogenomic approach. Through this approach, we were able to validate many predicted genes and identified novel protein-coding genes not currently represented in extant genomic annotations. We further provided a holistic view of post-translational modifications (PTMs) events in A. flavus. Functional analysis suggested the involvement of these PTMs in the cellular metabolic and aflatoxin biosynthetic pathways. Our current results represent a significant advance in the accurate annotation of the A. flavus genome and provide new insights into the mechanisms of pathogenicity and aflatoxin biosynthesis in A. flavus. 6 supplied with additional congo red (200µg/mL) and cultured in minimal medium (MM) (10 g/L glucose) and czapek-dox agar (CA) (6 g/L Ammonium tartrate) at 37 °C.

Protein extraction, proteolytic digestion and offline high-pH reversed phase C18 column prefractionation
The cultures from different treatments were then collected by filtration and washed three times with PBS buffer, respectively. The mycelia were ground into powders and reconstituted in buffer [20 mM Tris-Cl (pH 7.5), 150 mM NaCl, 50 mM nicotinamide, pH 7.5, 5 mM β-glycero-phosphate, 10 mM Na4P2O7, 10 mM NaF, 1 mM Na3VO4, 1% Triton X-100, 1×phenylmethanesulfonyl fluoride (PMSF)]. The cultures were shaken at 30 °C for 1 h and the debris was removed by centrifugation at 5,000 × g at 4 °C for 1 h. Finally, the protein concentration was measured by bicinchonininc acid (BCA) method (Tiangen, Beijing, China).
The constant protein amounts from different stress conditions were combined and precipitated by using 80% ice-cold acetone. The whole lysates were then washed three times with ice-cold acetone to remove the pigment and other small molecules.
The protein lysates were finally redissolved in 50 mM ammonium bicarbonate, and in-solution/in-gel digested by trypsin (Promega) according to previously described (29,35).
To reduce the complexity of tryptic digests, the in-solution proteolytic digested peptides were pre-fractionated on a self-packed SPE column as previously described (29,35). Briefly, the peptides were loaded onto the column and then desalted with 20 mM ammonium formate. The columns were eluted with a series of elution buffers by guest on January 11, 2021 https://www.mcponline.org Downloaded from 7 containing of 20 mM ammonium formate and different concentrations of acetonitrile (10,13,15,18,21,25,28,35,60,80, and 100%, vol/vol). Collected fractions were completely dried with a vacuum centrifuge and stored at −80 °C for further use.

Nano-LC-MS/MS analysis
The peptides were dissolved in 0.1% formic acid and separated on an online nano-flow EASY-nLC 1200 system with a 75 um × 15 cm analytical column (C18, 3 μm, Thermo Fisher Scientific), and analyzed on a Q Exactive HF mass spectrometer (Thermo Fisher Scientific). Peptides were eluted using a linear solvent gradient of 9-32% solvent B (0.1% formic acid/80% acetonitrile, v/v) over 100 min at 450 nL/min flow rate. The mass spectrometer was operated in data-dependent acquisition (DDA) mode with full scans (m/z range of 300-1,800) at 120,000 mass resolution using an automatic gain control (AGC) target value of 3e6. The top 20 most intense precursor ions were selected for following MS/MS fragmentation by higher-energy collision dissociation (HCD) with normalized collision energy (NCE) of 27% and analyzed with 30,000 resolution in the Orbitrap. The dynamic exclusion was set to 20 s and the isolation width of precursor ion was set to 1.6 m/z. The maximum injection times were 20 ms and 50 ms for both MS and MS/MS, respectively. The intensity threshold was set to 10,000.
The digested peptides were also separated on an UltimateTM 3000 nano-LC System (Dionex) with a 75 um × 15 cm analytical column (C18, 3 μm, Thermo Fisher Scientific) and analyzed on an Orbitrap Elite mass spectrometer (Thermo Fisher Scientific). Peptides were eluted at 300 nL/min flow rate with 70 min linear solvent by guest on January 11, 2021 https://www.mcponline.org Downloaded from 8 gradient of 4-25% solvent B (0.1% formic acid/80% acetonitrile, v/v). A full MS scan from m/z 300 to 1,800 was acquired at 60,000 resolution with a minimum signal intensity of 500. The top 15 most intense precursor ions were selected for following MS/MS fragmentation by HCD with NCE of 35% and analyzed with a resolution of 15,000 in the Orbitrap. The dynamic exclusion was set to 90 s and the isolation width of precursor ion was set to 2.0 m/z. The maximum injection times for both full MS and MS/MS were 120 ms and 120 ms, respectively.
The ion spray voltage was 2300 V, declustering potential 100 V, nebulizer gas 5 psi, curtain gas 35 psi, and interface heater temperature 150 °C. A full MS scan from 350 to 1,500 m/z was acquired in an Information Dependent Acquisition (IDA) mode. The top 40 most intense precursor ions were selected for following MS/MS fragmentation with a 2+ to 5+ charge state. The maximum injection times for both full MS and MS/MS were 250 ms and 50 ms, respectively. Dynamic exclusion was set for a period of 15 s and a tolerance of 50 ppm. Dynamic collision energy was used.

Retrieval of data sources
The A. flavus protein sequences, genome sequences and general feature format (GFF) file were downloaded from NCBI database. The A. flavus protein sequence 10 results into the GAPE pipeline, with regard to the performance characteristics of the two proprietary and commercial search engines. The search parameters were as follows: trypsin digestion with maximum of two missed cleavages allowed; fixed modification was carbamidomethylation of cysteine, whereas acetylation (protein N-terminal), deamidation (Asn/Gln) and oxidation (Met) were set as dynamic modifications. The mass errors for precursor ions and fragment ions were set to 10 ppm and 0.05 Da, respectively. The filtering strategies in GAPE tool, including the target-decoy strategy for all identified peptides (47) and a more stringent filtering strategy (separated FDR) for the novel peptides (48,49), were finally used to evaluate the identification error rates. Also, the identified spectra that were assigned different sequences in different searches were eliminated to filter the false discovery. All identified peptides were mapped to the existing reference protein database using BLASTP (50) and the remaining orphan peptides were designated as genome search-specific peptides (GSSPs).
All identified GSSPs were mapped to the A. flavus genome and the open reading frame (ORF) was predicted using GAPE tool (29). Novel protein-coding genes can be identified as the ORFs mapping to intergenic regions, while the ORFs partially overlapping within an annotated gene or exon will be reported as gene model revision.
Splice variants can be found either by exon-exon spanning peptides or the GSSPs that partially overlapping with an existing exon. The predicted ORFs containing specific amino acids mutations were defined as SAAVs.

PTMs identification
by guest on January 11, 2021 To gain a holistic view of PTMs events in A. flavus, the same experimental MS data were searched against the existing reference protein database (13,485 protein sequences) by using Open-search algorithm in pFind (version 3.1) (44, 45) with a precursor ion mass tolerance of 20 ppm and fragment ion mass tolerance of 20 ppm.
Two missed cleavages were allowed for trypsin. The target-decoy strategy based on score threshold was used to evaluate the error rates of identification. For the other parameters in pFind, we used the algorithm defaults.
We next performed restrictive database search strategy to confirm the localization accuracy of the modification sites by using MaxQuant program (version 1.6.2.4) (51).
All MS/MS spectra were searched individually twenty three times with different parameters as previously described (29,30). In brief, the identical search parameters were as follows: the precursor and fragment ion mass tolerances were 10 ppm and 0.05 Da, respectively; enzyme specifity was set as full cleavage by trypsin; proteins, modification sites and peptides were set at 1%, and the modified peptides with PTM score more than 40 were selected for further bioinformatics analysis.

Bioinformatics analysis
The functional annotation of all the identified proteins was performed by using Blast2GO tool and the subcellular localization was analyzed by CELLO web tool (52).
For the conservation analysis of all the predicted proteins in "non-coding genes analysis" section, we selected the protein databases of 811 fungi from NCBI (www.ncbi.nlm.nih.gov/), to provide evidence of the protein-coding ability as much as possible. For the conservation analysis of novel proteins identified in this work, the protein databases of 66 Aspergillus strains were retrieved from NCBI (www.ncbi.nlm.nih.gov/), JGI (www.jgi.doe.gov), AspGD (www.aspgd.org) and EnsEMBL (www.ensembl.org). Conservation analysis of the identified novel proteins was carried out by using reciprocal BLAST (50). The two-directional BLAST alignments were classified to be significant if the corresponding BLAST E values were lower than 1E-5. Finally, the best-scoring homologous protein was selected for further processing. The cluster analysis of identified proteins was performed by Cluster 3.0 and visualized by TreeView. The identified novel genes were visualized using the Integrative Genomics Viewer (IGV) browser (53). Potential non-coding genes analysis in A. flavus genome was performed as previous described (29). The homology analysis of novel protein was performed by ClustalW (54) and the novel protein were then mapped to KEGG pathway according to the known homologous proteins from other Aspergillus species. Signal peptide was predicted by SignalP (55) and PrediSi by guest on January 11, 2021 https://www.mcponline.org Downloaded from 14 (56) tool, and the motif was visualized by using WebLogo tool (57). R scripts and Excel were used for the statistical analyses. Statistical significance was analyzed by Student's t tests and expressed as a p value. p < 0.05 was considered to be statistically significant. For RNA-seq data analysis, the retrieved low-quality RNA reads from NCBI database were filtered and then mapped to the A. flavus genome using TopHat (version 2.

Real-Time PCR
Total RNA was extracted from the mycelia grown in different conditions (37 °C, 29 °C, salt and H2O2 stresses) using TRIzol reagent (Invitrogen) as the manufacturer's protocols. The concentration of total RNA was quantitated using Nanodrop 2000 spectrophotometer. RT-PCR validation of novel genes was performed using the

SYBR Green PCR Master Mix (Applied Biosystems) and the LightCycler 480
Real-Time PCR System (Roche). The actin gene was used as the endogenous control and three independent biological replicates were carried out for duplicate samples. The gene-specific primers were compiled in Table S1. The identified candidate novel peptides were randomly selected and synthesized by Bioyeargene Biotechnology (Wuhan, China). Approximately 1 pmol of synthetic peptide was analyzed on the LTQ-Orbitrap Elite mass spectrometers. The peptide fragments distributions of 49 synthetic peptides were then validated manually by aligning those from the identified peptides in proteogenomic analysis. To further ensure the reliability of the matching for all measured peptides, Cosine similarity analysis (Eq. 1) of these spectra were performed and the corresponding score value > 0.7 was considered reliable and selected as previously described (60)(61)(62)(63).

Experimental Design and Statistical Rationale
To achieve the in-deep protein coverage in this work, we performed the experiments for many times using different experimental procedures. The wild type strain of A.
flavus was grown under different treatment conditions, including different growth media (yeast extract-sucrose medium, minimal medium and czapek-dox agar medium) and stress conditions (H2O2, different pH, 28°C, sodium chloride, congo red). The cultures from these different treatments were collected by filtration and the mycelia were ground into powders to extract the whole cell lysates, respectively. After measuring the protein concentration, the constant protein amounts from different cultures were mixed as previous reports (29,35) and subsequently in-gel digested by trypsin and analyzed on the Q Exactive HF, Orbitrap Elite and Triple TOF™ 6600 mass spectrometers, respectively. Furthermore, the lysates were also processed with by guest on January 11,2021 in-solution tryptic digestion and pre-fractionated on a self-packed SPE column according to previously described (29,35). All the eluted fractions were then analyzed on the Q Exactive HF and Orbitrap Elite mass spectrometers, respectively. The final proteomic data was combined based on these different experimental results. The GAPE tool was used for the identification of peptides, expressed proteins and novel events. Post-translational modifications analysis was performed by using MODa and MaxQuant. Functional annotation of all proteins was performed by using Blast2GO tool and the subcellular localization was analyzed by CELLO web tool (52).
Conservation analysis was carried out by using reciprocal BLAST (50) and ClustalW (54). The cluster analysis of identified proteins was performed by Cluster 3.0 and visualized by TreeView. A protein-protein interaction map was constructed according to the interaction data from string database (https://string-db.org/). Potential non-coding genes analysis in A. flavus genome was performed as previous described (29). Signal peptide was predicted by SignalP (55) and PrediSi (56) tool, and the motif was visualized by using WebLogo tool (57). RNA-seq data analysis was performed using TopHat (version 2.1.1) (58) and cufflinks (v2.2.1) (59). R scripts and Excel were used for the statistical analyses. Statistical significance was analyzed by Student's t tests and expressed as a p value. p < 0.05 was considered to be statistically significant.

Results The A. flavus proteomic landscape
In order to ensure broad proteomic coverage, we began by preparing samples of A. by guest on January 11, 2021 flavus grown under 8 different culture conditions, including different growth media (yeast extract-sucrose medium, minimal medium and czapek-dox agar medium) and stress conditions (H2O2, different pH, 28°C, sodium chloride, congo red) (Fig. 1A).
These samples were then further processed and subjected to high resolution MS analysis (Fig. 1B). In addition to these data, additional MS sources from recently published large-scale MS experiments (38,39) were also incorporated and analyzed using the GAPE tool, which facilitates automated genomic annotation and post-translational modification (PTM) discovery in eukaryotic organisms (29). By applying our proteomic strategy to 88 raw MS files, we identified total 840,095 significant peptide spectrum matches (PSMs), corresponding to 150,458 unique peptides using a stringent false discovery rate (FDR) filtering threshold (FDR < 1%) (29). All the peptides identified from the different search engines, along with charge, score, peptide mass and mass error, were available at the iprox database (www.iprox.org) with the identifier IPX0001753001 (under the file name was ~25%, confirming the reliability of our proteomics datasets (Fig. S1A). Proteins that only matched a subset of peptides from another protein were designated as shared proteins as previous described (29). In addition to the 8,411 clearly identified proteins, we additionally identified 313 shared proteins (Table S2), which only evidenced by the presence of a subset of peptides that mapped to two or more different proteins. Figure 1C provides a summary of the number of genes with all the identified features.
However, the remaining 4,761 predicted genes were still not detected by any peptide evidence in this analysis, leading us to next determine whether they corresponded to non-coding genes in the A. flavus database according to previously described methods (64). Figure 1D exhibits the key details regarding the features of protein coding-genes identified in this analysis and summarized the identified result of each feature. We first noticed that the protein length of these unidentified proteins was shorter than those of detected proteins, suggesting that this may have influenced their proteomics-based detection owing to the relatively lower expression levels and number of peptides generated from these proteins (Fig. S1B). We next analyzed the functional features and localization of these predicted protein-coding genes based upon Gene Ontology (GO), KOG, and domain annotations, revealing that such functional annotations were present for many of these proteins (Table S3). However, roughly 3,000 of these undetected proteins lacked any functional evidence ( Fig. 2A), and roughly 50% of these undetected proteins were predicted to localize to the membrane (1,529) or extracellular environment (1,038) (Fig. 2B), potentially making by guest on January 11, 2021 https://www.mcponline.org Downloaded from them challenging to be detected via traditional MS approaches (64). With respect to the uniprot protein evidence pertaining to these genes, 82.82% of protein-coding genes were annotated based upon predictions and 16.89% were inferred from homology, with only 0.3% of proteins being annotated based upon protein and transcript-level experimental evidence (Fig. 2C and Table S4), suggesting the need for further proteomic analyses of A. flavus to ensure the robust and replicable characterization of this organism. We also found that there was a strong correlation between protein identification and the number of growth conditions in which a transcript was expressed ( Fig. 2D and Table S5). Many more transcripts were detectable for identified proteins, with 7,879 proteins having detected transcripts, whereas we could not detect any evidence of transcriptional expression for the ~25% of undetected proteins. Of these identified proteins with transcript-level evidence, ~89% (7,022) had the evidence of transcriptional expression in 10 or more samples, whereas only 334 (4%) were expressed in two or fewer samples. In contrast, many more of the undetected proteins (638) had transcripts that were detected in two or fewer samples (Fig. 2E). All the transcript abundances between protein-coding genes detected and not-detected in the proteomic experiments were provided in Table S5.
Based upon correlation analyses between protein detection and conservation in fungi, we found that the more conditions under which a transcript is expressed, the more likely proteins tended to be conserved ( Fig. 2F and Table S6). Also, the annotations of all predicted proteins from different databases were analyzed and we provided a complete annotation for the potential non-coding genes in Table S7, while we  (Table S8), a list of all potential non-coding genes (3,279) with at least one of above features is compiled in Table S9 and Figure 1D.

Identification of novel events in the A. flavus proteome
We next refined current gene predictions using genome search-specific peptides (GSSPs) in order to account for any missing or incorrect genes or exons. A total of 72,879 unique GSSPs resulted in the identification of 732 novel protein-coding genes using the identification standard of two unique peptides per protein that were not initially identified during A. flavus genome assembly (Table S10). Additionally, besides identifying novel genes, the GSSPs were also used to revise the incorrectly predicted gene structures within the A. flavus proteome. We proposed the revision of an existing gene model and designated it as a "revised gene". In total, 188 revised genes were identified in the current genomic annotation (Table S11). Moreover, the remaining novel peptides were then used to identify alternatively spliced (AS) proteins and proteins with SAAVs. Finally, we identified 269 AS proteins, of which 88 were novel and 181 were revised (Table S12). We additionally identified 447 SAAVs, of which 19 were novel, 7 were revised, and 421 which were amino acid mutations in the known predicted protein-coding genes ( Table S13). Each of the identified novel variants were supported by at least two unique GSSPs. additional RNA-seq evidence supporting the identification of this protein (Fig. 3B). As eukaryotic gene models are often complex, bioinformatics approaches often incorrectly predict the boundaries of many protein-coding genes. We detected a predicted L-arabinofuranosidase precursor protein-coding gene represented by 32 peptides across 2 exons based on the current genome annotation (Fig. 3C). Notably, 19 GSSPs were clustered in the intronic regions of this gene and 6 spanned exon-intron boundaries, suggesting that the exons of this gene were extended relative to the current annotated gene model. The validity of this observed partially-extended protein was also supported by transcriptomic evidence. Given the errors in and limited depth of whole-genome sequencing data, proteogenomic analyses such as those in the present study can offer direct evidence of translation for this type of gene. For example, besides the eight annotated peptides in the five exons, an additional three point-mutated GSSPs within the annotated gene of a short-chain dehydrogenases/reductase were observed, including the mutation of glycine to asparagic acid, valine to leucine, and isoleucine to valine (Fig. 3D). To further validate the four proteins, we finally used an alignment strategy to explore the conservation of by guest on January 11, 2021 these proteins more broadly, demonstrating orthologous proteins to be present across a range of other species (Fig. S2 and Table S14). The details of aligned information for the identified novel gene (Fig. 3A), revised gene (Fig. 3B), novel alternative splicing variant (Fig. 3C) and SAAV (Fig. 3D) in the other Aspergillus strains were uploaded onto the iprox database (www.iprox.org) with the identifier IPX0001753001 (under the file name "Results of sequence alignment"). In addition, to confirm the conservation of the putative mutation sites in the other Aspergillus strains, we performed sequence alignment analysis for all the identified SAAVs in this work and the conservation of the mutated amino acids that occurs in the other 66 different Aspergillus strains was analyzed (Table S15). Notably, the mutation of Val81 to Lue from the short-chain dehydrogenases/reductase (Fig. 3D)

was conserved in 31
Aspergillus strains and the mutation of Ile169 to Val was detected in 10 Aspergillus strains, while the third mutation site (Gly49 to Asp) was observed in 4 Aspergillus strains. For the other mutation sites, as depicted in Figure S3, 288 mutation sites identified in this study may also be detected in 5 or more Aspergillus strains, confirming the high conservation of these amino acids across a range of other Aspergillus species. While 114 sites were only observed in 5 or fewer Aspergillus strains and the remaining 45 mutation sites were not detected in the other Aspergillus strains. These findings suggested that most of the mutated residues identified in this work may have a potential function in Aspergillus. However, further experimental studies are needed to reveal the functional effects of these mutations in A. flavus.

Functional analysis of novelties in A. flavus
by guest on January 11, 2021 In order to further gain insights into the identified novel proteins, we conducted a conservative analysis of these novel events more broadly across Aspergillus species via a two-directional protein BLAST approach (50). As shown in Figure 4A, while the majority of these proteins appeared to exhibit evolutionary conservation, several species-specific proteins were also detected lacking orthologs in other Aspergillus species indicating that they may play unique regulatory roles in A. flavus (Table S14).
Additional functional annotation revealed that most of these evolutionary conserved novel proteins were associated with multiple metabolic processes, including biosynthetic and nitrogen metabolic processes ( Fig. 4B and Table S16). Since the samples were corrected from different stress treatments, a large subset of these novel proteins were stress-associated and found on the membrane or lumen, suggesting they are important for Aspergillus stress responses and pathogenicity.
To investigate the potential function of these novel proteins, the novel proteins identified in this work were further mapped to the KEGG pathways by homology search strategy. Consistently, we were able to identify many novel proteins that involved in the important pathways (Table S17). Among these pathways, a large subset of these novel proteins were stress-associated, such as rho signaling pathway, G-protein signaling system, MAPK signaling pathway and other stress-related signaling pathways, suggesting the potential role of these novel proteins in stress responses. Owing to the pivotal role of the signaling pathways in pathogenesis of Aspergillus (65), our results would facilitate the elucidation of signaling networks in A.
flavus. Furthermore, we identified six novel peroxisomal proteins in early aflatoxin by guest on January 11, 2021 https://www.mcponline.org Downloaded from synthesis (4) and many novel proteins that involved in aflatoxin biosynthesis (5) and transport pathway (12), indicating the potential role of these novel proteins in secondary metabolism. Although other groups have focused on the role of fungal organelles (vacuole, golgi apparatus and endoplasmic reticulum) in growth, differentiation, secondary metabolism and pathogenesis of Aspergillus (4,66,67), the detailed regulatory mechanism of these metabolism has not yet been reported, particularly the identification and functional analysis of the corresponding genes. As a result, these findings in this work could enable us to offer some potential insights into the important functional pathways and also provide a step tone for understanding the detailed regulatory mechanism of aflatoxin biosynthesis in A. flavus. However, further experimental procedures are needed to reveal how these novel events influence the synthesis and export of aflatoxin.
To further explore the biological roles of the novel proteins, we next constructed a protein-protein interaction network that can serve as an alternative strategy to analyze the physical and functional interactions, based upon known interactions among homologous proteins from other Aspergillus species (Table S18). The resultant network incorporated 115 novel proteins and 223 interactions (Fig. S4). While we do not have functional data to validate these interactions at present, this network highlights a potential framework for such interactions and offers some potential insights into the important physical and functional pathways in this pathogenic fungus.
As expected, a total of 12 stress-associated novel proteins were identified in this network, suggesting they may play key roles in stress responses. Due to the fact that

Validation of novelties in the A. flavus proteome
In order to confirm that the identified novel proteins in this analysis were valid, we sought to independently validate our results via MS analysis of synthetic peptides and via RT-PCR. We first synthesized 49 random GSSPs across a range of peptide types and analyzed them via MS. When we compared the results of the mass spectra for these peptides, we found that the relative intensity and distribution of a series of b/y-ions of the synthetic peptide spectra were in high agreement with the selected GSSPs, strongly supporting the results of our proteogenomic analyses. Furthermore, we performed the Cosine similarity analysis to confirm the matching for all measured peptides. This method was interpreted as a cosine of the angle and chosen as a measure of similarity between different mass spectra in this work. It's calculated as normalized inner product, with score values ranging between 0 and 1. Based on this analysis, the corresponding scores for the 42 measured peptides were more than 0.9 and five were more than 0.8, while the scores of the remaining two peptides were 0.768867 and 0.758584, respectively. We then manually checked these spectra with scores between 0.7-0.9 and concluded that these spectra of synthetic peptides agreed with the proteomic results in both the peptide fragments distribution and the relative intensity. All the spectra were available at www.iprox.org with the identifier We additionally validated the expression of 23 identified novel genes via RT-PCR using a range of different stress conditions, including low temperature (29 °C), oxidative (H2O2) and hyperosmotic (NaCl) stresses. Among these differently expressed novel genes, five genes were up-regulated under low temperature and two were up-regulated under oxidative stress, while most genes were down-regulated under hyperosmotic stress (Fig. 5D). The results of this analysis provided transcript-level results, suggesting that these genes may play an important role in stress responses.

PTM analysis and signal peptide prediction
We have previously conducted a specific PTM enrichment-based phosphoproteome and succinylome analysis of A. flavus (72,73). In the current study, we next  (Table S19).
We further identified common intracellular PTMs in our dataset as in previous reports (29), revealing 3,461 modification sites from 1,765 proteins in this study ( Table S20). post-translationally modified peptides were uploaded onto the iprox database (www.iprox.org) with the identifier IPX0001753001 (under the file name "Maxquant search results" and "Spectra of PTMs"). We next compared the score distribution of the different PTMs to the unmodified peptide identifications. We found that the Maxquant Andromeda score of modified peptides were lower than that of unmodified peptides ( Fig. S5 and Table S21). Consistent with our results, previous reports have shown that the stoichiometry of the modification is low, and the low stoichiometry of modified peptides in cell lysates may result in low identification score (74,75).
Additionally, we noticed that the four most common PTMs in this study were methylation, propionylation, malonylation, and crotonylation, with over 1,000 modification sites (Fig. 6A). The in-depth identification of more than 20 PTMs also enabled us to analyze the function of these modified proteins based upon the NCBI eukaryotic orthologous groups (KOG), revealing these modified proteins to play diverse roles as regulators of many cellular processes ( Fig. 6B and Table S22). We next investigated the different PTMs in A. flavus by mapping modified proteins to KEGG pathways. Consistent with the KOG annotations, most proteins were involved in metabolic pathways (225), the biosynthesis of secondary metabolites (104), and the biosynthesis of antibiotics (90) ( Table S23).
Based on our functional annotation of the identified predicted proteins (Table S3) and novel events ( Table S16 and Table S17), many proteins were likely to be involved in aflatoxin biosynthesis and export pathways, and predicted to localize to the membrane and extracellular. It should be noted that the export of secreted by guest on January 11, 2021 https://www.mcponline.org Downloaded from 28 proteins as well as proteins which are located in the inner or outer membrane or the periplasm usually requires an N-terminal signal sequence, while knowledge of signal peptides is important for understanding protein function. Additionally, signal peptide has been proved to play an important role in Aspergillus (76,77). Thus, apart from the post-translational chemical modifications, we next utilized two signal peptide prediction tools, PrediSi (56) and SignalP (55), to predict the presence of N-terminal signal peptides on identified proteins. In total, we identified 1,116 and 757 signal peptides using PrediSi (Table S24) and SignalP (Table S25), respectively, corresponding with clear sequence motifs (Fig. S6A). The majority of these peptides were confirmed by both tools (Fig. S6B), providing a robust basis for their validity.
Furthermore, the presence of N-terminal signal peptides on revised and novel proteins were also predicted by PrediSi and SignalP. As shown in Figure S6, there were similar motifs among the predicted, revised and novel proteins. We detected the conserved alanine and valine at the -1 and -3 positions on predicted, revised and novel proteins that predicted by SignalP and PrediSi, excepted the novel proteins from PrediSi (Table S24- signal peptide cleavages. We anticipated that these identified signal peptides likely play a role in A. flavus secondary metabolite transport.

Discussion
In this study, we presented the first comprehensive draft map of the A. flavus proteome. This is the first study we are aware of to have employed such a proteomic approach to improve Aspergillus genomic annotation. Importantly, we were able to leverage these results to gain novel insights into the mechanisms of aflatoxin synthesis and A. flavus pathogenicity, thus potentially highlight novel strategies for controlling aflatoxin contamination.
Through our proteogenomic analysis, we were able to definitively identify 732 novel protein-coding genes in A. flavus. Interestingly, some of these genes were relatively small in length, potentially explaining why they were overlooked in previous in silico analyse (8)(9)(10). Of these novel proteins, roughly 50% were conserved among 66 Aspergillus species (Fig. 4A and Table S14). The conserved novel genes (32) that could be detected having orthologs in 65 Aspergillus strains were selected for further functional annotation. We noted that these conserved proteins were involved in metabolic, reproductive, and developmental processes (Fig. S7). It seems that some of these proteins may play conserved roles in Aspergillus growth and survival, as such metabolic networks are critical in Aspergillus species (69)(70)(71)78). In order to remain an effective pathogen, it is vital that Aspergillus be able to adapt to environmental stressors. In our proteogenomic analysis, the samples were prepared by guest on January 11, 2021 https://www.mcponline.org Downloaded from 30 from 8 different stress treatments and many novel genes were differently expressed in response to these conditions (Fig. 1). As such, we hypothesized that novel proteins identified in A. flavus cells under stress conditions may be related to the survival, growth, pathogenicity, and stress response behavior of A. flavus in the infected host cell. In addition, the novel proteins involved in aflatoxin synthesis and export were also identified in the present work (Table S17). For instance, the novel proteins involved in carbon metabolism would regulate the flux of acetyl CoA, which is the initial substrate of aflatoxins biosynthesis (79). The vacuole-related novel proteins may affect the biosynthesis of aflatoxins and their consequent export by regulating the vacuolar homeostasis (66,67,80). Although the detailed regulatory mechanism of these novel proteins remains unclear, the identified novel proteins in this work provide a rich source to facilitate our understanding of the stress responses and aflatoxins biosynthesis in A. flavus.
Despite widespread recognition of the importance of PTMs in various biological processes (81)(82)(83)(84), there still lack of a holistic view of PTMs events on Aspergillus proteins and the functional roles of many types of PTMs have not yet been uncovered.
In this study, we identified 8,724 annotated protein-coding genes with high confidence and a significant number of these proteins were found to be modified with different  (Table S23). To the best of our knowledge, many important PTMs (such as acetylation, succinylation, propionylation and malonylation) could provide a mechanism to respond to changes in the energy status of the cell and regulate metabolic pathways (86)(87)(88)(89)(90)(91)(92). In addition, we have previously provided the first evidence that phosphorylation, methylation, acetylation, succinylation, ubiquitination and sumoylation may be a mechanism involved in aflatoxin biosynthesis by regulating the activities of enzymes in A. flavus (72,73,85,(93)(94)(95). Therefore, the identified PTM datasets from our work may provide a foundation for studying the biological functions of these PTM events in metabolic pathways and aflatoxin biosynthesis. Further experimental procedures are also needed to confirm the speculation and understand the potential regulatory mechanism of these PTMs in A. flavus. Moreover, secondary metabolites, produced by Aspergillus, exhibit harmful property for crops and humankind. Although only three secondary metabolite clusters have been characterized, those enzymes in the biosynthesis of secondary metabolites were also subjected to be modified with diverse PTMs, implying the importance of PTMs in the pathway. Together, our data sets provide an important resource for further functional analysis of these PTMs in A. flavus.
In summary, the proteogenomic strategy which we employed in this study allowed us to provide the high-quality comprehensive annotation of the A. flavus genome. The by guest on January 11, 2021 https://www.mcponline.org Downloaded from 32 results of this study will serve as a valuable resource for future efforts to explore the the entire networks governing the aflatoxin production in A. flavus.