Identification of the Linker-SH2 Domain of STAT as the Origin of the SH2 Domain Using Two-dimensional Structural Alignment*

The availability of large volumes of genomic sequences presents an unprecedented proteomic challenge to characterize the structure and function of various protein motifs. Primary structural alignment is often unable to accurately identify a given motif due to sequence divergence; however, with the aid of secondary structural prediction for analysis, it becomes feasible to explore protein motifs on a proteome-wide scale. Here we report the use of secondary structural alignment to characterize the Src homology 2 (SH2) domains of both conventional and divergent sequences and divide them into two groups, Src-type and STAT-type. In addition to the basic “αβββα” structure (βΒ), the Src-type SH2 domain contains an extra β-strand (βE or βE-βF motif). Alternatively, the linker domain-conjugated SH2 domain in STAT contains the αB` motif. Combining BLAST data from βΒ core motif sequences with predicted secondary structural alignment, we have screened for SH2 domains in various eukaryotic model systems including Arabidopsis, Dictyostelium, and Saccharomyces. Two novel genes carrying the linker-SH2 domain of STAT were discovered and subsequently cloned from Arabidopsis. These genes, designated as STAT-type linker-SH2 domain factors (STATL), are found in a wide array of vascular and nonvascular plants, suggesting that the linker-SH2 domain evolved prior to the divergence of plants and animals. Using this approach, we expanded the number of putative SH2 domain-bearing genes in Dictyostelium and comparatively studied the secondary structural profiles of both typical and atypical SH2 domains. Our results indicate that the linker-SH2 domain of the transcription factor STAT is one of the most ancient and fully developed functional domains, serving as a template for the continuing evolution of the SH2 domain essential for phosphotyrosine signal transduction.

The availability of large volumes of genomic sequences presents an unprecedented proteomic challenge to characterize the structure and function of various protein motifs. Primary structural alignment is often unable to accurately identify a given motif due to sequence divergence; however, with the aid of secondary structural prediction for analysis, it becomes feasible to explore protein motifs on a proteome-wide scale. Here we report the use of secondary structural alignment to characterize the Src homology 2 (SH2) domains of both conventional and divergent sequences and divide them into two groups, Srctype and STAT-type. In addition to the basic "␣␤␤␤␣" structure (␤〉), the Src-type SH2 domain contains an extra ␤-strand (␤E or ␤E-␤F motif). Alternatively, the linker domain-conjugated SH2 domain in STAT contains the ␣B motif. Combining BLAST data from ␤〉 core motif sequences with predicted secondary structural alignment, we have screened for SH2 domains in various eukaryotic model systems including Arabidopsis, Dictyostelium, and Saccharomyces. Two novel genes carrying the linker-SH2 domain of STAT were discovered and subsequently cloned from Arabidopsis. These genes, designated as STAT-type linker-SH2 domain factors (STATL), are found in a wide array of vascular and nonvascular plants, suggesting that the linker-SH2 domain evolved prior to the divergence of plants and animals. Using this approach, we expanded the number of putative SH2 domain-bearing genes in Dictyostelium and comparatively studied the secondary structural profiles of both typical and atypical SH2 domains. Our results indicate that the linker-SH2 domain of the transcription factor STAT is one of the most ancient and fully developed functional domains, serving as a template for the continuing evolution of the SH2 domain essential for phosphotyrosine signal transduction.

Molecular & Cellular Proteomics 3:704 -714, 2004.
The Src homology 2 (SH2) 1 domain is an ϳ100-aa-long motif that recognizes and interacts with phosphotyrosinecontaining motifs on the same or different protein molecules during signal transduction in animal cells. About 200 SH2 domain-containing genes have been identified in human cells, suggesting that this domain is one of the most rapidly expanded protein modules (1). In animal cells, SH2 domains are predominately present in signaling molecules, i.e. signalingrelated enzymes including protein tyrosine kinases, protein tyrosine phosphatases, inositol phosphatase, and phospholipase and signaling adapters. However, the SH2 domain has also been found in transcription factor STAT family members (2,3). In a signaling molecule with catalytic activity, the SH2 domain is often conjugated immediately upstream with another functional motif such as the SH3 domain, whereas in STAT the linker domain is immediate upstream of the SH2 domain. Recently, two STAT proteins have been discovered in Dictyostelium, a facultative slime mold capable of both growing as a single cell and differentiating into multicellular structures (4,5). More recently, SHK, the SH2 domain-bearing protein kinase, has been identified in the same species (6). In both cases, the SH2 domains are the linker domain-conjugated. Because a typical SH2 domain has not been found in single-cell eukaryote yeast or microorganisms, the SH2 domain formation and its phospho-signaling were proposed to be coincident with animal evolution, perhaps critical during the transition from single-cell to multicellular animals or metazoan (7,8).
The characteristic structure of the SH2 domain is three ␤-strands flanked by two ␣-helices (␣␤␤␤␣). The first ␤-strand (␤B), conserved in its sequence GXF/YBBR (9), is the core motif critical for binding phosphotyrosine (pTyr) (10). This sequence is required for the normal function of the SH2 domain and conveniently serves as the fingerprint structure in SH2 domain recognition. While ␤B core motif and motif-like sequences exist widely in different genes, some perfect ␤B sequences are not necessarily indicative of the SH2 domain (11). In these cases, secondary structural alignment clarifies the confusion caused by sequence alignment alone. Additionally, secondary structural analysis can provide reliable structural evidence for SH2 domains with ambiguous ␤B and ␤Bflanking sequences. A careful analysis of the amino acid sequence, motif orientation, and secondary structural features indicate that STAT SH2 domains differ from those involved in signal transduction. In a STAT protein, the SH2 domain is the immediate extension from five continuing ␣-helices representing the linker domain. Moreover, all STAT-like SH2 domains carry a ␣BЈ motif between ␤D and ␣B sequences. We combined ␤B core motif sequence BLAST with secondary structural screening to identify SH2 domains in genome databases of various eukaryotic model systems. We identified and analyzed a novel gene family, which carries the linker-SH2 domain of STAT from the genome database of Arabidopsis as well as other plants. Two of these linker-SH2 domain-carrying genes were cloned from the cDNA library of Arabidopsis and sequenced. Using secondary structural alignment, we comprehensively analyzed the typical and atypical SH2 domains found in plants, yeast, and Dictyostelium. According to our secondary structural analysis, the linker-SH2 domain of SHK in Dictyostelium, representing the modern SH2 domain in signal transduction, may share the common ancestor with or even directly evolved from the linker-SH2 domain of STAT. The discovery of the linker-SH2 domain of STAT in plants supports the notion that this domain had been developed prior to the divergence of plant and animal kingdoms.

Pat-match Search and Secondary Structural
Analysis-The consensus sequence used for Pat-match (www.arabidopsis.org/cgi-bin/ patmatch/nph-patmatch.pl for TAIR Arabidopsis genome and seq. yeastgenome.org/cgi-bin/SGD/PATMATCH/nph-patmatch for yeast genome) screening is "GXF/YBBR" (X ϭ any amino acid; B ϭ hydrophobic amino acid). This consensus represents 98% of the ␤B core motif sequences based upon analysis of 775 SH2 domain sequences in the SMART database using the SMART Simple Modular Architecture Research Tool (smart.embl-heidelberg.de). The sequences flanking the ␤B motif in these putative genes were analyzed with the secondary structure prediction program 3D-PSSM web tool version 2.5.6 (www.sbg.bio.ic.ac.uk/ϳ3dpssm/), which predicts ␣-helices and ␤-strands. When overall ␣-helix/␤-strand arrangement, but not individual amino acid residue around the motif sequences was evaluated, all predicted ␣-helix and ␤-sheet were correct comparing with the structural information obtained from crystallization analysis.
Cloning and Northern blot of at-STATLa and at-STATLb-Both at-statla and at-statlb genes were cloned from the Arabidopsis cDNA library. Primers from known, flanking regions were used in order to clone the full-length sequences of at-statla and at-statlb. Various primer combinations were designed in order to cover with overlapping the full-length sequences of at-statla and at-statlb. PCR products were subsequently sequenced and the gene sequences of both at-statla and at-statlb have been deposited in GenBank. Total RNAs prepared from different parts of Arabidopsis were used for Northern blot analysis.
Western Blotting and Immuno-precipitation-Arabidopsis thaliana, ecotype Columbia, was grown on Murashige and Skoog agar medium at 22°C under constant light for 3 weeks. Sodium orthovanadate (100 M) and hydrogen peroxide (1 mM) were added for the indicated times. The whole plant of Arabidopsis was homogenized in ice-cold extraction buffer (30 mM Tris-HCl, pH 8.5, 150 mM NaCl, 1 mM EDTA, 20% glycerol, 1 mM dithiothreitol, and proteinase inhibitors). Cell debris was separated from soluble material by centrifugation at 18,000 ϫ g for 5 min. GST-at-STATLa in full length and GST-at-STATLb-SH2 domain (596 -692) were constructed, expressed, and purified as described previously (12). Purified glutathione S-transferase (GST) recombinant proteins were incubated with the aboveprepared extracts. Extensively washed GST protein precipitates were then subjected to standard Western blotting procedures using horseradish peroxidase-conjugated anti-pTyr (pY20) and enhanced chemiluminescence (Amersham Biosciences, Piscartaway, NJ) for detection.

RESULTS AND DISCUSSION
STAT-type SH2 Domain Differs from Src-type SH2 Domain at Secondary Structural Level-Using the secondary structural prediction program, we analyzed the SH2 domains available in SMART protein database. As expected, all the SH2 domains examined contain the basic "␣␤␤␤␣" structure. The SH2 domains of signaling factors including enzymes and adapters exclusively contain the predicted ␤E or ␤E-␤F motif, consistent with crystallographic findings (Fig. 1A) (13)(14)(15)(16). The small ␤-strand, ␤E or ␤E-␤F motif, has proven to be critical for protein/ligand recognition in pTyr signaling factors (10). However, the ␤F fragment is not always detectable or predictable, presumably due to the instability of this sequence in ␤-strand formation (15,17). According to the distance between ␤C and ␤D motifs, the SH2 domain-carrying enzymes can be further divided into the long ␤C-␤D loop and the short ␤C-␤D loop groups (Fig. 1A). Src family members all belong to the long ␤C-␤D group. In most of the adapters, the SH2 domains carry a putative short ␤C-␤D loop. For transcription factor STAT, the length of the predicted ␤C-␣B sequence varies with various family members and are overall shorter than those of signaling factors (Fig. 1A). But the most striking difference between Src-type and STAT-type SH2 domains is that STAT or STAT-like SH2 domains do not contain the ␤E or ␤E-␤F motif. Instead, they all contain the ␣BЈ or a nonsplit ␣B motif (Fig. 1A), which is considered as the critical region for STAT dimerization (16,18). All these secondary structural features of STAT-type SH2 domains obtained from the secondary structural prediction are consistent with the findings obtained from the crystallization studies (16,18). In Drosophila STAT, the SH2 domain carries the putative ␣BЈ motif despite lacking the ␤D motif (Fig. 1A). Two putative STAT-like sequences (ce-STATa and ce-STATb) were previously found by homologous alignment in Caenorhabditis elegans genome data (19). Our secondary structural analysis predicts that these SH2 domains also contain ␣BЈ motifs (Fig. 1A). Therefore, the level of detail obtained here can easily distinguish subtle domain differences between STAT SH2 and signaling factors.
We also analyzed the SH2 domains with divergent sequences. We investigated CBL and JAK family members that  (19) were included for analysis. All the SH2 domains as indicated were submitted to secondary structure program analysis and aligned. The SH2 domains of the enzyme group are divided into long ␤C-␣B and short ␤C-␣B subgroups according to the distance between putative ␤C and ␣B motifs. The predicted ␣-helices are printed in cyan, and the predicted ␤-sheets are printed in yellow. The ␤B motifs are shown in red. In the STAT SH2 domain, the ␤B motif is followed by a phenylalanine (F) residue, which is highlighted in blue. Hs-Src-Crys1 and Hs-Src-Crys2 represent the secondary structure obtained from two independent crystal studies of the Src (16,17). B, the divergent amino acid sequence of the CBL-SH2 domain (U26170) was analyzed with secondary structure program and compared with the features obtained from crystallographic study (20). C, the suspected SH2-like sequences of human Jak1 (M64174, M35203), Jak2 (AF058925), Jak3 (U09607), Tyk2 (X54637), and Drosophila JAK (L26975) were submitted to secondary structure program analysis and aligned with each other. are all known to contain a ␤B-like motif flanked by ambiguous sequences. For CBL, the typical "␣␤␤␤␣" topology obtained from secondary structural prediction (Fig. 1B) agreed with the conclusion drawn in a previous crystallography study (20). However, unlike Src-type or STAT-type SH2 domains, the CBL SH2 domain lacks the small ␤E or ␣BЈ motif. The immediate upstream sequence of the kinase-like domain in JAK has long been suspected as an SH2-like domain (21). Two-dimensional alignment clearly reveals that this sequence, though ambiguous, represents a typical Src-type SH2 domain (Fig.  1C). Among all the JAK members analyzed, Drosophila JAK contains the longest loop between ␤B and ␤C motifs. Utilizing other programs such as PHD (maple.bioc. columbia.edu/predictprotein) and JPRED (www.compbio.dundee.ac.uk) for prediction, similar results were obtained (data not shown). Thus, secondary structural prediction is particularly suited for, and reliable in, the detection of such fine structural difference of protein motifs with divergent sequences.
STATLs Are Novel Genes Carrying the STAT-like Linker-SH2 Domain in Plants-To locate putative SH2 domains in plants, we used Pat-match to screen the genomes of various eukaryotes for the ␤B core motif (see "Materials and Methods"). In the Arabidopsis Information Resource (TAIR; www. arabidopsis.org), we found 604 ␤B sequences in 583 putative genes, of which secondary structural analysis confirmed two putative genes containing the typical "␣␤␤␤␣" structure of an SH2 domain [AC007651 (protein locus: AAD50031); AC007260 (protein locus: AAD30582)]. Subsequent cDNA cloning and DNA sequencing analysis indicate that these two genes are closely related to each other (65% identical at the amino acid level) ( Fig. 2A). In both genes, the C termini are longer and divergent than those predicated by those published genomic sequences. The SH2 domains reside at the C-terminal regions and the predicted secondary structure match those found in STAT (Fig. 2B). Moreover, a sequence of 90 amino acids immediately N-terminal of the SH2 domain is also well conserved and resembles STAT's linker domain (Fig.  2B) (16,18,22). We therefore named the two novel genes STAT-type linker-SH2 domain factor a and b (STATLa and STATLb). The linker-SH2 domain is well conserved in putative STATL genes identified in both monocot and dicot plants including soybean, sorghum, potato, tomato, medicago, wheat, and rice (Fig. 2C). Surprisingly, the STATL sequences were also retrieved in lower plants like the moss, Physcomitrella patens (pp-STATL) from NCBI translated BLAST searches, and the green algae, Chlamydomonas (cr-STATL) from the Chlamydomonas Resource Center (www.biology. duke.edu/chlamy_genome/crc.html), (Fig. 2C). The ubiquitous presence of STATL in plants suggests that the SH2 domain plays a fundamental role in plants and animals and rejects the possibility that this domain originated in plants through some accidental means (i.e. reverse horizontal gene transfer) or coevolved with tyrosine kinases or the SH3 domain (23).
Amino acid sequence alignment indicates that the linker-SH2 FIG. 1-continued domain identity is 29% between at-STATLa and hs-STAT3 and 33% between at-STATLb and hs-STAT3. About the same range of amino acid identity was obtained when the same domain of dd-STATa (29%) or dd-STATc (30%) was aligned with that of hs-STAT3. In the SH2 domains of all STAT proteins regardless of the species, the ␤B motif is exclusively followed by a phenylalanine. All STATL members follow this rule without exception (Fig. 2C), whereas in the SH2 domains of signaling factors the ␤B motif has never been found to be followed immediately by phenylalanine. Another protein sequence feature is that, like most STAT members, the first residue of ␣A motif in STATL is also a lysine rather than an arginine, which coordinates with a phosphate group in metazoan STAT (16). The secondary structure of STAT's linker-SH2 domain (the five ␣-helices and the "␣␤␤␤␣" sandwich) (16,18) is well conserved in at-STATL proteins according to our prediction (Fig. 2B). When aligned with STAT, sequence gaps in at-STATLa and at-STATLb occur at neutral positions and do not interrupt the arrangement of ␣-helices and ␤-strands. For instance, a 9-aa stretch (ENMAGKGFS), absent in the linker domain of at-STATLa or at-STATLb, forms an out loop between ␣9 and ␣10 in the linker domain, which does not interrupt the helicity between ␣9 and ␣10 (Fig. 2B) (16). For all STATL sequences obtained from different plants, the SH2 domains do not carry the ␤E motif. Instead, they exclusively carry the ␣BЈ motif or a nonsplit ␣BЈ␣B (Fig. 2C), which matches the secondary structural characteristics of the STAT SH2 domain. The lack of ␤D motif coupled with the presence of a large nonsplit ␣B in the SH2 domain of cr-STATL in Chlamydomonas may indicate a premature form of SH2 domain that arose during its development. The upstream of the linker-SH2 domain (Ser 139 -Pro 232 in both at-STATLa and at-STATLb) is predicted to form a continuous ␤-sheet (Fig. 2D), which has some similarity to the DNA-binding domain of dd-STATa or dd-STATc but not that of human STAT (16,18,24). Therefore, the strong similarity between STAT and STATL within their linker-SH2 domains at both the amino acid sequence and secondary structure levels strongly indicate that these two domains share a common ancestor that evolved prior to the divergence of plants and animals.
Messenger RNAs of both genes were detected ubiquitously in different parts of Arabidopsis (Fig. 2E). Plants are not known to contain JAK-like nonreceptor tyrosine kinases (25). However, receptor protein kinases, serine/threonine plus tyrosine dual-function protein kinases, as well as protein phosphatases exist in plants (13,26,27). In Fig. 2F, tyrosine-phosphorylated proteins ranging in size from 60 to 120 kDa were detected in Arabidopsis treated with vanadate, a naturally occurring transition metal that can function as a nonspecific protein phosphatase inhibitor and trigger protein tyrosine phosphorylation in cells (28). Purified GST-STATLa-full length and GST-STATLb-SH2 domain proteins but not the GST control (Fig. 2G, right panel) were able to pull down the 120-kDa tyrosine phosphorylated protein in vanadate-treated samples but not in the samples without vanadate treatment (Fig. 2G, left  panel). Therefore, the SH2 domain of STATL proteins might be involved in pTyr-dependent protein-protein interaction in plants.
SHK's Linker-SH2 Domain Is Homologous to That of STAT in Dictyostelium-How does the STAT-type linker-SH2 domain relate phylogenetically to the tyrosine signaling-or Srctype SH2 domain, which quickly expanded in number in animal cells? To answer this question, we studied the database of Dictyostelium, a slime mold considered more closely related to fungi and animals than to plants (29). SH2-bearing genes cloned from Dictyostelium include two transcription factors (i.e. STATa and STATc) and one signaling factor (i.e. SHK1) (4 -6). From the Dictyostelium discoideum Genome Project (www.sanger.ac.uk/Projects/D_discoideum), two additional putative STAT sequences, designated as dd-STATb and dd-STATd, and four additional putative SHK genes, designated as dd-SHK2, dd-SHK3, dd-SHK4, and dd-SHK5, were identified using ␤B core motif sequence as well as the whole SH2 domain for BLAST searches. Like SHK1, the protein kinases of these novel putative SHK members are most closely related to the protein kinases found in plants (6). However, these same kinases in plants are not conjugated to any SH2 or SH2-like sequences. Using the kinase domain (C-region) of the putative SHK2 for a BLAST search, a large number of homologous expressed sequence tags (ESTs) were identified in the genomes of both Arabidopsis and Dictyostelium, but not in the databases of other eukaryotes (not shown). This suggests a close evolutionary relationship between plants and Dictyostelium.
Primary and predicted secondary structure alignment indi-cates that the SHK SH2 domains carry some features of the STAT SH2 domains in Dictyostelium. Phenylalanine (F) has been found to follow the ␤B motif in the SH2 domains of SHK1, SHK2, and SHK3 (Fig. 3A). However, secondary structural modifications have been noted in SHK's SH2 domain that contrast STAT. In the region between the predicted ␤D and ␣B motifs, the ␣BЈ motif has been replaced by the ␤F or ␤E-␤F motif. The introduction of ␤E in SHK5 lengthened the distance between ␣A and ␣B motifs (Fig. 3A). This seems to reflect the trend of SH2 maturation because this distance between ␣A and ␣B motifs have become even longer in the Src SH2 domain in metazoans (Fig. 1A). In this region, both FIG. 3. The SH2 domain coevolves with the linker domain but not with the SH3 domain. A, the SH2 domains of SHK and STAT members from D. discoideum were analyzed with secondary structure program and aligned. SHK family members include dd-SHK1, putative dd-SHK2 (Contig13319, Sanger Center), putative dd-SHK3 (JAX4a118b12.r1), putative dd-SHK4 (JC2d33h04.s1), and putative dd-SHK5 (Contig12179, Sanger Center). STAT family members include dd-STATa, dd-STATc, putative dd-STATb (Contig17339, Sanger Center), and putative dd-STATd (JC1b225h10.r1). In dd-STATd, an S/T-rich sequence (KDSLSKSSNDKLLQSPTTTTT TTTS) between ␤C and ␤D was omitted. B, the linker domains of dd-SHK and dd-STAT members were analyzed with secondary structure program and aligned. C, the SH3 domains of human Src (hs-Src, P12931), yeast NAP1-binding protein (sc-NBP,YDR162C), a putative gene of Dictyostelium (dd-SH3a, C94356), and three putative genes of A. thaliana (at-SH3a, AAG5264; at-SH3b, AAL32440; and at-SH3c, AAL32439) were analyzed with secondary structure program and compared with the structural features of the SH3 domain obtained from crystallization (15). sequence similarities and gaps were observed, suggesting that the accumulation of favorable mutations, insertions, and deletions might all contribute to the ␣-helix/␤-sheet switch (Fig. 3A). Thus, this ␣BЈ/␤E-␤F-containing region in the SH2 domain serves as an evolutionarily active region (EAR) within an otherwise conserved domain essential for its function. When STATc's linker domain was used for a BLAST search, the sequence between the protein kinase domain and the SH2 domain (the linker) of SHK was recovered, suggesting a close relationship among these molecules within this region. SHK's linker domain is predicted to contain a ␣-helix repeat composed of ␣7 to ␣10 motifs, which is indeed homologous to that of STAT (Fig. 3B). The C-terminal three ␣-helices (␣9, ␣10, and ␣11) formed a large ␣-helix immediately upstream of the SH2 domain. While the linker domains of most SHK members are relatively similar in size, SHK5's linker domain is apparently much longer due to homopolymerism, a characteristic of many genes found in Dictyostelium (4,7). Comparing with STAT or STATL, the predicated helical characteristic of the linker domain is degenerating in SHK (Fig. 3B), perhaps due to a functional regression of this domain in tyrosine signaling.
Although the linker domain in STAT may play a role in transcription (22), it was either lost or replaced by the SH3 domain in SH2-bearing signaling proteins of animal cells. For a long time, the linker domain of metazoan STAT was confused as the SH3 domain based upon amino acid sequence alignment (2). The SH3 domain consists of five ␤-strands arranged as two tightly packed anti-parallel ␤-sheets (14,15). Our secondary structural analysis revealed the typical SH3 domain structure in the C-terminal regions of three putative Arabidopsis genes among 14 putative SH3-carrying genes under both plant and prokaryote categories given by the SMART protein database (Fig. 3C). The SH3 domains of these three putative genes showed either a ␣-helix or ␣/␤-hybridized motif in the region of the ␤e motif (Fig. 3C), suggesting that this region is a putative EAR. Although a weak sequence homology exists between the STAT linker domain and the SH3 domain (2), the presence of well-developed SH3 domains as well as the fully developed linker-SH2 domain in independent genes in plants indicates the SH3 domain is unlikely evolved from the linker domain. While SH3 domain carrying genes exist in plants and quickly multiplied in its number in lower eukaryotes such as Dictyostelium and yeast, SH3-SH2 domain conjugation has not been discovered in these organisms.
SPT6 Gene Carries a Putative Immature Linker-SH2-like Domain-We next studied the yeast genome in which no typical SH2 domains were identified (30). Using the same approach, we identified 89 ␤B sequences from 86 putative genes in the Saccaromyces genome database (genomewww.stanford.edu/Saccharomyces). The suppressor of Ty 6 gene (SPT6) attracted our attention after secondary structural analysis of all these sequences. SPT6 was previously reported in yeast and animal cells and is involved in transcriptional initiation and DNA/RNA binding (31,32). Using the yeast SPT6 protein sequence for a BLAST search, putative SPT6 genes were identified in plants, Dictyostelium, and other eukaryotes (NCBI BLAST searches), suggesting that SPT6 is also an ancient gene that existed prior to the divergence of plants and animals. The conserved third residue should be either Phe or Tyr in the standard ␤B sequence GXF/YBBR (Fig. 4); however, sc-SPT6 of yeast is the only one to follow this rule (Fig. 4). Nevertheless, the putative ␣␤␤␤␣-like structure, albeit less typical, is well maintained in all SPT6 proteins according to our secondary structural prediction, supporting the suspicion of a degenerate SH2 domain in SPT6 genes (33,34). The ␣A, ␤B, and ␤C motifs that compose the evolutionary inactive region in the SH2 domains are conserved in SPT6s regardless of the origin. The putative ␣BЈ␣B helical structure in SPT6 does not split as STAT's SH2 domain does. In the suspected EARs of most SPT6s, a short ␤-strand and a ␣-helix are predicted, and the ␤D motif was not fully extended according to the prediction analysis. Such a poorly developed putative ␤D motif may not efficiently form an anti-parallel ␤-sheet with ␤B and ␤C, and may eventually hamper its function in proteinprotein interactions. The large Lys/Glu-rich helical structure in the linker domain of SPT6 resembles the ancient STATL-like ␣9-␣10-␣11 continuing ␣-helices that became three discrete helices in the linker domain of human STAT or was discontinued in SHK ( Figs. 2A and 3B). Moreover, in the region upstream of the linker domain, SPT6 contains a putative 6-␤-strand repeat resembling the DNA-binding domain in both STAT and at-STATL (not shown). Interestingly, our secondary structure prediction analysis indicates that SPT6 and cr-STATL of the unicellular plant Chlamydomonas share some critical features at secondary structural level. Both are predicted to bear a large nonsplit ␣BЈ␣B motif and a poorly developed ␤D motif (Figs. 2B and 4). Therefore, it is possible that SPT6 and STAT are evolutionarily related.
In terms of amino acid sequence, the SH2 domains of SPT6, STAT, and JAK are among the most divergent. Unlike JAK family members, both SPT6 and STAT are transcription factors that contain special secondary structural characteristics, especially in the EAR sequence as seen above. Based on the phylogenetic alignment, SH2 domains can be grouped into two categories, i.e. STAT-type and Src-type (Fig. 5). SHK family members fall in between STAT-type and Src-type but are closer to the STAT-type (Fig. 5). This strongly indicates a close relationship between SHK and STAT families in their SH2 domains and further supports the notion that SHK's linker-SH2 domain evolved from STAT or STATL. In SHK, STAT, and SPT6, the linker-SH2 domains all reside exclusively in the C-terminal regions. The appearance of SHK gene, which bears both the linker-SH2 domain and the kinase domain, in Dictyostelium started the new era toward pTyr signal transduction.
The combination of amino acid sequence screening with secondary structural analysis improves the accuracy in protein motif prediction. The ␤B core motif-like and its surrounding sequences in the Rht-B1/D1 gene of wheat have been considered as the first SH2 domain discovered in plant (35). Moreover, GAI, RGA, and SCR, all members of a putative transcription factor family termed GRAS, were suspected as STAT-like factors in plant and carry the SH2 domains due to ␤B-like core sequences found in their C-terminal regions (11,35). However, according to our secondary structural prediction "␤␣␣" rather than "␣␤␤␤␣" topology is present in all those SH2-like sequences. Therefore, a perfect ␤B sequence does not necessarily reveal an SH2 domain structure. In contrast, as long as the "␣␤␤␤␣" secondary structure maintains, the amino acid sequence can be very divergent. The balance between evolution and conservation in the SH2 domain development reflects evolution at the primary structural level but conservation at secondary structural level.
The discovery of STAT linker-SH2 domain-bearing genes in plants underscores the proposal that SH2 domain development was an early step in the evolution of multicellularity. The SH2 domain formed most likely in a common eukaryote ancestor prior to divergence of any of the major eukaryote taxa. Hence, the linker-SH2 domain of the transcription factor STAT has been placed on center stage of transcriptional activation prior to the development of the SH2 domain into pTyr signal transduction (36,37).