December 28, 1998
Advisor: Guy Montelione
481 Advanced UG Research
The progress of protein structure determination is
evident from the increasing rate of protein or protein subunit structures
that are being added to the PDB every day1. With this progress, these scientists
are questioning their approaches for targeting which protein or protein
subunits to investigate so that their discoveries may have the greatest
impact on the existing protein structure knowledge. Since domains seem
to be the functionally independent pieces of proteins and evolutionary
conserved, a domain approach to structural characterization makes sense
and has been practiced for decades. However, how does one parse these domains
and choose which domains to investigate?
We choose to use sequence conservation to identify domains, and the subset of domains that we will structurally characterize will be those that seem to have no known biochemical function. These pieces of "orphan genes" (Holm & Sander, 1996) have relatively great potential for being a substantial number of the remaining protein folds that we need to complete the estimated range of 1000 - 1500 that populate life (Chothia, 1992).
We have decided to take a look at the nucleic acid sequence data resulting from powerhouse sequencing projects in order to get a picture of some of these previously described domains to investigate. Tatusov et al. (1997) have recently showed that analysis of eight completely sequenced genomes can result in designation of groups of genes that show homology across at least three of the organisms. It is hypothesized that the genes conserved among these completely sequenced genomes may be the basic set of genes necessary for life. As an organism becomes more complicated, the set of genes becomes for complicated by mutational events (Tatusov et al., 1997). These consensus of orthologous groups (COGs) are the foundation for our survey of conserved domains among distinct organisms.
We have expanded upon the work of Tatusov et al. (1997) in order to look for members of these COGs in all domains of life. This analysis provides a source of domains that fall within the size boundaries of NMR for us to structurally and functionally characterize. In this paper, the computational biology approach to this goal will be described as it was practiced on two of the COGs.
Protein sequence acquisition/comparison
Examination of a subset from the NCBI COG website2 was performed to determine which of the uncharacterized or not well-characterized COGs should be examined. COGs 0316S and 0011S (S stands for unknown function) were chosen because they have no function associated with their members and are within the size constraints for structural determination by NMR. The sequences for each of the COGs were obtained from the COG website and a multiple sequence alignment (msa) was performed using the ClustalW algorithm of NCSA Biology Workbench3. The alignment was analyzed for similarities between the sequences, the positions of strongly conserved residues, and the boundaries of the potential domain. This allowed for decisions to be made about sequence features of the COG members. In addition, each of the COG members (in the case of 0011S) or a few of the COG members and a derived consensus sequence of the COG (in the case of 0316) were searched against the PDB4 database of NCBI using BLAST5. This was done to see if COG members had common homologues in the PDB. None of the searches showed common homologues, allowing us to conclude that the COG is not represented in the PDB.
The next step was to try to expand the COG using the information contained in the msa. Protein sequences were searched first using HMMER6. This program uses the information contained in the msa to construct a Hidden Markov Model (HMM), which is a statistical representation of matrix-organized data. The HMM was used to search the nonredundant (nr) database of NCBI7. Those sequences reported as possible homologues were recorded. These sequences were aligned against the COG members to determine if they should be included in the COG. A phylogenetic tree using the NCSA Biology Workbench was created to assist in determining closeness in relationship between COG members and possible, additional COG members. Those that looked like they may fit the COG were allowed to remain in the analysis.
Protein function analysis & additional search methods
Swiss-Prot8 was used to examine possible cellular or biochemical functional significance of any of the COG members or the possible COG additions. When there was no Swiss-Prot entry for a sequence, its NCBI report was examined. When necessary, articles concerning a sequence were searched.
The Swiss-Prot reports of COG members and those sequences to be added to the COG were examined for ProDom and ProSite information. ProDom9 contains a list and cartoon representation of all Swiss-Prot entries containing a sequence domain common to a query entry. ProSite10 contains Swiss-Prot sequences that are grouped by common sequence motifs; it also provides annotated function for those sequences of the group.
Some of the nucleic acid sequence data, especially for eukaryotes, does not appear to have yet reached protein sequence databases. For that reason, BLAST searches using a derived, hand-written consensus sequence for information contained in the original COG member msa were performed against the NCBI EST database11. The possible EST homologues were examined for supporting, overlapping EST sequences by consulting the NCBI Unigene database in the cases of rat, mouse, and human12 and by doing BLAST searches of the NCBI EST database using the strong matching EST(s) as query. Since EST sequences are known for an error rate around 1-2 %, a consensus sequence was created (when possible) for the hypothetical domain in the overlapping EST sequence by creating a contig using Baylor College of Medicine's (BCM) CAP feature13. In addition, the organism's genomic DNA database websites were searched when possible to verify the EST sequence. The contig was translated in its six possible open reading frames (ORFs) using the BCM site, sequences with stop codons in the hypothetical domain were excluded, and best translation was compared using a msa containing the original COG members.
A msa was created to display the COG original members and its new additions. A phylogenetic tree was created to graphical display the evolutionary similarities between the sequences. This information was used to determine which of the members to target for protein expression, structural determination, and biochemical/cellular function assay.
Results & Discussion
The msa of the original COG members and the additional peptides suspected to show homology to the COG is presented in Figure 1. The coloring highlights the conservation in certain positions. This conservation information may aid us in testing what residues are necessary for proper folding, structure, and biochemical function. Two of the sequences, Ecol4 and Hinf3 provide a large insert towards the C-terminus of the overlap. This, plus their noncentral position in the phylogenetic tree representation (see Figure 2) of the msa gives reason to believe that these two sequences may be paralogues of the COG and that should be excluded for our purposes. However, their overall good similarity to the COG cannot be ignored, and it may be that the two are paralogues of the COG but orthologues of each other (as indicated in Figure 2). The msa shows the region of overlap these 37 members (excluding Hinf3 & Ecol4) to be approximately 104 residues, which we hypothesize to correspond to an evolutionary mobile peptide subunit, such as structural domain or module. The preponderance of this COG in organisms spanning from bacteria to humans makes it very interesting. However, it is key to note that member that have been added to this and COG 0316 may not be orthologues because the genomes of those respective organisms have not been completely sequenced. The unsequenced part may contain the true orthologue that belongs in the COG.
Inspection of the phylogenetic tree guided our choosing of 6 peptides to express (as indicated in Figure 2 and Table 1). We are distributing the targets as 3 prokaryotes and 3 eukaryotes and are trying for all main branches of the phylogenetic tree. Since these 6 homologues are categorically different from each other (as indicated in the branches of the tree), we may see structural/functional preservation within the subgroup (branch) but slight or possibly large structural/functional differences across the subgroups. As seen in the tree, we will analyze the protein properties of three of the four subgroups. The last subgroup, which consists of fairly "exotic" species, is a possible target for future analysis.
Throughout the course of the analysis, we came upon sequences that looked like they partially fit the COG (see Table 2). There were constraints in their EST sequences that caused their exclusion from the COG. However, they may actually be limited from the COG due to human errors or limitations (e.g. sequencing errors that artificially produced a stop codon in the middle of the ORF; a partial reverse translation of the mRNA or a low fidelity PCR reaction that leads to incomplete synthesis of the cDNA). These peptides will be "under watch" and may be incorporated into the COG if further EST or genomic data allows.
COG 0011S is very different from the previous one. Its population was originally low, and approximately only nine sequences could be added to it. It is hard to judge if sequences should not excluded due to the small population. However, the phylogenetic tree (figure 4) and the msa (figure 3) make good arguments for the exclusion of Aful2 and Cper1. Bsub1 may or may not belong in the COG, depending on what residues are critical the COG. Determining the significant residues of the COG is a difficult task. One would think that whatever is constant among the critical COG members would set the constraints. However, sometimes all of the original COG members but one will be consistent with the additional, possible COG members; meaning, there is lack of conservation with in the COG.
This COG showed no similarity to translated EST's, unlike 0316S. Consequently, this COG's only eukaryotic representation comes from its original yeast member. The region of overlap common to all members is approximately 97 residues long. For the moment, this COG will not be the subject of immediate expression efforts. However, it may be of clinical importance due to its lack of showing in higher eukaryotes, which may make it a possible antibiotic target for certain bacteria.
The two COGs described above are very different from each other, with one being greatly populated and showing homology to many higher eukaryotes and the other opposite in these two features. Perhaps this is a result of one gene being continued in the continuum of life and the other gene not. Currently, we are working on cloning the selected members of 0316S and expressing them. A structural and functional investigation will follow that may result in discovery of a novel fold that is important in all walks of life.
Chothia, C. (1992) Nature 357, 543-544.
Holm, L. & Sander, C. (1996) Science 273, 595-602.
Tatusov, R.L., Koonin, E.V., and Lipman, D.J. (1997) Science 278, 631-637.
1 statistics page of http://pdb.pdb.bnl.gov)