Ram Mani
December 28, 1998
Advisor: Guy Montelione
481 Advanced UG Research
The progress of protein structure determination is
evident from the increasing rate of protein or protein subunit structures
that are being added to the PDB every day1. With this progress, these scientists
are questioning their approaches for targeting which protein or protein
subunits to investigate so that their discoveries may have the greatest
impact on the existing protein structure knowledge. Since domains seem
to be the functionally independent pieces of proteins and evolutionary
conserved, a domain approach to structural characterization makes sense
and has been practiced for decades. However, how does one parse these domains
and choose which domains to investigate?
We choose to use sequence conservation to identify
domains, and the subset of domains that we will structurally characterize
will be those that seem to have no known biochemical function. These pieces
of "orphan genes" (Holm & Sander, 1996) have relatively great potential
for being a substantial number of the remaining protein folds that we need
to complete the estimated range of 1000 - 1500 that populate life (Chothia,
1992).
We have decided to take a look at the nucleic acid
sequence data resulting from powerhouse sequencing projects in order to
get a picture of some of these previously described domains to investigate.
Tatusov et al. (1997) have recently showed that analysis of eight completely
sequenced genomes can result in designation of groups of genes that show
homology across at least three of the organisms. It is hypothesized that
the genes conserved among these completely sequenced genomes may be the
basic set of genes necessary for life. As an organism becomes more complicated,
the set of genes becomes for complicated by mutational events (Tatusov
et al., 1997). These consensus of orthologous groups (COGs) are the foundation
for our survey of conserved domains among distinct organisms.
We have expanded upon the work of Tatusov et al.
(1997) in order to look for members of these COGs in all domains of life.
This analysis provides a source of domains that fall within the size boundaries
of NMR for us to structurally and functionally characterize. In this paper,
the computational biology approach to this goal will be described as it
was practiced on two of the COGs.
Experimental Methods
Protein sequence acquisition/comparison
Examination of a subset from the NCBI COG website2
was performed to determine which of the uncharacterized or not well-characterized
COGs should be examined. COGs 0316S and 0011S (S stands for unknown function)
were chosen because they have no function associated with their members
and are within the size constraints for structural determination by NMR.
The sequences for each of the COGs were obtained from the COG website and
a multiple sequence alignment (msa) was performed using the ClustalW algorithm
of NCSA Biology Workbench3. The alignment was analyzed for similarities
between the sequences, the positions of strongly conserved residues, and
the boundaries of the potential domain. This allowed for decisions to be
made about sequence features of the COG members. In addition, each of the
COG members (in the case of 0011S) or a few of the COG members and a derived
consensus sequence of the COG (in the case of 0316) were searched against
the PDB4 database of NCBI using BLAST5. This was done to see if COG members
had common homologues in the PDB. None of the searches showed common homologues,
allowing us to conclude that the COG is not represented in the PDB.
The next step was to try to expand the COG using
the information contained in the msa. Protein sequences were searched first
using HMMER6. This program uses the information contained in the msa to
construct a Hidden Markov Model (HMM), which is a statistical representation
of matrix-organized data. The HMM was used to search the nonredundant (nr)
database of NCBI7. Those sequences reported as possible homologues were
recorded. These sequences were aligned against the COG members to determine
if they should be included in the COG. A phylogenetic tree using the NCSA
Biology Workbench was created to assist in determining closeness in relationship
between COG members and possible, additional COG members. Those that looked
like they may fit the COG were allowed to remain in the analysis.
Protein function analysis & additional search methods
Swiss-Prot8 was used to examine possible cellular
or biochemical functional significance of any of the COG members or the
possible COG additions. When there was no Swiss-Prot entry for a sequence,
its NCBI report was examined. When necessary, articles concerning a sequence
were searched.
The Swiss-Prot reports of COG members and those
sequences to be added to the COG were examined for ProDom and ProSite information.
ProDom9 contains a list and cartoon representation of all Swiss-Prot entries
containing a sequence domain common to a query entry. ProSite10 contains
Swiss-Prot sequences that are grouped by common sequence motifs; it also
provides annotated function for those sequences of the group.
EST acquisition/analysis
Some of the nucleic acid sequence data, especially
for eukaryotes, does not appear to have yet reached protein sequence databases.
For that reason, BLAST searches using a derived, hand-written consensus
sequence for information contained in the original COG member msa were
performed against the NCBI EST database11. The possible EST homologues
were examined for supporting, overlapping EST sequences by consulting the
NCBI Unigene database in the cases of rat, mouse, and human12 and by doing
BLAST searches of the NCBI EST database using the strong matching EST(s)
as query. Since EST sequences are known for an error rate around 1-2 %,
a consensus sequence was created (when possible) for the hypothetical domain
in the overlapping EST sequence by creating a contig using Baylor College
of Medicine's (BCM) CAP feature13. In addition, the organism's genomic
DNA database websites were searched when possible to verify the EST sequence.
The contig was translated in its six possible open reading frames (ORFs)
using the BCM site, sequences with stop codons in the hypothetical domain
were excluded, and best translation was compared using a msa containing
the original COG members.
Final Analysis
A msa was created to display the COG original members
and its new additions. A phylogenetic tree was created to graphical display
the evolutionary similarities between the sequences. This information was
used to determine which of the members to target for protein expression,
structural determination, and biochemical/cellular function assay.
Results & Discussion
COG 0316S
The msa of the original COG members and the additional
peptides suspected to show homology to the COG is presented in Figure 1.
The coloring highlights the conservation in certain positions. This conservation
information may aid us in testing what residues are necessary for proper
folding, structure, and biochemical function. Two of the sequences, Ecol4
and Hinf3 provide a large insert towards the C-terminus of the overlap.
This, plus their noncentral position in the phylogenetic tree representation
(see Figure 2) of the msa gives reason to believe that these two sequences
may be paralogues of the COG and that should be excluded for our purposes.
However, their overall good similarity to the COG cannot be ignored, and
it may be that the two are paralogues of the COG but orthologues of each
other (as indicated in Figure 2). The msa shows the region of overlap these
37 members (excluding Hinf3 & Ecol4) to be approximately 104 residues,
which we hypothesize to correspond to an evolutionary mobile peptide subunit,
such as structural domain or module. The preponderance of this COG in organisms
spanning from bacteria to humans makes it very interesting. However, it
is key to note that member that have been added to this and COG 0316 may
not be orthologues because the genomes of those respective organisms have
not been completely sequenced. The unsequenced part may contain the true
orthologue that belongs in the COG.
Inspection of the phylogenetic tree guided our choosing
of 6 peptides to express (as indicated in Figure 2 and Table 1). We are
distributing the targets as 3 prokaryotes and 3 eukaryotes and are trying
for all main branches of the phylogenetic tree. Since these 6 homologues
are categorically different from each other (as indicated in the branches
of the tree), we may see structural/functional preservation within the
subgroup (branch) but slight or possibly large structural/functional differences
across the subgroups. As seen in the tree, we will analyze the protein
properties of three of the four subgroups. The last subgroup, which consists
of fairly "exotic" species, is a possible target for future
analysis.
Throughout the course of the analysis, we came upon
sequences that looked like they partially fit the COG (see Table 2). There
were constraints in their EST sequences that caused their exclusion from
the COG. However, they may actually be limited from the COG due to human
errors or limitations (e.g. sequencing errors that artificially produced
a stop codon in the middle of the ORF; a partial reverse translation of
the mRNA or a low fidelity PCR reaction that leads to incomplete synthesis
of the cDNA). These peptides will be "under watch" and may be incorporated
into the COG if further EST or genomic data allows.
COG 0011S
COG 0011S is very different from the previous one.
Its population was originally low, and approximately only nine sequences
could be added to it. It is hard to judge if sequences should not excluded
due to the small population. However, the phylogenetic tree (figure 4)
and the msa (figure 3) make good arguments for the exclusion of Aful2 and
Cper1. Bsub1 may or may not belong in the COG, depending on what residues
are critical the COG. Determining the significant residues of the COG is
a difficult task. One would think that whatever is constant among the critical
COG members would set the constraints. However, sometimes all of the original
COG members but one will be consistent with the additional, possible COG
members; meaning, there is lack of conservation with in the COG.
This COG showed no similarity to translated EST's,
unlike 0316S. Consequently, this COG's only eukaryotic representation comes
from its original yeast member. The region of overlap common to all members
is approximately 97 residues long. For the moment, this COG will not be
the subject of immediate expression efforts. However, it may be of clinical
importance due to its lack of showing in higher eukaryotes, which may make
it a possible antibiotic target for certain bacteria.
Conclusion
The two COGs described above are very different
from each other, with one being greatly populated and showing homology
to many higher eukaryotes and the other opposite in these two features.
Perhaps this is a result of one gene being continued in the continuum of
life and the other gene not. Currently, we are working on cloning the selected
members of 0316S and expressing them. A structural and functional investigation
will follow that may result in discovery of a novel fold that is important
in all walks of life.
References
Chothia, C. (1992) Nature 357, 543-544.
Holm, L. & Sander, C. (1996) Science 273, 595-602.
Tatusov, R.L., Koonin, E.V., and Lipman, D.J. (1997) Science 278, 631-637.
Online Refernces
1 statistics page of http://pdb.pdb.bnl.gov)
2 http://www.ncbi.nlm.nih.gov/COG/
3 http://biology.ncsa.uiuc.edu/
4 http://pdb.pdb.bnl.gov/
5 http://www.ncbi.nlm.nih.gov/BLAST/
6 http://hmmer.wustl.edu/
7 http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html
8 http://www.expasy.ch/sprot/sprot-top.html
9 http://www.toulouse.inra.fr/prodom/doc/prodom.html
10 http://www.expasy.ch/sprot/prosite.html
11 http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html
12 http://www.ncbi.nlm.nih.gov/UniGene/index.html
13 http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html