Introduction In recent years, with the initiation of human genomic project and other genomic projects, large numbers of new sequences have been generated and archived into databases. New techniques including computational methods are required to identify the biochemical and cellular functions of these novel gene products. Classification of conserved genes according to their homologous relationships provides a way to extract information from genomics. Clusters of Orthologous Groups (COGs) were developed by Koonin and co-worker to classify microbial (and a yeast) gene families (Science 278 : 631, 1997). As general strategy, we are interested to expand the COGs to other organisms including higher eukaryotic species.
Methods In this work, 4 COGs were studied. The original members of each COG family were obtained from the NCBI COG website (http://www.ncbi.nlm.nih.gov/COG/). The functions of 3 COGs are unknown, including COG0325S, COG0398S and COG0217S. The members of COG0615R are related to cytidylytransferase. A multiple sequence alignment (MSA) was performed with the Cluster_W 1.7 provided by Human Genome Center at Baylor College of Medicine (http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html). A Hidden Markov Model (HMM) was created from the MSA by Hmmer 2.1 and then used to perform profile search against the non-redundant (nr) database of NCBI. At the same time, a profile (consensus sequence) is derived from the MSA and used to search SwissProt, and nr with BLAST. The returned hits were studied with multiple sequence alignment and phylogenetic tree analysis using NCSA Biology Workbench 3.0 (http://biology.ncsa.uiuc.edu/). Searches were carried out against dbest database at NCBI with profiles derived from alignment of the original COG members or the original members plus new members. EST hits were assembled (http://hercules.tigem.it/ASSEMBLY/assemble.html), translated (http://dot.imgen.bcm.tmc.edu:9331/seq-util/seq-util.html), and then a single translated peptide sequence representing a single gene was included and aligned with the other COG members. From the alignment, phylogenetic tree was built to investigate the evolutionary relationships between members of the same COG. Finally, the results were summarized on our web site and made accessible to colleagues in the NMR group (http://www-nmr.cabm.rutgers.edu/~mani/cogs). Please refer to the lab protocol for more details.
Results COG0217S was expanded from 8 members to 25, but some of them are from the same species, therefore they could be paralogues. The eukaryotic members of this COG include yeast, one plant (arabidopsis), one nematode (but not C. elegans). The worm gene is truncated. It may be a pseudogene, or due to sequencing error. Comparison of the MSA of the 8 original members with the final MSA demonstrated that the genes of this COG are very well conserved. COG0398S was increased from 3 to 12 members. Although orthologous sequences were found in eukaryotes, they are either from yeast or arabidopsis, but not from other higher eukaryotic organisms. Only very weak hits were found in worm; these need to be further studied. Therefore, both COG0217S and COG0398S do not cover higher eukaryotic candidate genes for further structure investigation.
The members of COG0325S were expanded from 5 to 25, including 10 eukaryotic and 12 microbial organisms. All higher eukaryotic animal members (worm, fly, mouse and human) are located on the same branch of the phylogenetic tree, which indicates their close evolutionary relationships. Many examples of plant genes were discovered for this cluster. Furthermore, comparison of the final MSA with the MSA of the original members indicated that the members of COG0325S are very well conserved. The data suggests COG0325S is a good candidate COG for NMR structure determination. Unlike the previous 3 COGs, COG0615R is probably related to known function because most of its members are found to be cytidylyltransferase (CT). They also show very good conservation from the MSA data, especially 2 His (H) and 1 Arg (R), which might play key roles in catalyzing transferring of phosphate. For more detail data of the analysis, please visit (http://lion.cabm.rutgers.edu/~mani/cogs/).
Discussions In this work, there were
two steps required more personal judgement. One is how to derive a better
profile to search dbest. The other is how to decide whether or not to include
a sequence in the final alignment, which will affect the creation of phylogenetic
tree. I derived a profile (consensus sequence) either from MSA of the original
COG members or sometimes from MSA of the original COG plus other well-conserved
sequence. Generally, several version of profile were used to search databases.
Each profile sequence was used to do BLAST search to retrieve the original
COG members before it would be used to search dbest. To make a better MSA
output, long sequence and sequence containing long insert would be excluded
from the final MSA. Conservation of critical residues would be another
factor to decide if a sequence was included in the final MSA. Finally,
because new sequence data is generated and added into the genomic database
everyday, we should keep update our COG analysis results to represent the
I would like to thank Dr. Montelione for giving me the great opportunity to do this rotation. I would also like to acknowledge Kris Gunsalus, Ram Mani and Alex Gardino's kind help.