New Methods in Structural Genomics
 
Introduction:
    To outline the new method development in structural genomics, we must first define it. Structural genomics relies on the fact that in proteins, structure and function are closely related and linked. Therefore, we can use structure to understand function and we can study structure in a functional context. In this context, structural genomics can be viewed as studying all structures in a genome or studying all of known structure space (Montelione, GT. and Anderson, S, 1999). To accomplish this end, we are developing methods for high-throughput structure determination. As with any task requiring high-throughput calculations, computers are an essential component.
    The determination of a structure is a long and difficult task. First, one must select a gene target for study. After choosing a specific target, one must proceed to clone this target. This step requires the synthesis of primers and the ligation of the inserts generated by these primers into vectors. A multiplex expression kit has been designed by Kris Gunsalus to simplify the aforementioned step (Gunsalus, KC, Palacios D, and Montelione GT, Submitted). After cloning the gene target, we need to express, purify, characterize and concentrate the protein. For NMR the sample needs to be labeled with 15N and/or 13C. The chemical shifts of the labeled protein are a unique characteristic of each amino acid and can be used to distinguish them from each other. There are a standard set of backbone experiments and once completed, AutoAssign, an expert system based computer program designed to assign chemical shifts to the amino acid residues in the backbone, is run (Zimmerman, DE. et al., 1997). Next we run additional NMR experiments, assign the sidechains, collect and assign NOESY spectra to determine H-H distances less than five Angstroms (Moseley, HNB and Montelione, GT., 1999). This information is used for constraints in molecular mechanics packages that are used to derive the structure.
    The projects outlined in this progress report entail many aspects of the structural genomics initiative. During the summer of 1999, there was a project to organize data generated from the steps of target selection to the NMR experiments, but recently, my projects have moved to the AutoAssign or backbone analysis step in structure determination. The projects outlined are: a License Database, analyzing triple resonance data for mt1615, produced by Valerie Booth and Cheryl Arrowsmith in the University of Toronto, using AutoAssign, creating a preprocessing chemical shift validation script for AutoAssign (Montelione, GT. et al., 1999) and creating AutoAssign documentation.
Experimental Methods :
    So far, the major (only) tools have been computers, so the experimental methods are limited to programming languages and applications.
PERL / CGI                                                : Primer Program and the extracted modules of this program
PERL / CGI interfacing to postgresql         : License Database
AutoAssign and perl-preprocessing scripts : Analysis of the mt1615 triple resonance NMRdataset
PERL                                                           : Chemical shift validation script
HTML                                                         : AutoAssign Documention
 
    The Primer Program was written in PERL / CGI to allow web access. Due to the program's inoperative state, a few independant modules were removed, which will eventually reintegrate into the Primer Program. The purpose of their extraction was to allow some of the features of the program to be used while it was inoperative. These modules are the Rare Enzyme Restriction Site Finder (RERSF) and the Rare Codon Finder (RCF). The PERL / CGI interface to postgresql is analogous to the method used in the NMR Database. The CGI interfaces to a set of perl modules which then interface to the PERL DBI (database interface) which finally interfaces to the postgres database. The analysis of University of Toronto triple resonance NMR dataset started with the HNcoCACB and the HNCACB (PHASED CB no GLY phase) spectras. The validation script, written in PERL, uses the BRMB average amino acid chemical shift values to calculate the average expected shift. The AutoAssign documentation started as the documentation of strategies used in analyzing the Toronto group's triple resonance NMR dataset as well as other minimal dataset strategies. Recently, it has expanded to a full HTML-based documentation of AutoAssign, but the documentation is still under construction.
 

Results and Discussion :

    As mentioned before, the Primer Program is currently inactive and is undergoing a full rewrite. In the meantime, two other smaller modules have been constructed to allow usage of some features in the Primer Program. RERSF is used to locate restriction sites in a nucleotide sequence. It then displays the sequence with the restriction site highlighted. RCF is used to find certain rare codons. Stratagene has created a cell line that has tRNAs for these rare codons, which are usually absent in normal cells (Carstens, CP and Waesche A). This program finds the total number of these rare codons in the sequence and the number in the first 10 codons.  The license database was a simple programming project used to help automate the distribution of licenses of software developed in the lab. This allowed for efficient distribution of laboratory-developed software. However, this project had little scientific value and was designed primarily to organize and track the distribution of software.

    The triple resonance NMR dataset for mt1615 provided a challenge, because six NMR experiments had to be simulated from two collected NMR experiments. As a prelude to this project, I carried out a similar procedure on a dataset provided by Bristol-Meyer-Squibb (proprietary). These results agreed with a previously determined set of assignments, which were assigned using the following set of collected experiments: HSQC, HNCO, HNCA, HNcoCA, HNCACB, HNcoCACB, HNHA, HNcoHA. The minimal NMR experiments required for both datasets are as follows: HSQC, HNCO, HNCA, HNcoCA, HNCACB, HNcoCACB. The exact methodology is quite complex and will be included in the next release of AutoAssign as a part of the documentation. A temporary link to this page is Documentation/. Once at this page, click on the AutoAssign Strategy Page. In this page, go to the two rung strategy with the phased HNCACB, because with the 2 experiments, you can only match on the CA and CB ladders. In my analysis of the mt1615, I was able to obtain 73/81 CA assignments and only one proline was isolated (no links to any other residues), however, it seems that some of my assignments do not agree with their manual results. The processes of first obtaining the assignments and then checking their results lead to a hypothesis. When AutoAssign assigns equal chemical shift values for the CA and CB atoms of serine and threonine residues it implies that the assignments it made are not reliable. Normally these residues have the attribute that the average chemical shift value of a CB is downfield of a CA. This hypothesis holds for mt1615; the serines and threonines assigned with the same CA and CB chemical shift values do not agree with the manual assignment of the Toronto group. However, I will need do further testing to see whether the hypothesis holds for other triple resonance datasets.

   The next project was the development of a script to validate chemical shifts of peaks in a peakfile, a file with the processed and picked peaks from the NMR spectra. The program validates the peaks by comparing the ratios of the actual shifts to the number of actual peaks versus the expected shifts to the expected number of peaks. The expected shifts are calculated using the average chemical shift values provided in the BRMB database. These two ratios should be quite similar; however, improper referencing could cause the ratios to be very different. This script takes in the peakfile, the amino acid sequence and a filter (either a standard defined experiment or a dimension by dimension definition of the peakfile). The output is the ratio of the expected chemical shift to the expected peaks and the actual chemical shift to the actual peaks for each dimension. The purpose of this script is to find systematic deviations from the standard chemical shifts. With more analysis, this could lead to a method for detecting the percentage of deuteration or some idea of secondary structure (Bax A and Spera S, 1999). I am testing the validate script on some example datasets and will include the results of my studies in the help documentation of the perl script in the next release of AutoAssign.

    The final and current project for this semester is the AutoAssign documentation page. This documentation includes the methods used to determine the preliminary assignments for mt1615, along with other such minimal dataset strategies. The purpose for these strategies is to minimize the data collection time, which is currently a big bottleneck. Along with these dataset strategies, the page now includes the general help documentation for AutoAssign. This project will help the current and future users of AutoAssign use it more efficiently and effectively.
 

Conclusion :

    Currently, my work is under progress and I have not made any discrete conclusions. The Primer Program is incomplete, but two resulting tools RERSF and RCF are in use. The license database is completed and in operation. The assignments for mt1615 are pending further analysis. The validation script is currently under testing and the AutoAssign documentation is usable, but under construction. The only hypothesis made is that CA and CB assigned equal chemical shift values in serines and threonines, are signals that AutoAssign has made mistakes in its assignments. This hypothesis, however, needs more data to be validated.
 

Acknowledgements:

    Kristin Gunsalus for her help and algorithm design in the primer program (even though it hasn't come to fruition yet.)
    Hunter Moseley for his design advice and patience in teaching the correct usage of AutoAssign.
 

References:

      1. Bax, A and Spera S: Empricial Correlation between Protein Backbone Conformation and Ca and Cb 13C Nuclear Magnetic Resonance Chemical Shifts.J Am Chem Soc 1991, 113:5490-5492.
      2. Carstens, CP and Waesche A: Codon Bias-Adjusted BL21 Derivatives for Protein Expression. Stratagene, 12:49-51
      3. Gunsalus KC, Palacios D, Montelione GT: A Multiplex Protein Production System for Structural Genomics. Nature-Biotechnology Submitted.
      4. Montelione GT. and Anderson S: Structural genomics: a keystone for a human proteome project. Nat Struct Biology 1999, 6:11-12.
      5. Montelione GT, Rios CB, Swapna GVT, Zimmerman DE: NMR pulse sequences and computational approaches for automated analysis of sequence-specific backbone resonance assignments of proteins. In Biological Magnetic Resonance, Volume 17: Structure Computation and Dynamics in Protein NMR. Edited by Krishna R, Berliner L. New York:Plenum; 1999:81-130.
      6. Moseley HNB. and Montelione GT.: Automated analysis of NMR assignments and structures. Current Opinion in Struct Biology 1999, 9:635-642.
      7. Zimmerman DE, Kulikowski CA, Huang Y, Feng W, Tashiro M, Shimotakahara S, Chien C-Y, Powers R, Montelione GT: Automated analysis of protein NMR assignments using methods from artificial intelligence. J Mol Biol 1997, 269:592-610.