Report to the National Science Foundation
Workshop on Structural Genomics - Understanding Proteins Universal to Life
Advanced Photon Light Source, Argonne Natl. Laboratories
January 23 - 25, 1998

Prepared by G. T. Montelione and S. Anderson, CABM. Rutgers University


This meeting was organized by Paul Bash and other scientists at Argonne APS to consider the role of macromolecular structure analysis in the "postgenomic era". The meeting focused on two principle questions:  (1) What is required to determine by experiment and/or homology modeling methods, the 3D structures of all the proteins in the genomes of organisms of interest?  (2) Can the structural analysis of these novel gene products provide useful clues to their "functions"?

In answer to the first question, it was proposed by several speakers that there are some 1000 - 1500 common protein-domain fold families (though there may be many more uncommon ones).  About 500 - 600 of these are represented in the set of ~ 6,000 protein structures deposited in the protein data base (PDB).  Most of the PDB entries actually represent the same protein species (e.g. NMR vs. crystal structure, free vs. ligand bound, single-site mutants, etc.); only 1 in 10 represents a distinct fold family.  While we will soon have 100% of the gene sequences of several organisms, only about 15% of the corresponding protein structures are known either experimentally or by "reliable" homology modeling.   It is estimated that structure determination of a carefully-selected set of 3,000 - 10,000 proteins will be required to generate a "basis set of domain structures" which would be sufficient to homology model most of the remaining structures.  It was also pointed out that the process of clustering gene products into fold families is an on-going intellectual activity which has not achieved consensus in the field, and that statisticians should be recruited into this area of bioinformatics.

From a practical point of view, it clear that this effort will be carried out jointly by X-ray crystallography and NMR spectroscopy.  In this respect it was pointed out that in 1996 about 20% of the structures of new domain families were determined by NMR and 80% by X-ray crystallography.  This ratio should be normalized by the fact that there are many more crystallography labs in the world than NMR labs, in part because crystallography is a more mature field.  This leads to the conclusion that one can expect both NMR and crystallography to play major roles in any federally-funded structural genomics initiative.

A tremendous innovation in the field of X-ray crystallography is the development of multiple anomalous dispersion (MAD) methods for obtaining phase information.  These methods still require special samples (e.g. isomorphous heavy-atom derivatives or SeMet-enriched proteins), but greatly facilitate the generation of electron-density maps.  Atomic coordinates then have to be fit into the electron-density maps using interactive graphics.  Regarding "high throughput" X-ray crystallography, it was pointed out that a single high-energy synchrotron beam line at Brookhaven Natl Labs that has been optimized for MAD data collection is currently generating about 25 protein structures per year.  Most of these were done using MAD analysis methods.  It was suggested that this throughput could be increased by a factor of ~10, but it was not clear how this increase would be obtained.   There was little discussion regarding the state-of-the-art for automated analysis of such crystallographic data, and it was suggested that this scale-up would be done with a large team of crystallographers who would do the actual analysis of data.

NMR faces limitations of rotational correlation times, and therefore is only useful for determining atomic-resolution structures of small proteins or protein fragments.  It was pointed out that the average domain size is about ~175 amino acids, and that many of the target domains relevant to this project are within the size range (< 25 kD) suitable for structure determination with existing NMR techniques.  It was also noted that 4 - 8 weeks of NMR instrument time is needed to collect the data required for a high-resolution NMR structure determination.  However, automation of the NMR data analysis process is actually very advanced and will soon be capable of fully automated analysis of protein structures in several hours.  Accordingly, such NMR analyses will be rate-limited by the time required for the data collection.  None-the-less, NMR spectrometers are relatively inexpensive compared with synchrotron sources; two to three NMR machines have the equivalent throughput (~25 structures per year) as a single synchrotron beam line.  Methods for increasing this throughput using deuteration technologies were mentioned by one speaker, but other approaches including the potential use of high-field (900 - 1200 MHz) magnets and super-conducting probe technologies were not discussed.

It was also pointed out in several talks and in the discussion following the talks that the most significant bottleneck in such an large-scale effort involves the production of suitable protein samples for structural analysis.  Both crystallography and NMR speakers pointed out special advantages of studying isolated domains of larger proteins, which tend to crystallize more easily and are easier to express in bacterial expression systems.  Several groups described preliminary progress in developing methods for screening protein samples for the suitability for NMR spectroscopy or crystallization, including light scattering measurements to screen for "crystallizability" or the use of simple NMR experiments to evaluate "foldedness" of candidate domains.   It was agreed that in the long term, the protein sample production problem is likely to be the most critical bottleneck.

The answer to the second question regarding the analysis of "function" from the structural analysis was complicated by a lack of a general consensus on the meaning of "biological function".  Many speakers argued that structural analysis was the best way to discover biochemical functions of novel gene products.  The basis for this argument is that homology can be detected over longer evolutionary timescales using 3D structure than using sequence information alone, and the general concept that "form defines function". However, others argued that structure was insufficient to understand or predict "function".  There are many examples of proteins with similar structures (even similar evolutionary lineage) that have different "functions".

One key problem in this discussion involves the definition of "biological function".  Geneticists, cellular biologists, structural biologists, bioinformaticians, and biophysical chemists use this term to mean different things.  Some speakers used the word "function" to refer to the general biochemical activity of the gene produce (e.g. kinase activity), others referred to the cellular process in which the gene product is involved, while to others "function" meant an understanding of the details of the atomic mechanism of catalysis or recognition.  Still others referred to function in the genetic sense of a generalized phenotype.  This lack of consensus in defining what aspect of function one might learn about from an examination of protein structure prevented the group from reaching a consensus on the role that large scale structural genomics will have in the related area of functional genomics.

It was suggested in the discussion session that it is important that such an initiative move ahead in a highly collaborative and cooperative fashion.  Fore example, it was suggested that a "Electronic Bulletin Board" of genomic targets that are being pursued by various groups be created to avoid overlap of efforts.  It was also proposed that mechanisms be developed for sharing materials.  The discussion also focused on the concept of learning from the genomic sequencing projects how to divide up the structural genomics efforts so as to avoid redundancy and to develop a highly collegial effort.

Recommendations to NSF

NSF resources should be directed to developing "high throughput" data collection technologies for X-ray crystallography and NMR, including high field NMR facilities that will provide enhanced single-to-noise and higher throughput.  NSF resources should also support efforts to automate the process of generating 3D protein structures from NMR or crystallographic data, and work to improve the efficiency and reliability of homology-modeling methods.  Bioinformatics methods for clustering novel gene products into clades that are likely to have common structures, for parsing gene products into domain-encoding regions, and for structural data base organization and searching are also essential to this effort.

It is critical to develop sound strategies for target selection.  Funding should support efforts to use bioinformatics methods to define the clusters of homologous proteins within and across genomes that are likely to belong to the same fold family with similar 3D structures.  In the first pass, it will only be necessary to do one structure determination from each of these clusters, as the other proteins within the family can then be homology-modeled with some degree of accuracy.  Later passes through the genome would address detailed structural analyses of these other fold-family members.

Funding mechanisms must be established for ensuring the development of high-thoughput technologies for protein production, including high level production vectors and technologies for rapid protein purification.  These technologies are crucial to the efficiency and success of a high-throughput structural analysis initiative.  Technologies must also be developed for analyzing the "foldedness" of candidate protein domains, and for rapidly screening large numbers of samples for crystallizability and suitability for NMR analysis.  NSF resources should be directed to these severely underfunded technology areas.

Structural biology is a very competitive field.  While in the early days of X-ray crystallography, scientists attempted to avoid targets that were being analyzed in other laboratories, these days it is not uncommon to find multiple crystallagraphy and NMR groups competing to complete a structure determination of the same protein.  Part of this is due to the fact that the process of structural analysis is much faster today, and groups are often unaware of one another's efforts.  In the first pass through the genomes, such redundancy will be highly inefficient.  The genomic sequencing projects have some experience in orchestrating non-redundant efforts.  These projects should be evaluated as possible role models for the structural genomics initiative.  NSF should consider how it can play a role in helping to organize this large-scale effort by the creation mechanisms for coordinating a collegial effort in large-scale structural genomics.

Workshop Talks

A copy of the program of the meeting is attached.  The following summarizes our specific notes on each of the talks at the meeting.

John Wooley (DOE).

This meeting will feed into the "Computational Science Initiative" that DOE is beginning (see recent issue of Science magazine).  Funding for this initiative is $100 million / yr, and going up.

"Form Follows Function".  Analysis of protein structures can give critical clues to protein function.

The Genome Project will provide a "Periodic Table" of biology .  It will provide a new paradigm for molecular biology.  - W. Gilbert.

"High throughput" structural analysis can provide a "basis set" of protein structures that will allow better understanding of function and form the basis for structure prediction by homology modeling.

Harvey Drucker. (Argonne Natl. Labs)

This represents an unusual and special conjunction of genomic sequence data, protein structural data base, and funding opportunities.

Paul Bash (Argonne Natl. Labs)

Can call this: "Comparative Genomics-Directed Structural Biology"

One aim is to generate a representative set of structural motifs, a "Basis Set of Structures".

Key questions we hope to address at this workshop:

  •  How do we find meaningful clusters of protein structures?
  •  How many families are there?
  •  How can we use this basis set to homology model the remaining structures?
  •  How do we select targets for experimental structure analysis?
  •  What are the rate-limiting steps:

  •     -Expression
        -Protein Purification
        -"High throughput" crystallography or NMR?
        -X-ray, NMR data collection?
        -Interpreting the data -- automated methods for X-ray and NMR analysis
  • How much would it cost to solve structures in "high throughput" fashion.  [He estimated that it presently costs $100 - 200K per protein, but I do not think this includes capital costs, overhead, and protein expression/production expense.]



    Walter Gilbert (Harvard Univ)

    The growth of DNA sequence information has been exponential since the early 80's when rapid sequencing methods were developed.

    The sequence data base grows by a factor of 10 every 5 years, compared to science in general that grows a factor of 2 in every 10 years.

    For a graduate student, there is 10x more information available when she graduates than when she begins graduate studies.  Over the course of undergrad and graduate studies, the sequence information increases 100 fold.

    We can expect to see some 10 Gbases of information by ~2002.

    Some 50 - 100 new bacterial genomes will be completed during the next year (??).  Many plant genomes will follow.

    Should we do "high throughput structure determination" as a major effort?  Yes!!

    What are the genes?
    What do the genes do?  How do they interact?
    How do we change what they do?  How do we enhance or alter the function of the genes?

    Gilberts view is:  "Function" analysis involves the prediction of protein-protein interactions.

    Intron / Exon debate: Introns early vs. introns late.

    Introns can speed up recombination by shuffling.  Can get 106 speedup in recombination; exon shuffling can speed up evolution.

    Bacterial and protist genomes do not have introns, while vertebrates do.  This is one point which favors the intron-late theory.

    Exon Theory of Genes: Genes were constructed by 15 - 20 AA long primitive-genes, which correspond to exons that were mixed and matched to create larger genes.  Predicts that proteins will be made up of modules, or "folding units" that correspond to these exons.

    Retrotransposition and other mechanisms can cause loss of introns over time, and give rise to very large exons.

    Modules - compact regions of protein structure; not necessarily independent folding units.   Ancient introns may define these compact module structures of proteins.

    Intron phase:  Most common exon-exon boundaries are phase 0.

                Phase 0           56%
                Phase 1           23%
                Phase 2           21%

    Suggests that the 34% phase 0 boundaries are "old", and the 2% phase 1 boundaries are old.  In other words, 50% of phase 0 are old, 10% of phase 1 are "old" boundaries representing exon shuffling; the remainder are inserted and are not correlated with 3D structure.  Of all the exon-exon boundaries, 40% are "old" and 60% are "added" during evolution.

    But -- S. Anderson points out work by L. Patthy that shows that many domains appear to shuffle on phase 1 exons.  Gilbert responded by saying that what is being shuffled today may not be the same as what was being shuffled early.

    P. Wolynes points out that this kind of analysis requires sophisticated statistics, and we should be looking at ways to recruit statisticians into these research areas.

    Tom Terwilliger (Los Almos Natl. Labs)

    Nice talk outlining the general concept of "high throughput" structure analysis by X-ray crystallography.

    Depending on the level of functional information that we need, we will need different degrees of structural resolution.  We do not need all of the fine details of protein structure to answer some important functional questions, while other details require high resolution structural information.

    Opportunistic Approach: Let us get the structural info that we need from the proteins we can get structural information for.

    For example, GroEL -- 2 A.A. differences made a great effect on resolution of diffraction.

    At every step, you should take the easiest ones.  Express many homologues of proteins, try to find the ones that crystallize.

    Talked alot about target selection.  Particular focus on pathogens.  Hypothesized that one day we may be able to use protein structures to even anticipate the evolution of antibiotic resistance mutations.

    NMR -- need to consider large-scale NMR facilities.  Need to develop automated methods for analysis of NMR data.

    Purification -- need to develop methods to rapidly identify soluble fragments.  "High throughput" purification.

    Eugene Koonin (NCBI - NIH)

    There are many multidomain proteins out there.  They are very common in eukaryotes, but also occur in prokaryotes.

    COGs - clusters of orthologous genes.  Genes that are highly conserved across many phyla (or even kingdoms) or organisms.

    Has 720 COGs on his web site.  These correspond to 6814 proteins (or domains), mapping to 6646 genes.

               Biochemical function assigned by sequence homology                     81%
               General prediction of biochemical function                                       14%
               No function predicted                                                                         5%

    But -- COGs are by definition "highly conserved".  These COGs represent only about 30% of the genes in these organisms.  Many of the remaining genes have no clearly assignable functions.  Indeed, some 20 - 40% of genes in these genomes have "unknown biochemical functions", most of the rest can be guessed with good reliability from sequence homology.

    Of these 720 COGs, only 10% are unique to bacteria (these are potential targets for bacteria-specific drugs).

    Koonin has gone through these to identify COGs that are similar to proteins of know 3D structure.  These "filtered" COGs represent good targets for structural analysis by NMR or X-ray crystallography since they are proteins for which you cannot easily predict 3D structure.

    *Ribosomal Protein S4* - important new class.

    Steve Benner (Univ. of Florida)

    Functional Interference from Reconstituted Evolutionary Biology Using Rectified Databases.

    Def: Function - adaptive behavior

    DARWIN program - "exhaustive matching" of fragments of the sequence data base

    Learn the transformation rules used by evolution

    SAINT: structural analysis with informative transport?

    Divergence of Function: similar proteins may have different functions; homology does not imply analogous function.  [But in the example where he showed similar structures but supposedly dissimilar functions, all the proteins carried out the same basic biochemical reaction -- only the finer details of the substrates were different -- so it looks like it depends on how one defines "similar" as it applies to function, e.g., such functions would be similar by the Chris Sander definition.]

    Terry Gaasterland (Argonne Natl. Labs)

    She is focusing on the Sulfolobus solfoatarius genome project.

    15 public genomes
    3 more pending
    5 at 95%

    23 genomes

    She is doing something like Koonin's COGs, but with tighter criteria.

    *| Functional categories of biochemical groups |*   should look at her web page.

    Uses sequence stretch searches w/ DARWIN.  (ask Kris G. if she has looked at DARWIN for her project).  Need to use this method for multidomain protein of Eukaryotes.

    Steve Brenner (Stanford Univ, M. Levitt Lab)

    co-developed SCOP with Murzin and Chothia.

    A field whose time has come.

    His main interest is in evolutionary relationships.  Def: Homology - evolutionary relationships.

    More protein functions / structures have been determined by homology than by any other methods.

    A better way than sequence to see evolutionary relationships is by "structural similarity".

    Myoglobin/Hemoglobin:  first example of structural homology showing functional homology, even though sequence similarity is minimal.

    Homologies can give important clues to function, though they do not necessarily determine it.

    SCOP:  Database for structural classification of proteins.  Classification is done by initial structural homology analysis with a computer program, followed by manual inspection and assignments to family.  Current ver 1.35.

    Levels of structural classification in SCOP:

    Class: conventional groups (a, b, a/b, a+b)
    Fold: 3D structural similarity
    Superfamily: evidence of common evolutionary relationship, look at structures and functions
    Family: clear evolutionary relationship
    Domain: unit of protein structure

    SCOP includes:
    370 Folds
    519 Superfamilies
    751 Families
    ___ Domains

    Some superfamilies have many members, but most superfamilies have one or few families.

    What is being deposited in the PDB?  For new structures submitted to PDB in 1994:
    70% of new deposits are structures already in data base (e.g. complexes, mutants, NMR vs. x-ray)
    10% of new deposits are species variants of structures already in the data base
    11% of new deposits map to known protein families
    9% of new deposits are New Family representatives.

    Of the new families (9%):
    1/3 New folds
    1/3 New superfamilies

    About 40% of new structures map to new families.  This suggests that we know about half of the protein families.  Thus one can estimate that there are about ~1500 protein domain families, or ~1000 superfamilies.   But -- there may also be a large number of minor superfamilies.

    He also evaluated BLAST and other programs for their abilities to identify family clusters.  In BLAST search, the E values were the most reliable, the %-identity was the worst correlate with true homologies.  Alignments were done with SSEARCH.  WU-BLAST was the most reliable program for identifying homologies.  PSI-BLAST can be very good, but also finds alot of false positive.

    Excellent talk!!

    Christine Orenga (University College, London)

    Developed CATH, together with Janet Thornton and others.  Automatic and manual procedure for grouping proteins into structural families.

    CATH currently has 28 architectures, with different folds and functions.

    Use CATH sequence alignments to classify structures in the PDB.
    SSAP > 80  homologous structures
    SSAP > 70  analogous structures

    OB fold.  Example of a superfamily.  Includes various toxins that bind to oligosaccharides and many ss nucleic-acid binding proteins.  Murzin has suggested that the DNA-binding and oligosaccharide-binding functions may be fundamentally similar.

    CATH Classification:
    3 classes
    28 architectures
    620 families
    870 ??
    1400 ??
    11,300 domains

    12 families have 2 or more functions; 10 have > 3 functions; 490 single-function families.
    Most families map to single class of biochemical functions.

    10 fold families have members with a very wide range of functions.  Call these superfamilies.

    She also talked about a new program, CORA, that calculates a consensus 3D structure for a group of structural homologs in order to derive a "core motif".  Use of this core motif structure can improve the sensitivity of 3D homology searches.

    Really great talk!

    Chris Sander (EBI)

    We will soon have all gene sequences.  However, considering NMR, Xray, and "safe" homology modeling, we have only 15-20% of the structures covered.

    ~30% of sequences are covered by the known structures.  ~70% are either new families, or they map to existing families but this mapping cannot be made reliably from sequence information alone. The effort to characterize all of these chain folds will be a combination of commercial and academic efforts.

    New structure is added to PDB every 4 hours.  But many of these are essentially identical to structures already in the database.

    Described DALI - program for determining structural similarities by structural data base searches.

    Key Question: How often does structural homology give useful clues to biochemical function.

    Ans.  In principle, structural homology can be very useful in identifying biochemical functions.  However, 3D structures are usually determined for proteins for which the function is already known, so the structure analysis does not provide new clues.  Even so, sometimes it is very helpful.

    Example from work of Liisa Holm:  GalT (known enzyme) and PKCI-1 dimer (Hendrickson structure).  Structural homology provided key clues to functional homology.  Moreover, once the structure is available, can use this structure for structure-directed homology searching.  This improves the efficiency of sequence database searches.

    Cutoffs for sequence homology search (e.g., what kind of BLAST score provides reliable identification of homology) have been calibrated by Brenner (see last talk).  But -- these calibrations should be made individually on each family of genes.

    Have developed a program to parse 3D coordinates into domains.  (Holm & Sander).  A description of the Dali organization of domains is available on the DALI home page.

    All vs. All Dali search on protein domains.  Get 5 clusters or classes.  These correspond to the traditional fold classes.  About 26% of all domains fall into 3 fold classes.

    Structural homology sometimes involves a minimal structural and/or functional core.  It is very interesting to look at which parts of proteins are common -- the "homologous core".

    Describe GeneQuiz.  Limited to 12 sequences / day on their server.

    In discussion, someone brought up the idea we have discussed in the lab of using GRASP or electrostatic surfaces to gain additional clues of potential functions from the structural analysis.

    Excellent talk!

    Andrej Sali (Rockefeller Univ)

    Homology Modeling in Genomics.

    Homology modeling the full set of yeast open reading frames.

    Relatively weak hits are accepted in alignments to yeast genome because it is better to evaluate the quality of a homology alignment in 3D space.

    MODELER -- uses methods like we use with CONGEN.  Extracts homologous distance and dihedral angle constraints, constructs probability density function, use simulated annealing with CHARMM force field.

    Aim: generate 6800 homology models from the yeast genome??

    Measure of model quality: define a z-score which is a measure of the model quality.  Using Baysian methods can develop a test if a model is "good" model or "bad" model.

    Generated 3993 models for known protein structures -- these defined "good" models. Z-scores for these correct models provides criteria of "good z-score".  Using z-score, can decide if new model is "good" or "bad"

    6218 ORFs in yeast.
    Av. ORF length: 466 AA
    #ORFs modeled: 2256
    # ORFs with "good" models: 1071
    Avg. model length: 171 AA (corresponds to avg. domain size)

    Of 1071 models of yeast proteins, 235 not related to protein of known structure(??), 41 not related to protein of known function.  Using these methods, he has found ca. 50 new homology assignments for previously unassigned yeast orphan genes.

    But -- need to develop improved methods of mapping structures into potential functions.

    In this way, he modeled 17% of yeast genes.  In addition:
      18% of E. coli ORFs
      19% of mycoplasm ORFs
      20% C.elegans ORFs
      16% M. jannashci ORFs

    ****|  Most pairs of structurally similar proteins have < 30% sequence identity  |****

    How about a "basis set of protein structures" for homology modeling all the rest:

    Need one new structure per family (not per fold).
    Structure is more directly related to function than is sequence.

    Need 3000 - 10,000 domains structures determined by experimental methods to homology model the rest.  Want 75% of C-alpha atoms within 3.5 Å of correct structure.

    Peter Wolynes (UIUC)

    The "protein folding problem"

    The "forward folding problem" is np complete.  But -- the threading problem is not.  May make sense to address folding by "inverse folding" or threading.

    Need to optimize procedures and parameters for fitting sequences onto protein structure scaffolds.  How to "learn" an energy function.

    Proposed some method based on energy functions to get alignments correct for homology modeling -- it was published in Proteins in the last year.  It is some sort of threading approach with empirical pair-wise potential, using SARMD?.  Parag should look at this.

    May be better to get the complete set of "foldons" rather than complete basis set of domain structures.

    G. Rose. (J. Hopkins)

    Structure-based Genomics.

    Structural biology reveals surprising "connections between things".  New science will be discovered in this way.

    Very few of these conclusions depend on high-resolution structures.

    Homologous relationships can be detected over longer evolutionary distances using structural comparisons rather than sequence comparisons.  See paper by R.F. Doolittle, 1995. Ann Rev of something.

    Using 2o structure to find fold.  Divides Ramachandran plot into 6 X 6 matrix (60° boxes), then every peptide bond in a protein can be assigned to one box in this matrix.  Fold-recognition using 2o structure based on matrix box assignments; a new threading algorithm.  Takes about 7.5 hrs to search SwissProt.  He uses 2o structure prediction, and then uses these 2o structures to help with the alignment.  This is similar to ideas that Ron Levy is trying to develop.

    J. Moult (CARB / NIST)

    Structural Genomics -- Deducing Function from Structure

    Joint Project with TIGR/NIST on Hemophilus influenza genome.

    TIGR is across the road from CARB.  So -- have developed a collaboration between CARB, NIST, and TIGR

    To what extent might knowledge of novel gene product structure contribute to our understanding of function?

    PIs:  Poljak, Doyle, Eisenstein, Gilliand, Howard (Xray), Herzberg (Xray), Orban (NMR), Moult (Informatics), Richardson (TIGR), DiFrancisco (TIGR)

    Targets - everything in H. influenza genome with no known function.
     -50 proteins on hit list, all with no known function
     -have expressed 17 out of 50
     - 2 have been worked up to crystallization by Osnat and Herzberg

    Levels of Function:
     Class, Primary Function, Primary Specificity, Control Network

    What info can you get from structure regarding function??  His opinion is that you need more than structure to say anything useful about function.  Electrostatic fields, docking, virtual screening, etc.

    David Moncton (Assoc. Laboratory Director, Argonne Labs)

    Prospects for Synchrotron Facilities.

    More than 50% of "high impact" structures are now being determined by synchrotron radiation.

    Intensity of available synchrotron radiation has increased historically with a doubling time of every 6 mos.  [But A. Joachimiak (below) mentioned that the beam at APS is already so intense that it has to be attenuated 50-fold to avoid radiation damage to the crystal, even with cryocrystallography!]

    APS has room for 70 beam lines; 40 are currently under contract.

    New beam line is being proposed by Wayne Hendrickson and colleagues for commercial purposes - may go ahead if funding is available.

    Throughput could be increased by x10; optimal lines might approach 1000 structures / yr.

    Cost per structure: $100,000  (I think that this is a wrong estimate -- maybe if you assume the protein is expressed, crystallization conditions are available, and the cost of the synchrotron itself is factored out).

    $100K / structure x 100,000 structures = $10 billion.  Not expensive compared to other government projects.

    (I think that this calculation is nonsense -- it costs more that $100K per structure with current technologies, and you do not need 100,000 structures.  But -- I put about the same price tag on the complete project).

    Wayne Hendrickson (Columbia University)

    Macromolecular structures keeps track of new structures entering the PDB.  In 1996, X-ray contributed 461 new structures; NMR contributed 112.  Thus, NMR contributed 20%.

    The doubling time for novel structures is 2-3 years.

    MAD phasing is a big advance.  Still need heavy atom derivatives, SeMet, or Br-nucleic acids.

    For incorporating SeMet, they usually use a Met auxotroph in E. coli.  Have also used CHO cells (Endocrinology, 136: 640, 1995).

    Most work is now being done at the Brookhaven NSLS  on beam line X4A.  In 1996, there were 17 MAD structures published from data collected on this beam line, plus 8 other non-MAD structures.  This represents 25 structures / beam line - yr.

    Wayne thinks that this can be increased by a factor of 10, to about 200 structures / beam line -year.  However, this requires a large number of people for analyzing the structures, since no good automation software is available yet.

    Sung-Ho Kim (UC - Berkeley)

    Berkeley group has selected the genome of Methanococcus jannaschii as a model for developing Structural Genomics.  The "E coli" of Structural Genomics.

    This is a thermophilic bacteria.  Proteins from thermophilic bacteria often have better properties for crystallization.  It is also easy to purify when expressed in E. coli since you can heat denature the E. coli proteins and get a nearly pure sample in one step.  Steve Anderson also pointed out that these proteins are not as likely to be toxic in E. coli since they are evolved to function at higher temperatures than E. coli is grown at.

    They break their genomic targets into three classes of genes:

        A.  Proteins that have homologues in the data base with "known functions"
        B.  Proteins that have sequence homologues in database, none of which have "known functions".
        C.  Proteins from the plasmid of this organism.

    Points out that we still need better methods to go from structure to function.

    Have selected 12 genes from class A, 10 from class B, 5 from class C.

    Observed that codon preference of archeabacteria are different from expected -- need rare tRNA on a plasmid in the expression vector.

                                                                              Class A             Class B             Class C
          Soluble                                                           9/12                  8/10                  4/5
          Insoluble                                                         2/12                  1/10                  1/5
          Membrane Associated                                     1/12                  1/10

    Heat Stability
          Stable                                                              8/12                  8/10                  2/5
          Unstable                                                          2/12                                           2/5
          Insoluble                                                         2/12                  1/10                  1/5

    80% if genes examined provided soluble expression.  Of these 4 proteins have been purified.  Three 3D structures have been determined by X-ray crystallography.

    Protein Structure 1: Homologue of a eukaryotic translational initiation factor.
    Protein Structure 2:  Structurally homologous to several Methyl transferases.  Structure provides  key clues to function.
    Protein Structure 3: Unusual heat-shock protein.

    Really excellent talk.

    Steve Burley (Rockefeller Univ)

    How will we choose a protein fragment for NMR or crystallography?

    It is very crucial to select the right target for structural analysis.  Suggests focussing on disease gene products as targets (same as us); is working on transcription factors.

    Light scattering data is essential for getting good crystallization conditions.

    P(crystal | mondisperse) = 75%
    P(crystal | polydisperse) = 10%

    !!!Domain parsing provides good way to prepare crystallizable samples.  Using a combination of proteolysis and mass spectrometry, can identify domains.  Once you have a small domain, nearly always get crystals!!!

    MALDI-TOF Mass spectroscopy has special advantages in this application.  Though it is lower resolution than other kinds of mass spectroscopy, you can work with mixtures of proteins, like what you get from proteolysis.

    After a target is chosen, he is an advocate of trying to express and crystallize each member of a homologous family in order to find (empirically) the one with the best properties for structure determination.  He gave an example where they expressed 6 homologs in a protein family, only 3 yielded crystals, and only one of these gave a well-diffracting crystal from which the structure could be determined.

    Paul Godowaski (Genentech)

    How will we express and purify proteins and protein domains?

    Genentech is in the process of identifying, cloning, and expressing a large number of genes for functional genomics analysis.    (People have said the target is 1000 proteins / yr.).

    Critical database issues that they have identified:
         - give forethought to protein tracking and proper databasing of results as you go along.
         - recommended flexible prototype initially (e. g., Filemaker Pro), then porting to a more powerful database system (e.g., Informix) once important parameters are known.
         - use database to identify the rate-limiting step, then focus resources on this.
         - hot-link internal database to outside databases.

    Higher level expression is obtained by optimization of the translation initiation region (TIR) -- see Simmons & Yansuru, Nature Biotechnology 14: 629 (1996).

    All work is currently done in shake flasks.

    For work involving intracellular expression in E. coli, fusions generally require optimization of TIR (for each protein) for good levels of expression.  Criteria for good* results using this method:
          - length of protein ? 400 a.a.
          - no. of cysteine residues ? 16
          - no. of potential glycosylation sites ? 3
    [*Note: they are only assaying crude extracts for various biological activities; this protein expression methodology is not being used to prepare proteins for structural studies.]

    Also use N-terminal fusions with purification tag.  Are using a plasmid carrying the ArgU  gene to increase efficiency of production of proteins with codon usage significantly different form that of highly-expressed coli genes.

    Expression is also being done in baculovirus.  Commercial kits for baculovirus expression are available.  These proteins exhibit posttranslational modifications that may be a problem for crystallization or NMR if they are not uniform: specifically, N-glyosylation, O-glycosylation, phosporylation.  Another problem: cell stress due to lytic growth of the virus can also lead to heterogeneity.  But generally, results are very good.  Easy to generate subclones of domains.  One RA can handle about 10 constructs simultaneously --- about 3 weeks of work for 10 constructs, start to finish.  [Note: purified protein is not being generated here, either.]

    Pichia expression is not being done at Genetech, but in other labs expression levels of up to 6 g / L have been observed.

    Excellent talk.

    Andrzej Joachimiak (Argonne Labs)

    Using third generation synchrotrons for rapid protein structure determination.

    Interesting technical talk about the potential for using third generation light sources like Argonne for high throughput structure determination.

    Mentioned that even cryopreserved crystals are subject to rapid degradation by the intense APS Xrays.

    Joel Sussman (PDB, Weizmann Institute)

    Bridging the gap between 3D structure and the genome world.

    A nice talk about the current capabilities of the PDB, and how a PDB deposition is made.

    Angela Gronenborn (NIH)

    Steps toward improving NMR structure determination.

    A nice talk.  She described how her lab uses 2D HSQC spectra on mature recombinant proteins and on Protein G fusions of proteins of interest to evaluate "foldedness" and suitability for NMR.  This is a good screening process that is related to ideas we have been discussing for domain trapping using Hex-His or Z fusions.

    She also discussed improved methodologies including residual dipolar couplings for structure refinement.

    Did not really talk about "High Throughput" analysis ideas.

    Gaetano T. Montelione (Rutgers University)

    I sent the following abstract to the meeting:

    Prognosis for "High-Throughput"  Protein NMR for Determining Structures and Biochemical Functions of the Products of Novel Genes    G. T. MONTELIONE,  S. ANDERSON, Y.-P. HUANG, C. KULIKOWSKI, G.V.T. SWAPNA, R. TEJERO, D. ZIMMERMAN

    The Human Genome Project is rapidly identifying all of the genes in humans.  The products of these genes are widely recognized as the next generation of therapeutics and targets for the development of pharmaceuticals.  While identification of these genes is proceeding quickly, elucidation of their biochemical functions lags far behind.  Knowledge of 3D structures of gene products can provide important insights into structural homology that is not easily recognized by sequence comparisons.  Thus, structure determination by NMR can provide key information connecting a protein sequence and its biochemical function, as well as setting the stage for subsequent drug development using combinatorial and/or rational design methods.  We are developing technology that will significantly accelerate this process.  This includes bioinformatics methods for parsing genes into domain encoding regions, high level domain expression systems, methods for automated analysis of protein structures from NMR spectral data, and approaches for comparing new 3D structures with structures available in the Protein Database.  This technology will be capable of bringing the processes of analyzing protein structures and discovering the corresponding biochemical functions into the same time frame as the process of gene identification, which is crucial to maximize the benefit of Genomics for pharmaceutical discovery.  Examples illustrating the use of NMR analysis as a means of discovering biochemical functions of pharmaceutically-important RNA-binding proteins will be presented.

    The lunch break was postponed for about 2 hours, and this was the last talk of the meeting.  So, people were very hungry and the talk had to be cut a bit short.

    Handout 1
    Handout 2
    Handout 3