NIGMS Structural Genomics Project Planning Meeting
The Protein Structure Iniattive
Bethesda, MD
November 24, 1998




In opening comments, Marv Cassman said that this discussion should build on previous conclusions of the meetings at Argonne, NIH, and Avalon.

David Eisenberg

Summarized Avalon Meeting.

What is the goal of a Human Proteome Project?

     All domain structures?                        | Consensus of Avalon meeting is
     Most / all human structures?              | that you need both goals

Need to begin thinking about Protein Chips; Ligands; Protein binding partners

Goals of a Proteome Initiative:

 -Development of high throughput methods
 -Folds for targets / design  / interpretation
 -Expanded database for folding potentials
 -Completeness changes everything


Arguments against:

 -Structures might outpace functional studies
 -Centers are less efficient
 -Centers cannot solve hard problems


Desirable Features

 -Development and testing of htp methods
 -Pilot projects on small organisms / pathogens
 -Database / fold assignment
 -Structures of molecular machines (S. Harrison)
Hendrickson / Kim:       Should include RNA / not limited to domain structures
                                      But crystallography of RNA not trivial

Sigler:          Domain parsing is inefficient -- this is a major obstacle

Navia:         How will this impact on culture in academic institutes?  How will such a government-funded effort compete with industrial efforts?
 

Steve Burley
 

Expression            |
Purification           |   Robotics are crucial
Characterization   |


Assembly of src -- big molecular complex - could not have done large system if the structures of the individual domains were not available

HKL2000 - Experiment management database system

Phase Problem - SOLVE, CNS, CCP4, SHARP

Model Building - automated model building in place

Refinement - CNS, XPLOR

Pilot Studies: several in progress

Phase 1: Determine structures of folds
Phase 2: Protein chips / antibody chips / drug discover / NMR analysis

R. Matthews:  What about 2D protein gels.  Discussion expanded into interactions, differential display, etc.

P. Sigler: Concepts are maturing; but it is a question of efficiencies.

The Human Genome Project is moving out if its pilot phase.  As this happens the costs are dropping -- pilot projects allow you to develop efficiencies -- to see where the bottlenecks are.

S. Kim: There is value in determining non-hypothesis driven structures.

L. Masefsky (NSF): Must also address RNA / membrane-bound proteins.  Methods development in this area is crucial.

John Moult

Advances in Target Selection

Folds                           All topologies                    |  Not
Superfamilies              All evolutionary lines        | enough
Sequence families        All BLAST clusters          | minimum sampling: 104 sequence families
"Good models"           All clusters with > ~40% sequence ID

Challenges:

 -Robustness of clustering algorithms
 -Domain identification


Results from European Group: SwissProt33 - 33,000 sequences
 

  Cutoff                      # Clusters                  Avg Cluster Size

  10-100                      34,000                       ~ 1 protein

  10-35                        20,000                          2.5

  10-5                          11,367

  100                           10,500                           5           < 30% have a representative structure


Doing only 2000 structures at E = 10° could provide 80% of clusters.  This is evident since some of these families have large # proteins per cluster.

In 10° grouping:
 

  51  > 100 sequences / cluster
  94  80 - 100 sequences / cluster
  etc


Suggests that we should focus on those families with large # proteins per cluster.

For the biggest clusters, we already have a representative structure.  But -- not for all of them.

Hendrickson:  This analysis is somewhat biased -- some gene types are overrepresented.

Moult then discussed the PSI Web Site and Database, a project in collaboration with C. Sander.

Identified 8 research groups: Holm, Koonin, PIR, etc
Each group asked to provide sequence clusters in a common format.
The families are annotated by target properties:

 -size of family
 -medical relevance
 -structural modules
 -etc


These are domains rather than full length sequences.
 

Paul Sigler

Why not think of a Structure-Function Initiative -- why think in terms of the paradigm of sequence genomics.

Putting expression into centralized facilities will be very inefficient.

How do you focus bright minds?  Smart people need to be challenged; search for surprises.  Will have a very different type of scientist.

If you do not know the function in advance, you cannot make suitable samples.  The function is key to the design of expression system and modifications needed for structural analysis.

What about the difficult projects.  (Montelione -- these kind of projects will still be preserved for hypothesis-driven research programs, and will be enabled by the larger database of structures).

"The action is in the interactions".  Function requires a structural partner.

Rowena Matthews

Need appropriate functional annotations and links to each of the PDB enteries.  Need links between ENTREZ and PDB.

In chosing proteins for structural analysis - should select one or more members for a common functional class -- the most useful clusters are those with known function.

This initiative should include efforts to determine functions of proteins - a protein phylogeny in which structures and functions are linked.

Example from R. Matthews lab:

dihydropteroate synthetase

-very limited sequence similarity among homologues
-7 of 9 identical residues interact with substrate -- but can only tell this for a 3D structure which is available for only one of the homologues.
** Need structure-function knowledge base correlating each 3D structure with what is known about it functions, and which parts of the protein carry out specific aspects of its function.

(P. Sigler feels that this should not be done by the PDB)

Brian Matthews:  Can expect 1 structure / FTE / yr

Navia: Perhaps this project would best be done by private enterprise.

Grant Person: Sequencing centers get ~ $30 MM / yr; 60 Mbases / yr

Marv Cassman:
 

-need to begin with pilot projects
-Germany will put $ 20 MM into protein structure factory
-NIH may put in as much as $20 MM.  The money is there for a limited number of significantly funded projects.

 
 Cassman then outlined NIH support for X-ray crystallography:
 
  $150 MM / yr ($50 MM / yr NIGMS)

  There could be a 20 - 30% increase on top of this -- not all out of NIGMS

S. Smerdon (MRC)

Burgeoning interest in Structural Genomics.
MRC interested in US activities; interested in collaborations.
Use of synchrotron radiation; DIAMOND light source.

Adam Fasenfeld - Human Genome Sequencing Project

What should count in review of proposal?

Numbers of structures to be determined??  Maybe in production phase, but in pilot phase this may not be the most important measure.

Other key criteria:

 Cost reduction
 Technology development
 Evolution of technology


There should be close coordination between pilot and production facilities.

* Cooperation vs. competition -- need to keep cooperation high enough so that when these groups need to work together later they have maintained a high level of cooperative interactions. *

Need to help direct what will go into such centers -- the pipeline of targets.

Very difficult to compare costs between centers.  Costs may depend on the difficulty of the target choices.

Need quality control.  The sequencing project will soon get a "quality assessment center".

Data Distribution:  Data release - must develop policies and processes up front.  Intellectual property issues must be defined up front.  What kinds of intermediate data will be made available to the community.

Mistakes by sequencing projects. Would not have let tight definition of costs go so long.  e.g. $ / bp is not a good measure in the early days when technologies are being developed.  In later years as projects move more into production mode, need common basis to compare efficiencies of different centers.

One key problem encountered in these projects is resistance to changing a process to make it better.

Project Databases -- a critical component of the center -- willing to share project databases developed in these sequencing projects with the community.
 

Afternoon Discussion - How Do We Do It?

Discussion Leader: David Eisenberg

Is there a body of knowledge that will inform biology??

What will we gain from these pilot projects:
 

 Developing technologies
 Learning metrics -- what are efficiencies that will be gained?


We are planning with the assumption that this initiative is ADDED resources to NIH budget.

These projects will have to be organized differently from a normal lab effort.
 

Goals in Phase I

Main goal:

- developing htp technologies
- getting structures of members from unrepresented sequence families
- classifying unrepresented structures relative to the rest
- complexes; machines


Activities in Phase I:

Technology Development

 -Automated data analysis -- from data to structure
 -Sample preparation
  Cloning
  Expression
  Purification
  Crystallization
 -Heterologous expression of complexes
 -Membrane protein sample preparation
 -Informatics
  Target / Family Databases
  Methodologies for all vs all comparisons
  3D family comparisons
  Prediction of possible functions from structure
  Project databases and efficiencies
  Structure-function databases
   Sequence patterns
   3D patterns
  Quality assessment


Metric Determinations for Evaluating Centers

 -Expression metrics for different classes of organisms
 -Yields at various steps
 -Quality assessment
 -Success rate at each stage is weighted for difficulty
 -Cost weighted by difficulty
 -Metrics for lowering difficult barriers / speed
 -Not biological impact - since we do not know "what is important"
 -Difficult structures should be addressed in Phase II
 -Overlap may be OK -- but should be informed
Establishment of Centers
 -Centers should serve many labs
 -Produce proteins / RNAs
 -Centers coordinates record keeping and communication
Goals of Phase II
-Protein Chips
 -RNAs
 -Molecular Machines
 -High specialization

 

Handouts:
 

  • Page 1
  • Page 2
  • Page 3
  • Page 4
  • Page 5
  • Page 6
  • Page 7
  • Page 8
  • Page 9