In opening comments, Marv Cassman said that this discussion should build on previous conclusions of the meetings at Argonne, NIH, and Avalon.
Summarized Avalon Meeting.
What is the goal of a Human Proteome Project?
| Consensus of Avalon meeting is
Most / all human structures? | that you need both goals
Need to begin thinking about Protein Chips; Ligands; Protein binding partners
Goals of a Proteome Initiative:
-Development of high throughput methods
-Folds for targets / design / interpretation
-Expanded database for folding potentials
-Completeness changes everything
-Structures might outpace functional studies
-Centers are less efficient
-Centers cannot solve hard problems
-Development and testing of htp methodsHendrickson / Kim: Should include RNA / not limited to domain structures
-Pilot projects on small organisms / pathogens
-Database / fold assignment
-Structures of molecular machines (S. Harrison)
Sigler: Domain parsing is inefficient -- this is a major obstacle
How will this impact on culture in academic institutes? How will
such a government-funded effort compete with industrial efforts?
Purification | Robotics are crucial
Assembly of src -- big molecular complex - could not have done large system if the structures of the individual domains were not available
HKL2000 - Experiment management database system
Phase Problem - SOLVE, CNS, CCP4, SHARP
Model Building - automated model building in place
Refinement - CNS, XPLOR
Pilot Studies: several in progress
Phase 1: Determine structures of
Phase 2: Protein chips / antibody chips / drug discover / NMR analysis
R. Matthews: What about 2D protein gels. Discussion expanded into interactions, differential display, etc.
P. Sigler: Concepts are maturing; but it is a question of efficiencies.
The Human Genome Project is moving out if its pilot phase. As this happens the costs are dropping -- pilot projects allow you to develop efficiencies -- to see where the bottlenecks are.
S. Kim: There is value in determining non-hypothesis driven structures.
L. Masefsky (NSF): Must also address RNA / membrane-bound proteins. Methods development in this area is crucial.
Advances in Target Selection
Superfamilies All evolutionary lines | enough
Sequence families All BLAST clusters | minimum sampling: 104 sequence families
"Good models" All clusters with > ~40% sequence ID
-Robustness of clustering algorithms
Results from European Group: SwissProt33 - 33,000 sequences
Cutoff # Clusters Avg Cluster Size
10-100 34,000 ~ 1 protein
10-35 20,000 2.5
100 10,500 5 < 30% have a representative structure
Doing only 2000 structures at E = 10° could provide 80% of clusters. This is evident since some of these families have large # proteins per cluster.
In 10° grouping:
51 > 100 sequences / cluster
94 80 - 100 sequences / cluster
Suggests that we should focus on those families with large # proteins per cluster.
For the biggest clusters, we already have a representative structure. But -- not for all of them.
Hendrickson: This analysis is somewhat biased -- some gene types are overrepresented.
Moult then discussed the PSI Web Site and Database, a project in collaboration with C. Sander.
Identified 8 research groups: Holm,
Koonin, PIR, etc
Each group asked to provide sequence clusters in a common format.
The families are annotated by target properties:
-size of family
These are domains rather than full length sequences.
Why not think of a Structure-Function Initiative -- why think in terms of the paradigm of sequence genomics.
Putting expression into centralized facilities will be very inefficient.
How do you focus bright minds? Smart people need to be challenged; search for surprises. Will have a very different type of scientist.
If you do not know the function in advance, you cannot make suitable samples. The function is key to the design of expression system and modifications needed for structural analysis.
What about the difficult projects. (Montelione -- these kind of projects will still be preserved for hypothesis-driven research programs, and will be enabled by the larger database of structures).
"The action is in the interactions". Function requires a structural partner.
Need appropriate functional annotations and links to each of the PDB enteries. Need links between ENTREZ and PDB.
In chosing proteins for structural analysis - should select one or more members for a common functional class -- the most useful clusters are those with known function.
This initiative should include efforts to determine functions of proteins - a protein phylogeny in which structures and functions are linked.
Example from R. Matthews lab:
-very limited sequence similarity among homologues** Need structure-function knowledge base correlating each 3D structure with what is known about it functions, and which parts of the protein carry out specific aspects of its function.
-7 of 9 identical residues interact with substrate -- but can only tell this for a 3D structure which is available for only one of the homologues.
(P. Sigler feels that this should not be done by the PDB)
Brian Matthews: Can expect 1 structure / FTE / yr
Navia: Perhaps this project would best be done by private enterprise.
Grant Person: Sequencing centers get ~ $30 MM / yr; 60 Mbases / yr
-need to begin with pilot projects
-Germany will put $ 20 MM into protein structure factory
-NIH may put in as much as $20 MM. The money is there for a limited number of significantly funded projects.
Cassman then outlined NIH support for X-ray crystallography:S. Smerdon (MRC)
$150 MM / yr ($50 MM / yr NIGMS)
There could be a 20 - 30% increase on top of this -- not all out of NIGMS
Burgeoning interest in Structural
MRC interested in US activities; interested in collaborations.
Use of synchrotron radiation; DIAMOND light source.
Adam Fasenfeld - Human Genome Sequencing Project
What should count in review of proposal?
Numbers of structures to be determined?? Maybe in production phase, but in pilot phase this may not be the most important measure.
Other key criteria:
Evolution of technology
There should be close coordination between pilot and production facilities.
* Cooperation vs. competition -- need to keep cooperation high enough so that when these groups need to work together later they have maintained a high level of cooperative interactions. *
Need to help direct what will go into such centers -- the pipeline of targets.
Very difficult to compare costs between centers. Costs may depend on the difficulty of the target choices.
Need quality control. The sequencing project will soon get a "quality assessment center".
Data Distribution: Data release - must develop policies and processes up front. Intellectual property issues must be defined up front. What kinds of intermediate data will be made available to the community.
Mistakes by sequencing projects. Would not have let tight definition of costs go so long. e.g. $ / bp is not a good measure in the early days when technologies are being developed. In later years as projects move more into production mode, need common basis to compare efficiencies of different centers.
One key problem encountered in these projects is resistance to changing a process to make it better.
Project Databases -- a critical component
of the center -- willing to share project databases developed in these
sequencing projects with the community.
Afternoon Discussion - How Do We Do It?
Discussion Leader: David Eisenberg
Is there a body of knowledge that will inform biology??
What will we gain from these pilot
Learning metrics -- what are efficiencies that will be gained?
We are planning with the assumption that this initiative is ADDED resources to NIH budget.
These projects will have to be organized
differently from a normal lab effort.
Goals in Phase I
- developing htp technologies
- getting structures of members from unrepresented sequence families
- classifying unrepresented structures relative to the rest
- complexes; machines
Activities in Phase I:
-Automated data analysis -- from data to structure
-Sample preparationCloning-Heterologous expression of complexes
-Membrane protein sample preparation
-InformaticsTarget / Family Databases
Methodologies for all vs all comparisons
3D family comparisons
Prediction of possible functions from structure
Project databases and efficiencies
Structure-function databasesSequence patternsQuality assessment
Metric Determinations for Evaluating Centers
-Expression metrics for different classes of organismsEstablishment of Centers
-Yields at various steps
-Success rate at each stage is weighted for difficulty
-Cost weighted by difficulty
-Metrics for lowering difficult barriers / speed
-Not biological impact - since we do not know "what is important"
-Difficult structures should be addressed in Phase II
-Overlap may be OK -- but should be informed
-Centers should serve many labsGoals of Phase II
-Produce proteins / RNAs
-Centers coordinates record keeping and communication