Goals of Meeting
- Computational aspects of target selectionMarv Cassman- Development of international collaborations
Have a little way to go to describe the value of this program to skepticsSteve Burley - Medical ApplicationHow - clearly feasible
What - this remains to be defined; what are the mistakes?
Acceleration of Fundamental Work- Structures of foldsPhase II|Site Directed Mutagenesis
Models
|Functional Hypothesis
Correlation similarities gained from 3D structure analysis
(provides infrastructure for Phase II)- Human Disease Genes- Bacterial / Viral Pathogens
Using the InfrastructureI. Access to gram (kg) quantities of proteins1. biol. chemistryII. High-Throughput Structure Determination2. fabrication of protein chips
3. all-on-all binding study
4. facilitate ligand screening
1. structure of protein therapeuticsIII. High-Throughput biophysical / functional characteristics (* Very critical for flushing out the value of this project - my own opinion is that this is a significant potential value of NMR)2. structure of disease genes
3. -
4. infrastructure for X-ray crystal and NMR structure analysis
1. X-ray fluorescenceIV. Bioinformatics Databases2. CD etc.
Main value - to invigorate structure-based biomedical research in general.Greg Petsko - Pharmaceutical Applications
Q: M. Cassman - Why use broad approach, rather than choosing thing with already characterized functions?
Q: T. Blundell - Have to describe "level of function" relevant to discussion
Q: T. Terwilliger - in answer to M. Cassman's question, when you need the structure later - it is already done. Thinks the most difficult part of this project is getting the protein itself - need to think very hard about this.
S. Burley: 80% of effort is in getting samples and crystallography
Funding scientists of _________, a startup pharma companyLeigh English - Agricultural Applications
In drug development -
1/2 time getting to clinical target
1/2 time in clinical trials
All human-gene directed drugs targeted to 500 genes; 60% hit G-coupled membrane-bound receptors
Less than 200 classes of pathogen-gene targets
Most drug development is not structure-based in early stage-use instead combinatorial libraries. Structures could help in defining these libraries.
From lead compound to drug; not a lot of role for structure.
I warn you against overselling the value of structural genomics in pharmaceutical design - structure has limited value in pharmaceutical development
Can possibly be valuable to assign function and folds.
Many times - proteins can serve multiple functions.
Structures can provide useful clues regarding how to design compound libraries for screening.
Q: B. Honig - Computational tools for deducing function from structure are developing. Need many more structures.
Q: W. Hul - It may be impossible to measure the value of a structural genomics project. Structures of human proteins will be very valuable for designing pathogen-specific molecules.
Targets:Paula Fitzgerald - Molecular Replacement1. Broad public utility
2. Value to health and agriculture
3. Class of proteins that can utilize structural information
4. Accessible to M.A.D. - __________ - aided design
Nice discussion regarding examples of using 3D structures of proteins to engineer better toxins and other agriproducts.
Q: C. Sanders - besides toxins - what other agriproducts could value from structure?
A: Mostly Toxins. Comment M. Navia: Infectious disease related to animals; production of animal protein (food) improves health.
Q: What will you get from having large numbers of structure?
A: Deducing function from structure. Extracting knowledge from structures.
Not only do we need all of the structures - we need multiple structures since things move when they bind things.Andrej SaliCould precalculate structure functions and spherical resonance function for all structures in PDB - then try molecular replacement on initial XTAL data set.
S. Kim Steve Holbrook (?) at LBL has done this.
With molecular replacement - do not need synchrotron.
Can introduce biases that are very hard to refine out.
Maybe best to use MR structure to do electron density fitting.
NMR structures not particularly valuable for MR -- this is related to fact that many of these are small and pack in unusual space groups which are not easily addressed with MR - maybe also inaccuracies in these structures?
NMR vs. NMRJ. Moult - Deducing function from structureNMR vs. Xray
Xray vs. Xray
Xray vs. Model
(backbone deviations are similar in all of these analyses)
In modeling yeast genomes, ~20% of models are in "high accuracy" class.
Structure-Function Motif Searching-motivates structural genomics. Using homology models to do this. Seems not be crucial to have high-resolution structures to do this.
Predicting function from structure:
- electrostatic distributionsSali is putting all his models on the web.- structure-function motif searching
- large - scale modeling for binding specificity
For proteins with ~50% sequence identity - can homologs model with good accuracy.
Q: D. Eisenberg- how about homology-modeling with when you know the individual domain structure?
A: A. Sali- no one addressing this adequately
Q: B. Honig- what is prognosis for improved homology-modeling?
A: A. Sali- some improvements in S. C. and loop modeling; no good progress in getting secondary structure shifts.
What kind of function can we get:Richard Durbin - PfamPhenotypic functionResolution of Functional Information:Cellular function
Molecular Function - these are things we can get from structural analysis
Low ResolutionFunction from structure - Goals- Functional class; enzyme classMed. Resolution- Primary function, b-lactauge, proteaseHigh Resolution- Specificity-Assignment of molecular function to "hypothetical" proteinsCase 1: Assignment of molecular function to "hypothetical" protein.-Assignment of molecular functions to proteins of known cellular function
3D structure of unknown ORF. Looked like existing enzyme.Showed example from CASP in which medium-to-high resolution info could be deduced from modeling.Subsequently could pick out other related enzymes.
Tools for structure-function relationships:
Medium Resolution Tools:-Electrostatic field distributions (Honig)-Look for big depressions in surface (Thornton)
-Distributions of conserved / non-conserved residues (Fred Cohen)
-Distributions of conserved functional groups. e.g. oxyanion hole (Thornton)
High Resolution Tools:List of Knowledge that can come from Structural Genomics:-Mapping of specific locations of functional groups.- Structure in handWhat / Where to Deposit- Analogs on hand
- Structures for new models
- molecular replacement- detection of new folds
- for templates
- function clues
- deducing families
F's, d's, Models Vectors, Expression Protocols, NMR ConstraintsT. Blundell: Need to allow some flexibility of industry to keep some confidentiality of structures, but need to eventually get into public domain
www.sanger.ac.uk/software/Pfam
-dbase of protein domain families
-for each family, have HMM
-using HMMR
-clusters of protein segments
Sanger Center - Washington University Sean Eddy
Family - maximum set of sequence that can be defined by a HMM. Aim to reduce number of false positive by using conservative rules.
1407 protein families
52% of sequences, 42% of residues match
In C. elegans, 20% of genes are singletons
P-fam B - all sequences that are not in a clustered family
Structures have a high priority for new family creation.
G25 from 1407 families have a structure
Can submit sequence to P fam to parse it out
Take families linked to OM/M mutations (human discourse gene mutants)
-76 have no structures; 19 of these integral membrane, 4 are now complexity. ~50 left
-Probably will like to get multiple structures for each P fam family
-~30% of norm genome is covered by P fam families
SYSTERSDaniel Kahn ProDom
Some ambiguities about what we mean by the words "family" and "domain"
Cluster sequences in "sequence - space"
Annotation searching using GLIMPSE - GLobal IMPlicit SEarch
Select clusters with the following features
- contain human sequenceDomain annotations via pfam- contains "cancer" in annotation
PIR - Barker (Washington, DC)
360 domains
58,000 families
12,000 > 1 member
46,000 single long
(at 50% identities seem too high a cutoff)
Proclass / Genefind Wu (University of Texas)
130,600 sequences
61,000 clustered out families (~50%)
but somehow
80,000 now classified
-Family Attribute File
-Family Member File
-Family Alignment File
-Family Target File
976 PCFA families1149 PCFB families
- Many proteins are combinatorial in naturePROTOMAP - Michael Linial- 104 Domain families
- If you have a single domain protein it is relatively easy to recruit an entire domain family using properly parameterized psi-blast
- SwissProt & TRMBL sequences
- Can submit sequence to ProNom to parse it out
- Generates phylogenetic trees. You define max number of clusters that you want to display
- Seems to be very similar to our "expanded COG" site
ProDom Proposal for Selection Targets
-2600 families for structure analysis- No 3D structre available- At least 2 members
- Should contain at least one member that is single domain (climate, ~2000 families)
- Shorter than 500 AA
- 2 most distant sequences at least 10% similar (family homogeneity)
- Should try single domains - difficulty to define domain boundaries
- Preferably human
www.protomap.cs.huji.ac.ilEugene Koonin - COGs
Transitivity - powerful tool for establishing remote relationships; use a lot of these to minimize false positives
<10% singletons
~90% of SwissProt is clustered
ProtoMap maps to "Family" level of SCOP
We should look at these clusters: Tries to use statistics to compute probability of being a new fold - What is the probability for a cluster to have a new / existing fold - use Bayesian Theorem - counts stop from thin cluster to known folds.
Should that member protein have high probability about being new fold
On WebLiisa Holm864 COGs on from 8 Genomes18% of C. elegans in COG's
26% of yeast
~50% of most prokaryotes
Fillers to find ones without known structure - remove COGs without structural info
New will be published on web.
Should ask Eugene to recheck list and run list against ProteaseIn order to get into details of biochemical function we need multiple structures per family19% known structures7% easily predictable
42% predictable
10% integral membrane
22% unknown structure
~200 targets
She has deposited 30,000 family targetsDavid EisenbergTwo problems: (1) domains and (2) coverage
Family unification using BLAST and multiple alignments
30,000 families
- published 3D structuresdata set- large complexes
- integral membrane proteins
++ large family size
++ conserved hypothetical proteins
++ disease related
nrdb# of representations90 nrdb
40 nrdb
families deposited
260,000149,000
68,000
30,000 (20,000 singletons)
Proteome of P. aerophilumSung-Hou Kim
2300 ORFs
Targeted 207
126 proteins expressed
Crystals 9
NMR in Progress 2
X-ray Structures 3
Computational analysis
- Fold assignmentComputer probability of a new fold - detecting new folds- Med. relevant homologs
25 cloned genes 13 pure proteinsBill Studier - Brookhaven20 soluble proteins 11 crystallized
5 insoluble proteins 5 or 7 structures solved
"Hypothetical Proteins" with no known structure or function
1. Me Transferase cellular function was implied, but 3D structures determined
biochemical function
2. Small Heat Shock Protein - structure implies molecular function
3. New ATP - binding motif
Have had a website from as soon as organized to publicize targets and progressPaul Bash
Coordination on potential overlap is critical
2 3D structures.
Now picking 100 new targets - already on the website - take a look at this site.
48 COGs selectedJohn Moult$3000 / constant
Maf Protein Structure - X-ray structure
YCIH - initial NMR structure. strong sequence homology to domain of a repressor
http://www.s2f.carb.nist.govEd LattmanCARB / TIGR Structural Genomics Effort
H. influenza
Total number of genes = 1855
Hypothetical genes = 701
Expressed = 21 - for 1 we have initial chain tracing
Purified
Crystallients = 7Crystallized = 2NMR = 4
Need to share data ASAP since there is a lot of potential overlapAndre Javicianski (Argonne)
Focusing on X-ray crystallographyU. Heineman Protein Structure FactoryHave selected 18 targets
12 Koonin COGsApplies 1- 2 member from family6 Terry Gaasterland family
Expressed 12
Determined 1 new 3D structure
Any form to collect full data set ~15 min
MAD dataUsed automatic chain tracing - 1 hour to structure1 weeks to good structuresUsing automatic chain tracing1 hour to initial structure1 week to good structure
$20 million over 5 years including cost of beam line aiming for 100 structuresShigeyuki Yokoyama
Aim - Establish an infrastructure for low cost, high-throughput analysis of proteins
High throughput technology in human genome search
German Human Genome Project
DHGP 1500 fl cDNA clones"Select the ones that are easy" - showed funnel. Need bioinformatics to set priorities.IMAGE 1000 fl cDNA clones
Bioinformatics - help to decide which route to take - NMR vs X-ray
Using CD and DSC to evaluate stability of protein - I think that DSC may be better than CD. Developing project database; expand on db used for storing clones in genome sequencing center.
BESSY - II
NMR also being used for ligand screening.
Tom BlundellPlanning to do 100 structures / year using NMR Center
Now purchasing six 800's & ten 600's
Building constraints will be done in June 2000
March 2001 - new proposal 4 900 MHz NMRs
Cell-free protein synthesis
Spring 861 beam lines will go on SRing8Target 1. Thermo thermophilis - Whole Cell Project1 beam line will be dedicated for high throughput structure analysis
Complete sequence will be done soon. Crystallizability is veryWhole Cell Structure Analysis - offers possibility of simulation.good. Can move from pET to cell-free system easily.
1000 structures in 10 years.
European Commission is very interested in supporting thisEuropean Commission
Huber / Dodoor and perhaps Blundell in a joint academic / industrial consortium.
Funding will also probably come from Wellcome-Trust. A begin line of one
synchrotron will be dedicated to this initiative
Important definitions -
FunctionFold
Future US Projects - NIHwww.biodigm.com/science/ip.htmStructural Biology Industrial Platform
1. Structural genomics is already happeningLattman - careful "observation" has tremendous value in science2. GM will support program projects. ~$900 direct costs / year
3. GM considering larger centers. Need to set up pilot programs. Test some outcomes.
What do you want
A. Optimizing technologies; need guidepostsAnnouncement in next 3 months for funding next yearB. Target selection schemes - what do they give you in terms of outcome
C. Promary issue - what is kind of interesting science will come out of this
- starting from the genome
Caskey - Information is valuable
Terwilliger - Need to coordinate structural genomics with functional genomics
Hendrickson - the pilot-project centers should be a test-bed of larger initiative. It is a bit early to set objectives for
Sung-hou Kim - Importance of totality of information is much greater than individual parts. Large amount of new otherwise unpredictable information
Barry Honig - we do not have any good ideas of numbers of singletons
-- this needs to be defined
Prioritization in Genome Sequencing Project (Skene, Wellcome Trust)
Objective - Complete sequence projectTarget-Selection Web Site
But objectives not formally written down until July 1996.
The only way to coordinate is to communicate - rapid data release is essential. Need to get data / targets into public domain.
Who owns the information? Need rapid release
"Any project defined for determining a sequence of a specific gene should not be supported"
www.structuralgenomics.orgSteve Brenner
COGS PROCLASS
pfam PROTOMAP
PIR Picasso
-51,000 unique families
-Provide info for target selection
- Promote cooperation and prevent work duplication - can register a target
- Preselected targets
-Criteria for target selection
- Annotation and links for each sequence, including links to the families in each database
- Project status report for that sequence
- Registration. Group page. All your sequences can be presented
Pressage published in Nucleic Acid ResearchChris Sander
- which groups working on it
- what states
- also can submit prediction
- database of homology models
Considering families with > 5 membersJohn Moult1000 - 2000 families in each databaseHow many structures should be determined from each family?~1000 are families with no known structures
500 - 1000 families with > 50 members
Say 30% - homology radius
At 40% - need to solve 10,000 - 15,000 structuresAt 30% - need to solve 8,000 - 10,000 structures
Singletons-using only SwissProt - 10% singletons using other db, get more singletonsTarget Selection
Folds vs. Function -> no clear consensusDomains
Medical Importance -> it is hard to predict what is likely to be of medical importance
Genome specific vs. Classes -> need full complement within an organismbasic molecular biophysics also crucial
model organism very crucial to human health
important protein - ones that are widely conserved
for practical reasons (crystallization, NMR), advantage to go across different species
E. Koonin - in sequence, these domains represent segments of sequence with different evolutionary destinies. Evolutionary independence. Many correspond to "structural domains", but need not necessarily do so.Impact
~ Half the time, a domain identified by sequence will correspond to an "autonomous" folding unit.
Don't wantHANDOUTS:- overlap with existing structural biology efforts
- students as technicians
Advantages
- having the info available and early
- we want the big picture - model of life on an atomic level
May make sure to focus on one organism and understand how it works
Japan
October 1999 Structural Genome MeetingHeineman2000 2nd Meeting related to opening of building
can share crystallization robot and beam time