NIH Protein Structure Initiative Meeting:
Target Selection
February 1999, Washington, D.C.

Goals of Meeting

- Computational aspects of target selection

- Development of international collaborations

Marv Cassman
Have a little way to go to describe the value of this program to skeptics

How - clearly feasible

What - this remains to be defined; what are the mistakes?

Steve Burley - Medical Application
Acceleration of Fundamental Work
- Structures of folds
Site Directed Mutagenesis

Functional Hypothesis

Correlation similarities gained from 3D structure analysis

(provides infrastructure for Phase II)
Phase II
- Human Disease Genes

- Bacterial / Viral Pathogens

Using the Infrastructure
I. Access to gram (kg) quantities of proteins
1. biol. chemistry

2. fabrication of protein chips

3. all-on-all binding study

4. facilitate ligand screening

II. High-Throughput Structure Determination
1. structure of protein therapeutics

2. structure of disease genes

3. -

4. infrastructure for X-ray crystal and NMR structure analysis

III. High-Throughput biophysical / functional characteristics (* Very critical for flushing out the value of this project - my own opinion is that this is a significant potential value of NMR)
1. X-ray fluorescence

2. CD etc.

IV. Bioinformatics Databases
Main value - to invigorate structure-based biomedical research in general.

Q: M. Cassman - Why use broad approach, rather than choosing thing with already characterized functions?

Q: T. Blundell - Have to describe "level of function" relevant to discussion

Q: T. Terwilliger - in answer to M. Cassman's question, when you need the structure later - it is already done. Thinks the most difficult part of this project is getting the protein itself - need to think very hard about this.

S. Burley: 80% of effort is in getting samples and crystallography

Greg Petsko - Pharmaceutical Applications
Funding scientists of _________, a startup pharma company

In drug development -

1/2 time getting to clinical target

1/2 time in clinical trials

All human-gene directed drugs targeted to 500 genes; 60% hit G-coupled membrane-bound receptors

Less than 200 classes of pathogen-gene targets

Most drug development is not structure-based in early stage-use instead combinatorial libraries. Structures could help in defining these libraries.

From lead compound to drug; not a lot of role for structure.

I warn you against overselling the value of structural genomics in pharmaceutical design - structure has limited value in pharmaceutical development

Can possibly be valuable to assign function and folds.

Many times - proteins can serve multiple functions.

Structures can provide useful clues regarding how to design compound libraries for screening.

Q: B. Honig - Computational tools for deducing function from structure are developing. Need many more structures.

Q: W. Hul - It may be impossible to measure the value of a structural genomics project. Structures of human proteins will be very valuable for designing pathogen-specific molecules.

Leigh English - Agricultural Applications

1. Broad public utility

2. Value to health and agriculture

3. Class of proteins that can utilize structural information

4. Accessible to M.A.D. - __________ - aided design

Nice discussion regarding examples of using 3D structures of proteins to engineer better toxins and other agriproducts.

Q: C. Sanders - besides toxins - what other agriproducts could value from structure?

A: Mostly Toxins. Comment M. Navia: Infectious disease related to animals; production of animal protein (food) improves health.

Q: What will you get from having large numbers of structure?

A: Deducing function from structure. Extracting knowledge from structures.

Paula Fitzgerald - Molecular Replacement
Not only do we need all of the structures - we need multiple structures since things move when they bind things.

Could precalculate structure functions and spherical resonance function for all structures in PDB - then try molecular replacement on initial XTAL data set.

S. Kim Steve Holbrook (?) at LBL has done this.

With molecular replacement - do not need synchrotron.

Can introduce biases that are very hard to refine out.

Maybe best to use MR structure to do electron density fitting.

NMR structures not particularly valuable for MR -- this is related to fact that many of these are small and pack in unusual space groups which are not easily addressed with MR - maybe also inaccuracies in these structures?

Andrej Sali

NMR vs. Xray

Xray vs. Xray

Xray vs. Model

(backbone deviations are similar in all of these analyses)

In modeling yeast genomes, ~20% of models are in "high accuracy" class.

Structure-Function Motif Searching-motivates structural genomics. Using homology models to do this. Seems not be crucial to have high-resolution structures to do this.

Predicting function from structure:

- electrostatic distributions

- structure-function motif searching

- large - scale modeling for binding specificity

Sali is putting all his models on the web.

For proteins with ~50% sequence identity - can homologs model with good accuracy.

Q: D. Eisenberg- how about homology-modeling with when you know the individual domain structure?

A: A. Sali- no one addressing this adequately

Q: B. Honig- what is prognosis for improved homology-modeling?

A: A. Sali- some improvements in S. C. and loop modeling; no good progress in getting secondary structure shifts.

J. Moult - Deducing function from structure
What kind of function can we get:
Phenotypic function

Cellular function

Molecular Function - these are things we can get from structural analysis

Resolution of Functional Information:
Low Resolution
- Functional class; enzyme class
Med. Resolution
- Primary function, b-lactauge, protease
High Resolution
- Specificity
Function from structure - Goals
-Assignment of molecular function to "hypothetical" proteins

-Assignment of molecular functions to proteins of known cellular function

Case 1: Assignment of molecular function to "hypothetical" protein.
3D structure of unknown ORF. Looked like existing enzyme.

Subsequently could pick out other related enzymes.

Showed example from CASP in which medium-to-high resolution info could be deduced from modeling.

Tools for structure-function relationships:

Medium Resolution Tools:
-Electrostatic field distributions (Honig)

-Look for big depressions in surface (Thornton)

-Distributions of conserved / non-conserved residues (Fred Cohen)

-Distributions of conserved functional groups. e.g. oxyanion hole (Thornton)

High Resolution Tools:
-Mapping of specific locations of functional groups.
List of Knowledge that can come from Structural Genomics:
- Structure in hand

- Analogs on hand

- Structures for new models

- molecular replacement

- detection of new folds

- for templates

- function clues

- deducing families

What / Where to Deposit
F's, d's, Models Vectors, Expression Protocols, NMR Constraints
T. Blundell: Need to allow some flexibility of industry to keep some confidentiality of structures, but need to eventually get into public domain
Richard Durbin - Pfam

-dbase of protein domain families

-for each family, have HMM

-using HMMR

-clusters of protein segments

Sanger Center - Washington University Sean Eddy

Family - maximum set of sequence that can be defined by a HMM. Aim to reduce number of false positive by using conservative rules.

1407 protein families

52% of sequences, 42% of residues match

In C. elegans, 20% of genes are singletons

P-fam B - all sequences that are not in a clustered family

Structures have a high priority for new family creation.

G25 from 1407 families have a structure

Can submit sequence to P fam to parse it out

Take families linked to OM/M mutations (human discourse gene mutants)

-76 have no structures; 19 of these integral membrane, 4 are now complexity. ~50 left

-Probably will like to get multiple structures for each P fam family

-~30% of norm genome is covered by P fam families


Some ambiguities about what we mean by the words "family" and "domain"

Cluster sequences in "sequence - space"

Annotation searching using GLIMPSE - GLobal IMPlicit SEarch

Select clusters with the following features

- contain human sequence

- contains "cancer" in annotation

Domain annotations via pfam

PIR - Barker (Washington, DC)

360 domains

58,000 families

12,000 > 1 member

46,000 single long

(at 50% identities seem too high a cutoff)

Proclass / Genefind Wu (University of Texas)

130,600 sequences

61,000 clustered out families (~50%)

but somehow

80,000 now classified

-Family Attribute File

-Family Member File

-Family Alignment File

-Family Target File

976 PCFA families

1149 PCFB families

Daniel Kahn ProDom
- Many proteins are combinatorial in nature

- 104 Domain families

- If you have a single domain protein it is relatively easy to recruit an entire domain family using properly parameterized psi-blast

- SwissProt & TRMBL sequences

- Can submit sequence to ProNom to parse it out

- Generates phylogenetic trees. You define max number of clusters that you want to display

- Seems to be very similar to our "expanded COG" site

ProDom Proposal for Selection Targets

-2600 families for structure analysis
- No 3D structre available

- At least 2 members

- Should contain at least one member that is single domain (climate, ~2000 families)

- Shorter than 500 AA

- 2 most distant sequences at least 10% similar (family homogeneity)

- Should try single domains - difficulty to define domain boundaries

- Preferably human

PROTOMAP - Michael Linial

Transitivity - powerful tool for establishing remote relationships; use a lot of these to minimize false positives

<10% singletons

~90% of SwissProt is clustered

ProtoMap maps to "Family" level of SCOP

We should look at these clusters: Tries to use statistics to compute probability of being a new fold - What is the probability for a cluster to have a new / existing fold - use Bayesian Theorem - counts stop from thin cluster to known folds.

Should that member protein have high probability about being new fold

Eugene Koonin - COGs
On Web
864 COGs on from 8 Genomes

18% of C. elegans in COG's

26% of yeast

~50% of most prokaryotes

Fillers to find ones without known structure - remove COGs without structural info

New will be published on web.

Should ask Eugene to recheck list and run list against Protease
19% known structures

7% easily predictable

42% predictable

10% integral membrane

22% unknown structure

~200 targets
In order to get into details of biochemical function we need multiple structures per family
Liisa Holm
She has deposited 30,000 family targets

Two problems: (1) domains and (2) coverage

Family unification using BLAST and multiple alignments

30,000 families

- published 3D structures

- large complexes

- integral membrane proteins

++ large family size

++ conserved hypothetical proteins

++ disease related

data set

90 nrdb

40 nrdb

families deposited

# of representations



30,000 (20,000 singletons)

David Eisenberg
Proteome of P. aerophilum

2300 ORFs

Targeted 207

126 proteins expressed

Crystals 9

NMR in Progress 2

X-ray Structures 3

Computational analysis

- Fold assignment

- Med. relevant homologs

Computer probability of a new fold - detecting new folds
Sung-Hou Kim
25 cloned genes 13 pure proteins

20 soluble proteins 11 crystallized

5 insoluble proteins 5 or 7 structures solved

"Hypothetical Proteins" with no known structure or function

1. Me Transferase cellular function was implied, but 3D structures determined

biochemical function

2. Small Heat Shock Protein - structure implies molecular function

3. New ATP - binding motif

Bill Studier - Brookhaven
Have had a website from as soon as organized to publicize targets and progress

Coordination on potential overlap is critical

2 3D structures.

Now picking 100 new targets - already on the website - take a look at this site.

Paul Bash
48 COGs selected

$3000 / constant

Maf Protein Structure - X-ray structure

YCIH - initial NMR structure. strong sequence homology to domain of a repressor

John Moult

CARB / TIGR Structural Genomics Effort

H. influenza

Total number of genes = 1855

Hypothetical genes = 701

Expressed = 21 - for 1 we have initial chain tracing


Crystallients = 7

NMR = 4

Crystallized = 2
Ed Lattman
Need to share data ASAP since there is a lot of potential overlap
Andre Javicianski (Argonne)
Focusing on X-ray crystallography

Have selected 18 targets

12 Koonin COGs

6 Terry Gaasterland family

Applies 1- 2 member from family

Expressed 12

Determined 1 new 3D structure

Any form to collect full data set ~15 min

MAD data
Used automatic chain tracing - 1 hour to structure
1 weeks to good structures
Using automatic chain tracing
1 hour to initial structure

1 week to good structure

U. Heineman Protein Structure Factory
$20 million over 5 years including cost of beam line aiming for 100 structures

Aim - Establish an infrastructure for low cost, high-throughput analysis of proteins

High throughput technology in human genome search

German Human Genome Project

DHGP 1500 fl cDNA clones

IMAGE 1000 fl cDNA clones

"Select the ones that are easy" - showed funnel. Need bioinformatics to set priorities.

Bioinformatics - help to decide which route to take - NMR vs X-ray

Using CD and DSC to evaluate stability of protein - I think that DSC may be better than CD. Developing project database; expand on db used for storing clones in genome sequencing center.


NMR also being used for ligand screening.

Shigeyuki Yokoyama

Planning to do 100 structures / year using NMR Center

Now purchasing six 800's & ten 600's

Building constraints will be done in June 2000

March 2001 - new proposal 4 900 MHz NMRs

Cell-free protein synthesis

Spring 8
61 beam lines will go on SRing8

1 beam line will be dedicated for high throughput structure analysis

Target 1. Thermo thermophilis - Whole Cell Project
Complete sequence will be done soon. Crystallizability is very

good. Can move from pET to cell-free system easily.

Whole Cell Structure Analysis - offers possibility of simulation.

1000 structures in 10 years.

Tom Blundell
European Commission is very interested in supporting this

Huber / Dodoor and perhaps Blundell in a joint academic / industrial consortium.

Funding will also probably come from Wellcome-Trust. A begin line of one

synchrotron will be dedicated to this initiative

Important definitions -



European Commission
Structural Biology Industrial Platform
Future US Projects - NIH
1. Structural genomics is already happening

2. GM will support program projects. ~$900 direct costs / year

3. GM considering larger centers. Need to set up pilot programs. Test some outcomes.

What do you want

A. Optimizing technologies; need guideposts

B. Target selection schemes - what do they give you in terms of outcome

C. Promary issue - what is kind of interesting science will come out of this

- starting from the genome

Announcement in next 3 months for funding next year
Lattman - careful "observation" has tremendous value in science

Caskey - Information is valuable

Terwilliger - Need to coordinate structural genomics with functional genomics

Hendrickson - the pilot-project centers should be a test-bed of larger initiative. It is a bit early to set objectives for

Sung-hou Kim - Importance of totality of information is much greater than individual parts. Large amount of new otherwise unpredictable information

Barry Honig - we do not have any good ideas of numbers of singletons -- this needs to be defined

Prioritization in Genome Sequencing Project (Skene, Wellcome Trust)

Objective - Complete sequence project

But objectives not formally written down until July 1996.

The only way to coordinate is to communicate - rapid data release is essential. Need to get data / targets into public domain.

Who owns the information? Need rapid release

"Any project defined for determining a sequence of a specific gene should not be supported"

Target-Selection Web Site



PIR Picasso

-51,000 unique families

-Provide info for target selection

- Promote cooperation and prevent work duplication - can register a target

- Preselected targets

-Criteria for target selection

- Annotation and links for each sequence, including links to the families in each database

- Project status report for that sequence

- Registration. Group page. All your sequences can be presented

Steve Brenner
Pressage published in Nucleic Acid Research

- which groups working on it

- what states

- also can submit prediction

- database of homology models

Chris Sander
Considering families with > 5 members
1000 - 2000 families in each database

~1000 are families with no known structures

500 - 1000 families with > 50 members

How many structures should be determined from each family?

Say 30% - homology radius

At 40% - need to solve 10,000 - 15,000 structures

At 30% - need to solve 8,000 - 10,000 structures

John Moult
Singletons-using only SwissProt - 10% singletons using other db, get more singletons
Target Selection
Folds vs. Function -> no clear consensus

Medical Importance -> it is hard to predict what is likely to be of medical importance

basic molecular biophysics also crucial

model organism very crucial to human health

Genome specific vs. Classes -> need full complement within an organism
important protein - ones that are widely conserved

for practical reasons (crystallization, NMR), advantage to go across different species

E. Koonin - in sequence, these domains represent segments of sequence with different evolutionary destinies. Evolutionary independence. Many correspond to "structural domains", but need not necessarily do so.

~ Half the time, a domain identified by sequence will correspond to an "autonomous" folding unit.

Don't want

- overlap with existing structural biology efforts

- students as technicians


- having the info available and early

- we want the big picture - model of life on an atomic level

May make sure to focus on one organism and understand how it works


October 1999 Structural Genome Meeting

2000 2nd Meeting related to opening of building

can share crystallization robot and beam time