The draft sequences: Comparing species
URL: http://www.nature.com/cgi-taf/DynaPage.taf?file=/nature/journal/v409/n6822/full/409820a0_fs.html
Date accessed: 15 March 2001
15 February 2001 |
Nature 409, 820 - 821 (2001) © Macmillan Publishers Ltd. |
GERALD M. RUBIN
Comparing the human
genome sequences with those of other species will not only reveal what makes us
genetically different. It may also help us understand what our genes do.
How are the differences between humans and other organisms reflected in our
genomes? How similar are the numbers and types of proteins in humans, fruitflies,
worms, plants and yeast? And what does all of this tell us about what makes a
species unique? With the publication of the draft human genome sequences, on page
860 of this issue1 and in this week's Science2,
we can start to compare the sequences of vertebrate, invertebrate and plant
genomes in an attempt to answer these questions. An obvious place to start our comparison is the total number of genes in each
species. Here is a real surprise: the human genome probably contains between
25,000 and 40,000 genes, only about twice the number needed to make a fruitfly3,
worm4 or plant5. We
know that there is a higher degree of 'alternative splicing' in humans than in
other species. In other words, there are often many more ways in which a gene's
protein-coding sections (exons) can be joined together to create a functional
messenger RNA molecule, ready to be translated into protein. So more proteins
are encoded per gene in humans than in other species. Even so, we cannot escape the conclusion — drawn previously from
comparisons of simpler genomes6 — that physical
and behavioural differences between species are not related in any simple way to
gene number. Many researchers, struck by the fact that there are four times as
many genes in some gene families in the human genome compared with fruitflies7,
extrapolated from these cases and suggested that the human genome might be the
product of two doublings of the whole of a simpler genome found in the common
ancestor of fruitflies and humans. But, as the analyses of the human genome show1,
2, if such doublings did occur, the evidence for them
has since been obscured by massive gene loss and amplification of particular
gene families in the human genome. Individual proteins often feature discrete structural units, called domains,
that are conserved in evolution. More than 90% of the domains that can be
identified in human proteins are also present in fruitfly and worm proteins,
although they have been shuffled to create nearly twice as many different
arrangements in humans1, 2. Thus,
vertebrate evolution has required the invention of few new domains. Of the human
proteins that are predicted to exist, 60% have some sequence similarity to
proteins from other species whose genomes have been sequenced. Just over 40% of
the predicted human proteins share similarity with fruitfly or worm proteins.
And 61% of fruitfly proteins, 43% of worm proteins and 46% of yeast proteins
have sequence similarities to predicted human proteins. But what about the proteins whose sequences show no strong similarity to
known proteins from other species? Over a third of the yeast, fruitfly, worm and
human proteins fall into this class. These proteins might retain similar
functions, even though their sequences have diverged. Or they might have
acquired species-specific functions. Alternatively, we may need to entertain the possibility that the open reading
frames that encode these proteins are maintained in a new way, one that is
independent of the precise amino-acid sequence and thus is free to evolve
rapidly. (An open reading frame is the part of a gene encoding the amino-acid
sequence of its protein product.) After all, we know that cells have at least
one mechanism, called nonsense-mediated decay of mRNA, for detecting imperfect
open reading frames irrespective of the amino-acid sequence that they encode8. It will be interesting to see the extent to which the number of human
proteins in this rapidly evolving class decreases as the genomes of other
vertebrates, such as mice, are sequenced. This will give us an indication of
just how fast these proteins are changing. Indeed, there is already evidence
from studies of flies9 and worms10
that these rapidly evolving proteins are less likely to have essential
functions, consistent with their being less likely to be conserved during
evolution. Such comparisons of distantly related genomes are fascinating from an
evolutionary point of view. But comparison of closely related genomes will be
much more important in addressing the key problem now facing genomics —
determining the function of individual DNA segments. The concept is simple:
segments that have a function are more likely to retain their sequence during
evolution than non-functional segments. So DNA segments that are conserved
between species are likely to have important functions. The ideal species for
comparison are those whose form, physiology and behaviour are as similar as
possible, but whose genomes have evolved sufficiently that non-functional
sequences have had time to diverge. In practice, there may be no one ideal
species, because different genes and regulatory sites evolve at different rates.
Nevertheless, this approach has a long history of success, and becomes
progressively more efficient as the cost of DNA sequencing declines. One use of such sequence comparisons is to determine the structure of genes
— which parts (the exons) make their way into a functional mRNA molecule and
which do not (the introns). The high degree of alternative splicing in
vertebrates makes this comparative approach particularly important. Gene-finding
computational algorithms cannot easily predict the existence of alternative
forms of an mRNA without experimental information, but this information is
difficult to come by in the case of rare mRNAs. For example, an exon that is
used in only a few cells of the human brain might never be experimentally
detected in an mRNA. But that exon's sequence would probably be conserved in the
mouse genome. Comparing the genomes of closely related species can also help in identifying
gene-control regions. This approach has been used for over two decades11,
and has been validated by showing that the conserved sequences indeed correspond
to functional control elements in individual genes12.
But this computational problem is more difficult than identifying exons, and it
will be challenging to scale up to a genome-wide level. The proteins that
control gene expression by recognizing regulatory regions often detect sequence
features that elude the best computer algorithms, and may use information from
contacts with other proteins that is difficult to model. Proteins are simply
cleverer than computers. That said, our knowledge of the DNA-binding properties of individual
proteins, as well as the structural features of the DNA sites to which they
bind, continues to increase. Moreover, we can use experimental evidence; for
example, genes that are expressed together might be expected to share control
elements. And, as methods for comparing sequences continue to improve, we can
expect to learn more about elusive features of the genome, such as genes
encoding RNAs that do not encode proteins13, start
points of DNA replication, and genetic elements that control chromosome
structure.
1. | International Human Genome Sequencing Consortium Nature 409, 860-921 (2001). |
2. | Venter, J. C. et al. Science 291, 1304-1351 (2001). |
3. | Adams, M. D. et al. Science 287, 2185-2195 (2000). | Article | PubMed | |
4. | The C. elegans Sequencing Consortium Science 282, 2012-2018 (1998). |
5. | The Arabidopsis Genome Initiative Nature 408, 796-815 (2000). |
6. | Rubin, G. M. et al. Science 287, 2204-2215 (2000). | Article | PubMed | |
7. | Spring, J. FEBS Lett. 400, 2-8 (1997). | Article | PubMed | |
8. | Hentze, M. W. & Kulozik, A. E. Cell 96, 307-310 (1999). | PubMed | |
9. | Ashburner, M. et al. Genetics 153, 179-219 (1999). | PubMed | |
10. | Fraser, A. G. et al. Nature 408, 325-330 (2000). | Article | PubMed | |
11. | Ravetch, J. V., Kirsch, I. R. & Leder, P. Proc. Natl Acad. Sci. USA 77, 6734-6738 (1980). | PubMed | |
12. | Fortini, M. E. & Rubin, G. M. Genes Dev. 4, 444-463 (1990). | PubMed | |
13. | Lee, R. C., Feinbaum, R. L. & Ambros, V. Cell 75, 843-854 (1993). | PubMed | |
Category: 32. Genome Project and Genomics