Our genome unveiled
URL: http://www.nature.com/cgi-taf/DynaPage.taf?file=/nature/journal/v409/n6822/full/409814a0_fs.html
Date accessed: 25 February 2001
15 February 2001 |
Nature 409, 814 - 816 (2001) © Macmillan Publishers Ltd. |
DAVID BALTIMORE
The draft sequences of
the human genome are remarkable achievements. They provide an outline of the
information needed to create a human being and show, for the first time, the
overall organization of a vertebrate's DNA.
I've seen a lot of exciting biology emerge over the past 40 years. But chills
still ran down my spine when I first read the paper that describes the outline
of our genome and now appears on page
860 of this issue1. Not that many questions are
definitively answered for conceptual impact, it does not hold a candle to
Watson and Crick's 1953 paper2 describing the
structure of DNA. Nonetheless, it is a seminal paper, launching the era of
post-genomic science. This milestone of biology's megaproject is the long-promised draft DNA
sequence from the International Human Genome Sequencing Consortium (the public
project). The sequence itself is available to all those connected to the
Internet3. In the paper in this issue, we are
presented with a description of the strategy used to decipher the structures of
the huge DNA molecules that constitute the genome, and with analyses of the
content encoded in the genome. It is the achievement of a coordinated effort
involving 20 laboratories and hundreds of people around the world. It reflects
the scientific community at its best: working collaboratively, pooling its
resources and skills, keeping its focus on the goal, and making its results
available to all as they were acquired. Simultaneously, another draft sequence is being published4.
It is less freely available because it was generated by a company, Celera
Genomics, that hopes to sell the information. This week's Science
contains an account of the history of that project and the analyses of its data,
while another of the papers in this issue contains a comparison of the quality
of the two sequences5. To those who saw this as a
competitive sport, the papers make it appear to be roughly a tie. However, it is
important to remember that Celera had the advantage of all of the public
project's data. Nevertheless, Celera's achievement of producing a draft sequence
in only a year of data-gathering is a testament to what can be realized today
with the new capillary sequencers, sufficient computing power and the faith of
investors. Answers The sequences are about 90% complete for the euchromatic (weakly staining,
gene-rich) regions of the human chromosomes. The estimated total size of the
genome is 3.2 Gb (that is gigabases, the latest escalation of units needed to
contain the fruits of modern technology). Of that, about 2.95 Gb is euchromatic.
Only 1.1% to 1.4% is sequence that actually encodes protein; that is just 5% of
the 28% of the sequence that is transcribed into RNA. Over half of the DNA
consists of repeated sequences of various types: 45% in four classes of
parasitic DNA elements, 3% in repeats of just a few bases, and about 5% in
recent duplications of large segments of DNA. The amounts in the first and third
classes will certainly grow as our ability to characterize them increases in
effectiveness and we examine the darkly staining, heterochromatin regions of
chromosomes. As the co-discoverer of reverse transcriptase (the enzyme that
reverses the common mode of information transfer from DNA to RNA), I find it
striking that most of the parasitic DNA came about by reverse transcription from
RNA. In places, the genome looks like a sea of reverse-transcribed DNA with a
small admixture of genes. Repeats In humans, virtually all of the parasitic DNA repeats seem old and enfeebled,
with little evidence of continuing reinsertions. However, there has been very
little evolutionary scouring of these repeats from the human genome, making it a
rich record of evolutionary history. The mouse genome, by contrast, has many
actively reinserting parasitic sequences and is scoured more intensely, making
it a much younger and more dynamic genome. This difference might reflect the
shorter generation time of mice or something about their physiology, but I find
it an intriguingly enigmatic observation. Much of what we learn about the global organization of the genome is an
elaboration of previous notions. For instance, we knew that the genome had
regions with a relatively high content of GC bases and regions high in AT, but
now we have a very complete appreciation of this architecture. What maintains
the patchiness of the GC/AT ratio in the genome remains an unanswered question.
As was expected, most genes are located outside the heterochromatic regions;
interestingly, however, in regions of the genome rich in GC bases, the gene
density is greater and the average intron size is lower. These introns made
up of largely meaningless sequence that breaks up the protein-coding sequences (exons)
of genes are much longer in human DNA than in the genomes previously
sequenced. Their dilution of the coding sequence is one element that makes
finding genes by computer so difficult in human DNA. A major interest of the genome sequence to many biologists will be the
opportunity it provides to discover new genes in their favourite systems for
instance, cell biologists will search for new genes for signalling proteins, and
neurobiologists will look for new ion channels. This data-mining exercise was
carried out by various groups which report their initial findings in papers that
appear on pages 824859
of this issue. They found some new and interesting genes, but surprisingly few,
and occasionally could not find the full extent of genes that they knew were
there. The paucity of discoveries reflects their concentration on systems that
were previously heavily studied. Gene-regulatory sequences are now there for all to see, but initial attempts
to find them were also disappointing. This is where the genomic sequences of
other species in which the regulatory sequences, but not the functionally
insignificant DNA, are likely to be much the same will open up a cornucopia.
Basically, the human sequence at its present level of analysis allows us to
answer many global questions fairly well, but the detailed questions remain open
for the future. What interested me most about the genome? The number of genes is high on the
list. The public project estimates that there are 31,000 protein-encoding genes
in the human genome, of which they can now provide a list of 22,000. Celera
finds about 26,000. There are also about 740 identified genes that make the
non-protein-coding RNAs involved in various cell housekeeping duties, with many
more to be found. The number of coding genes in the human sequence compares with
6,000 for a yeast cell, 13,000 for a fly, 18,000 for a worm and 26,000 for a
plant. None of the numbers for the multicellular organisms is highly accurate
because of the limitations of gene-finding programs. But unless the human genome
contains a lot of genes that are opaque to our computers, it is clear that we do
not gain our undoubted complexity over worms and plants by using many more
genes. Understanding what does give us our complexity our enormous
behavioural repertoire, ability to produce conscious action, remarkable physical
coordination (shared with other vertebrates), precisely tuned alterations in
response to external variations of the environment, learning, memory. . . need I
go on? remains a challenge for the future. Complexity The most exciting new vista to come from the human genome is not tackling the
question "What makes us human?", but addressing a different one:
"What differentiates one organism from another?". The first question,
imprecise as it is, cannot be answered by staring at a genome. The second,
however, can be answered this way because our differences from plants, worms and
flies are mainly a consequence of our genetic endowments. The Celera team4
presents the more detailed analysis of the numbers of different protein motifs
and protein types, in extensive tables. From them, it is easy to see what types
of proteins and motifs have been amplified for specific types of organisms. In
vertebrates, not surprisingly, we see elaboration and the de novo
appearance of two types of genes: those for specific vertebrate abilities (such
as neuronal complexity, blood-clotting and the acquired immune response), and
those that provide increased general capabilities (such as genes for intra- and
intercellular signalling, development, programmed cell death, and control of
gene transcription). Someday soon we will have the mouse genome, and then those
of fish and dogs, and probably the kangaroo genome from the Australians. Each of
these will fill in a piece of the evolutionary puzzle and will provide exciting
comparisons. We wait with bated breath to see the chimpanzee genome. But knowing now how
few genes humans have, I wonder if we will learn much about the origins of
speech, the elaboration of the frontal lobes and the opposable thumb, the advent
of upright posture, or the sources of abstract reasoning ability, from a simple
genomic comparison of human and chimp. It seems likely that these features and
abilities have mainly come from subtle changes for example, in gene
regulation, in the efficiency with which introns are spliced out of RNA, and in
proteinprotein interactions that are not now easily visible to our
computers and will require much more experimental study to tease out. Another
half-century of work by armies of biologists may be needed before this key step
of evolution is fully elucidated. What is next? Lots of hard work, but with new tools and new aims. First, we
have to stay the course and get the most precise representation of the genome
that we can: this is a matter of filling the cracks, cleaning up the errors, and
getting rid of the uncertainties that plague each of the analytical methods.
Second, we need to see more genomes, with each one giving us a deeper insight
into our own. Third, we need to learn how to take advantage of this book of
life. Tools for scanning the activity levels of genes in different cells,
tissues and settings are becoming available and are already revolutionizing how
we do biological investigation. But we will have to move back from the general
to the particular, because each gene is a story in itself and its full
significance can be learned only from concentrating on its particular
properties. Fourth, we need to turn our new genomic information into an engine of
pharmaceutical discovery. Individual humans differ from one another by about one
base pair per thousand. These 'single nucleotide polymorphisms' (SNPs) are
markers that can allow epidemiologists to uncover the genetic basis of many
diseases. They can also provide information about our personal responses to
medicines in this way, the pharmaceutical industry will get new targets and
new tools to sharpen drug specificity. Moreover, the analysis of SNPs will
provide us with the power to uncover the genetic basis of our individual
capabilities such as mathematical ability, memory, physical coordination, and
even, perhaps, creativity. Biology today enters a new era, mainly with a new methodology for answering
old questions. Those questions are some of the deepest and simplest:
"Daddy, where did I come from?"; "Mommy, why am I different from
Sally?". As these and other questions get robust answers, biology will
become an engine of transformation of our society. Instead of guessing about how
we differ one from another, we will understand and be able to tailor our life
experiences to our inheritance. We will also be able, to some extent, to control
that inheritance. We are creating a world in which it will be imperative for
each individual person to have sufficient scientific literacy to understand the
new riches of knowledge, so that we can apply them wisely.
What have we learned from all of these AGCTs? The best way to answer the
question is to read the analytical sections of the papers. I will only make some
general comments. It is important to remember that no statements can be made
with high precision because the draft sequences have holes and imperfections,
and the tools for analysis remain limited (as described in a further paper6
in this issue, page
828). However, the answers provided by the draft will be of interest to many
investigators, and the value of having the draft published in its imperfect form
is unquestionable.
By contrast, the puffer fish another vertebrate has a genome that
contains very few repeats. But it encodes a perfectly functional creature, so it
seems likely that most of the repeats are simply parasitic, selfish DNA elements
that use the genome as a convenient host. People call this 'junk DNA', but from
the DNA's point of view it deserves more respect. In most places in the human
genome the selfish elements are tolerated, and in some places near the ends
of chromosomes, for instance, or near the chromosome constrictions called
centromeres it builds up to form huge segments. However, the repeated DNA
may have both negative and positive effects. For instance, the paucity of
repeats in certain highly regulated regions of the genome suggests that
insertions there can disrupt gene regulation and are deleterious. Conversely,
the enrichment of the so-called Alu class of repeated sequences in the
gene-rich, high-GC regions of the genome implies that they have a positive
function. The repeats can also be fodder for evolving new functions and act as
loci for gene rearrangements.
Where do our genes come from? Mostly from the distant evolutionary past. In
fact, only 94 of 1,278 protein families in our genome appear to be specific to
vertebrates. The most elementary of cellular functions basic metabolism,
transcription of DNA into RNA, translation of RNA into protein, DNA replication
and the like evolved just once and have stayed pretty well fixed since the
evolution of single-celled yeast and bacteria. The biggest difference between
humans and worms or flies is the complexity of our proteins: more domains
(modules) per protein and novel combinations of domains. The history is one of
new architectures being built from old pieces. A few of our genes seem to have
come directly from bacteria, rather than by evolution from bacteria
apparently bacterial genomes can be direct donors of genes to vertebrates. So
DNA chimaeras consisting of the genes from several organisms can arise naturally
as well as artificially (opponents of 'genetically modified foods' take note).
1. | International Human Genome Sequencing Consortium Nature 409, 860-921 (2001). |
2. | Watson, J. D. & Crick, F. H. C. Nature 171, 737-738 (1953). |
3. | http://genome.cse.ucsc.edu/ |
4. | Venter, J. C. et al. Science 291, 304-1351 (2001). |
5. | Aach, J. et al. Nature 409, 856-859 (2001). | Article | |
6. | Birney, E., Bateman, A., Clamp, M. E. & Hubbard, T. J. Nature 409, 827-828 (2001). | Article | |
Category: 32. Genome Project and Genomics