This column is about the genome alphabet and a new method
for expanding it, described in a paper recently published in Nature online. In
particular, the number of letters in the alphabet of the genome has been
expanded from 4 to 6. To understand the significance of this development, we'll
start by reviewing some basic molecular biology.
All life on earth, ranging from bacteria to humans, uses the
same genetic code of four nucleotide bases: thymidylic acid (T), cytidylic acid
(C), adenylic acid (A), and guanylic acid (G). The nucleotides form pairs, so
that A pairs with T and C pairs with G to form the ladder-like chains of DNA
that act as a library of instructions for assembling the proteins from which all
life forms are constructed. The transcriptase enzyme runs along the DNA chain,
transcribing the nucleotide sequence of A, C, G, and T to a "messenger
RNA" (mRNA) chain of nucleotides A, C, G, and U, with uridicylic acid (U)
taking the place of T in the mRNA chain. The transcribed mRNA is then read like
a strip of punched paper tape, 3 letters at a time, by the ribosome enzyme (see
my column #106 "Decoding the Ribosome" in the May-2001 issue of
Analog) that, with the help of "transfer RNA" (tRNA), assembles a
protein molecule.
A protein is essentially a one-dimensional string of amino
acids that folds itself into a three-dimensional object. Such folded proteins
are the basic building blocks of all living things. Proteins form the structural
members, regulators, defenders, catalysts, communicators, and pumps; the movers,
and the shakers of all living organisms. The active assembly instructions of
each protein are encoded in a strand of mRNA. DNA has a backbone of deoxyribose
and phosphate supporting an alternating sequence of A, C, G, and T letters,
while mRNA has a backbone of ribose and phosphate supporting an alternating
sequence of A, C, G, and U letters. These structural differences make mRNA a
more mobile linear sequence that does not have DNA's tendency to coil and form a
double helix.
The sequence of mRNA instructions for assembling a protein
can be thought of as a long sentence constructed from a string of words, each
consisting of three letters drawn from a four letter alphabet, with the sentence
ultimately terminated by a period or stop command.. The alphabet used consists
of the four RNA nucleotides A, C, G, and U. In principle, there should be 4 x 4
x 4 or 64 different three-letter combinations constructed from such an alphabet
of four letters. However, three of the possible three-letter combinations, UAA,
UAG, and UGA, are stop commands that inform the ribosome that the protein
sequence is completed and that assembly should halt. Many of the other
combinations are redundant, with several different three-letter codes specifying
the same amino acid. For example, the three-letter sequences UUA, UUG, CUU, CUC,
CUA, and CUG are all instructions to attach the same leucine amino acid to the
protein being assembled. Because of these redundancies, the 64 possible
three-letter combinations actually code for only 20 different amino acids. Those
20 amino acids specified by the mRNA code are lysine, argenine, histidine,
aspartic acid, glutamic acid, asparagines, glutamine, serine, threonine,
tyrosine, glycine, alanine, valine, leucine, isoleucine, proline, phenylalanine,
methionine, tryptophan, and cysteine. It is interesting to note that there are
many other natural amino acids that are not included in this list of 20 and
therefore cannot be included in the natural protein synthesis process.
A group at the Scripps Institute in La Jolla, California has
recently made a significant change in these basic rules of genetics by expanding
the genome alphabet from 4 letters to 6. They used a trick involving
chloroplasts in plants, which have the ability to import nucleotides from
surrounding tissues. The gene responsible for this has been identified, and the
Scripps group extracted that gene from an algae cell and spliced it into the DNA
of an Escherichia coli bacterium.
The Scripps group had previously identified an
"unnatural" pair of nucleotides, d5SICSTP (X) and dMaMTP (Y), that are
about the same size and pair in the same way as the natural nucleotides of DNA.
In previous publications they demonstrated that these un-natural nucleotides
behaved like natural nucleotide pairs, could be amplified by the polymerase
chain reaction (PCR), and could be accurately transcribed from DNA to mRNA. In
the present work they introduced a small ring of DNA (a plasmid) containing one
X-Y base pair into the E. coli genome and demonstrated, tracking the process
through 24 bacterial reproduction cycles, that the normal cell replication
machinery of the bacteria always caused the DNA containing the X-Y base pair to
be reproduced along with the other genetic material of the organism. The work
was done with only a single X-Y base pair added to the cell, but it is expected
that cells with many such base pairs should reproduce in the same way.
The implication of this work is that two letters have been
added to the alphabet of the genome, leading to 6 x 6 x 6 or 216 possible
3-letter word combinations so that, even with additional redundancies and stop
codes, many more possible 3-letter words are available for commanding the
assembly of amino acids into proteins. The natural 4-letter genome code permits
only 20 amino acids to be used as protein components. It is estimated that with
the un-natural 6-letter code, as many as 172 different amino acids could be used
in protein assembly, leading to a vastly richer spectrum of possible proteins.
This has large implications for bacterium-produced designer drugs and medicines
of increased variety, potency, and effectiveness.
How does the link between the new 3-letter words containing X
and/or Y and the extra 152 amino acids become established? This is a possible
problem for any use of the expanded alphabet. The connection between the mRNA
codes and the amino acids is implemented by special transfer RNA (tRNA)
molecules present in the cell's cytoplasm. These are rather short RNA strands
that function as the amino acid handlers in the protein synthesis process. The
tRNA molecule has a special section that contains a complementary sequence that
matches and docks with the three-letter mRNA code, and the tRNA molecule ends
with a sequence that attaches to an activated form of the specific amino acid to
which the three-letter code refers. These tRNA molecules collect their
designated amino acids, transport them to the ribosome, and participate in
assembling them into a protein.
For the process to work, new tRNA molecules that
dock with the mRNA codes containing X or Y must be available, and these must
attach to one selected member of the expanded set of amino acids on the other
end. It is unlikely that such tRNA would be present in a natural cell. Rather,
some section of the cell's DNA would have to be modified to have a sequence,
containing the new X or Y nucleotides, that would cause transcriptase to produce
the needed un-natural tRNA molecules. In principle, such DNA sequences could be
synthesized in the laboratory and spliced into cell DNA to provide the needed
un-natural tRNA. However, one would have to design the resulting tRNA molecule
so that it would perform the needed docking and amino acid assembly tasks. It is
not clear, at least to me, that we understand enough about molecular genetics to
do that design.
Another problem is that of protein folding. The accurate
prediction of just how a natural protein, which is a linear chain of links
selected from among the 20 natural amino acids, folds itself into a
three-dimensional molecule is a major problem of molecular biology. At this
writing there are about 71 dedicated digital computer servers specifically
constructed to address the protein folding problem. These giant systems are
being used to predict protein structure, but none of them have been completely
successful at the task. The protein folding problem is a major roadblock in
designing new drugs and in understanding the molecular processes of living
organisms. Now, suppose that we expand the field to include proteins made with
172 possible amino acids instead of just 20. The folding problem clearly becomes
more complex. How will we possibly be able to predict how these new un-natural
proteins fold, when we hare having so much trouble in predicting the folding of
the simpler natural ones? As scientists like to say at the end of research
papers, much work remains to be done in this area.
Consequently, the work of the Scripps group is a good start,
but we are still far away from expanding the genome alphabet in a directly
useful way. Nevertheless, we have started on a path that could lead to many new
genetic marvels.
This is a science-fiction magazine, so lets consider some
of the SF implications of the expanded genome alphabet. First, it is now clear
that the standard 4-letter genome code forming 3-letter words that is present in
all known life forms on Earth is an accident of evolution. There's a better
code, which Nature did not choose. A 6-letter genome code is available, which
would greatly expand the range of amino acids that cells could use to assemble
proteins. Life elsewhere in the Universe might have settled on a different
standard, using 6 nucleotides instead 4. Life forms using such a genome alphabet
might be superior to Earth life in adaptability because the range of proteins
that these alien cells could assemble would not be restricted to only 20 amino
acids.
Alternatively, technological progress in this area might lead
to new genetically engineered organisms that use the expanded genome alphabet to
open a wide variety of new chemical functions, including new medicines and new
foods. Is it likely that such organisms using un-natural proteins could
"run away" and cause plagues, death, and destruction? Not really.
Their reproduction depends on the availability of the synthetic X and Y
nucleotides, which are scarce or unavailable in natural cells. Modified natural
biological organisms are likely to be more dangerous.
In the far future, we can imagine that the descendants of
humanity may have taken their genetic heritage under their own management, and
they might graduate to a new genomic level, in which their minds and bodies
routinely use the 6-letter genome code to adapt to the challenging and dangerous
conditions present in other parts of the universe in space and on alien planets.
In any case, a new door has opened in the field of molecular biology.
John G. Cramer's 2016 nonfiction book (Amazon gives it 5 stars) describing his transactional interpretation of quantum mechanics, The Quantum Handshake - Entanglement, Nonlocality, and Transactions, (Springer, January-2016) is available online as a hardcover or eBook at: http://www.springer.com/gp/book/9783319246406 or https://www.amazon.com/dp/3319246402.
SF Novels by John Cramer: Printed editions of John's hard SF novels Twistor and Einstein's Bridge are available from Amazon at https://www.amazon.com/Twistor-John-Cramer/dp/048680450X and https://www.amazon.com/EINSTEINS-BRIDGE-H-John-Cramer/dp/0380975106. His new novel, Fermi's Question may be coming soon.
Alternate View Columns Online: Electronic reprints of 212 or more "The Alternate View" columns by John G. Cramer published in Analog between 1984 and the present are currently available online at: http://www.npl.washington.edu/av .
References:
PCR
Amplification of the 6-Letter Genome Alphabet:
"PCR with an Expanded Genetic Alphabet
", D. A. Malyshev, et al., J. Am.
Chem. Soc 131,
14260-14621 (2013).
E.
Coli with 6-Letter Genome Alphabet:
"A
semi-synthetic organism with an expanded genetic alphabet", D. A. Malyshev, et al., Nature 509,
385-388 (2014).