Analog Science Fiction & Fact Magazine
"The Alternate View" columns of John G. Cramer 
Previous Column  Index Page  Next Column 

Hacking the Genome Alphabet

by John G. Cramer

Alternate View Column AV-175
Keywords: genome, alphabet, protein, synthesis, 6-letter, amino acids
Published in the December-2014 issue of Analog Science Fiction & Fact Magazine;
This column was written and submitted 6/11/2014 and is copyrighted 2014 by John G. Cramer.
All rights reserved. No part may be reproduced in any form without
the explicit permission of the author.

    This column is about the genome alphabet and a new method for expanding it, described in a paper recently published in Nature online. In particular, the number of letters in the alphabet of the genome has been expanded from 4 to 6. To understand the significance of this development, we'll start by reviewing some basic molecular biology.

    All life on earth, ranging from bacteria to humans, uses the same genetic code of four nucleotide bases: thymidylic acid (T), cytidylic acid (C), adenylic acid (A), and guanylic acid (G). The nucleotides form pairs, so that A pairs with T and C pairs with G to form the ladder-like chains of DNA that act as a library of instructions for assembling the proteins from which all life forms are constructed. The transcriptase enzyme runs along the DNA chain, transcribing the nucleotide sequence of A, C, G, and T to a "messenger RNA" (mRNA) chain of nucleotides A, C, G, and U, with uridicylic acid (U) taking the place of T in the mRNA chain. The transcribed mRNA is then read like a strip of punched paper tape, 3 letters at a time, by the ribosome enzyme (see my column #106 "Decoding the Ribosome" in the May-2001 issue of Analog) that, with the help of "transfer RNA" (tRNA), assembles a protein molecule.

    A protein is essentially a one-dimensional string of amino acids that folds itself into a three-dimensional object. Such folded proteins are the basic building blocks of all living things. Proteins form the structural members, regulators, defenders, catalysts, communicators, and pumps; the movers, and the shakers of all living organisms. The active assembly instructions of each protein are encoded in a strand of mRNA. DNA has a backbone of deoxyribose and phosphate supporting an alternating sequence of A, C, G, and T letters, while mRNA has a backbone of ribose and phosphate supporting an alternating sequence of A, C, G, and U letters. These structural differences make mRNA a more mobile linear sequence that does not have DNA's tendency to coil and form a double helix.

    The sequence of mRNA instructions for assembling a protein can be thought of as a long sentence constructed from a string of words, each consisting of three letters drawn from a four letter alphabet, with the sentence ultimately terminated by a period or stop command.. The alphabet used consists of the four RNA nucleotides A, C, G, and U. In principle, there should be 4 x 4 x 4 or 64 different three-letter combinations constructed from such an alphabet of four letters. However, three of the possible three-letter combinations, UAA, UAG, and UGA, are stop commands that inform the ribosome that the protein sequence is completed and that assembly should halt. Many of the other combinations are redundant, with several different three-letter codes specifying the same amino acid. For example, the three-letter sequences UUA, UUG, CUU, CUC, CUA, and CUG are all instructions to attach the same leucine amino acid to the protein being assembled. Because of these redundancies, the 64 possible three-letter combinations actually code for only 20 different amino acids. Those 20 amino acids specified by the mRNA code are lysine, argenine, histidine, aspartic acid, glutamic acid, asparagines, glutamine, serine, threonine, tyrosine, glycine, alanine, valine, leucine, isoleucine, proline, phenylalanine, methionine, tryptophan, and cysteine. It is interesting to note that there are many other natural amino acids that are not included in this list of 20 and therefore cannot be included in the natural protein synthesis process.

    A group at the Scripps Institute in La Jolla, California has recently made a significant change in these basic rules of genetics by expanding the genome alphabet from 4 letters to 6. They used a trick involving chloroplasts in plants, which have the ability to import nucleotides from surrounding tissues. The gene responsible for this has been identified, and the Scripps group extracted that gene from an algae cell and spliced it into the DNA of an Escherichia coli bacterium.

    The Scripps group had previously identified an "unnatural" pair of nucleotides, d5SICSTP (X) and dMaMTP (Y), that are about the same size and pair in the same way as the natural nucleotides of DNA. In previous publications they demonstrated that these un-natural nucleotides behaved like natural nucleotide pairs, could be amplified by the polymerase chain reaction (PCR), and could be accurately transcribed from DNA to mRNA. In the present work they introduced a small ring of DNA (a plasmid) containing one X-Y base pair into the E. coli genome and demonstrated, tracking the process through 24 bacterial reproduction cycles, that the normal cell replication machinery of the bacteria always caused the DNA containing the X-Y base pair to be reproduced along with the other genetic material of the organism. The work was done with only a single X-Y base pair added to the cell, but it is expected that cells with many such base pairs should reproduce in the same way.

    The implication of this work is that two letters have been added to the alphabet of the genome, leading to 6 x 6 x 6 or 216 possible 3-letter word combinations so that, even with additional redundancies and stop codes, many more possible 3-letter words are available for commanding the assembly of amino acids into proteins. The natural 4-letter genome code permits only 20 amino acids to be used as protein components. It is estimated that with the un-natural 6-letter code, as many as 172 different amino acids could be used in protein assembly, leading to a vastly richer spectrum of possible proteins. This has large implications for bacterium-produced designer drugs and medicines of increased variety, potency, and effectiveness.

    How does the link between the new 3-letter words containing X and/or Y and the extra 152 amino acids become established? This is a possible problem for any use of the expanded alphabet. The connection between the mRNA codes and the amino acids is implemented by special transfer RNA (tRNA) molecules present in the cell's cytoplasm. These are rather short RNA strands that function as the amino acid handlers in the protein synthesis process. The tRNA molecule has a special section that contains a complementary sequence that matches and docks with the three-letter mRNA code, and the tRNA molecule ends with a sequence that attaches to an activated form of the specific amino acid to which the three-letter code refers. These tRNA molecules collect their designated amino acids, transport them to the ribosome, and participate in assembling them into a protein.

     For the process to work, new tRNA molecules that dock with the mRNA codes containing X or Y must be available, and these must attach to one selected member of the expanded set of amino acids on the other end. It is unlikely that such tRNA would be present in a natural cell. Rather, some section of the cell's DNA would have to be modified to have a sequence, containing the new X or Y nucleotides, that would cause transcriptase to produce the needed un-natural tRNA molecules. In principle, such DNA sequences could be synthesized in the laboratory and spliced into cell DNA to provide the needed un-natural tRNA. However, one would have to design the resulting tRNA molecule so that it would perform the needed docking and amino acid assembly tasks. It is not clear, at least to me, that we understand enough about molecular genetics to do that design.

    Another problem is that of protein folding. The accurate prediction of just how a natural protein, which is a linear chain of links selected from among the 20 natural amino acids, folds itself into a three-dimensional molecule is a major problem of molecular biology. At this writing there are about 71 dedicated digital computer servers specifically constructed to address the protein folding problem. These giant systems are being used to predict protein structure, but none of them have been completely successful at the task. The protein folding problem is a major roadblock in designing new drugs and in understanding the molecular processes of living organisms. Now, suppose that we expand the field to include proteins made with 172 possible amino acids instead of just 20. The folding problem clearly becomes more complex. How will we possibly be able to predict how these new un-natural proteins fold, when we hare having so much trouble in predicting the folding of the simpler natural ones? As scientists like to say at the end of research papers, much work remains to be done in this area.

    Consequently, the work of the Scripps group is a good start, but we are still far away from expanding the genome alphabet in a directly useful way. Nevertheless, we have started on a path that could lead to many new genetic marvels.

    This is a science-fiction magazine, so lets consider some of the SF implications of the expanded genome alphabet. First, it is now clear that the standard 4-letter genome code forming 3-letter words that is present in all known life forms on Earth is an accident of evolution. There's a better code, which Nature did not choose. A 6-letter genome code is available, which would greatly expand the range of amino acids that cells could use to assemble proteins. Life elsewhere in the Universe might have settled on a different standard, using 6 nucleotides instead 4. Life forms using such a genome alphabet might be superior to Earth life in adaptability because the range of proteins that these alien cells could assemble would not be restricted to only 20 amino acids.

    Alternatively, technological progress in this area might lead to new genetically engineered organisms that use the expanded genome alphabet to open a wide variety of new chemical functions, including new medicines and new foods. Is it likely that such organisms using un-natural proteins could "run away" and cause plagues, death, and destruction? Not really. Their reproduction depends on the availability of the synthetic X and Y nucleotides, which are scarce or unavailable in natural cells. Modified natural biological organisms are likely to be more dangerous.

    In the far future, we can imagine that the descendants of humanity may have taken their genetic heritage under their own management, and they might graduate to a new genomic level, in which their minds and bodies routinely use the 6-letter genome code to adapt to the challenging and dangerous conditions present in other parts of the universe in space and on alien planets. In any case, a new door has opened in the field of molecular biology.


PCR Amplification of the 6-Letter Genome Alphabet:
"PCR with an Expanded Genetic Alphabet ", D. A. Malyshev, et al., J. Am. Chem. Soc 131, 14260-14621 (2013).

E. Coli with 6-Letter Genome Alphabet:
"A semi-synthetic organism with an expanded genetic alphabet", D. A. Malyshev, et al., Nature 509, 385-388 (2014).

SF Novels by John Cramer:  my two hard SF novels, Twistor and Einstein's Bridge, are newly released as eBooks by Book View Cafe and are available at : .

AV Columns Online: Electronic reprints of about 177 "The Alternate View" columns by John G. Cramer, previously published in Analog , are available online at:

Previous Column  Index Page  Next Column 

Exit to the Analog Logo website.

 This page was created by John G. Cramer on 11/08/2014.