Methods for Computational Gene Prediction / Edition 1

Methods for Computational Gene Prediction / Edition 1

by William H. Majoros
ISBN-10:
0521706947
ISBN-13:
9780521706940
Pub. Date:
08/16/2007
Publisher:
Cambridge University Press
ISBN-10:
0521706947
ISBN-13:
9780521706940
Pub. Date:
08/16/2007
Publisher:
Cambridge University Press
Methods for Computational Gene Prediction / Edition 1

Methods for Computational Gene Prediction / Edition 1

by William H. Majoros

Paperback

$51.99
Current price is , Original price is $51.99. You
$51.99 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE

    Your local store may have stock of this item.

  • SHIP THIS ITEM

    Temporarily Out of Stock Online

    Please check back later for updated availability.


Overview

Inferring the precise locations and splicing patterns of genes in DNA is a difficult but important task, with broad applications to biomedicine. The mathematical and statistical techniques that have been applied to this problem are surveyed and organized into a logical framework based on the theory of parsing. Both established approaches and methods at the forefront of current research are discussed. Numerous case studies of existing software systems are provided, in addition to detailed examples that work through the actual implementation of effective gene-predictors using hidden Markov models and other machine-learning techniques. Background material on probability theory, discrete mathematics, computer science, and molecular biology is provided, making the book accessible to students and researchers from across the life and computational sciences. This book is ideal for use in a first course in bioinformatics at graduate or advanced undergraduate level, and for anyone wanting to keep pace with this rapidly-advancing field.

Product Details

ISBN-13: 9780521706940
Publisher: Cambridge University Press
Publication date: 08/16/2007
Edition description: New Edition
Pages: 448
Product dimensions: 6.89(w) x 9.72(h) x 0.79(d)

About the Author

W. H. Majoros is Staff Scientist at the Center for Bioinformatics and Computational Biology, in the Institute for Genome Sciences and Policy at Duke University. He has worked as a research scientist in the fields of computational biology, natural language processing, and information retrieval for over a decade. He was part of the human genome project at Celera Genomics and has taken part in the sequencing and analysis of numerous organisms including human, mouse, fly and mosquito.

Read an Excerpt

Methods for Computational Gene Prediction
Cambridge University Press
9780521877510 - Methods for Computational Gene Prediction - by William H. Majoros
Excerpt



I
Introduction




The problem that we wish to address in this book is that of predicting computationally the one-dimensional structure of eukaryotic protein-coding genes. Our first order of business will be to define this problem more precisely, and to circumscribe the issue so as to reflect the set of assumptions and constraints which typically apply in the case of practical gene-finding systems. As we recognize that not all readers will be familiar with the relevant facts and theories from molecular biology, we begin with the so-called central dogma of molecular biology, in which the significance of genes and their genomic structure are defined in relation to current biological understanding and the goals of modern medicine.

1.1 The central dogma of molecular biology

Life on Earth began approximately 3.5 billion years ago, and since that time it has advanced through a number of stages of increasing complexity.1 Beginning with the first replicating molecules and unicellular organisms, our own evolutionary trajectory has taken us along an epic journey progressing from microscopic invertebrates to bony fishes, to the amphibians and reptiles who first established the vertebrate kingdom on dry land, to the basal mammals who hid fromthe dinosaurs among the primitive trees and bushes of the Mesozoic era, and then on to our closest living relatives, the great apes. Through all that time, one fact has remained constant: that our animal selves, both physically and cognitively (with the latter following, of course, the advent of that organ which we call the brain) have been extensively shaped by the action of our genes, through their influence both on our bodily development (our ontogeny) and also on our ongoing biological processes, both of which are overwhelmingly mediated through the action of proteins, the primarily biochemical products of gene expression.

   It is this fundamental influence of the genes and their protein products on human health which is largely responsible for the success and continued momentum of the biological revolution and the tremendous insight into human genetics which has materialized in the late twentieth and early twenty-first centuries, and which at present drives the search for cures to such maladies as cancer, heart disease, diabetes, and dementia, all of which continue to afflict a significant portion of our population, despite the best efforts of modern medicine. It is precisely because

The genetic code: beside each amino acid are listed the codons that are translated into that acid

of their direct influence on our development and on our physical and mental health that the academic and medical communities have focused so much of their attention on identifying and characterizing human genes. In addition, because many organisms act as pathogens either to ourselves or to animals or plants of economic importance to us, there is also a great interest in identifying and understanding the genes of many non-human species.

   Each of the trillion cells in the human body contains in its nucleus the complete human genome – 23 pairs of chromosomes – a collection of DNA molecules encoding the information necessary for building and maintaining a healthy human being. It is this information which we pass on to our children in the form of resemblances and inherited traits, and which, when unfavorably perturbed, can lead to disease.

   As depicted in Figure 1.1, the DNA in our chromosomes forms a double helix, and in its native state is tightly bound into a complex involving extra-genetic elements called histones, about which, though they may have substantial influence on the expression of genes, we will have relatively little to say except in relation to the methylation status of certain elements upstream of mammalian genes (i.e., CpG islands – see section ).

   In eukaryotic organisms (those having nucleated cells), the chromosomes are contained within the nucleus of the cell, though their influence can extend quite far beyond the nuclear walls, and indeed, even beyond the boundaries of the organism (e.g., Dawkins, 1982). Different organisms have different numbers of chromosomes, with the fruitfly Drosophila melanogaster having only four pairs as compared to our 23. Each member of a pair constitutes a single haplotype, which in diploid organisms (those whose chromosomes normally occur in pairs) is normally inherited from one of the two parents. These pairs of chromosomes should not be confused with the two strands of DNA; each of the two haplotypes in a diploid genome consists of a set of double-stranded DNA chromosomes.

   Figure 1.2 gives a closer look at the structure of the DNA molecule. DNA, or deoxyribonucleic acid, consists of a series of paired nucleotides, or bases, joined at their margins by a sugar–phosphate backbone. The four nucleotides occurring in DNA are adenine (denoted chemically as C5H5N5), thymine (C5H6N2O2), cytosine (C4H5N3O), and guanine (C5H5N5O). Adenine and guanine are known as purines, while cytosine and thymine are called pyrimidines. An important variant of DNA – ribonucleic acid,

Image not available in HTML version

Figure1.1 The eukaryotic cell. Genetic information is encoded in the DNA making up the chromosomes in the nucleus. (Courtesy: National Human Genome Research Institute).

or RNA – substitutes uracil(C4H4N2O2) for thymine. The pairing of nucleotides in DNA and RNA follows a strict rule, in which adenine and thymine (or uracil) may bind to one another, and cytosine and guanine may bind to one another, but all other pairings are strictly avoided and occur only in extremely rare circumstances. This pairing is known as Watson–Crick complementarity, in honor of the two men who first discovered the structure of DNA and who correctly hypothesized its great importance to genetics (Watson and Crick, 1953). Abbreviating adenine to A, cytosine to C, guanine to G, and thymine to T, we can represent the Watson–Crick complementarity rule as: (or in the case of RNA, ). Figure 1.3 shows the chemical structure of the individual nucleotides.

Image not available in HTML version

Figure 1.2 Chemical structure of DNA. Each half of the double helix is composed of a sequence of nucleotides linked by a sugar–phosphate backbone. (Courtesy: National Human Genome Research Institute.)

Image not available in HTML version

Figure 1.3 Chemical structure of the nucleotides making up DNA and RNA. The thymine in DNA is replaced by uracil in RNA. (Courtesy: National Human Genome Research Institute.)

   It is important to note that the nucleotides are not symmetric, so that they have a preferred orientation along the sugar–phosphate backbone. In particular, the 5-phosphate group of a nucleotide orients toward the “upstream” direction of the strand, while the 3-hydroxyl group orients toward the “downstream” direction, where upstream and downstream refer to the direction in which the DNA replicates during cell division (i.e., upstream-to-downstream). Thus, we say that DNA and RNA strands are synthesized in the 5′-to-3′ direction. Furthermore, the orientations of the two strands making up the double helix run in opposite directions (i.e., they are antiparallel), so that the 5′-to-3′ direction of one strand corresponds to the 3′-to-5′ direction of the opposite strand. As a convention, whenever describing features of a DNA or RNA sequence, one generally assumes that the sequence under consideration is given in the 5′-to-3′ direction along the denoted molecule, and this sequence is always referred to as the forward strand (or sense strand). The other strand is obviously referred to as the reverse strand (or antisense strand).

   The bonds which join complementary bases in a DNA or RNA molecule are hydro- gen (H) bonds. has three H-bonds, has two H-bonds, and the extremely rare base-pairing has only one H-bond. For this reason, when DNA is heated, the A–T pairings will disintegrate (or denature) at lower temperatures than will the C–G pairings, since more energy is required in order to break all of their bonds. It should be noted that the bonds joining complementary bases across the two strands of a DNA molecule are chemically different from those that join successive nucleotides along a DNA strand. The latter are termed phosphodiester bonds, and are denoted, e.g., CpG, for nucleotides C and G and phosphodiester bond p. When referring to arbitrary DNA sequences, we will generally omit the p and give only the sequence of nucleotides along one of the two DNA strands (most often the forward strand): e.g., ACTAGCTAGCTCTTGATCG for DNA, or ACUAGCUAGCUCUUGAUCG for the corresponding RNA sequence. Note, however, that we will rarely give explicit RNA sequences in this book, opting instead to give the corresponding DNA sequence whenever discussing RNA, for notational simplicity. Hence, we will typically substitute T for U in what would otherwise be RNA sequences, and will therefore confine our discussions to sequences of letters drawn from the set {A,C,G,T}.

   Because of the strict Watson–Crick base-pairing observed in normal DNA and RNA, it is possible to infer the precise sequence of nucleotides along one DNA strand when given the sequence of the other strand. For example, given the sequence ATCTAGGCA, the reverse complement sequence (i.e., the complementary bases, in reverse order) making up the opposite strand can be confidently deduced (except in extremely rare circumstances) to be TGCCTAGAT, where both of these sequences are given in the 5′-to-3′ order for the respective strands (remembering that 5′-to-3′ for one strand is 3′-to-5′ for the other – hence our convention of describing the opposite strand using the reverse complement, rather than just the complement). Thus, all the information encoded in a complete DNA molecule is present in one strand of that molecule (with a few exceptions, which we will not consider), and as a general practice, we will consider only one strand or the other, whichever suits our present needs. We shall at this juncture make only a brief remark on the profund- ity of this equating of sequence to information content. Although we will delay to subsequent chapters a precise definition of sequence information content, the astute reader will perceive that the very potential for DNA to assume arbitrary sequences of As, Cs, Gs, and Ts has important and far-reaching implications for the robustness of the genetic code and of the evolutionary process and its associated transfer of genetic information between generations, even over millions of years of biological evolution.

Image not available in HTML version

Figure 1.4 The central dogma of molecular biology. DNA is transcribed via RNA polymerase into messenger RNA; RNA is translated via a ribosome complex into a polypeptide; finally, the polypeptide is folded into a completed protein, one of the fundamental building blocks of living organisms.

!

The genetic code: beside each amino acid are listed the codons that are translated into that acid

   It is the very existence of a discrete genetic code which allows us to treat in so straightforward a fashion the relations between genes and their protein products using rigorous mathematical techniques. We now come to the central dogma of molecular biology (Figure 1.4), which stipulates that DNA sequences give rise to messenger RNAs (mRNAs), which are then translated into the polypeptides (chains of amino acids) that fold into functional proteins. These protein products perform most of the vital functions of the cell, and therefore of the organism; hence the great importance attached to this particular molecular pathway. Indeed, much of modern biomedical research is founded on the hope that a better understanding of the genes (both singly and in combination) will enable more successful intervention in the development of disease states, through the reshaping or altered regulation of relevant proteins.

   Let us now consider in greater detail the processes involved in gene expression as illustrated in Figure 1.4. The first step involves the transcription of DNA into RNA. It is a curious fact that much of eukaryotic DNA is non-genic in many organisms; in all the vertebrate genomes sequenced to date, long stretches of DNA can be readily found which do not appear to stimulate the production of proteins or other biologically active molecules in the cell. Punctuating these noncoding regions (or intergenic regions) are the actual genes, or loci (singular: locus), which individually encode the functional units of hereditary information, and which in many species constitute only a small percentage of the organism’s genome. Because a particular locus may take slightly different forms in the different individuals of a species (accounting for, e.g., some individuals having blue eyes and others having green or brown eyes), we may differentiate between these forms – called alleles – and their different effects on the biology of a particular organism. The differential effects and expression patterns of multi-allelic loci are of primary concern in the field of quantitative genetics (e.g., Falconer, 1996), in which single-nucleotide polymorphisms (SNPs) are often used as surrogates for the full complement of allelic variations in a population. For the purposes of computational gene prediction, however, these issues of individual variation are generally ignored in practice. Instead, given a nucleotide sequence along one DNA strand, the task is to identify the loci that are present in the DNA, and to delimit precisely the boundaries separating those loci from the noncoding regions which surround them, as well as the internal structure of each locus as defined by the splicing process (see below). It should also be noted that not all transcribed loci in a genome are protein-coding genes – that is, not all transcribed genes give rise to mRNAs which are translated into functional proteins. We will consider noncoding genes very briefly in Chapter 12.

   The process of transcription is carried out by an RNA polymerase molecule, which scans one DNA strand in the 5′-to-3′ direction, pairing off each DNA nucleotide (A,C,G,T) on the antisense strand with a complementary RNA nucleotide (U,G,C,A). In this way, the DNA of a gene acts as a template for the formation of an RNA sequence from the free ribonucleotides (RNA nucleotides) which are present individually in the nucleus. The RNA nucleotides are joined together with phosphodiester bonds to produce an RNA molecule known as a messenger RNA – abbreviated mRNA – which will later migrate out of the nucleus into the cytoplasm of the cell, where it then acts as a template for protein synthesis. First, however, the emergent mRNA – known as a pre-mRNA, or a transcript – is generally spliced by a molecule known as the spliceosome, which removes intervals of RNA known as introns from the sequence. This process is illustrated in Figure 1.5.

   Each intron begins with a donor site (typically GT) and ends with an acceptor site (typically AG). The entire intron, including the donor and acceptor splice sites, is excised from the mRNA by the spliceosome via a two-step process, in which the donor site is first cleaved from the preceding ribonucleotide and brought into association with a region upstream from the acceptor site known as the branch point, to produce a loop-like structure known as a lariat, and then in the second step, the acceptor site is cleaved from the following ribonucleotide, thereby completing the separation of the intron from the mRNA. The excised intron is discarded, to be degraded by enzymes in the cell back into individual ribonucleotides for use in future transcription events, and the remaining portions of the mRNA are ligated (i.e., joined together by newly created chemical bonds – in this case, phosphodiester bonds) so as to close the gap created by the excised intron. The regions separated by introns are known as exons; these are the portions of the transcript that remain after all splicing of a particular mRNA has been completed, and which generally influence the resulting structure of the encoded protein.

Image not available in HTML version

Figure 1.5 Splicing of a two-exon pre-mRNA to produce a mature mRNA. Splicing removes the introns and ligates the remaining exons together to close the gap where the intron used to reside.

   Although every protein-coding gene contains at least one exon, introns are not always present. In prokaryotes (organisms whose cells lack a nucleus), introns do not occur, so that the identification of functional genes is synonymous with the identification of coding exons. Although we will largely limit our discussion to eukaryotes, it should be noted that even in the latter organisms, genes lacking introns can sometimes be found (though what may appear to be an intronless gene may often be a retrotransposed pseudogene – a mature mRNA which has been reverse-transcribed back into the chromosome in a random location and subsequently rendered nonfunctional through the accumulation of mutations; see Lewin, 2003). Another exceptional case in the biology of introns occurs in the case of ribozymes – introns that autocatalytically splice themselves out of a transcript – of which the interested reader may learn more elsewhere (e.g., Doudna and Cech, 2002)

   The ends of the spliced transcript are additionally processed while still within the nucleus, as follows. The 5′ end of the mRNA is capped by the addition and methylation of a guanosine, to protect the mRNA from being destroyed by the cell’s RNA degradation processes (which are carried out by a molecule called an exonuclease). At the 3′ end of the mRNA a poly-A tail (a long string of adenine residues) is appended after the transcript is cleaved at a point roughly 20–40 nucleotides

Image not available in HTML version

Figure 1.6 Protein synthesis. RNA is translated by a ribosome complex into a polypeptide chain of amino acids. The polypeptide will then fold into a protein structure. (Courtesy: National Human Genome Research Institute.)

downstream from a special polyadenylation signal (typically ATTAAA or AATAAA). The poly-A tail likewise protects the 3′ end of the mRNA from exonuclease activity, and also aids in the export of the mRNA out of the nucleus.

   Once the splicing, 5′ capping, and polyadenylation processes are complete, the processed transcript – now known as a mature mRNA – can be exported from the nucleus in preparation for protein synthesis.

   Figure 1.6 illustrates the process of translation, whereby an mRNA is translated into a polypeptide. At the bottom of the figure can be seen the ribosome, a molecular complex which scans along the mRNA, much as the RNA polymerase scans the DNA during transcription. Whereas in transcription a series of free RNA nucleotides are polymerized into a contiguous RNA molecule, the ribosome instead attracts molecules known as transfer RNAs (tRNAs) which aid in the formation of the polypeptide product. Each tRNA possesses at one end an amino acid, and at the other end an anti-codon which will bind only to a particular combination of three nucleotides. Such a sequence of three nucleotides is called a codon. Although there are 64 possible nucleotide triplets (4 × 4 × 4 = 64), there are only 20 amino acids (A = alanine, R = arginine, N = asparagine, D = aspartic acid, C = cysteine, Q = glutamine, E = glutamic acid, G = glycine, H = histidine, I = isoleucine, L = leucine, K = lysine, F = phenylalanine, P = proline, S = serine, T = threonine, W = tryptophan, Y = tyrosine, V = valine, and lastly M = methionine which is encoded by the start codon – see below), so that the mapping from codons to amino acids is degenerate, in the sense that two codons may map to the same amino acid (though one codon always maps unambiguously to a single amino acid; see Table 1.1). Furthermore, the frequencies of the various codons within protein-coding genes tend to be significantly nonuniform; this codon bias (i.e., profile of codon usage statistics) is an essential component of nearly all gene-finding programs in use today.



© Cambridge University Press

Table of Contents

Foreword Steven Salzberg; 1. Introduction; 2. Mathematical preliminaries; 3. Overview of gene prediction; 4. Gene finder evaluation; 5. A toy Exon finder; 6. Hidden Markov models; 7. Signal and content sensors; 8. Generalized hidden Markov models; 9. Comparative gene finding; 10. Machine Learning methods; 11. Tips and tricks; 12. Advanced topics; Appendix - online resources; References; Index.
From the B&N Reads Blog

Customer Reviews