Proteome Informatics

Proteome Informatics

Proteome Informatics

Proteome Informatics

Hardcover

$251.00 
  • SHIP THIS ITEM
    Not Eligible for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Related collections and offers


Overview

The field of proteomics has developed rapidly over the past decade nurturing the need for a detailed introduction to the various informatics topics that underpin the main liquid chromatography tandem mass spectrometry (LC-MS/MS) protocols used for protein identification and quantitation. Proteins are a key component of any biological system, and monitoring proteins using LC-MS/MS proteomics is becoming commonplace in a wide range of biological research areas. However, many researchers treat proteomics software tools as a black box, drawing conclusions from the output of such tools without considering the nuances and limitations of the algorithms on which such software is based. This book seeks to address this situation by bringing together world experts to provide clear explanations of the key algorithms, workflows and analysis frameworks, so that users of proteomics data can be confident that they are using appropriate tools in suitable ways.

Product Details

ISBN-13: 9781782624288
Publisher: RSC
Publication date: 11/23/2016
Series: ISSN , #5
Pages: 412
Product dimensions: 6.14(w) x 9.21(h) x (d)

About the Author

Conrad Bessant is Professor of Bioinformatics at Queen Mary University of London. He has particular interests in proteomics, software development and machine learning and is striving to ensure that everyone using proteomics data can access the latest analysis methods and knows how to use them in the most effective way.

Read an Excerpt

Proteome Informatics


By Conrad Bessant

The Royal Society of Chemistry

Copyright © 2017 The Royal Society of Chemistry
All rights reserved.
ISBN: 978-1-78262-673-2



CHAPTER 1

Introduction to Proteome Informatics

CONRAD BESSANT


1.1 Introduction

In an era of biology dominated by genomics, and next generation sequencing (NGS) in particular, it is easy to forget that proteins are the real workhorses of biology. Among other tasks, proteins give organisms their structure, they transport molecules, and they take care of cell signalling. Proteins are even responsible for creating proteins when and where they are needed and disassembling them when they are no longer required. Monitoring proteins is therefore essential to understanding any biological system, and proteomics is the discipline tasked with achieving this.

Since the ground-breaking development of soft ionisation technologies by Masamichi Yamashita and John Fenn in 1984, liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS, introduced in the next section) has emerged as the most effective method for high throughput identification and quantification of proteins in complex biological mixtures. Recent years have seen a succession of new and improved instruments bringing higher throughput, accuracy and sensitivity. Alongside these instrumental improvements, researchers have developed an extensive range of protocols which optimally utilise the available instrumentation to answer a wide range of biological questions. Some protocols are concerned only with protein identification, whereas others seek to quantify the proteins as well. Depending on the particular biological study, a protocol may be selected because it provides the widest possible coverage of proteins present in a sample, whereas another protocol may be selected to target individual proteins of interest. Protocols have also been developed for specific applications, for example to study post-translational modification of proteins, e.g., to localise proteins to their particular subcellular location, e.g., and to study particular classes of protein, e.g.

A common feature of all LC-MS/MS-based proteomics protocols is that they generate a large quantity of data. At the time of writing, a raw data file from a single LC-MS/MS run on a modern instrument is over a gigabyte (GB) in size, containing thousands of individual high resolution mass spectra. Because of their complexity, biological samples are often fractionated prior to analysis and ten individual LC-MS/MS runs per sample is not unusual, so a single sample can yield 10–20 GB of data. Given that most proteomics studies are intended to answer questions about protein dynamics, e.g. differences in protein expression between populations or at different time points, an experiment is likely to include many individual samples. Technical and biological replicates are always recommended, at least doubling the number of runs and volume of data collected. Hundreds of gigabytes of data per experiment is therefore not unusual.

Such data volumes are impossible to interpret without computational assistance. The volume of data per experiment is actually relatively modest compared to other fields, such as next generation sequencing or particle physics, but proteomics poses some very specific challenges due to the complexity of the samples involved, the many different proteins that exist, and the particularities of mass spectrometry. The path from spectral peaks to confident protein identification and quantitation is complex, and must be optimised according to the particular laboratory protocol used and the specific biological question being asked. As laboratory proteomics continues to evolve, so do the computational methods that go with it. It is a fast moving field, which has grown into a discipline in its own right. Proteome informatics is the term that we have given this discipline for this book, but many alternative terms are in use. The aim of the book is to provide a snapshot of current thinking in the field, and to impart the fundamental knowledge needed to use, assess and develop the proteomics algorithms and software that are now essential in biological research.

Proteomics is a truly interdisciplinary endeavour. Biological knowledge is required to appreciate the motivations of proteomics, understand the research questions being asked, and interpret results. Analytical science expertise is essential – despite instrument vendors' best efforts at making instruments reliable and easy to use, highly skilled analysts are needed to operate such instruments and develop the protocols needed for a given study. At least a basic knowledge of chemistry, biochemistry and physics is required to understand the series of processes that happen between a sample being delivered to a proteomics lab and data being produced. Finally, specialised computational expertise is needed to handle the acquired data, and it is this expertise that this book seeks to impart. The computational skills cover a wide range of specialities, ranging from algorithm design to identify peptides (Chapters 2 and 3), statistics to score and validate identifications (Chapter 4), infer the presence of proteins (Chapter 5) and perform downstream analysis (Chapter 14), through signal processing to quantify proteins from acquired mass spectrometry peaks (Chapters 7 and 8) and software skills needed to devise and utilise data standards (Chapter 11) and analysis frameworks (Chapters 12–14), and integrate proteomics data with NGS data (Chapters 15 and 16).


1.2 Principles of LC-MS/MS Proteomics

The wide range of disciplines that overlap with proteome informatics draws in a great diversity of people including biologists, biochemists, computer scientists, physicists, statisticians, mathematicians and analytical chemists. This poses a challenge when writing a book on the subject as a core set of prior knowledge cannot be assumed. To mitigate this, this section provides a brief overview of the main concepts underlying proteomics, from a data-centric perspective, together with citations to sources of further detail.


1.2.1 Protein Fundamentals

A protein is a relatively large (median molecular weight around 40 000 Daltons) molecule that has evolved to perform a specific role within a biological organism. The role of a protein is determined by its chemical composition and 3D structure. In 1949 Frederick Sanger provided conclusive proof that proteins consist of a polymer chain of amino acids (The 20 amino acids that occur naturally in proteins are listed in Table 1.1). Proteins are synthesised within cells by assembling amino acids in a sequence dictated by a gene – a specific region of DNA within the organism's genome. As it is produced, physical interactions between the amino acids causes the string of amino acids to fold up into the 3D structure of the finished protein. Because the folding process is deterministic (albeit difficult to model) it is convenient to assume a one-to-one relationship between amino acid sequence and structure so a protein is often represented by the sequence of letters corresponding to its amino acid sequence. These letters are said to represent residues, rather than amino acids, as two hydrogens and an oxygen are lost from an amino acid when it is incorporated into a protein so the letters cannot strictly be said to represent amino acid molecules.

Organisms typically have thousands of genes, e.g. around 20 000 in humans. The human body is therefore capable of producing over 20 000 distinct proteins, which illustrates one of the major challenges for proteomics – the large number of distinct proteins that may be present in a given sample (referred to as the so-called search space when seeking to identify proteins). The situation is further complicated by alternative splicing, where different combinations of segments of a gene are used to create different versions of the protein sequence, called protein isoforms. Because of alternative splicing each human gene can produce on average around five distinct protein isoforms per gene. So, our search space expands to ~100 000 distinct proteins. If we are working with samples from a population of different individuals, the search space expands still further as some individual genome variations will translate into variations in protein sequence, some of which have transformative effects on protein structure and function.

However, the situation is yet more complex because, after synthesis, a protein may be modified by covalent addition (and possibly later removal) of a chemical entity at one or more amino acids within the protein sequence. Phosphorylation is a very common example, known to be important in regulating the activity of many proteins. Phosphorylation involves the addition of a phosphoryl group, typically (but not exclusively) to an S, T or Y. Such post-translational modifications (PTMs) change the mass of proteins, and often their function. Because each protein contains many sites at which PTMs may occur, there is a large number of distinct combinations of PTMs that may be seen on a given protein. This increases the search space massively, and it is not an exaggeration to state that the number of distinct proteins that could be produced by a human cell exceeds one million. We will never find a million proteins in a single cell – a few thousand is more typical – but the fact that these few thousand must be identified from a potential list of over a million represents one of the biggest challenges in proteomics.


1.2.2 Shotgun Proteomics

The obvious way to identify proteins from a complex sample would be to separate them from each other, then analyse each protein one by one to determine what it is. Although conceptually simple, practical challenges of this so-called top-down method have led the majority of labs to adopt the alternative bottom-up methodology, often called shotgun proteomics. This book therefore deals almost exclusively with the analysis of data acquired using this methodology, which is shown schematically in Figure 1.1.

In shotgun proteomics, proteins are broken down into peptides – amino acid chains that are much shorter than the average protein. These peptides are then separated, identified and used to infer which proteins were in the sample. The cleavage of proteins to peptides is achieved using a proteolytic enzyme which is known to cleave the protein into peptides at specific points. Trypsin, a popular choice for this task, generally cuts proteins after K and R, unless these residues are followed by P. The majority of the peptides produced by trypsin have a length of between 4–26 amino acids, equivalent to a mass range of approximately 450–3000 Da, which is well suited to analysis by mass spectrometry. Given the sequence of a protein, it is computationally trivial to determine the set of peptides that will be produced by tryptic digestion. However, digestion is not always 100% efficient so any data analysis must also consider longer peptides that result from one or more missed cleavage sites.


1.2.3 Separation of Peptides by Chromatography

Adding an enzyme such as trypsin to a complex mixture of proteins results in an even more complex mixture of peptides. The next step in shotgun proteomics is therefore to separate these peptides. To achieve high throughput this is typically performed using high performance liquid chromatography (HPLC). Explanations of HPLC can be found in analytical chemistry textbooks, e.g., but in simple terms it works by dissolving the sample in a liquid, known as the mobile phase, and passing this under pressure through a column packed with a solid material called the solid phase. The solid phase is specifically selected such that it interacts with, and therefore retards, some compounds more than others based on their physical properties. This phenomenon is used to separate different compounds as they are retained in the column for different amounts of time (their individual retention time, RT) and therefore emerge from the column (elute) separately. In shotgun proteomics, the solid phase is usually chosen to separate peptides based on their hydrophobicity. Protocols vary, but a typical proteomics chromatography run takes 30-240 minutes depending on expected sample complexity and, after sample preparation, is the primary pace factor in most proteomic analyses.

While HPLC provides some form of peptide separation, the complexity of biological samples is such that many peptides co-elute, so further separation is needed. This is done in the subsequent mass spectrometry step, which also leads to peptide identification.


1.2.4 Mass Spectrometry

In the very simplest terms, mass spectrometry (MS) is a method for sorting molecules according to their mass. In shotgun proteomics, MS is used to separate co-eluting peptides after HPLC and to determine their mass. A detailed explanation of mass spectrometry is beyond the scope of this chapter. The basic principles can be found in analytical chemistry textbooks, e.g., and an in-depth introduction to peptide MS can be found in ref. 11, but a key detail is that a molecule must be carrying a charge if it is to be detected. Peptides in the liquid phase must be ionised and transferred to the gas phase prior to entering the mass spectrometer. The so-called soft ionisation methods of electrospray ionisation (ESI) and matrix assisted laser desorption–ionisation (MALDI) are popular for this because they bestow charge on peptides without fragmenting them. In these methods a positive charge is endowed by transferring one or more protons to the peptide, a process called protonation. If a single proton is added, the peptides become a singly charged (1+) ion but higher charge states are also possible (typically 2+ or 3+) as more than one proton may be added. The mass of a peptide correspondingly increases by one proton (~1.007 Da) for each charge state. Not every copy of every peptide gets ionised (this depends on the ionisation efficiency of the instrument) and it is worth noting that many peptides are very difficult to ionise, making them essentially undetectable in MS – this has a significant impact on how proteomics data are analysed as we will see in later chapters.

The charge state is denoted by z (e.g. z = 2 for a doubly charged ion) and the mass of a peptide by m. Mass spectrometers measure the mass to charge ratio of ions, so always report m/z, from which mass can be calculated if z can be determined. In a typical shotgun proteomics analysis, the mass spectrometer is programmed to perform a survey scan – a sweep across its whole m/z range – at regular intervals as peptides elute from the chromatography column. This results in a mass spectrum consisting of a series of peaks representing peptides whose horizontal position is indicative of their m/z (There are invariably additional peaks due to contaminants or other noise.). This set of peaks is often referred to as an MS1 spectrum, and thousands are usually acquired during one HPLC run, each at a specific retention time.

The current generation of mass spectrometers, such as those based on orbitrap technology can provide a mass accuracy exceeding 1 ppm so, for example, the mass of a singly charged peptide with m/z of 400 can be determined to an accuracy of 0.0004 Da. Determining the mass of a peptide with this accuracy provides a useful indication of the composition of a peptide, but does not reveal its amino acid sequence because many different sequences can share the exact same mass.

To discover the sequence of a peptide we must break it apart and analyse the fragments generated. Typically, a data dependent acquisition (DDA) approach is used, where ions are selected in real time at each retention time by considering the MS1 spectrum, with the most abundant peptides (inferred from peak height) being passed to a collision chamber for fragmentation. Peptides are passed one at a time, providing a final step of separation, based on mass. A second stage of mass spectrometry is performed to produce a spectrum of the fragment ions (also called product ions) emerging from the peptide fragmentation – this is often called an MS2 spectrum (or MS/MS spectrum). Numerous methods have been developed to fragment peptides, including electron transfer dissociation (ETD,) and collision induced dissociation (CID,). The crucial feature of these methods is that they predominantly break the peptide along its backbone, rather than at random bonds. This phenomenon, shown graphically in Figure 1.2, produces fragmentions whose masses can be used to determine the peptide's sequence.

The DDA approach has two notable limitations: it is biased towards peptides of high abundance, and there is no guarantee that a given peptide will be selected in different runs, making it difficult to combine data from multiple samples into a single dataset. Despite this, DDA remains popular at the time of writing, but two alternative methods are gaining ground. Selected reaction monitoring (SRM) aims to overcome DDA's limitations by a priori selection of peptides to monitor (see Chapter 9) at the expense of breadth of coverage, whereas data independent acquisition (DIA) simply aims to fragment every peptide (see Chapter 10).


(Continues...)

Excerpted from Proteome Informatics by Conrad Bessant. Copyright © 2017 The Royal Society of Chemistry. Excerpted by permission of The Royal Society of Chemistry.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Table of Contents

Introduction to Proteome Informatics; De Novo Sequencing; Peptide-Spectrum Matching; PSM Scoring and Validation; Protein Grouping; Identification and Localisation of Post Translational Modifications; Algorithms for MS1-Based Quantitation; Algorithms for MS2-Based Quantitation; Informatics Solutions for Selected Reaction Monitoring; Data Analysis for Data Independent Acquisition; Mining Proteomics Repositories; Data Formats of the Proteomics Standards Initiative; OpenMS; Using Galaxy for Proteomics; R for Proteomics; Proteogenomics: Proteomics for Genome Annotation; Proteomics Informed by Transcriptomics; Subject Index
From the B&N Reads Blog

Customer Reviews