The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions

Almost all software solutions are developed through academic research and implemented only in prototype machines leaving the field of software techniques for maintaining the cache coherence widely open for future research and development. This book is a collection of all the representative approaches to software coherence maintenance including a number of related efforts in the performance evaluation field.

The book presents a selection of 27 papers dealing with state-of-the-art software solutions for cache coherence maintenance in shared-memory multiprocessors. It begins with a set of four introductory readings that provides a brief overview of the cache coherence problem and introduces software solutions to the problem. The text defines and illustrates static and dynamic software schemes, techniques for modeling performance evaluation mechanisms, and performance evaluation studies.

The book is intended for the experienced reader in computer engineering but possibly a novice in the topic of cache coherence. It also provides an in-depth understanding of the problem as well as a comprehensive overview for multicomputer designers, computer architects, and compiler writers. In addition, it is a software coherence reference handbook for advanced undergraduate and typical graduate students in multiprocessing and multiprogramming areas.

The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions

93.95 In Stock

The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions

Add to Wishlist

The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions

Paperback

$93.95

Paperback
$93.95

SHIP THIS ITEM

In stock. Ships in 1-2 days.
PICK UP IN STORE

Your local store may have stock of this item.

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Product Details

ISBN-13:	9780818670961
Publisher:	Wiley
Publication date:	02/13/1996
Series:	Systems , #10
Pages:	368
Product dimensions:	8.15(w) x 11.00(h) x 0.80(d)

About the Author

Igort Tartalja is currently with the Department of Computer Engineering, Schools of Electrical Engineering, University of Belgrade. He received the BSEE in 1984 and MSEE in 1989, both from the School of Electrical Engineering, University of Belgrade, Belgrade, Serbia, Yugoslavia. He is in the final phase of finishing his PhD thesis on dynamic software maintenance of cache coherence in shared-memory multiprocessors. From 1984 to 1989 he was with the Laboratory for Computer Engineering, Institute for Nuclear Sciences, Vinca, Serbia, Yugoslavia, working primarily on the development of a real-time computer for applications in biophysics and distributed operating system for a special-purpose multicomputer. His current research interests include multiprocessor and multicomputer architectures, heterogeneous processing, and system software support for shared-memory multiprocessors and distributed systems.

Veljko Milutinovic has been the Department of Computer Engineering, Schools of Electrical Engineering, University of Belgrade, Belgrade, Serbia Yugoslavia since 1990. From 1982 to 1990 he was on the faculty of the School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana. He has been active in the RISC field for the last decade and in technology-related research (32-bit GaAS RISC for RCA) and application-related research (multimedia-oriented RISC-based multiprocessors efforts of NCR). He has published 40 papers in IEEE journals and presented over 200 invited talks all over the world. He is a senior member of the IEEE.

Read an Excerpt

The Cache Coherence Problem in Shared-Memory Multiprocessors

Software Solutions

By Igor Tartalja Veljko Milutinovic

John Wiley & Sons

ISBN: 0-8186-7096-7

Chapter One

How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

Leslie Lamport

Abstract-Many large sequential computers execute operations in a different order than is specified by the program. A correct execution is achieved if the results produced are the same as would be produced by executing the program steps in order. For a multiprocessor computer, such a correct execution by each processor does not guarantee the correct execution of the entire program. Additional conditions are given which do guarantee that a computer correctly executes multiprocess programs.

Index Terms-Computer design, concurrent computing, hardware correctness, multiprocessing, parallel processing.

A high-speed processor may execute operations in a different order than is specified by the program. The correctness of the execution is guaranteed if the processor satisfies the following condition: the result of an execution is the same as if the operations had been executed in the order specified by the program. A processor satisfying this condition will be called sequential. Consider a computer composed of several such processors accessing a common memory. The customary approach to designing and proving the correctness of multiprocess algorithms for such a computer assumes that the following condition is satisfied: the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. A multiprocessor satisfying this condition will be called sequentially consistent. The sequentiality of each individual processor does not guarantee that the multiprocessor computer is sequentially consistent. In this brief note, we describe a method of interconnecting sequential processors with memory modules that insures the sequential consistency of the resulting multiprocessor.

We assume that the computer consists of a collection of processors and memory modules, and that the processors communicate with one another only through the memory modules. (Any special communication registers may be regarded as separate memory modules.) The only processor operations that concern us are the operations of sending fetch and store requests to memory modules. We assume that each processor issues a sequence of such requests. (It must sometimes wait for requests to be executed, but that does not concern us.)

We illustrate the problem by considering a simple two-process mutual exclusion protocol. Each process contains a critical section, and the purpose of the protocol is to insure that only one process may be executing its critical section at any time. The protocol is as follows.

process 1

a := 1;

if b = 0 then critical section; a :=0 else *** fi

process 2

b := 1;

if a = 0 then critical section; b := 0 else *** fi

The else clauses contain some mechanism for guaranteeing eventual access to the critical section, but that is irrelevant to the discussion. It is easy to prove that this protocol guarantees mutually exclusive access to the critical sections. (Devising a proof provides a nice exercise in using the assertional techniques of [2] and [3], and is left to the reader.) Hence, when this two-process program is executed by a sequentially consistent multiprocessor computer, the two processors cannot both be executing their critical sections at the same time.

We first observe that a sequential processor could execute the "b := 1" and "fetch b" operations of process 1 in either order. (When process 1's program is considered by itself, it does not matter in which order these two operations are performed.) However, it is easy to see that executing the "fetch b" operation first can lead to an error-both processes could then execute their critical sections at the same time. This immediately suggests our first requirement for a multiprocessor computer.

Requirement R1: Each processor issues memory requests in the order specified by its program.

Satisfying Requirement R1 is complicated by the fact that storing a value is possible only after the value has been computed. A processor will often be ready to issue a memory fetch request before it knows the value to be stored by a preceding store request. To minimize watting, the processor can issue the store request to the memory module without specifying the value to be stored. Of course, the store request cannot actually be executed by the memory module until it receives the value to be stored.

Requirement R1 is not sufficient to guarantee correct execution. To see this, suppose that each memory module has several ports, and each port services one processor (or I/O channel). Let the values of "a" and "b" be stored in separate memory modules, and consider the following sequence of events.

1) Processor 1 sends the "a := 1" request to its port in memory module 1. The module is currently busy executing an operation for some other processor (or I/O channel).

2) Processor 1 sends the "fetch b" request to its port in memory module 2. The module is free, and execution is begun.

3) Processor 2 sends its "b := 1" request to memory module 2. This request will be executed after processor l's "fetch b" request is completed.

4) Processor 2 sends its "fetch a" request to its port in memory module 1. The module is still busy.

There are now two operations waiting to be performed by memory module 1. If processor 2's "fetch a" operation is performed first, then both processes can enter their critical sections at the same time, and the protocol fails. This could happen if the memory module uses a round robin scheduling discipline in servicing its ports.

In this situation, an error occurs only if the two requests to memory module 1 are not executed in the same order in which they were received. This suggests the following requirement.

Requirement R2: Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue.

Condition R1 implies that a processor may not issue any further memory requests until after its current request has been entered on the queue. Hence, it must wait if the queue is full. If two or more processors are trying to enter requests in the queue at the same time, then it does not matter in which order they are serviced.

Note. If a fetch requests the contents of a memory location for which there is already a write request on the queue, then the fetch need not be entered on the queue. It may simply return the value from the last such write request on the queue.

Requirements R1 and R2 insure that if the individual processors are sequential, then the entire multiprocessor computer is sequentially consistent. To demonstrate this, one first introduces a relation [right arrow] on memory requests as follows. Define A [right arrow]B if and only if 1) A and B are issued by the same processor and A is issued before B, or 2) A and B are issued to the same memory module, and A is entered in the queue before B (and is thus executed before B). It is easy to see that R1 and R2 imply that [right arrow] is a partial ordering on the set of memory requests. Using the sequentiality of each processor, one can then prove the following result: each fetch and store operation fetches or stores the same value as if all the operations were executed sequentially in any order such that A [right arrow] B implies that A is executed before B. This in turn proves the sequential consistency of the multiprocessor computer.

Requirement R2 states that a memory module's request queue must be serviced in a FIFO order. This implies that the memory module must remain idle if the request at the head of its queue is a store request for which the value to be stored has not yet been received. Condition R2 can be weakened to allow the memory module to service other requests in this situation. We need only require that all requests to the same memory cell be serviced in the order that they appear in the queue. Requests to different memory cells may be serviced out of order. Sequential consistency is preserved because such a service policy is logically equivalent to considering each memory cell to be a separate memory module with its own request queue. (The fact that these modules may share some hardware affects the rate at which they service requests and the capacity of their queues, but it does not affect the logical property of sequential consistency.)

The requirements needed to guarantee sequential consistency rule out some techniques which can be used to speed up individual sequential processors. For some applications, achieving sequential consistency may not be worth the price of slowing down the processors. In this case, one must be aware that conventional methods for designing multiprocess algorithms cannot be relied upon to produce correctly executing programs. Protocols for synchronizing the processors must be designed at the lowest level of the machine instruction code, and verifying their correctness becomes a monumental task.

Synchronization, Coherence, and Event Ordering in Multiprocessors

Michel Dubois and Christoph Scheurich Computer Research Institute, University of Southern California

Faye A, Briggs Sun Microsystems

Multiprocessors, especially those constructed of relatively low-cost microprocessors, offer a cost-effective solution to the continually increasing need for computing power and speed. These systems can be designed either to maximize the throughput of many jobs or to speed up the execution of a single job; they are respectively called throughput-oriented and speedup-oriented multiprocessors. In the first type of system, jobs are distinct from each other and execute as if they were running on different uniprocessors. In the second type an application is partitioned into a set of cooperating processes, and these processes interact while executing concurrently on different processors. The partitioning of a job into cooperating processes is called multitasking or multithreading. In both systems global resources must be managed correctly and efficiently by the operating system. The problems addressed in this article apply to both throughput-and speedup-oriented multiprocessor systems, either at the user level or the operating-system level.

Multitasked multiprocessors are capable of efficiently executing the many cooperating numerical or nonnumerical tasks that comprise a large application. In general, the speedup provided by multitasking reduces the turnaround time of a job and therefore ultimately improves the user's productivity. For applications such as real-time processing, CAD/CAM, and simulations, multitasking is crucial because the multiprocessor structure improves the execution speed of a given algorithm within a time constraint that is ordinarily impossible to meet on a single processor employing available technology.

Designing and programming multiprocessor systems correctly and efficiently pose complex problems. Synchronizing processes, maintaining data coherence, and ordering events in a multiprocessor are issues that must be addressed from the hardware design level up to the programming language level. The goal of this article is not only to review these problems in some depth but also to show that in the design of multiprocessors these problems are intricately related. The definitions and concepts presented here provide a solid foundation on which to reason about the logical properties of a specific multiprocessor and to demonstrate that the hardware adheres to the logical model expected by the programmer. This foundation aids in understanding complex but useful architectures such as multiprocessors with private caches or with recombining interconnection networks (Figure 1). Other important issues, such as scheduling and partitioning, have been addressed in a previous survey article. Readers who are not familiar with the concept of cache memory should consult the survey by Smith.

Basic definitions

The instruction set of a multiprocessor usually contains basic instructions that are used to implement synchronization and communication between cooperating processes. These instructions are usually supported by special-purpose hardware. Some primary hardware functions are necessary to guarantee correct interprocess communication and synchronization, while other, secondary hardware functions simplify the design of parallel applications and operating systems. The notions of synchronization and communication are difficult to separate because communication primitives can be used to implement synchronization protocols, and vice versa. In general, communication refers to the exchange of data between different processes. Usually, one or several sender processes transmit data to one or several receiver processes. Interprocess communication is mostly the result of explicit directives in the program. For example, parameters passed to a coroutine and results returned by such a coroutine constitute interprocess communications. Synchronization is a special form of communication, in which the data are control information. Synchronization serves the dual purpose of enforcing the correct sequencing of processes and ensuring the mutually exclusive access to certain shared writable data. For example, synchronization primitives can be used to

(1) Control a producer process and a consumer process such that the consumer process never reads stale data and the producer process never overwrites data that have not yet been read by the consumer process.

(2) Protect the data in a database such that concurrent write accesses to the same record in the database are not allowed. (Such accesses can lead to the loss of one or more updates if two processes first read the data in sequence and then write the updated data back to memory in sequence.)

In shared-memory multiprocessor systems, communication and synchronization are usually implemented through the controlled sharing of data in memory.

A second issue addressed in this article is memory coherence, a system's ability to execute memory operations correctly. Censier and Feautrier define a coherent memory scheme as follows: "A memory scheme is coherent if the value returned on a Load instruction is always the value given by the latest Store instruction with the same address." This definition has been useful in the design of cache coherence mechanisms. As it stands, however, the definition is difficult to interpret in the context of a multiprocessor, in which data accesses may be buffered and may not be atomic. Accesses are buffered if multiple accesses can be queued before reaching their destination, such as main memory or caches. An access by processor i on a variable X is atomic if no other processor is allowed to access any copy of X while the access by processor i is in progress. It has been shown that memory accesses need not be atomic at the hardware level for correct execution of concurrent programs. Correctness of execution depends on the expected behavior of the machine. Two major classes of logical machine behavior have been identified because they are common in existing multiprocessor systems: the strongly ordered and the weakly ordered models of behavior. The hardware of the machine must enforce these models by proper ordering of storage accesses and execution of synchronization and communication primitives. This leads to the third issue, the ordering of events.

(Continues...)

Excerpted from The Cache Coherence Problem in Shared-Memory Multiprocessors by Igor Tartalja Veljko Milutinovic Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Preface.

Introduction.

Chapter 1: Introductory Readings.

How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs (L. Lamport).

Synchronization, Coherence, and Event Ordering in Multiprocessors (M. Dubois, C. Scheurich, and F.A. Briggs).

Cache Coherence in Large-Scale Shared-Memory Multiprocessors: Issues and Comparisons (D. Lilja).

Software Cache Consistency in Shared-Memory Multiprocessors: A Survey of Approaches and Performance Evaluation Studies (I. Tartalja and V. Milutinovic).

Chapter 2: Static Software Cache Coherence Schemes.

Compiler-Directed Cache Management in Multiprocessors (H. Cheong and A.V. Veidenbaum).

RP3 Processor-Memory Element (W.C. Brantley, K.P. McAuliffe, and J. Weiss).

A Compiler-Assisted Cache Coherence Solution for Multiprocessors (A.V. Veidenbaum).

A Cache Coherence Scheme With Fast Selective Invalidation (H. Cheong and A.V. Veidenbaum).

Automatic Management of Programmable Caches (R. Cytron, S. Karlovsky, and K.P. McAuliffe).

A Version Control Approach to Cache Coherence (H. Cheong and A.V. Veidenbaum).

Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps (S.L. Min and J.-L. Baer).

A Generational Algorithm to Multiprocessor Cache Coherence (T.C. Chiueh).

Cache Coherence Using Local Knowledge (E. Darnell and K. Kennedy).

Chapter 3: Dynamic Software Cache Coherence Schemes.

Software-Controlled Caches in the VMP Multiprocessor (D.R. Cheriton, G.A. Slavenburg, and P.D. Boyle).

CPU Cache Consistency with Software Support and Using "One Time Identifiers" (A.J. Smith).

An Approach to Dynamic Software Cache Consistency Maintenance Based on Conditional Invalidation (I. Tartalja and V. Milutinovic).

Adaptive Software Cache Management for Distributed Shared Memory Architectures (J.K. Bennett, J.B. Carter, and W. Zwaenepoel).

Chapter 4: Techniques for Modeling and Performance Evaluation of Cache Memories and Cache Coherence Maintenance Mechanisms.

Analysis of Multiprocessors with Private Cache Memories (J.H. Patel).

Effectiveness of Private Caches in Multiprocessor Systems with Parallel-Pipelined Memories (F.A. Briggs and M. Dubois).

On the Validity of Trace-Driven Simulation for Multiprocessors (E.J. Koldinger, S.J. Eggers, and H.M. Levy).

Multiprocessor Cache Simulation Using Hardware Collected Address Traces (A.W. Wilson).

Cache Invalidation Patterns in Shared-Memory Multiprocessors (A. Gupta and W.-D. Weber).

Benchmark Characterization for Experimental System Evaluation (T.M. Conte and W.W. Hwu).

A Model of Workloads and Its Use in Miss-Rate Prediction for Fully Associative Caches (J.P. Singh. H.S. Stone, and D.F. Thiebaut).

Chapter 5: Performance Evaluation Studies of Software Coherence Schemes).

A Performance Comparison of Directory-Based and Timestamp-Based Cache Coherence Schemes (S.L. Min and J.-L. Baer).

Evaluating the Performance of Software Cache Coherence (S. Owicki and A. Agarwal).

Comparison of Hardware and Software Cache Coherence Schemes (S.V. Adve, V.S. Adve, M.D. Hill, and M.K. Vernon).

About the Author.

From the B&N Reads Blog

Page 1 of

The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions

The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions

Paperback

Paperback

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

The Cache Coherence Problem in Shared-Memory Multiprocessors

John Wiley & Sons

Chapter One

Table of Contents

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

The Cache Coherence Problem in Shared-Memory Multiprocessors

John Wiley & Sons

Chapter One

Table of Contents

Related Subjects

Customer Reviews