You are here

Fault-tolerant distributed shared memories

Download pdf | Full Screen View

Date Issued:
1993
Summary:
Distributed shared memory (DSM) implements a shared-memory programming interface on message-passing hardware. The shared-memory programming paradigm offers several advantages over the message-passing paradigm. DSM is recognized as an important technology for massively parallel computing. However, as the number of processors in a system increases, the probability of a failure increases. To be widely useful, the DSM must be able to tolerate failures. This dissertation presents a method of implementing fault-tolerant DSM (FTDSM) that is based on the idea of a snooper. The snooper monitors DSM protocol messages and keeps a backup of the current state of the DSM. The snooper can respond on behalf of failed processors. The snooper-based FTDSM is an improvement over existing FTDSMs because it is based on the efficient dynamic distributed manager DSM algorithm, does not require the repair of a failed processor in access the DSM, and does not query all nodes to rebuild the state of the DSM. Three snooper-based FTDSM systems are developed. The single-snooper (SS) FTDSM has one snooper and is restricted to a broadcast network. Additional snoopers are added in the multiple-snooper (MS) FTDSM to improve performance. Two-phase commit (2PC) protocols are developed to coordinate the activities of the snoopers, and a special data structure is used to store causality information to reduce the amount of snooper activity. Snooping is integrated with each processor in the integrated snooper (IS) FTDSM. The IS FTDSM is scalable because it is not restricted to a broadcast network. The concept of dynamic snooping is introduced for the IS FTDSM and several snooper migration algorithms are studied. Several recovery algorithms are developed to allow failed processors to rejoin the system. The properties of data structures used to locate owners and snoopers are studied and used to prove that the system can tolerate any single fault. A flexible method of integrating application-level recovery with the FTDSM is presented, and a reliability analysis is conducted using a Markov-chain modeling tool to show that the snooper-based FTDSM is a cost effective way to improve the reliability of DSM.
Title: Fault-tolerant distributed shared memories.
70 views
22 downloads
Name(s): Brown, Larry.
Florida Atlantic University, Degree grantor
Wu, Jie, Thesis advisor
College of Engineering and Computer Science
Department of Computer and Electrical Engineering and Computer Science
Type of Resource: text
Genre: Electronic Thesis Or Dissertation
Issuance: monographic
Date Issued: 1993
Publisher: Florida Atlantic University
Place of Publication: Boca Raton, Fla.
Physical Form: application/pdf
Extent: 253 p.
Language(s): English
Summary: Distributed shared memory (DSM) implements a shared-memory programming interface on message-passing hardware. The shared-memory programming paradigm offers several advantages over the message-passing paradigm. DSM is recognized as an important technology for massively parallel computing. However, as the number of processors in a system increases, the probability of a failure increases. To be widely useful, the DSM must be able to tolerate failures. This dissertation presents a method of implementing fault-tolerant DSM (FTDSM) that is based on the idea of a snooper. The snooper monitors DSM protocol messages and keeps a backup of the current state of the DSM. The snooper can respond on behalf of failed processors. The snooper-based FTDSM is an improvement over existing FTDSMs because it is based on the efficient dynamic distributed manager DSM algorithm, does not require the repair of a failed processor in access the DSM, and does not query all nodes to rebuild the state of the DSM. Three snooper-based FTDSM systems are developed. The single-snooper (SS) FTDSM has one snooper and is restricted to a broadcast network. Additional snoopers are added in the multiple-snooper (MS) FTDSM to improve performance. Two-phase commit (2PC) protocols are developed to coordinate the activities of the snoopers, and a special data structure is used to store causality information to reduce the amount of snooper activity. Snooping is integrated with each processor in the integrated snooper (IS) FTDSM. The IS FTDSM is scalable because it is not restricted to a broadcast network. The concept of dynamic snooping is introduced for the IS FTDSM and several snooper migration algorithms are studied. Several recovery algorithms are developed to allow failed processors to rejoin the system. The properties of data structures used to locate owners and snoopers are studied and used to prove that the system can tolerate any single fault. A flexible method of integrating application-level recovery with the FTDSM is presented, and a reliability analysis is conducted using a Markov-chain modeling tool to show that the snooper-based FTDSM is a cost effective way to improve the reliability of DSM.
Identifier: 12349 (digitool), FADT12349 (IID), fau:9251 (fedora)
Collection: FAU Electronic Theses and Dissertations Collection
Note(s): College of Engineering and Computer Science
Thesis (Ph.D.)--Florida Atlantic University, 1993.
Subject(s): Fault-tolerant computing
Electronic data processing--Distributed processing
Parallel processing (Electronic computers)
Computer networks
Held by: Florida Atlantic University Libraries
Persistent Link to This Record: http://purl.flvc.org/fcla/dt/12349
Sublocation: Digital Library
Use and Reproduction: Copyright © is held by the author, with permission granted to Florida Atlantic University to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Use and Reproduction: http://rightsstatements.org/vocab/InC/1.0/
Host Institution: FAU
Is Part of Series: Florida Atlantic University Digital Library Collections.