Current Search: Fault-tolerant computing (x)
View All Items
Pages
- Title
- USING A SUPERPROCESS TO ACCELERATE CONVERSATIONS FOR FAULT-TOLERANT CONCURRENT SOFTWARE.
- Creator
- GAO, LIXIN., Florida Atlantic University, Fernandez, Eduardo B.
- Abstract/Description
-
Since computer systems are applied to many critical areas, fault-tolerance is a necessary requirement for their operation. Many techniques for dealing with hardware faults have been developed. Fault-tolerant software has had a much slower progress. Concurrent software adds an additional dimension to the problem of fault-tolerant software. This thesis uses an intermediate structure between two major schemes, conversation and programmer transparent coordination. The scheme proposed here...
Show moreSince computer systems are applied to many critical areas, fault-tolerance is a necessary requirement for their operation. Many techniques for dealing with hardware faults have been developed. Fault-tolerant software has had a much slower progress. Concurrent software adds an additional dimension to the problem of fault-tolerant software. This thesis uses an intermediate structure between two major schemes, conversation and programmer transparent coordination. The scheme proposed here accelerates conversations by using a special process or superprocess, which is executed on the same system level as the run-time system, and that by having access to the history of all interprocess communications can allow a process that passes its acceptance test to proceed conditionally. If the process does not pass its acceptance test all processes recover immediately without waiting to get to their acceptance tests. This work presents a set of algorithms to implement these ideas.
Show less - Date Issued
- 1987
- PURL
- http://purl.flvc.org/fcla/dt/14398
- Subject Headings
- Fault-tolerant computing
- Format
- Document (PDF)
- Title
- A multiprocessor simulator to test fault detection and reconfiguration algorithms.
- Creator
- Bhathija, Unmesh Jethanand., Florida Atlantic University, Fernandez, Eduardo B.
- Abstract/Description
-
In recent years multiprocessor systems are becoming increasingly important in critical applications. In particular, their fault tolerance properties are of great importance for their ability to be used in these type of applications. We have developed a multiprocessor simulator that can be used to test different fault detection algorithms. The processors must have four communication links. This simulator operates by passing messages between processors. An algorithm was developed for routing...
Show moreIn recent years multiprocessor systems are becoming increasingly important in critical applications. In particular, their fault tolerance properties are of great importance for their ability to be used in these type of applications. We have developed a multiprocessor simulator that can be used to test different fault detection algorithms. The processors must have four communication links. This simulator operates by passing messages between processors. An algorithm was developed for routing the messages among the processors. The simulator can also be used to try different reconfiguration strategies. In particular we have tested Malek's comparison algorithm using different multiprocessor configurations. We also developed a program which determines the configuration of an unknown network of transputers.
Show less - Date Issued
- 1990
- PURL
- http://purl.flvc.org/fcla/dt/14622
- Subject Headings
- Multiprocessors, Fault-tolerant computing
- Format
- Document (PDF)
- Title
- Fault-tolerant routing in two-dimensional and three-dimensional meshes.
- Creator
- Chen, Xiao., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Mesh-connected multicomputers are one of the simplest and least expensive structures to build a system using hundreds and even thousands of processors. The nodes communicate with each other by sending and receiving messages. As the system gets larger and larger, it not only requires the routing algorithms be efficient but also fault-tolerant. The fault model we use in 2-D meshes is a faulty block while in 3-D meshes, the fault model is a faculty cube. In order to route messages through...
Show moreMesh-connected multicomputers are one of the simplest and least expensive structures to build a system using hundreds and even thousands of processors. The nodes communicate with each other by sending and receiving messages. As the system gets larger and larger, it not only requires the routing algorithms be efficient but also fault-tolerant. The fault model we use in 2-D meshes is a faulty block while in 3-D meshes, the fault model is a faculty cube. In order to route messages through feasible minimum paths, the extended safety level is used to determine the existence of a minimal path and faulty block (cube) information is used to guide the routing. This dissertation presents an in-depth study of fault-tolerant minimal routing in 2-D tori, 3-D meshes, and tree-based fault-tolerant multicasting in 2-D and 3-D meshes using extended safety levels. Also path-based fault-tolerant deadlock-free multicasting in 2-D and 3-D meshes is studied. In fault-tolerant minimal routing in 2-D meshes, if no faulty block is encountered, any adaptive minimal routing can be used until the message encounters a faulty block. The next step is guided by the faulty block information until the message gets away from the faulty block. After that, any minimal adaptive routing can be used again. The minimal routing in 2-D tori is similar to that in 2-D meshes if at the beginning of the routing a conversion is made from a 2-D torus to a 2-D mesh. The fault-tolerant minimal routing in 3-D meshes can be done in a similar way. In the tree-based multicasting in 2-D and 3-D meshes, a time-step optimal and traffic-step suboptimal algorithm is proposed. Several heuristic strategies are presented to resolve a conflict, which are compared by simulations. A path-based fault-tolerant deadlock-free multicast algorithm in 2-D meshes with inter-block distance of at least three is presented to solve the deadlock problem in tree-based multicast algorithms. The approach is then extended to 3-D meshes and to inter-block distance of at least two in 2-D meshes. The path is Hamiltonian that is only updated locally in the neighborhood of a faulty block when a faulty block is encountered. Two virtual channels are used to prevent deadlock in 2-D and 3-D meshes with inter-block (inter-cube) distance of at least three and two more virtual channels are added if the inter-block distance is at least two.
Show less - Date Issued
- 1999
- PURL
- http://purl.flvc.org/fcla/dt/12597
- Subject Headings
- Fault-tolerant computing, Computer algorithms
- Format
- Document (PDF)
- Title
- A unified methodology for software and hardware fault tolerance.
- Creator
- Wang, Yijun., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The growing demand for high availability of computer systems has led to a wide application range of fault-tolerant systems. In some real-time applications ultrareliable computer systems are required. Such computer systems should be capable of tolerating failures of not only their hardware components but also of their software components. This dissertation discusses three aspects of designing an ultrareliable system: (a) a hierarchical ultrareliable system structure; (b) a set of unified...
Show moreThe growing demand for high availability of computer systems has led to a wide application range of fault-tolerant systems. In some real-time applications ultrareliable computer systems are required. Such computer systems should be capable of tolerating failures of not only their hardware components but also of their software components. This dissertation discusses three aspects of designing an ultrareliable system: (a) a hierarchical ultrareliable system structure; (b) a set of unified methods to tolerate both software and hardware faults in combination; and (c) formal specifications in the system structure. The proposed hierarchical structure has four layers: Application, Software Fault Tolerance, Combined Fault Tolerance and Configuration. The Application Layer defines the structure of the application software in terms of the modular structure using a module interconnection language. The failure semantics of the service provided by the system is also defined at this layer. At the Software Fault Tolerance Layer each module can use software fault tolerance methods. The implementation of the software and hardware fault tolerance is achieved at the Combined Fault Tolerance Layer which utilizes the combined software/hardware fault tolerance methods. The Configuration Layer performs actual software and hardware resource management for the requests of fault identification and recovery from the Combined Fault Tolerance Layer. A combined software and hardware fault model is used as the system fault model. This model uses the concepts of fault pattern and fault set to abstract the various occurrences of software and hardware faults. We also discuss extended comparison models that consider faulty software as well. The combined software/hardware fault tolerance methods are based on recovery blocks, N-version programming, extended comparison methods and both forward and backward recovery methods. Formal specifications and verifications are used in the system design process and the system structure to show that the design and implementation of a fault-tolerant system satisfy the functional and non-functional requirements. Brief discussions and examples of using formal specifications in the hierarchical structure are given.
Show less - Date Issued
- 1995
- PURL
- http://purl.flvc.org/fcla/dt/12424
- Subject Headings
- Fault-tolerant computing, Computer architecture
- Format
- Document (PDF)
- Title
- THE IMPLEMENTATION OF SOFTWARE FAULT TOLERANCE IN THE INTEL 80286 PROCESSOR.
- Creator
- OZAKI, BRENDA., Florida Atlantic University, Fernandez, Eduardo B., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
This thesis analyzes how the architecture of the Intel 80286 microprocessor may be used to implement fault tolerant software structures. The Multi-Micro Programming Line, MML, and the Intel 80286 kernel, K286, are used as tools to illustrate the implementation of software fault tolerance in an 80286 environment. The recovery metaprogram approach is supported by software layers which use the privilege levels in the 80286. Implementation of recovery blocks, N-version programming, exceptions,...
Show moreThis thesis analyzes how the architecture of the Intel 80286 microprocessor may be used to implement fault tolerant software structures. The Multi-Micro Programming Line, MML, and the Intel 80286 kernel, K286, are used as tools to illustrate the implementation of software fault tolerance in an 80286 environment. The recovery metaprogram approach is supported by software layers which use the privilege levels in the 80286. Implementation of recovery blocks, N-version programming, exceptions, and conversations using a recovery metaprogram are covered. While the details are specific to the 80286 architecture, the general results apply to any architecture with three or more rings of privilege and a segmented virtual memory using descriptors.
Show less - Date Issued
- 1987
- PURL
- http://purl.flvc.org/fcla/dt/14399
- Subject Headings
- Fault-tolerant computing, Intel 80286 (Microprocessor)
- Format
- Document (PDF)
- Title
- Fault-tolerant multicasting in hypercube multicomputers.
- Creator
- Yao, Kejun., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Interprocessor communication plays an important role in the performance of multicomputer systems, such as hypercube multicomputers. In this thesis, we consider the multicast problem for a hypercube system in the presence of faulty components. Two types of algorithms are proposed. Type 1 algorithms, which are developed based on local network information, can tolerate both node failures and link failures. Type 2 algorithms, which are developed based on limited global network information, ensure...
Show moreInterprocessor communication plays an important role in the performance of multicomputer systems, such as hypercube multicomputers. In this thesis, we consider the multicast problem for a hypercube system in the presence of faulty components. Two types of algorithms are proposed. Type 1 algorithms, which are developed based on local network information, can tolerate both node failures and link failures. Type 2 algorithms, which are developed based on limited global network information, ensure that each destination receives message through the shortest path. Simulation results show that type 2 algorithms achieve very good results on both time and traffic steps, two main criteria in measuring the performance of interprocessor communication.
Show less - Date Issued
- 1993
- PURL
- http://purl.flvc.org/fcla/dt/14896
- Subject Headings
- Hypercube networks (Computer networks), Computer architecture, Fault-tolerant computing
- Format
- Document (PDF)
- Title
- Dynamic routing in grid-connected networks.
- Creator
- Jiang, Zhen., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
This dissertation describes the effect of collection and distribution of fault information on routing capacity in grid-connected networks with faults occurring during the routing process. The grid-connected network, such as hypercubes, 2-D meshes, and 3-D meshes, is one of the simplest and least expensive structures to build a system using hundreds and even thousands of processors. In such a system, efficient communication among the processors is critical to performance. Hence, the routing of...
Show moreThis dissertation describes the effect of collection and distribution of fault information on routing capacity in grid-connected networks with faults occurring during the routing process. The grid-connected network, such as hypercubes, 2-D meshes, and 3-D meshes, is one of the simplest and least expensive structures to build a system using hundreds and even thousands of processors. In such a system, efficient communication among the processors is critical to performance. Hence, the routing of messages is an important issue that needs to be addressed. As the number of nodes in the networks increases, the chance of failure also increases. The complex nature of networks also makes them vulnerable to disturbances. Therefore, the ability to route messages efficiently in the presence of faulty components, especially those might occur during the routing process, is becoming increasingly important. A central issue in designing a fault-tolerant routing algorithm is the way fault information is collected and used. The safety level model is a special coded fault information model in hypercubes which is more cost effective and more efficient than other information models. In this model, each node is associated with an integer, called safety level, which is an approximated measure of the number and distribution of faulty nodes in the neighborhood. The safety level of each node in an n-dimensional hypercube can be easily calculated through (n - 1)-rounds information exchanges among neighboring nodes. A k-safe node indicates the existence of at least one Hamming distance path (also called optimal path or minimal path) from this node to any node with Hamming distance k. We focus on routing capacity using safety levels in a dynamic system. In this case, the update of safety levels and the routing process proceed hand-in-hand. During the converging period, the routing process may experience extra hops based on unstable (inconsistent) information. Under the assumption that the total number of faults is less than n, we provide an upper bound of extra hops and show its accuracy and effectiveness. After that, we extend the results to meshes. Our simulation results show the effectiveness of our information model and scalability of our fault-information-based routing in the grid-connected networks with dynamic faults. Because our information is easy to update and maintain and optimality is still preserved, it is more cost effective than the others.
Show less - Date Issued
- 2002
- PURL
- http://purl.flvc.org/fcla/dt/12002
- Subject Headings
- Fault-tolerant computing, Hypercube networks (Computer networks)
- Format
- Document (PDF)
- Title
- Design and modeling of hybrid software fault-tolerant systems.
- Creator
- Zhang, Man-xia Maria., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Fault tolerant programming methods improve software reliability using the principles of design diversity and redundancy. Design diversity and redundancy, on the other hand, escalate the cost of the software design and development. In this thesis, we study the reliability of hybrid fault tolerant systems. Probability models based on fault trees are developed for the recovery block (RB), N-version programming (NVP) and hybrid schemes which are the combinations of RB and NVP. Two heuristic...
Show moreFault tolerant programming methods improve software reliability using the principles of design diversity and redundancy. Design diversity and redundancy, on the other hand, escalate the cost of the software design and development. In this thesis, we study the reliability of hybrid fault tolerant systems. Probability models based on fault trees are developed for the recovery block (RB), N-version programming (NVP) and hybrid schemes which are the combinations of RB and NVP. Two heuristic methods are developed to construct hybrid fault tolerant systems with total cost constraints. The algorithms provide a systematic approach to the design of hybrid fault tolerant systems.
Show less - Date Issued
- 1992
- PURL
- http://purl.flvc.org/fcla/dt/14783
- Subject Headings
- Computer software--Reliability, Fault-tolerant computing, Algorithms
- Format
- Document (PDF)
- Title
- Fault tolerant scheduling for multiprocessor systems.
- Creator
- Mahanthi, Gangadhar., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
In the last few years, it has become profound to achieve higher performance of computers by solely upgrading logic technology. This required a move to a parallel processing system or a multiprocessor system in order to build faster computer systems. The importance of multiprocessor systems is increasing due to many reasons, one of which is reliability. In a multiprocessor system, a number of tasks may concurrently exist. To operate the system efficiently, one must carefully schedule the tasks...
Show moreIn the last few years, it has become profound to achieve higher performance of computers by solely upgrading logic technology. This required a move to a parallel processing system or a multiprocessor system in order to build faster computer systems. The importance of multiprocessor systems is increasing due to many reasons, one of which is reliability. In a multiprocessor system, a number of tasks may concurrently exist. To operate the system efficiently, one must carefully schedule the tasks. This thesis proposes a set of algorithms to schedule these tasks exploiting the inherent redundancy of processors in a multiprocessor system. Also discussed are some reliability issues and application to different networks with some examples.
Show less - Date Issued
- 1992
- PURL
- http://purl.flvc.org/fcla/dt/14822
- Subject Headings
- Multiprocessors, Fault-tolerant computing, Electronic digital computers--Reliability
- Format
- Document (PDF)
- Title
- Reliability modeling of fault-tolerant software.
- Creator
- Leu, Shao-Wei., Florida Atlantic University, Fernandez, Eduardo B., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
We have developed reliability models for a variety of fault-tolerant software constructs including those based on two well-known methodologies: recovery block and N-version programming, and their variations. We also developed models for the conversation scheme which provides fault tolerance for concurrent software and a newly proposed system architecture, the recovery metaprogram, which attempts to unify most of the existing fault-tolerant strategies. Each model is evaluated using either GSPN...
Show moreWe have developed reliability models for a variety of fault-tolerant software constructs including those based on two well-known methodologies: recovery block and N-version programming, and their variations. We also developed models for the conversation scheme which provides fault tolerance for concurrent software and a newly proposed system architecture, the recovery metaprogram, which attempts to unify most of the existing fault-tolerant strategies. Each model is evaluated using either GSPN, a software package based on Generalized Stochastic Petri Nets, or Sharpe, an evaluation tool for Markov models. The numerical results are then analyzed and compared. Major results derived from this process include the identification of critical parameters for each model, the comparisons of relative performance among different software constructs, the justification of a preliminary approach to the modeling of complex conversations, and the justification of recovery metaprogram regarding improvement of reliability.
Show less - Date Issued
- 1990
- PURL
- http://purl.flvc.org/fcla/dt/12256
- Subject Headings
- Fault-tolerant computing, Computer software--Reliability
- Format
- Document (PDF)
- Title
- The balanced hypercube: A versatile cube-based multicomputer system.
- Creator
- Huang, Ke., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
We propose the balanced hypercube (BH), which is a variant of the standard hypercube (Q), as a multicomputer topological structure. An n-dimensional balanced hypercube BHn has the same desirable topological properties of the 2n-dimensional standard hypercube Q2n such as size (2^2n nodes and n2^2n edges), regularity and symmetry, connectivity (2n node-disjoint pathes between any pair of nodes), and diameter (2n when n = 1 or n is even). Moreover, BHn has smaller diameter (2n-1) than Qn's (2n)...
Show moreWe propose the balanced hypercube (BH), which is a variant of the standard hypercube (Q), as a multicomputer topological structure. An n-dimensional balanced hypercube BHn has the same desirable topological properties of the 2n-dimensional standard hypercube Q2n such as size (2^2n nodes and n2^2n edges), regularity and symmetry, connectivity (2n node-disjoint pathes between any pair of nodes), and diameter (2n when n = 1 or n is even). Moreover, BHn has smaller diameter (2n-1) than Qn's (2n) when n is odd other than 1. In addition, BHn is load balanced, i.e., for every node v of BHn, there exists another node v', called v's matching node, such that v and v' share the same adjacent node set. Therefore, BHn has a desirable fault tolerance feature: when a node v fails, we can simply shift the job execution on v to its matching node v' and the communication pattern between jobs remains the same. In this dissertation, we study the topological properties of BHn and explore its fault tolerance feature. Other design issues are considered, such as communication primitives, capability of simulating other multicomputer systems through graph embedding, resource placement. and VLSI/WSI layout. Finally, the use of BHn is illustrated by an application.
Show less - Date Issued
- 1997
- PURL
- http://purl.flvc.org/fcla/dt/12519
- Subject Headings
- Hypercube networks (Computer networks), Fault-tolerant computing
- Format
- Document (PDF)
- Title
- Load balancing on multiprocessor systems.
- Creator
- More, Hemant B., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The utilization of a multiprocessor system is enhanced when idle time of processors is reduced. Allocation of processes from overloaded processors to idle processors can balance the load on multiprocessor systems and increase system throughput by reducing the process execution time. This thesis presents a study of parameters, issues and existing algorithms related to load balancing. The performance of load balancing on hypercubes using three new algorithms is explored and analyzed. A new...
Show moreThe utilization of a multiprocessor system is enhanced when idle time of processors is reduced. Allocation of processes from overloaded processors to idle processors can balance the load on multiprocessor systems and increase system throughput by reducing the process execution time. This thesis presents a study of parameters, issues and existing algorithms related to load balancing. The performance of load balancing on hypercubes using three new algorithms is explored and analyzed. A new algorithm to balance load on hypercubes in the presence of link faults is presented and analyzed here. Another algorithm to balance load on hypercube systems containing faulty processors is proposed and studied. The applicability of load balancing to real life problems is demonstrated by showing that the execution of branch and bound problem on hypercubes speeds up when load balancing is used.
Show less - Date Issued
- 1993
- PURL
- http://purl.flvc.org/fcla/dt/14957
- Subject Headings
- Hypercube networks (Computer networks), Multiprocessors, Fault-tolerant computing
- Format
- Document (PDF)
- Title
- Massively parallel fault simulator.
- Creator
- Parigi, Eshwar V., Florida Atlantic University, Mazuera, Olga, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Fault simulators can be used for various purposes, such as the determination of the Fault coverage, the Automatic test pattern generation and the preparation of the Fault dictionaries. As the size of the digital circuits increases, the number of gates present increases and the time taken for fault simulation also increases. In order to reduce the fault simulation time, massively parallel computers are being used. We have developed a fault simulator on MASPAR, a massively parallel Single...
Show moreFault simulators can be used for various purposes, such as the determination of the Fault coverage, the Automatic test pattern generation and the preparation of the Fault dictionaries. As the size of the digital circuits increases, the number of gates present increases and the time taken for fault simulation also increases. In order to reduce the fault simulation time, massively parallel computers are being used. We have developed a fault simulator on MASPAR, a massively parallel Single Instruction Multiple Data machine, based on the principles of parallel pattern parallel fault simulation. In order to eliminate the limitation of limited memory on MASPAR, we have designed an algorithm which reduces the amount of memory required for storing the circuit. We have implemented these algorithms in two different ways. These algorithms were tested on ISCAS85 benchmark circuits. The results have shown an improvement over other parallel algorithms.
Show less - Date Issued
- 1994
- PURL
- http://purl.flvc.org/fcla/dt/15050
- Subject Headings
- Fault-tolerant computing, Parallel processing (Electronic computers)
- Format
- Document (PDF)
- Title
- Fault tolerance and reliability patterns.
- Creator
- Buckley, Ingrid A., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The need to achieve dependability in critical infrastructures has become indispensable for government and commercial enterprises. This need has become more necessary with the proliferation of malicious attacks on critical systems, such as healthcare, aerospace and airline applications. Additionally, due to the widespread use of web services in critical systems, the need to ensure their reliability is paramount. We believe that patterns can be used to achieve dependability. We conducted a...
Show moreThe need to achieve dependability in critical infrastructures has become indispensable for government and commercial enterprises. This need has become more necessary with the proliferation of malicious attacks on critical systems, such as healthcare, aerospace and airline applications. Additionally, due to the widespread use of web services in critical systems, the need to ensure their reliability is paramount. We believe that patterns can be used to achieve dependability. We conducted a survey of fault tolerance, reliability and web service products and patterns to better understand them. One objective of our survey is to evaluate the state of these patterns, and to investigate which standards are being used in products and their tool support. Our survey found that these patterns are insufficient, and many web services products do not use them. In light of this, we wrote some fault tolerance and web services reliability patterns and present an analysis of them.
Show less - Date Issued
- 2008
- PURL
- http://purl.flvc.org/FAU/166447
- Subject Headings
- Fault-tolerant computing, Computer software, Reliability, Reliability (Engineering), Computer programs
- Format
- Document (PDF)
- Title
- Towards a methodology for building reliable systems.
- Creator
- Buckley, Ingrid A., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Reliability is a key system characteristic that is an increasing concern for current systems. Greater reliability is necessary due to the new ways in which services are delivered to the public. Services are used by many industries, including health care, government, telecommunications, tools, and products. We have defined an approach to incorporate reliability along the stages of system development. We first did a survey of existing dependability patterns to evaluate their possible use in...
Show moreReliability is a key system characteristic that is an increasing concern for current systems. Greater reliability is necessary due to the new ways in which services are delivered to the public. Services are used by many industries, including health care, government, telecommunications, tools, and products. We have defined an approach to incorporate reliability along the stages of system development. We first did a survey of existing dependability patterns to evaluate their possible use in this methodology. We have defined a systematic methodology that helps the designer apply reliability in all steps of the development life cycle in the form of patterns. A systematic failure enumeration process to define corresponding countermeasures was proposed as a guideline to define where reliability is needed. We introduced the idea of failure patterns which show how failures manifest and propagate in a system. We also looked at how to combine reliability and security. Finally, we defined an approach to certify the level of reliability of an implemented web service. All these steps lead towards a complete methodology.
Show less - Date Issued
- 2012
- PURL
- http://purl.flvc.org/FAU/3342037
- Subject Headings
- Computer software, Reliability, Reliability (Engineering), Computer programs, Fault-tolerant computing
- Format
- Document (PDF)
- Title
- A fault-tolerant memory architecture for storing one hour of D-1 video in real time on long polyimide tapes.
- Creator
- Monteiro, Pedro Cox de Sousa., Florida Atlantic University, Glenn, William E., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Research is under way to fabricate large-area thin-film transistor arrays produced on a thin polyimide substrate. The polyimide substrate is available in long thirty centimeter wide rolls of tape, and lithography hardware is being developed to expose hundreds of meters of this tape with electrically addressable light modulators which can resolve 2 $\mu$m features. A fault-tolerant memory architecture is proposed that is capable of storing one hour of D-1 component digital video (almost 10^12...
Show moreResearch is under way to fabricate large-area thin-film transistor arrays produced on a thin polyimide substrate. The polyimide substrate is available in long thirty centimeter wide rolls of tape, and lithography hardware is being developed to expose hundreds of meters of this tape with electrically addressable light modulators which can resolve 2 $\mu$m features. A fault-tolerant memory architecture is proposed that is capable of storing one hour of D-1 component digital video (almost 10^12 bits) in real-time, on eight two-hundred meter long tapes. Appropriate error correcting codes and error concealment are proposed to compensate for drop-outs resulting from manufacturing defects so as to yield video images with error rates low enough to survive several generations of copies.
Show less - Date Issued
- 1992
- PURL
- http://purl.flvc.org/fcla/dt/14869
- Subject Headings
- Polyimides, Computer architecture, Memory hierarchy (Computer science), Fault-tolerant computing
- Format
- Document (PDF)
- Title
- The design of reliable decentralized computer systems.
- Creator
- Wu, Jie., Florida Atlantic University, Fernandez, Eduardo B., College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
With the increase in the applications of computer technology, there are more and more demands for the use of computer systems in the area of real-time applications and critical systems. Reliability and performance are fundamental design requirements for these applications. In this dissertation, we develop some specific aspects of a fault-tolerant decentralized system architecture. This system can execute concurrent processes and it is composed of processing elements that have only local...
Show moreWith the increase in the applications of computer technology, there are more and more demands for the use of computer systems in the area of real-time applications and critical systems. Reliability and performance are fundamental design requirements for these applications. In this dissertation, we develop some specific aspects of a fault-tolerant decentralized system architecture. This system can execute concurrent processes and it is composed of processing elements that have only local memories with point-to-point communication. A model using hierarchical layers describes this system. Fault tolerance techniques are discussed for the applications, software, operating system, and hardware layers of the model. Scheduling of communicating tasks to increase performance is also addressed. Some special problems such as the Byzantine Generals problem are considered. We have shown that, by combining reliable techniques on different layers and with consideration of system performance, one can provide a system with a very high level reliability as well as performance.
Show less - Date Issued
- 1989
- PURL
- http://purl.flvc.org/fcla/dt/12237
- Subject Headings
- Electronic digital computers--Reliability, Fault-tolerant computing, System design, Computer software--Reliability
- Format
- Document (PDF)
- Title
- Fault-tolerant parallel computing using shuffle exchange hypercube and cube-connected cubes.
- Creator
- Goyal, Praduemn K., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
The hypercube has become one of the most popular architectures for a wide variety of parallel processing applications and has been used in several commercial and research multiprocessors. Its topological and reliability properties have been studied extensively and several techniques have been proposed for enhancing its reliability. We first present a survey of the techniques that have been used for analyzing and enhancing the reliability of the hypercube and propose a classification framework...
Show moreThe hypercube has become one of the most popular architectures for a wide variety of parallel processing applications and has been used in several commercial and research multiprocessors. Its topological and reliability properties have been studied extensively and several techniques have been proposed for enhancing its reliability. We first present a survey of the techniques that have been used for analyzing and enhancing the reliability of the hypercube and propose a classification framework in which the surveyed reliability analysis techniques can be critically evaluated. Invariably, the techniques for enhancing the fault tolerance of the hypercube require modification of the processing nodes to include redundant elements, or alternatively, degrade the hypercube to a lower dimension cube when faults occur. We propose a technique using unmodified processing elements that takes advantage of the dataflow patterns of a specific class of parallel algorithms belonging to the divide-and-conquer paradigm. It is shown that by incorporating shuffles and exchanges, the execution of the divide-and-conquer class of algorithms on the hypercube can be made fault- tolerant. We develop this technique into a fault-tolerant computing scheme for execution of divide-and-conquer class of parallel algorithms, which we call Shuffle Exchange Hypercube (SEH). We propose a new recursively defined interconnection architecture for parallel computation called Cube-Connected Cubes (CCCubes). It is shown that the CCCubes architecture can emulate both the hypercube and the Cube-Connected Cycles (CCC) architectures. The CCCubes architecture is recursively extended into the kth order Generalized Cube-Connected Cubes (GCCCubes) architecture. We propose several classes of CCCubes and GCCCubes architectures and study their topological and reliability properties. A comparison of the reliability and topological properties of the proposed architectures with those of the hypercube is provided and it is shown that the CCCubes and GCCCubes architectures present practical alternatives to the hypercube. Finally, some areas worthy of further pursuit are presented, which include the problem of determining a switch route schedule for SEH, extension of shuffles and exchanges to CCCubes and GCCCubes, and the determination of a VLSI layout for the proposed CCCubes and GCCCubes architectures.
Show less - Date Issued
- 1998
- PURL
- http://purl.flvc.org/fcla/dt/12581
- Subject Headings
- Fault-tolerant computing, Hypercube networks (Computer networks), Parallel processing (Electronic computers)
- Format
- Document (PDF)
- Title
- Fault-tolerant distributed shared memories.
- Creator
- Brown, Larry., Florida Atlantic University, Wu, Jie, College of Engineering and Computer Science, Department of Computer and Electrical Engineering and Computer Science
- Abstract/Description
-
Distributed shared memory (DSM) implements a shared-memory programming interface on message-passing hardware. The shared-memory programming paradigm offers several advantages over the message-passing paradigm. DSM is recognized as an important technology for massively parallel computing. However, as the number of processors in a system increases, the probability of a failure increases. To be widely useful, the DSM must be able to tolerate failures. This dissertation presents a method of...
Show moreDistributed shared memory (DSM) implements a shared-memory programming interface on message-passing hardware. The shared-memory programming paradigm offers several advantages over the message-passing paradigm. DSM is recognized as an important technology for massively parallel computing. However, as the number of processors in a system increases, the probability of a failure increases. To be widely useful, the DSM must be able to tolerate failures. This dissertation presents a method of implementing fault-tolerant DSM (FTDSM) that is based on the idea of a snooper. The snooper monitors DSM protocol messages and keeps a backup of the current state of the DSM. The snooper can respond on behalf of failed processors. The snooper-based FTDSM is an improvement over existing FTDSMs because it is based on the efficient dynamic distributed manager DSM algorithm, does not require the repair of a failed processor in access the DSM, and does not query all nodes to rebuild the state of the DSM. Three snooper-based FTDSM systems are developed. The single-snooper (SS) FTDSM has one snooper and is restricted to a broadcast network. Additional snoopers are added in the multiple-snooper (MS) FTDSM to improve performance. Two-phase commit (2PC) protocols are developed to coordinate the activities of the snoopers, and a special data structure is used to store causality information to reduce the amount of snooper activity. Snooping is integrated with each processor in the integrated snooper (IS) FTDSM. The IS FTDSM is scalable because it is not restricted to a broadcast network. The concept of dynamic snooping is introduced for the IS FTDSM and several snooper migration algorithms are studied. Several recovery algorithms are developed to allow failed processors to rejoin the system. The properties of data structures used to locate owners and snoopers are studied and used to prove that the system can tolerate any single fault. A flexible method of integrating application-level recovery with the FTDSM is presented, and a reliability analysis is conducted using a Markov-chain modeling tool to show that the snooper-based FTDSM is a cost effective way to improve the reliability of DSM.
Show less - Date Issued
- 1993
- PURL
- http://purl.flvc.org/fcla/dt/12349
- Subject Headings
- Fault-tolerant computing, Electronic data processing--Distributed processing, Parallel processing (Electronic computers), Computer networks
- Format
- Document (PDF)
- Title
- Checkpointing schemes for high-performance parallel applications in networks of workstations.
- Creator
- He, Fusen., Florida Atlantic University, Wu, Jie
- Abstract/Description
-
In this thesis, a low interprocessor communication overhead and high performance data parallelism parallel application model in a network of workstations (NOWs) is proposed. Checkpointing and rollback technologies are used in this model for performance enhancement purpose. The proposed model is analyzed both theoretically and numerically. The simulation results show that a high performance of the parallel application model is expected. As a case study, the proposed model is used to the...
Show moreIn this thesis, a low interprocessor communication overhead and high performance data parallelism parallel application model in a network of workstations (NOWs) is proposed. Checkpointing and rollback technologies are used in this model for performance enhancement purpose. The proposed model is analyzed both theoretically and numerically. The simulation results show that a high performance of the parallel application model is expected. As a case study, the proposed model is used to the parallel Everglades Landscape Fire Model (ELFM) code which was developed by South Florida Water Management District (SFWMD). The parallel programming environment is Message-Passing Interface (MPI). A synchronous checkpointing and rollback mechanism is used to handle the spread of fire which is a dynamic and irregular component of the model. Results show that the performance of the parallel ELFM using MPI is significantly enhanced by the application of checkpointing and rollback.
Show less - Date Issued
- 1998
- PURL
- http://purl.flvc.org/fcla/dt/15597
- Subject Headings
- Computer networks, Electronic data processing--Distributed processing, Fault-tolerant computing
- Format
- Document (PDF)