This report captures in detail a Two-level Software Health Management strategy on a real-life example of an Inertial Measurement Unit subsystem. We describe in detail the design of the component and system level health management strategy. Results are expressed as relevant portions of the detailed logs that shows the successful adaptation of the monitor/ detect/ diagnose/ mitigate approach to Software Health Management.
A fractionated spacecraft is a cluster of independent modules that interact wirelessly to maintain cluster flight and realize the functions usually performed by a monolithic satellite. This spacecraft architecture poses novel software challenges because the hardware platform is inherently distributed, with highly fluctuating connectivity among the modules. It is critical for mission success to support autonomous fault management and to satisfy real-time performance requirements. It is also both critical and challenging to support multiple organizations and users whose diverse software applications have changing demands for computational and communication resources, while operating on different levels and in separate domains of security. The solution proposed in this paper is based on a layered architecture consisting of a novel operating system, a middleware layer, and component-structured applications. The operating system provides primitives for concurrency, synchronization, and secure information flows; it also enforces application separation and resource management policies. The middleware provides higher-level services supporting request/response and publish/subscribe interactions for distributed software. The component model facilitates the creation of software applications from modular and reusable components that are deployed in the distributed system and interact only through well-defined mechanisms. Two cross-cutting aspects - multi-level security and multilayered fault management - are addressed at all levels of the architecture. The complexity of creating applications and performing system integration is mitigated through the use of a domain-specific model-driven development process that relies on a dedicated modeling language and its accompanying graphical modeling tools, software generators for synthesizing infrastructure code, and the extensive use of model-based analysis for verification and validation.
While traditional design-time and off-line approaches to testing and verification contribute significantly to improving and ensuring high dependability of software, they may not cover all possible fault scenarios that a system could encounter at runtime. Thus, runtime `health management' of complex embedded software systems is needed to improve their dependability. Our approach to Software Health Management uses concepts from the field of `Systems Health Management': detection, diagnosis and mitigation. In earlier work we had shown how to use a reactive mitigation strategy specified using a timed state machine model for system health manager. This paper describes the algorithm and key concepts for an alternative approach to system mitigation using a deliberative strategy, which relies on a function-allocation model to identify alternative component-assembly configurations that can restore the functions needed for the goals of the system.
Statecharts is a model-based formalism for simulating and analyzing reactive systems. In our previous work, we developed Polyglot, a unified framework for analyzing different semantic variants of Statechart models. However, for systems containing communicating, asynchronous components deployed on a distributed platform, additional features not inherent to the basic Statecharts paradigm are needed. These include a connector mechanism for communication, a scheduling framework for sequencing the execution of individual components, and a method for specifying verification properties spanning multiple components. This paper describes the addition of these features to Polyglot, along with an example NASA case study using these new features. Furthermore, the paper describes on-going work on modeling Plexil execution plans with Polyglot, which enables the study of interaction issues for future manned and unmanned missions.
Component-based software development for real-time systems necessitates a well-defined `component model' that allows compositional analysis and reasoning about systems. Such a model defines what a component is, how it works, and how it interacts with other components. It is especially important for real-time systems to have such a component model, as many problems in these systems arise from poorly understood and analyzed component interactions. In this paper we describe a component model for hard real-time systems that relies on the services of an ARINC-653 compliant real-time operating system platform. The model provides high-level abstractions of component interactions, both for the synchronous and asynchronous case. We present a formalization of the component model in the form of timed transition traces. Such formalization is necessary to be able to derive interesting system level properties such as fault propagation graphs from models of component assemblies. We provide a brief discussion about such system level fault propagation templates for this component model.
Complex real-time software systems require an active fault management capability. While testing, verification and validation schemes and their constant evolution help improve the dependability of these systems, an active fault management strategy is essential to potentially mitigate the unacceptable behaviors at run-time. In our work we have applied the experience gained from the field of Systems Health Management towards component-based software systems. The software components interact via well-defined concurrency patterns and are executed on a real-time component framework built upon ARINC-653 platform services. In this paper, we present the lessons learned in architecting and applying a two-level health management strategy to assemblies of software components.
Many-core Graphics Processing Units (GPU) provide a high-performance parallel hardware platform on the desktop at an incredibly low cost. However, the widespread use of this computational capacity is hindered by the fact that programming GPUs is difficult. The state-of-the-art is to develop code utilizing the NVIDIA Compute Unified Device Architecture (CUDA). However, effective use of CUDA requires developers highly skilled in both low-level systems programming and parallel processing. Recognizing this roadblock to widespread adaption of General-Purpose Computing on GPUs (GPGPU), the NVIDIA Performance Primitives (NPP) library was released recently. While greatly easing the burden, utilizing NPP still requires one to learn CUDA. In this paper, we introduce a graphical environment for the design of image processing workflows that automatically generates all the CUDA code including NPP calls necessary to run the application on a GPU. Experimental results show that the generated code is almost as efficient as the equivalent hand written program and 10 times faster than running on the CPU alone in the typical case.
Aircraft and spacecraft electrical power distribution systems are critical to overall system operation, but these systems may experience faults. Early fault detection makes it easier for system operators to respond and avoid catastrophic failures. This paper discusses a fault detection scheme based on a tunable generalized likelihood algorithm. We discuss the detector algorithm, and then demonstrate its performance on test data generated from a spacecraft power distribution testbed at NASA Ames. Our results show high detection accuracy and low false alarm rates.
Modern electrical power disribution systems play a critical role in system operations. Therefore, early fault detection and isolation is essential to maintaining system safety and avoiding catastrophic failures. This paper discusses a fault isolation scheme based on a qualitative fault signature-based isolation mechanism that applies to abrupt, incipient and intermittent faults in the system. We discuss the isolation algorithms for a combination of these faults, and demonstrate their performance on a set of test cases generated from a NASA Ames spacecraft power distribution testbed. Our results show good isolation accuracy with 103 out of 134 faulty scenarios isolated correctly. Most of the isolation errors can be attributed to errors in the detection scheme.
Failure of electronic devices is a concern for future electric aircrafts that will see an increase of electronics to drive and control safety-critical equipment throughout the aircraft. As a result, investigation of precursors to failure in electronics and prediction of remaining life of electronic components is of key importance. DC-DC power converters are power electronics systems employed typically as sourcing elements for avionics equipment. Current research efforts in prognostics for these power systems focuses on the identification of failure mechanisms and the development of accelerated aging methodologies and systems to accelerate the aging process of test devices, while continuously measuring key electrical and thermal parameters. Preliminary model-based prognostics algorithms have been developed making use of empirical degradation models and physics-inspired degradation model with focus on key components like electrolytic capacitors and power MOSFETs (metal-oxide-semiconductor-field-effect-transistor). This paper presents current results on the development of validation methods for prognostics algorithms of power electrolytic capacitors. Particularly, in the use of accelerated aging systems for algorithm validation. Validation of prognostics algorithms present difficulties in practice due to the lack of run-to-failure experiments in deployed systems. By using accelerated experiments, we circumvent this problem in order to define initial validation activities.
Electrolytic capacitors are used in several applications ranging from power supplies for safety critical avionics equipment to power drivers for electro-mechanical actuators. Past experiences show that capacitors tend to degrade and fail faster under high electrical and thermal stress conditions that they are often subjected to during operations. This makes them good candidates for prognostics and health management. Model based prognostics captures system knowledge in the form of physics-based models of components in order to obtain accurate predictions of end of life based on their current state of health and their anticipated future use and operational conditions. The focus of this paper is on deriving first principles degradation models for thermal stress conditions and implementing Bayesian framework for making remaining useful life predictions. Data collected from simultaneous experiments are used to validate the models. Our overall goal is to derive accurate models of capacitor degradation, and use them to remaining useful life in DC-DC converters.