CSC 808 (FAULT-TOLERANCE) INTRODUCTION AND OVERVIEW OF FAULT-TOLERANT SCHEMES; FAULT AND ERROR MODELLING; TEST GENERATION AND FAULT SIMULATION Presented By: ABDULLAHI OPEOLUWA ABAYOMI (CSC/16/9778) ADESUNMIBOLA LINDA NSIKAN (CSC/24/3828) AFOLAYAN DAMILOLA ODUNAYO (CSC/24/4200) ALLI NAJEEM IDOWU (CSC/24/4217) 1
Outline Introduction Overview of Fault Tolerant Schemes Faults, Errors, And Failures Test Generation And Fault Simulation In Fault 2
3 Think About this Imagine cruising at 36,000 feet in a commercial airplane. Suddenly, one of the engines fails. Do passengers panic? No. Do alarms blare? Not likely. The aircraft keeps flying — calmly, smoothly. Why? Because its designers anticipated failure. A backup engine, redundant control systems, and robust fail-safes are all working quietly in the background. This isn’t luck — it’s fault tolerance. From the biological marvel of the human body to the digital intricacy of banking systems, redundancy is the hidden scaffolding of resilience. We do not build systems to be perfect. We build them to survive imperfection.
4 What Is a System? System is a collection of interconnected parts working together toward a specific goal. Whether natural or man-made, systems exist to perform functions — and do so reliably. For example: The Human Body System An Aircraft A computer System Despite their differences, all systems share one unavoidable truth: they are susceptible to faults. Left unchecked, a fault can degrade performance, cause total failure, or trigger cascading consequences. This vulnerability drives the need for a new kind of thinking — one that assumes faults are inevitable and prepares systems to endure them
5 Why Fault Tolerance Matters No system is infallible. Even the most rigorously tested, expertly engineered systems can and do fail. Causes range from software bugs and hardware degradation to cyberattacks, natural disasters, and even human error.
6 Why Fault Tolerance Matters In response, fault tolerance has emerged not as an afterthought, but as an architectural necessity. A fault-tolerant system doesn’t promise perfection. It promises continuity — even when things go wrong. It detects, isolates, and recovers from faults so critical services continue uninterrupted. In today’s interconnected, high-stakes world, fault tolerance is more than a feature — it’s a foundation of trust.
7 The Backbone of Fault Tolerance Redundancy is the first line of defense against failure. By incorporating duplicate components or alternative pathways, systems gain the ability to detect and recover from faults automatically. Redundancy find expressions in The human Body Commercial Aircraft Banking and Financial Systems Car Braking Systems Cloud Infrastructure Military Communications Designing for survival
8 Designing for Survival The common thread in all these examples? Preparation for failure. Systems that matter — to safety, economy, national security, or human life — are not built to be flawless. They are built to withstand imperfection. They anticipate faults, design for detection and response, and recover gracefully. Next, we’ll look into the how: fault-tolerant architectures, modeling techniques, test generation, and simulation strategies — all of which aim to build systems that not only function… but endure.
9 Importance of Fault Tolerance A fault-tolerant scheme is a planned design approach or architecture that allows a system to detect, isolate, and recover from hardware or software faults without compromising its overall functionality. Fault tolerance is critical to the stability, safety, and efficiency of modern society. In today’s interconnected world, even a brief system failure can result in huge economic losses, safety hazards, and productivity declines.
10 Key Fault-Tolerant Techniques Scheme Description Common Use Redundancy Duplicating hardware/software so that if one fails, the other takes over. Servers, storage (RAID), aircraft systems. Checkpointing and Rollback Saving system state periodically so it can recover after failure. Distributed computing, databases. Replication Maintaining multiple copies of a service or data. Cloud services (AWS, Google Cloud), blockchain. Failover Systems Automatic switching to standby systems upon failure. Network infrastructure, telecommunication systems. Error Detection and Correction Detecting and correcting data errors (e.g., ECC memory). Communication systems, memory hardware. Self-Healing Systems Systems that autonomously detect and fix faults. Cloud-native apps, Kubernetes clusters.
11 Categories of Fault-Tolerant Schemes Category Purpose Types / Techniques Examples Pros Cons Hardware Redundancy Enhance system reliability by duplicating physical components - Passive, Active, Standby redundancy - Triple Modular Redundancy (TMR) -Hot/Warm/Cold Standby - Airplane control systems - RAID storage - Dual power supplies - Fast fault recovery -Reliable for hardware failures -High cost - Complex systems Software Redundancy Tolerate software faults via diverse implementations -N-Version Programming - Recovery Blocks - Data Redundancy (e.g., checksums) - Avionics systems - NASA mission software - Safety-critical medical software - Robust software - Detects logic/design flaws -Expensive -Vulnerable to common-mode failures Time Redundancy Detect transient faults by repeating operations - Retry Mechanisms - Temporal Voting - Data Retransmission - ECC memory refresh - TCP/IP retransmissions - Retry logic in software - Cost-effective - No extra hardware needed - Causes delay - Ineffective for permanent faults Information Redundancy Improve data integrity by adding extra information - Parity bits - Checksums - Hamming, CRC, ECC codes - Data networks - ECC RAM - Digital media storage - Corrects data corruption - Critical in communications - Data overhead - Limited fault correction
12 Important Lessons Learned from Real-World Failures Lesson Explanation Examples 1. Redundancy Alone Isn’t Enough Redundancy must be correctly configured, regularly tested, and avoid single points of failure. - Delta Airlines: Backup systems failed due to poor failover.- Boeing 737 MAX: Lack of sensor redundancy led to fatal crashes. 2. Resilience Must Extend Across Layers Fault tolerance should cover hardware, software, network, and operational levels. - Facebook: DNS/BGP misconfig at the network layer disrupted the whole stack.- Azure: Health probe bug at software level took down global DNS. 3. Systems Must Anticipate Human Error Misconfigurations and update errors are common; systems must be robust against them. - AWS: Bug in traffic management system caused a cascading failure.- Facebook: Misconfigured BGP update locked out access. 4. Systemic Thinking is Critical Local failures can cascade; holistic system awareness and modeling are essential. - AWS & Facebook: Local issues triggered global outages due to lack of isolation.
13 Why We Must Model Faults Helps identify weak spots before deployment. Reveals interdependencies that may cause cascading failures. Enables simulation of “what-if” scenarios and failure modes.
14 Fault Modeling Techniques Fault Tree Analysis (FTA) – Visual mapping of how faults lead to system failures. Failure Mode and Effects Analysis (FMEA) – Identifies potential faults, their causes, and impacts. Markov Models – Used in probabilistic reliability modeling. Chaos Engineering – Intentionally injecting faults into a system to observe how it behaves under stress (used by Netflix, Google).
15 Predicting Faults with AI/ML Machine learning models can analyze logs, sensor data, and telemetry to: Detect anomalies in real time Predict component degradation Trigger alerts before actual failure Predictive maintenance and intelligent fault diagnostics are widely used in: Cloud infrastructure Industrial automation Smart grids and IoT systems
16 Faults, Errors, and Failures Fault = Root cause or latent defect (e.g., design flaw, bug, crack). Error = Internal incorrect state caused by a fault (e.g., wrong data, miscalculation). Failure = Observable malfunction due to an error (e.g., breakdown, crash). Not every fault leads to an error, and not every error leads to failure — robust systems prevent or contain them.
17 Types and Classifications of Faults, Errors, and Failures Fault Categories: Design Faults (e.g., poor logic, miscalculations) Implementation Faults (e.g., coding bugs, manufacturing defects) Operational Faults (e.g., user mistakes, wear & tear) Environmental Faults (e.g., power surges, extreme heat) Interaction Faults (e.g., unexpected system conflicts) Failure Types: Transient, Intermittent, Permanent Fail-Safe, Fail-Operational, Byzantine
18 Engineering Tools for Fault Modeling and Analysis Key Techniques: Fault Injection: Test system resilience by introducing errors. Reliability Block Diagrams (RBD): Visualize component dependencies. Fault Tree Analysis (FTA): Trace causes of top-level failures. Failure Mode & Effects Analysis (FMEA): Bottom-up approach to prioritize risks using severity, occurrence, and detection scores.
19 Applications in Engineering and Manufacturing Manufacturing: Predictive Maintenance (e.g., vibration monitoring) Quality Control (e.g., vision inspection systems) Safety Systems (e.g., FMEA for pressure relief valves) Other Domains: Software: Fault tolerance via redundant systems Civil: FTA to prevent bridge collapses Aerospace: Redundancy in flight systems Automotive: On-Board Diagnostics (e.g., error codes)
20 The Value of Reliability Engineering Benefits of Managing Faults, Errors, and Failures: Higher system reliability and availability Improved safety and cost-efficiency Greater customer satisfaction and brand trust Data-driven design improvements
21 Takeaway A fault is the flaw, an error is its internal effect, and a failure is what the world sees. Understanding and managing these is central to engineering excellence.
22 Test Generation & Fault Simulation in Fault-Tolerant Design Purpose: Ensure systems function correctly despite faults. Detect and mitigate faults before deployment. Test Generation: Manual: Human-created test cases (flexible but limited). Automated (ATPG): Algorithm-driven, scalable for complex systems. Testing Approaches: Black-box: Tests I/O behavior (no internal access). White-box: Uses system internals (code/logic-based testing). Testing Focus: Structural Testing: Tests physical layout (e.g., gates, connections). Functional Testing: Verifies system meets user requirements.
23
24 Fault Injection Definition Deliberately introduce faults to test system resilience and error handling. Main Goals: Validate fault tolerance. Reveal hidden bugs. Test recovery logic and system stability.
25 Types of Fault Injection Category Techniques Example Software-Based Code manipulation, runtime injection Simulate memory leaks in flight booking systems Hardware-Based EMI, voltage variation, clock glitching Test ECU response to simulated lightning
26 Tools, Benefits, and Limitations of Fault Injection Popular Tools: Chaos Monkey: Random service shutdowns (Netflix). LFI, QEMU FI, DOCTOR: Software fault simulators. Hardware Simulators: For ASIC/FPGA fault testing. Benefits: Proves fault tolerance and system robustness. Detects rare or unexpected faults. Boosts confidence in system reliability. Challenges: Incomplete coverage → false confidence. Hardware risks (damage).Resource and time intensive.
27 Applications of Fault Injection Across Industries Domain Application Aerospace Test flight control under component failure Automotive Simulate ECU and sensor faults Cloud Systems Chaos engineering in production Medical Devices Fault resilience in life-support systems Banking Systems Ensure transaction rollback in failures
28 Chaos Engineering Chaos Engineering: Introduce faults in live or production-like environments. Example: Netflix’s Chaos Monkey tests microservice resilience.
29 Hardware & Software Tools Tool Type Examples Use Case Hardware Simulation ModelSim, Cadence Simulate hardware faults in SoCs Software FI SWIFI, LLFI Inject software bugs in apps & OS