# MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Siva Kumar Sastry Hari, Manlap Li, Pradeep Ramachandran, Byn Choi and Sarita Adve University of Illinois, Urbana-Champaign GSRC Annual Symposium Sep 29-30, 2008 Resilient Theme, Task # 1.2.2.5 ### Motivation SoftWare Anomaly Treatment (SWAT) effective for HW faults in single-threaded apps > Low SDC rate of 0.2% (dedicated poster) But multicore systems w/ multithreaded apps here to stay Does the SWAT approach work for multicore? # **Key Challenge: Cross-Core Fault Propagation** ### Multithreaded apps share data across threads - ⇒ Fault may propagate across cores - ⇒ Is SWAT effective in detecting these faults? - ⇒ Symptom causing core is no longer faulty Implicit assumption in prior SWAT work Need to detect fault and diagnose faulty core ### **MSWAT Fault Detection** # **Symptom Detection** Fatal Traps, Hangs, High OS, Kernel Panic, No-Forward-Progress ### **Key Results** >99% of unmasked permanent faults detected Low SDC rate of 0.2% of injected faults Several detections from fault-free cores ### **MSWAT:** Diagnosis Challenges and Approaches # **MSWAT Fault Diagnosis Algorithm** ### Capture fault activating trace Native execution ⇒ No added support for replay Record inputs to each thread (loads) for replay Low hardware overhead for buffering ### **Re-Execute Captured Trace** Firmware emulated isolated deterministic replay ⇒ Zero hardware overhead Compare retiring mem/ctrl instructions for divergence ⇒ Less comparisons ### Iterative Diagnosis to reduce overheads E.g., capture replay every 100k instructions till divergence ### **Diagnosis Results** >95% of detected faults successfully diagnosed µarch non-determinism ⇒ undiagnosed faults 97% faults diagnosed in <10m cycles <10ms on a 1GHz processor ⇒ invisible 93% diagnosed in 1 iteration w/ 100k instructions <200kB logs ⇒ fit in lower level caches #### **Conclusions and Future Work** SWAT effective even for multicore systems with multithreaded apps ### **Future Work** Studying SWAT for faults in memory and off-core components **Distributed Client/Server Applications** Prototyping SWAT on FPGA in collaboration with University of Michigan Faulty trace is a test vector ⇒ Apply SWAT for post-silicon debug and test