Software Based Fault Tolerance Against Byzantine Failures

The proposed software technique is a very low cost and an effective solution towards designing Byzantine fault tolerant computing application systems that are not so safety critical. It does not rely on multiple versions of software running simultaneously on multiple machines. The proposed software approach is to mask various hardware random errors on adopting the so-called, ESVP (an enhanced single -version program) scheme, while an application is being executed. It is not intended to eliminate software design bugs. In other words, it is assumed that code is correct and the faulty behavior is only due to transient or Byzantine faults affecting an application system. Implementation of this approach is also easy. A test program's present state is compared with its pre-computed state also in order to detect state transition - fault also. ESVP is intended to be suitable for a computer- based process monitoring system.


INTRODUCTION
The demand for more dependable software or software fault tolerance is increasing rapidly as our hightech society becomes increasingly dependent on computers.We can say a processor system has a Byzantine failure when an incorrect and odd answer results from a computation.A Byzantine failure [1] represents the most difficult type of fault caused by faulty processing system.Clearly, a system that can protect itself from Byzantine faults is impervious to any kind of fault.In a distributed environment, this may not necessarily be due to a processor, but rather the medium of communication.If a processor returns an incorrect result, there is not any hardware to reassure us that the correct result was computed.When a processor system returns wrong result because of a bug, or due to a malicious design, -an incorrect and inconsistent computation simply is classified as a Byzantine failure.A message sent over a network often becomes corrupted in some manner.Though Byzantine failures do not occur often but the consequences of incorrect results are very disastrous in our real life applications.Nobody wants erroneous results in his or her computing system.A fault -tolerant system is one that can sustain a reasonable number of process or communication failures of both permanent and intermittent nature.Developing software for tolerating Byzantine faults is obviously a very challenging task.The objective of this paper is to demonstrate the lowcost design technique for gaining Byzantine fault tolerance in computing system without using an extra processing system.It is assumed that the application program code is correct and has no software design bugs.It is assumed that the conventional error detection and correcting codes based on hardware implementations will complement the proposed approach.Software implementations of such error codes (e.g., Hamming codes, CRC etc.), have significant overhead on both run-time and memory space.It is assumed that an application program has been rigorously tested and the program has no software design bugs.The proposed enhanced single-version program (ESVP) based technique (e.g., with three copies of an application program that executes one after another with its similar but own input data set along with a test program execution with its known data) detects the Byzantine-faults and hardware transient faults as early as possible.And, it can initiate the necessary recovery actions to repair the erroneous code (for availability at next run) immediately for preventing any disastrous consequence or, further damage of the computing system.However, such recovery is optional here.At the same time, it provides correct result through masking the erroneous computed answers.It is our common belief that when systems are not so safety critical, it is difficult to suggest the conventional expensive techniques that rely on triple modular redundancy (TMR) for tolerating one fault.An ESVP with three images tolerates one fault.In case there are two or more faults then such system crashes.Though the ESVP applies time redundancy in its simplest form, it is intended to be useful as a lowcost solution to design a dependable system (e.g., a commodity system) that is not so safety critical.The proposed ESVP scheme is intended to get fault tolerance using ordinary computer.The paper is not intended to eliminate software design bugs.It is intended to mask various hardware faults including transient and Byzantine faults in the process, processor registers or in memory etc., during the run-time of an application for gaining a low-cost, fault tolerant and dependable computing using an ordinary off-theshelf computing machine.It is assumed that hardware error detection mechanisms (e.g.microprocessor exceptions) will control the non mask-able errors that may occur during execution time of an ESVP application.We adopted three copies of an application program along with a test program and a test data.All the copies of an application program run (with similar but their own set of input data in order to mask errors in their input data) in a sequence one after another and, the results collected from several executions are compared (or voted upon) to mask potential errors.This scheme has been tested on an application of Boiler Turbine Efficiency Computation for the Kolaghat Thermal Power Station in India.Efficiency computation is carried out using various online sensor data on buffer and displayed for further monitoring works by the plant engineers.It is intended to produce reliable result so that the engineers can take appropriate action based on the reliable efficiency result.It has been manually verified that the computation is very reliable over about last six months.It has been reported that 1% -2% errors have been observed because of communication or sensor hardware errors.Like any other software based fault tolerance technique, the time redundancy here is also a significant one (2.65 times).We believe that a faster microprocessor will be good enough to cope up such time redundancy for computing a reliable answer for an application (based on ESVP) in accordance to an asking response time.Space redundancy here, (2.75 times for the executable code) is not a significant problem nowadays because memory is also becoming cheaper day by day like other hardware.We examined the ESVP scheme also with few C programs as a benchmark.The conventional Recovery Block Scheme (RBS) and N-Version Program (NVP) scheme depend on multiple versions of software and hardware.That is why they are costlier than the ESVP.Such schemes are to mask software design bugs.Again, the algorithm based fault tolerance (ABFT) and the control flow checking (CFC) schemes (based on one copy with robust data structure) lack of generality and their universal applicability.Whereas, the ESVP inherits its own generality, wide-applicability and simplicity.Both the ABFT [2,11,12,13] and CFC [3,12,13] are intended to detect errors only and, they are intended to complement the intrinsic hardware error detection mechanisms (e.g., microprocessor exceptions) in a microprocessor system.The so-called ESVP also inherits similar time and space redundancy as that of with the conventional fault tolerance schemes: RBS [4], ABFT and CFC.The NVP [4] also suffers from high synchronization overhead because N versions of an application run concurrently on three different machines (to tolerate one fault) followed by voting upon the answers for an agreement.Similar to NVP, the ESVP also uses comparison code for obtaining an answer in majority.Unlike the NVP and RBS, the proposed ESVP does not rely on multiple versions of software and hardware.Rather, it relies on single reliable version of software and hardware only.

WORK -DESCRIPTION
The proposed software technique does not need multiple processors or multiple software versions of the application.Rather it needs only one version of reliable application software that needs to be executed on single processor system.We need three copies (e.g., A 1 , A2 , and A3) of the same version of an application program.One sample test program (TP) is executed on similar processor with known inputs and the computed result is stored in an output data file namely, STPR for future reference during the run-time of an CLEI ELECTRONIC JOURNAL, VOLUME 9, NUMBER 2, PAPER 10, DECEMBER 2006 application based on ESVP.Now, this enhanced technique contains only three copies or replicas of the application program i.e., A 1 , A2 , A3 , and the test program's pre-computed output STPR data file, along with the test program TP and its known input data only.Now, the application copies are executed one after another with similar inputs and, the corresponding results or states are said, R1, R2 , R3 from A1, A2 , A3 respectively at one run.In some cases where program relies on timing of certain events such as external interrupts, or rely on reading things such as the system time, these inputs need to be buffered.A built-in test program is also executed (after the execution of A 1 , A2 , A3 ) on a known input and its state (TPR) is compared with the expected state (that is, a pre-computed and stored state STPR).The test program's (TP) result is say, TPR at the same run.Run time or observed result or state TPR needs to be compared with the expected result at STPR.The following steps explains how the proposed technique is capable of detecting and tolerating the Byzantine faults or hardware faults (during the run-time of an application) more efficiently, without using costly multiple hardware resources.Here, recovery procedure is an optional one.For single run of the ESVP based application, we do not need to execute the recovery routine.However, it is required to make the code available for the subsequent usage only.
The basic steps involved in an ESVP Scheme with three images are stated below.

DISCUSSION
The above steps are self -describing ones and very easy to understand also.Here, we compare the corresponding three outputs on executing the three copies of the application program (one after another) with similar but their own sets of input data (to tolerate errors in input data) and, select the output data that are agreed upon voting, as a correct answer of an application based on ESVP scheme.At an event when all the three outputs differ from each other (an instance with faults at two or three copies) then it might be a case of Byzantine failure that results in a system crash.Again, when the run -time result of the test program (TPR) does not match with the pre-stored test program's expected result (STPR) then also we should consider it as a Byzantine failure.At the most disastrous case (as stated at the first step of the algorithm), when all the outputs of A1, A2 , A3 of one application run, do not match to each other and, the run-time test program result or state TPR does not match to the expected one at STPR, then we need to recover the codes of A1, A2 , A3 , TP and STPR by reloading them (from a stable storage) in order to restart the execution of the application program for re-computation (e.g.fail-safe kind of fault tolerance).
For other related recovery works, interested readers may refer to various software techniques as discussed in the works [5,8,9,10].However, in the event of single Byzantine-fault, we do not need to restart the application for re-computing because, we can carry on further with the correct computed result (i.e., the answer in majority) and then, the faulty program code is repaired or recovered by copying from the master copy on a stable storage for later usage.When a re-start is not wanted then the byte-by-byte recovery work can be carried out based on algorithm as discussed in the work [6].This proposed approach is also very suitable for tolerating random errors in application code also.
In general, we need (f + 2) replicas of the application program in order to tolerate f -number of hardware or memory faults.Based on Bayesian statistical approach, we get the probability of uncertainty with an application adopting this ESVP approach, of the ratio of ((3* Tc + Tv )/ 5*( 3 *Tc + Tv) ) or, 0.2 only, over an application-execution session.Where, Tc denotes the execution time for one copy or replica and, Tv denotes the time for comparing the answer.Here, we do not carry out the recovery work for a single fault at any copy of an application.

EXPERIMENTAL RESULTS
We have also tested a C-program on bubble sorting one hundred integers, as a benchmark, for evaluating the feasibility and effectiveness of this ESVP approach.We modified the source code (manually) of the following two programs according to the proposed scheme.a) The approach has been executed for bubble sort algorithm with 12 integer elements.The source code size increases by 3.82.The executable code size increases with a factor of 2.72 and performance slows down by a factor of 2.6.
b) The approach has also been executed on matrix multiplication of two 8 x 8 integer values.The source code size increases by 3.69.The executable code size increases with a factor of 2.79 and performance slows down by a factor of 2.7.
We have used a Motorola 68040 processor and the SDS Single Step 7.4 as an adopted compiler.An average slow-down (time redundancy) of about 2.65 times is observed.
We have adopted the fault model of single-bit flip into memory locations on an environment [7] built around an application board hosting a 25 MHz Motorola 68040 processor, 2 MB RAM and some peripheral devices.Fault injection has been performed exploiting an ad hoc hardware device that allows monitoring the application program execution and triggering a fault injection procedure when a given point is reached.We have injected 1000 bit-flips in either one of the application copy and we have observed that on an average 45 % bit-flip errors have been masked by this ESVP scheme.Hardware error detection mechanism (e.g., microprocessor exceptions) detects about 34 % errors.The rest 21 % errors have been classified as fail silent (causing no difference in the program behavior).

CONCLUSION
The proposed ESVP technique is a very low cost but an effective solution towards designing Byzantine fault -tolerant system.This is a generalized scheme that does not rely on conventional, costly multiple versions of both the application software and hardware.It is easy to implement also.We believe that the significant overhead on run-time and space is easily afforded if we use a faster, modern microprocessor in order to gain higher dependency in computing system at no extra cost for multiple versions of hardware and software.The ESVP approach does not lack of generality and applicability.In particular, system engineers might opt for this low cost ESVP scheme for various computer-based processes monitoring application systems.This approach has also been implemented and tested on a thermal power plant application and trusted as a reliable low-cost solution.