A Design and Implementation of Cluster Heartbeat Network for Efficient Fault Detection

Ahmad Shukri Mohd Noor, Emma Ahmad Sirajudin


To achieve fault tolerance in a server cluster, fault detection capability is a primary prerequisite. Efficient fault detection is prompt, correct and complete. This paper revisited the technique called Reactive Failure Detection (RFD) that dynamically predicts a heartbeat delay from a cluster node. We also identified the requirements to deploy RFD in actual servers. A new cluster heartbeat network with concurrency is proposed to use push and pull interaction during live monitoring and determining node’s status. The prototype of the new model is tested on a platform running multiple independent web applications and analyzed for its implementation and design correctness.


Hearbeat Network; Fault Detection; High Availability; Concurrency;

Full Text:



Falai, L. and Bondavalli, A. (2005), “Experimental Evaluation of the QoS of Failure Detectors on Wide Area Network,” International Conference on Dependable Systems and Networks (DSN’05), pp. 624–633.

Fu, S. (2010), “Failure-Aware Resource Management for HighAvailability Computing Clusters with Distributed Virtual

Machines,” Journal of Parallel and Distributed Computing, vol. 70, pp.


Kaur, A. and Verma, S. (2015), “Performance Measurement and Analysis of High-Availability Clusters,” SIGSOFT Softw. Eng. Notes, vol. 40, pp. 1–7.

M. Noor, A. S. and M. Deris, M. (2012), “Fail-stop Failure Recovery in Neighbor Replica Environment,” Procedia Computer Science,

vol. 19, pp. 1040–1045.

Noor, A.S.M., Deris, M.M. (2010), “Failure recovery mechanism in neighbor replica distribution architecture” Lecture Notes in Computer Lecture Notes in Bioinformatics), 6377 LNCS (M4D), pp. 41-48. Springer Verlag

Mamat, R., M. Deris, M., and Jalil, M. (2004), “Neighbor Replica Distribution Technique for Cluster Server Systems,” Malaysian Journal of Computer Science, vol. 17, pp. 11–20.

Mitchell, M., Oldham, J., Samuel, A. (2001), Advanced Linux

Programming, pp. 45-60, 95-129, Indiana USA, New Riders Publisher.

Schmidt, K. (2006), “High Availability and Disaster Recovery Concepts, Design, Implementation”. Berlin London: Springer

Shi, L., Yang, S. and Zhang, Q. (2010), “Research and Analysis of Adaptive Failure Detection Algorithm,” 3rd International Symposium on Computer Science and Computational Technology, pp. 21–24, Academy Publisher.

Zakaria, A., Awang, W., Mohamad, Z., Rose, A., and M. Deris, M. (2010), “Improving Response Time, Availability and Reliability Through Asynchronous Replication Technique in Cluster Architecture of Web Server Cluster,” in Database Theory and Application, Communications in Computer and Information Science, vol. 118, pp 29-36, Springer

Noor, A.S.M., Deris, M.M. (2009), “Extended heartbeat mechanism for fault detection service methodology” Communications in Computer and Information Science, 63, pp. 88-95. Springer Verlag

Matsudaira, K. “Scalable Web Architecture and Distributed Systems” Architecture of Open Source Applications.

http://www.aosabook.org/en/distsys.html. Accessed on 22 January 2015.

Khan, F. G., Qureshi, K., and Nazir, B. (2010), “Performance Evaluation of Fault Tolerance Techniques in Grid Computing System,” Computers & Electrical Engineering, vol. 36, pp. 1110–1122

Butenhof, David R.(1997) “Programming with POSIX threads.” Addison Wesley Professional.

Lea, Douglas (2000) “Concurrent programming in Java: design principles and patterns”. Addison-Wesley Professional


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.

ISSN: 2180-1843

eISSN: 2289-8131