Real-time Fault Detection and Stability Enhancement Mechanism Based on Large Models

Authors

  • Chuanyong Zhao Beijing Didichuxing Technology Development Co., Ltd. Author
  • Yuan Xi Beijing Dongchezhu Technology Co., Ltd. Author

DOI:

https://doi.org/10.62677/IJETAA.2502132

Keywords:

Large Language Models, Fault detection, Real-time monitoring, Stability enhancement, Self-supervised Learning

Abstract

This paper proposes a framework for real-time fault detection using large models, which rapidly identifies potential faults through system log and operational data analysis, triggering stability enhancement mechanisms. The research designs a self-supervised learning algorithm that enables large models to continuously improve detection accuracy in dynamic environments. The method adopts a transformer architecture and attention mechanism to capture temporal dependencies and complex patterns in system behavior. Through comparison with traditional methods, we validate the advantages of the proposed method in terms of accuracy and real-time performance. Experiments conducted across various fault scenarios demonstrate that the method significantly reduces fault response time in high-concurrency systems, decreasing average detection latency by 47.3% and shortening system recovery time by 35.8%, thereby improving overall stability. Additionally, the self-supervised nature of the method enables continuous adaptation to new fault patterns, providing an innovative solution for reliability assurance in distributed systems.

Downloads

Download data is not yet available.

References

S. Zhang, Y. Liu, D. Pei, Y. Chen, X. Qu, S. Tao, and Z. Zang, “Rapid and robust impact assessment of software changes in large internet-based services,” in Proc. ACM Internet Meas. Conf., 2022, pp. 1-14.

C. Duan, Y. Yang, T. Jia, G. Liu, J. Liu, H. Zhang, et al., “FAMOS: Fault diagnosis for Microservice Systems through Effective Multi-modal Data Fusion,” in 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025, pp. 610-610.

H. Guo, X. Lin, J. Yang, Y. Zhuang, J. Bai, T. Zheng, et al., “Translog: A unified transformer-based framework for log anomaly detection,” arXiv preprint arXiv:2201.00016, 2021.

Y. Lin, “Self-Supervised Distributed Machine Learning for Robust Containerized Systems,” North Carolina State University, 2023.

S. Bharany, S. Sharma, O. I. Khalaf, G. M. Abdulsahib, A. S. Al Humaimeedy, T. H. Aldhyani, et al., “A systematic survey on energyefficient techniques in sustainable cloud computing,” Sustainability, vol.14, no. 10, p. 6256, 2022.

S. Chakraborty, S. K. Pandey, S. Maity, and L. Dey, “Detection and classification of novel attacks and anomaly in IoT network using rule based deep learning model,” SN Computer Science, vol. 5, no. 8, p.1056, 2024.

T. Khan, W. Tian, G. Zhou, S. Ilager, M. Gong, and R. Buyya, “Machine learning (ML)-centric resource management in cloud computing: A review and future directions,” Journal of Network and Computer Applications, vol. 204, p. 103405, 2022.

H. Wang, D. Feng, and K. Liu, “Fault detection and diagnosis for multiple faults of VAV terminals using self-adaptive model and layered random forest,” Building and Environment, vol. 193, p. 107667, 2021.

B. Lindemann, B. Maschler, N. Sahlab, and M. Weyrich, “A survey on anomaly detection for technical systems using LSTM networks,”Computers in Industry, vol. 131, p. 103498, 2021.

A. Terbuch, P. O’Leary, N. Khalili-Motlagh-Kasmaei, P. Auer, A.Z¨ohrer, and V. Winter, “Detecting anomalous multivariate time-series via hybrid machine learning,” IEEE transactions on instrumentation and measurement, vol. 72, pp. 1-11, 2023.

Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun, “Transformers in time series: A survey,” arXiv preprint arXiv:2202.07125,2022.

L. Ericsson, H. Gouk, C. C. Loy, and T. M. Hospedales, “Self-supervised representation learning: Introduction, advances, and challenges,” IEEE Signal Processing Magazine, vol. 39, no. 3, pp. 42-62, 2022.

J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.

S. Francy and R. Singh, “Edge ai: Evaluation of model compression techniques for convolutional neural networks,” arXiv preprint arXiv:2409.02134, 2024.

H. Zhang, M. Shen, Y. Huang, Y. Wen, Y. Luo, G. Gao, and K. Guan, “A serverless cloud-fog platform for dnn-based video analytics with incremental learning,” arXiv preprint arXiv:2102.03012, 2021.

B. Barua and M. S. Kaiser, “Enhancing Resilience and Scalability in Travel Booking Systems: A Microservices Approach to Fault Tolerance, Load Balancing, and Service Discovery,” arXiv preprint arXiv:2410.19701, 2024.

Y. Zhu, J. Wang, B. Li, X. Tang, H. Li, N. Zhang, and Y. Zhao, “Root Cause Localization for Microservice Systems in Cloud-edge Collaborative Environments,” arXiv preprint arXiv:2406.13604, 2024.

D. Saxena, I. Gupta, A. K. Singh, and C. N. Lee, “A fault tolerant elastic resource management framework toward high availability of cloud services,” IEEE Transactions on Network and Service Management, vol.19, no. 3, pp. 3048-3061, 2022.

M. B. Taha, Y. Sanjalawe, A. Al-Daraiseh, S. Fraihat, and S. Al-E’mari,“Proactive auto-scaling for service function chains in cloud computing based on deep learning,” IEEE Access, 2024.

A. Vervaet, “Monilog: An automated log-based anomaly detection system for cloud computing infrastructures,” in 2021 IEEE 37th international conference on data engineering (ICDE), 2021, pp. 2739-2743.

S. Chaabene, A. Boudaya, B. Bouaziz, and L. Chaari, “An overview of methods and techniques in multimodal data fusion with application to healthcare,” International Journal of Data Science and Analytics, pp.1-25, 2025.

Y. Wang, H. Dong, H. Wu, W. Wang, and J. Zhang, “A neural network model based on attention pooling and adaptive multi-level feature fusion for arrhythmia automatic detection,” Computer Methods in Biomechanics and Biomedical Engineering, pp. 1-15, 2025.

A. Barbalau, R. T. Ionescu, M. I. Georgescu, J. Dueholm, B. Ramachandra, K. Nasrollahi, et al., “SSMTL++: Revisiting self-supervised multitask learning for video anomaly detection,” Computer Vision and Image Understanding, vol. 229, p. 103656, 2023.

A. S. Kumar, S. Raja, N. Pritha, H. Raviraj, R. B. Lincy, and J. J. Rubia, “An adaptive transformer model for anomaly detection in wireless sensor networks in real-time,” Measurement: Sensors, vol. 25, p. 100625, 2023.

C. Liu, L. Gong, and X. Chen, “Multi-scale spatiotemporal normality learning for unsupervised video anomaly detection,” Applied Intelligence, vol. 55, no. 7, p. 584, 2025.

H. Zhang, X. Jia, and C. Chen, “Deep Learning-Based Real-Time Data Quality Assessment and Anomaly Detection for Large-Scale Distributed Data Streams,” 2025.

Z. Chen, W. Qin, G. He, J. Li, R. Huang, G. Jin, and W. Li, “Explainable deep ensemble model for bearing fault diagnosis under variable conditions,” IEEE Sensors Journal, vol. 23, no. 15, pp. 17737-17750, 2023.

R. Krishnan and S. Durairaj, “Reliability and performance of resource efficiency in dynamic optimization scheduling using multi-agent microservice cloud-fog on IoT applications,” Computing, vol. 106, no. 12, pp. 3837-3878, 2024.

P. Habibi and A. Leon-Garcia, “SliceSphere: Agile Service Orchestration and Management Framework for Cloud-native Application Slices,” IEEE Access, 2024.

Z. Zhou, L. Wang, C. Song, Y. Shen, M. Li, and S. Liu, “Challenges of Data Consistency in High-Concurrency Environments: Algorithms and Implementation for the Electric Power Industrial Internet Platform,” in 2024 5th International Conference on Information Science, Parallel and Distributed Systems (ISPDS), 2024, pp. 526-530.

T. M. van Vugt and T. Malik, “A Practical Analysis of Open-Source Security Tools in Microservice Kubernetes Environments,” in 2023 Cyber Research Conference-Ireland (Cyber-RCI), 2023, pp. 1-8.

M. Mora-Cantallops, S. S´anchez-Alonso, E. Garc´ıa-Barriocanal, and M. A. Sicilia, “Traceability for trustworthy ai: A review of models and tools,” Big Data and Cognitive Computing, vol. 5, no. 2, p. 20, 2021.

N. Suleiman and Y. Murtaza, “Scaling Microservices for Enterprise Applications: Comprehensive Strategies for Achieving High Availability, Performance Optimization, Resilience, and Seamless Integration in Large-Scale Distributed Systems and Complex Cloud Environments,” Applied Research in Artificial Intelligence and Cloud Computing, vol.7, no. 6, pp. 46-82, 2024.

J. Zhao, J. Xiong, H. Yu, Y. Bu, K. Zhao, J. Yan, et al., “Reliability evaluation of community integrated energy systems based on fault incidence matrix,” Sustainable Cities and Society, vol. 80, p. 103769,2022.

S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, et al., “Failure diagnosis in microservice systems: A comprehensive survey and analysis,” ACM Transactions on Software Engineering and Methodology, 2024.

P. Kumari and P. Kaur, “A survey of fault tolerance in cloud computing,” Journal of King Saud University-Computer and Information Sciences, vol. 33, no. 10, pp. 1159-1176, 2021.

T. Fedullo, A. Morato, F. Tramarin, L. Rovati, and S. Vitturi, “A comprehensive review on time sensitive networks with a special focus on its applicability to industrial smart and distributed measurement systems,” Sensors, vol. 22, no. 4, p. 1638, 2022.

S. Li, H. Zhang, Z. Jia, C. Zhong, C. Zhang, Z. Shan, et al., “Understanding and addressing quality attributes of microservices architecture: A Systematic literature review,” Information and software technology, vol. 131, p. 106449, 2021.

A. Mahida, P. Chintale, and H. Deshmukh, “Enhancing Fraud Detection in Real Time using DataOps on Elastic Platforms,” 2024.

F. P´erez-Bueno, L. Garc´ıa, G. Maci´a-Fern´andez, and R. Molina, “Leveraging a probabilistic PCA model to understand the multivariate statistical network monitoring framework for network security anomaly detection,” IEEE/ACM Transactions on Networking, vol. 30, no. 3, pp. 1217-1229,2022.

J. Zipfel, F. Verworner, M. Fischer, U. Wieland, M. Kraus, and P. Zschech, “Anomaly detection for industrial quality assurance: A comparative evaluation of unsupervised deep learning models,” Computers & Industrial Engineering, vol. 177, p. 109045, 2023.

W. Sakong, J. Kwon, K. Min, S. Wang, and W. Kim, “Anomaly Transformer Ensemble Model for Cloud Data Anomaly Detection,” IEEE Transactions on Cloud Computing, 2024.

M. Zhang, B. Yuan, H. Li, and K. Xu, “LLM-Cloud Complete: Leveraging cloud computing for efficient large language model-based code completion,” Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023, vol. 5, no. 1, pp. 295-326, 2024.

M. S. Rahaman, A. Islam, T. Cerny, and S. Hutton, “Static-analysis based solutions to security challenges in cloud-native systems: systematic mapping study,” Sensors, vol. 23, no. 4, p. 1755, 2023.

P. Jieyang, A. Kimmig, W. Dongkun, Z. Niu, F. Zhi, W. Jiahai, et al., “A systematic review of data-driven approaches to fault diagnosis and early warning,” Journal of Intelligent Manufacturing, vol. 34, no. 8, pp.3277-3304, 2023.

Y. Jiang, J. Kang, D. Niyato, X. Ge, Z. Xiong, C. Miao, and X. Shen, “Reliable distributed computing for metaverse: A hierarchical gametheoretic approach,” IEEE Transactions on Vehicular Technology, vol.72, no. 1, pp. 1084-1100, 2022.

A. R. Abbasi, “Fault detection and diagnosis in power transformers: a comprehensive review and classification of publications and methods,” Electric Power Systems Research, vol. 209, p. 107990, 2022.

S. Li, Z. Wang, F. Juefei-Xu, Q. Guo, X. Li, and L. Ma, “Common corruption robustness of point cloud detectors: Benchmark and enhancement,” IEEE Transactions on Multimedia, 2023.

S. Ding, Y. Xu, Z. Lu, F. Tang, T. Li, and J. Ge, “Power Microservices Troubleshooting by Pretrained Language Model with Multisource Data,” in 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2024, pp. 1768-1775.

Y. Xu, X. Lu, T. Gao, and R. Meng, “A Self-Supervised Multi-view Contrastive Learning Network for the Fault Diagnosis of Rotating Machinery under Limited Annotation Information,” IEEE Transactions on Instrumentation and Measurement, 2025.

Downloads

Published

2025-03-31

Issue

Section

Research Articles

Categories

How to Cite

[1]
C. Zhao and Y. Xi, “Real-time Fault Detection and Stability Enhancement Mechanism Based on Large Models”, ijetaa, vol. 2, no. 2, pp. 1–12, Mar. 2025, doi: 10.62677/IJETAA.2502132.

Similar Articles

11-20 of 25

You may also start an advanced similarity search for this article.