A Scalable and Flexible Monitoring System Framework for Supercomputers

Tong XIAO, Kai LU

Abstract


Mankind’s demand for more powerful computing capabilities is never met, which has led to the continuous improvement of supercomputers’ performance. A more powerful supercomputer tends to have a larger system scale, which brings serious challenges to the system management, within which how to monitor the system’s state is a critical problem. To address this problem, a scalable and flexible monitoring system framework for supercomputers is brought forward in this paper which can monitor supercomputers with tens of thousands of nodes effectively and efficiently. In this paper, we firstly give an overview of the framework and then focus on the Super Computer System Description Language (SCSDL) which is key to the framework. In the end, we explain some techniques about implementing the framework, and the client GUIs of a job monitoring system and an error monitoring system for Tianhe-2 based on this framework are given, from which we can see that the framework is well scalable and flexible to monitor Tianhe-2 which has 16,000 nodes effectively and efficiently.

Keywords


Monitoring system, Framework, Supercomputer, Scalable, Flexible


DOI
10.12783/dtcse/cece2017/14383

Refbacks

  • There are currently no refbacks.