In December 2023, we embarked upon an ambitious initiative to develop a comprehensive digital twin of the Frontier supercomputer. This twin includes: 3D asset modeling with virtual and augmented reality capabilities, telemetry data assimilation, AI/ML integration, simulations, and reinforcement learning for optimization. The goal was initially to develop four main modules:
- A transient simulation of the thermo-fluid cooling system from cooling tower to cold plate.
- A resource allocator and power simulator - which models workloads and resulting dynamic power, along with energy conversion losses.
- A visual analytics module consisting of both an augmented reality model based on Unreal Engine 5, and a web-based dashboard for launching experiments.
- A network digital twin to study dynamic network power and congestion.
Once we were able to model Frontier, we set out to generalize these modules as a generalized framework called ExaDigiT for modeling a variety of supercomputer architectures. This digital twin framework offers insights into operational strategies, “what-if” scenarios, as well as elucidates complex, cross-disciplinary transient behaviors. It also serves as a design tool for future system prototyping. Built on an open software stack (Modelica, SST Macro, Unreal Engine) with an aim to foster community-driven development, we have formed a partnership with supercomputer centers around the world to develop an open framework for modeling supercomputers. The source code is available here:
For more information, contact Wes Brewer at brewerwh@ornl.gov.