This use case describes the total expected compute needs for the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) by 2029, based on the most recent estimates.
Background and objectives
The success of the physics program of the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) critically depends on having sufficient computing resources to process, store, and analyze collision and simulated data samples in a timely manner. This will be no less true in the HL-LHC era starting in 2029. Early estimates from 2017 and 2018 projected the resource needs to be several times greater than the project’s budget, assuming a flat funding profile.
Today, HPCs are used in production by CMS for all kinds of central workflows, including event generation, simulation, digitization, pileup mixing, reconstruction, and creation of analysis data formats MiniAOD and NanoAOD. During the past years, HPC machines contributed significantly to the processing of the Run 2 data as well as the generation of the related Monte Carlo samples. Between five and ten percent of the total used computing power dedicated to that activity came from HPCs. CMS extrapolations to Phase-II show a very steep dependency of computing needs on the number of simultaneous collisions per bunch crossing (pileup, PU).
The reconstruction task executes all of the algorithms needed to interpret signals as being due to the interaction of identifiable particles with the detector. The reconstruction time per event increases superlinearly with the number of pileup events per collision due to the combinatorial nature of the most resource-intensive algorithms. (PU of 52 in 2023, PU of 140 for Run 4, 200 for Run 5).
Timescale and scope
The description that follows describes projections for run 4 and 5 of the LHC, expected to begin in 2029.
Scientific and technical challenges
Looking ahead to the HL-LHC (High Luminosity Large Hadron Collider), the data rates and data complexity caused by increased luminosity will increase dramatically. All of this results in excess pressure on the current CMS computing infrastructure and an outlook for the HL-LHC where our existing computing infrastructure will simply not be sufficient to meet the demands of the experiment.
The ability to fully utilize large allocations on HPCs depends on having a submission infrastructure capable of scaling up to a sufficient number of tasks and execution slots from HTCondor. This implies work on federated access to HPC sites, standardized transfer tooling and policies, as well as common service and access methods.
Scientific user community
The CMS experiment is a general purpose particle physics detector at the LHC at CERN. CMS has over 4000 particle physicists, engineers, computer scientists, technicians and students from around 240 institutes and universities from more than 50 countries. The collaboration operates and collects data from the CMS detector, one of the general-purpose particle detectors at CERN’s LHC. Collaborators from all over the world helped design and fabricate components of the detector, which were brought to CERN for final assembly. Data collected by CMS are shared with several computing centres via the Worldwide LHC Computing Grid (WLCG). From there, they are distributed to CMS institutions in over forty countries for physics analysis.
Technical Details
Storage
Storage use consists of software, scratch, and input/output data. Access to calibration data and CMS software is normally provided via the Cern Virtual Machine File System (CVMFS), deployed as a service at a compute site and mounted to each node (or alternatively provided from the local network file system). Nodes generally require 20-50 GB of cache.
Low-latency scratch space for actively running jobs is very useful for HEP workflows that often deal with many small configuration files. Requirements vary by job type, but typically are less than 2GB/core scratch.
Input/output data storage is highly dependent on workload (for example event generation has no input data). Event reconstruction (RECO) jobs, which process RAW datasets, are the most demanding. Typical RAW input datasets are a small fraction of the total produced, O(PB). Derived output datasets (results) are highly reduced, O(MB-GB). Dataset files typically contain many thousands of events per file, and are binned to a common size (eg 1 TB dataset split at 200GB segments)
Availability of outbound internet access from worker nodes to experiment owned storage at other sites (either directly or through a proxy edge service) allows input data reads and output data writes from jobs to bypass the local storage available at the HPC, at the cost of larger loads on networks and on the storage at CMS sites.
Event Type | Volume | Size per event | Total |
RAW | 34 ✕ 109 events | 4.30 MB | 146.20 PB |
MC | 85 ✕ 109 events | 0.18 MB | 6.12 PB |
Annual data volume by data type
Input datasets are ideally staged on shared storage for a period of weeks-months until jobs corresponding to that source are fulfilled. They are then evicted and a new dataset is fetched. If there is insufficient quota for long-term rotating staging, a smaller cache space may be used and data streamed from external (WAN) storage pools, although this impacts throughput and scalability. There is no long term retention expectation for input datasets.
Output data is transferred as soon as reasonable to external (WAN). It may be temporarily staged on local/shared storage awaiting transfer services.
There is no foreseen retention period once results are verified as successfully transferred..
None foreseen at HPC sites, all source data is already duplicated and archived via WLCG resources.
Data is embargoed for physics analysis for a period of years, after which all data is publicly available. This is described in the section on open data policies above.
Input data will dominantly be read sequentially and Analysis Object Data (AOD) formats permit the use of object storage technologies. Jobs are independent, and thus may have high impact during startup when loading software and conditions required. There is no restriction on how many jobs may read from the same input data concurrently, but output is independent.
CMS event storage format is RAW/AOD/miniAOD/nanoAOD, ranked by projected size
CMS projected annual total events by type for runs 4 and 5
Data transport
Transfer throughput is expected to increase from 2.5MB/s/core in 2023 to an upper bound of 5MB/s/core in 2029. CMS does not currently require low latency storage, but depending on the size of the allocation at the HPC center, i.e. the number of cores, throughput from storage can become a bottleneck.
Large LHC data transfers should ideally be routed through private networks (such as GEANT) and avoid the public internet to guarantee acceptable latency, bandwidth and overall Quality of Service. This happens through the various national research networks which are interconnected worldwide into the primary WLCG grid LHCONE and LHCOPN networks.
Transfer is approximately linear with job scaling, transfer size will increase in 2030, and 2035 according to the LHC schedule due to event size, complexity, and frequency increase..
For running jobs with pre-placed datasets and availability of edge services and caches such as CVMFS, remote streaming data can be minimized to small conditions data, and job management communication (few Mbps/node). If data is not local, it can be streamed via transfer protocols and entirely avoid the local storage system, at a much higher transfer requirement (5MB/s/core). On heavily restricted sites with no node connectivity, this has been avoided to varying degrees via message passing using the shared file system.
For the volume of datasets required, SSH-based tools will be insufficient. RUCIO is used by CMS for orchestrating transfers between Data Transfer Nodes (DTN) over a variety of protocols including FTS, S3, and Globus (although Globus is not preferred). XrootD is the preferred protocol for streaming job data on-demand. CMS is following new projects such as SENSE/AutoGOLE and NOTED, as well as leveraging modern advancements in SDN to further reduce transfer requirements and increase throughput.
Compute
Batch submission, data-driven parallel computing.
Primarily x86, with support for ARM, IBM POWER; RISC-V possible when production hardware matures. GPU primarily CUDA, AMD/Intel being ported. CMS is heavily invested in replacing the most compute-expensive CPU functions with AI for several operations.
CMS workloads consist of several chain-able steps that may begin with collision event data (real or simulated), and result in analysis datasets. All steps may be performed in a single job with no intermediate files, or only a subset, as the software stack is common for all stages.
Jobs consist of a batch of events to be processed. In 2023, Computing is benchmarked on the basis of:
- CMS GEN-SIM (MC prod.) 4 Threads: 1G/thread, 100evt/thread
- CMS DIGI (simulation digitization) 4 Threads: 2G/thread, 100evt/thread
- CMS RECO 4 Threads: 2G/thread, 100evt/thread
Multiple jobs are started until all CPU cores are consumed, for example 64 jobs would fit on a modern 256 core node.
From the CMS detector, RAW events will be recorded at around 5,000/second. For Reconstructions jobs, 34 Billion RAW events and another 85 Billion Simulated events are expected for processing in 2029, with several hundreds of thousands of simultaneous jobs running 24/7 to meet this need.
HS06 (HEPspec06) units depend on CPU efficiency. 2023 CPUs are up to 16 HS06 per core hour, 2029 CPUs may be closer to 20, depending on R&D.
Within the same event pileup (PU) schedule delivered by the LHC, compute requirements will scale linearly with event production. Large step increases will occur when the pileup increases from 52 in 2023, to 70, 140, then 200, as the complexity of track reconstruction algorithms in RECO increases more than linearly with pileup, making tracking one of the main consumers of compute resources in the HL-LHC environment.
The majority of the CMS offline software (CMSSW) is written in C++, with components in Fortran, and Python. The CMS software framework has the capability today to run production workflows on GPUs, offloading work on anything external to a CPU thread, whether an accelerator or another process. Current GPU algorithms under development for 2029 are written in a mixture of python, CUDA, and GPU-portable frameworks.
Software is generally read from CVMFS, or provided via container or local filesystem.
Workflow Management
Workflow description
For the purposes of this example use case, the below describes central production workflows, and not user (physicist) submitted workflows, although the general principle is similar.
CMS employs heavily automated workflow management to coordinate roughly 500k CPUs across a global batch system. CMS computing jobs execute on independent, federated, compute nodes via pull mechanism (in contrast with SLURM push paradigm), where a (partially) empty node first runs a pilot step (via GlideInWMS) that reports back the available resources to the federated job pool (HTCondor). A job is then pulled by the node, according to matching resources, and execution begins. Remote job monitoring and reporting is facilitated by HTcondor, as well as dynamic resizing of resources.
CMS compute steps are capable of running in a chained pipeline within the same node, potentially eliminating the need for intermediate storage transfers between stages. In all cases, resulting job output is transferred to the collector
In general, CMS workflows have not been latency dependent, nor time-critical, however in high pileup runs there is a risk of data production exceeding the computational requirement to process it, resulting in a several year delay for physics analysis (or reduced precision of measurements).
Job definitions are simple bash scripts which contain the necessary CMSSW commands and input data references.
Workflows need access to WAN, ideally from the compute node, to communicate monitoring and scheduling information of jobs, and to pull external datasets. This may entail a dedicated or virtualized DTN service, both for File Transfer (via FTS/S3 etc), and XrootD-based DTN for on-demand streams. Squid proxy caching services further reduce the cost of these transfers. CVMFS, ideally deployed as an edge service (either dedicated machine, or via Kubernetes), and mounted to each node is expected by CMS jobs. User namespaces provided by the kernel allows portable CVMFS access on sites where it has not been installed, at a higher cost to efficiency and networking. Workloads can execute in Apptainer by this same mechanism.
Access and Analysis
How do you interact with the resources?
Compute sites are foreseen to be fully automated, outside of initial setup interaction. A small amount of compute time may be used interactively, likely via web interface (Jupyter notebooks) for analysis and development tasks from users.
Do you need access to interactive facilities for computational steering?
Interactive facilities are not needed for computational steering but physicists often use interactive jobs with e.g. Jupyter Notebooks for analysis.
Will data be analyzed in-situ, or transferred to another site (e.g. local workstation) for analysis? In the latter case, how much data will be transferred?
All batch job results will be transferred off-site to CMS collection points. Interactive jobs normally are not downloaded if the size is large, as efficient transfer mechanisms are already in place.
If analysing data in situ, which capabilities are required (e.g remote visualization, Jupyter notebook, shell access)?
The largest request is for Jupyter notebooks via web interface.
Non-technical challenges
How do you or your collaboration obtain access to computational and data storage resources?
Access to CMS computational and data storage resources is typically obtained through institutional affiliations, collaborations, or project-specific agreements. User certificates and tokens are used for authentication.
How many individuals need to access the resources? What sort of authorization system do you rely on?
The CMS batch system is fully automated. Jobs are placed by the batch system on behalf of an authenticated individual or service, authenticated via token or certificate provided by the Interoperable Global Trust Federation (IGTF) framework.
Do you provide or make use of training, assistance, or support facilities provided by the e-infrastructure?
Any resources that individuals have access to is supported by both a CMS and CERN training services and support ticket systems.
Is your data covered by data protection or other regulatory frameworks (e.g. GDPR, AI Act)?
Event data contains no GDPR or AI-act coverage.
Gap Analysis
Work with HPC providers to:
- develop a plan for providing multi-year allocations for storage and compute such that HPC resources can be incorporated into CMS planning horizons. [2]
- develop a plan for providing large storage allocations that can federate with existing CMS data management software, such that HPC storage can be used in a manner similar to how WLCG site storage is used today (i.e., as Rucio Storage Endpoints). [2]
- standardize on a technology for providing a facility API that permits both interactive and automated access to manage job workflows and data, in order to integrate better with CMS and other large-scale multi-institutional scientific collaborations. If such an API were standardized, it would require some significant R&D up-front to integrate with our respective workflow management systems but would ultimately reduce the costs of initial integration and ongoing operations in the long run. [2]
Furthermore, HPC facility security models are increasingly moving toward a security posture, where multi-factor authentication (MFA) is the only acceptable method of user authentication to a site. While this model works well for individual (or small groups of) researchers submitting tasks to an HPC by hand via interactive shell, the process by which workloads are submitted to computing resources would ideally be entirely automated, modulo initial setup. Sites with restrictive MFA policies that require a human interaction component make automation extremely difficult. As such there is a general, demonstrable need to broaden acceptable authentication factors in HPC site policies to include some additional authentication element that does not preclude automation, while still meeting the security needs of these facilities. [2]
While the largest HPC sites generally provide dedicated Data Transfer Nodes (DTNs), typically only the proprietary Globus Online service is supported for large-scale transfers. While Globus is appropriate for many of the users that HPC facilities service, LHC experiments instead rely on the XRootD protocol and SciToken-based authentication for large scale transfer of data between and within WLCG computing sites. Working together with HPC facilities to provide an XRootD-based DTN for use by LHC experiments would be of great help. If successful, this approach will significantly simplify integrating HPCs into the data federations for CMS, reducing unnecessary proxy transfers via the WAN while making the facility appear more similar to other WLCG sites. [2]
Due to queue times in the HPC batch system, the pilot jobs often stay pending for some time before they run, meaning that CMS jobs that originally triggered them will have run somewhere else in the meantime. As long as CMS can maintain a continuous job queue for HPC resources, this does not matter, the pilots will just execute other CMS jobs targeted at HPC. This doesn’t necessarily mean utilizing HPC resources will require extra effort, but it does mean that in this model a good HPC utilization depends on a large enough fraction of the workflow mix that is in the system at any given time being able to run on HPC. [2]
Open Science and Open Data
The CMS Collaboration recognizes the unique nature of CMS data and is committed to preserve its data and to allow their re-use by a wide community of collaboration members long after the data are taken, experimental and theoretical HEP scientists who were not members of the collaboration, educational and outreach initiatives, and citizen scientists in the general public.
For the widest possible re-use of the data, while protecting the Collaboration’s liability and reputation, data will be released under the Creative Commons CC0 waiver. Data will also be identified with persistent data identifiers, and it is expected that the third parties cite the public CMS data through these identifiers.
Data produced by the LHC experiments are usually categorised in four different levels (DPHEP Study Group, 2009). The CERN Open Data portal focuses on the release of event data from levels 2 and 3. The LHC collaborations may also provide small samples of level 4 data.
CMS will provide open access to its data at different points in time with appropriate delays, which will allow CMS collaborators to fully exploit the scientific potential of the data before open access is triggered.
- At level 1, the additional data is made available at the moment of the publication, such as extra figures and tables.
- At level 2, simplified data format samples are released promptly as determined by the Collaboration Board.
- At level 3, public data releases, accompanied by stable, open source, software and suitable documentation, will take place regularly. CMS will normally make 50% of its data available 6 years after they have been taken. The proportion will rise to 100% within 10 years, or when the main analysis work on these data in CMS has ended. However, the amount of open data will be limited to 20% of data with the similar centre-of-mass energy and collision type while such data are still planned to be taken. The Collaboration Board can, in exceptional circumstances, decide to release some particular data sets either earlier or later.
- At level 4, small samples of raw data potentially useful for studies in the machine learning domain and beyond can be released together with level 3 formats. If storage space will be available, raw data can be made public after the end of all data taking and analysis.