NCP-AII NVIDIA AI Infrastructure Questions and Answers
During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?
You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?
An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?
Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?
A system administrator wants to configure MIG for seven slices on an H100 GPU in an NVIDIA HGX system. Which command should be used?
An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?
During HPL execution on a DGX cluster, the benchmark fails with " not enough memory " errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?
You are an infrastructure engineer tasked with validating a new AI training cluster before releasing it to users. Your team wants to perform a NeMo burn-in to ensure both hardware and software are reliable and ready for production workloads. Which of the following actions are required as part of a proper NeMo burn-in process?
Pick the 2 correct responses below.
A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)
A user wants to restrict a Docker container to use only GPUs 0 and 2. Which command achieves this?
You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?
Pick the 2 correct responses below.
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?
What is the primary purpose of running an NCCL burn-in test on a new GPU cluster?
You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager cluster. Which two of the following actions are essential for a successful OS installation on the cluster’s head node?
Pick the 2 correct responses below.
After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?
A systems engineer is updating firmware across a large DGX cluster using automation. What is the best practice for minimizing risk and ensuring cluster health during and after the process?
An enterprise IT team has completed the physical installation of an AI Factory with a Spectrum-X Ethernet network connected to all GPU servers. They now need to ensure the environment is ready for scalable AI workload deployment. What is the recommended sequence of validation steps?
A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?
As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?
As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware according to NVIDIA’s documentation and recommended operational steps?
One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?
You are evaluating the integration of NVIDIA BlueField DPUs into your data center ' s storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?
A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?
Refer to the output:
~ $ sudo nvsm show healthinfo
—Timestamp: Sat Dec 16 16:26:32 2017 -0800
Version: 17.12-5
Checks—BIOS Revision [5.11].........................
DGX Serial Number [YSY72800016)..................
Verify installed DIMM memory sticks........................Healthy
...[output truncated)
Verify Ethernet controllers...........................Healthy
Verify installed GPU ' s..............................Unhealthy
Checking output of ' lspci ' for expected GPU ' s
Missing GPU at PCI address ' 07:00.0 '
Verify installed InfiniBand controllers....................Healthy
Verify PCIe switches..................................Healthy
...[output truncated)
What insights can a system administrator gain regarding the DGX system ' s health?
An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?
A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?
An engineer wants to verify that an NVIDIA GPU is accessible inside a Docker container for running deep learning workloads. The NVIDIA Container Toolkit is installed on a machine with working NVIDIA drivers. Which command demonstrates the correct way to run a container that can access all available GPUs?
You must validate all physical cabling as part of the network bring-up phase in a new NVIDIA GPU cluster deployment. The design requires you to confirm that each cable matches the intended topology, all links are functional, and future troubleshooting and scalability are supported. Which two steps are essential to an effective recommended cabling validation process during cluster deployment?
Pick the 2 correct responses below.
A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?
Which of the following tests should be used to check for the lowest possible latency between two nodes in a fabric?
After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?
You are responsible for ensuring interoperability between AI applications deployed across a diverse IT landscape, including an on-premises data center equipped with NVIDIA GPUs and multiple cloud platforms from different vendors. These environments need to support complex AI workflows that involve large-scale data processing, real-time analytics, and machine learning model training. To maintain consistent performance and flexibility, which strategy should you prioritize?
An infrastructure engineer is preparing a new AI cluster for production use, relying on NVIDIA switches and high-speed optical transceivers for node connectivity. The team is finalizing network validation before launching large-scale training jobs. Why is it critical to confirm and align the firmware version on all switch transceivers prior to production?