NCP-AII NVIDIA AI Infrastructure Questions and Answers
An engineer needs to validate 400G DAC cable signal integrity in a DGX cluster. Which CVT metric best identifies marginal cables needing replacement?
You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)
ClusterKit's NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?
An engineer needs to verify the current firmware versions of all components (ATF, BSP, NIC, UEFI) on a BlueField-3 DPU's BMC. Which Redfish API command provides this information?
After running a 24-hour stress test on a DGX node, the administrator should verify which two key metrics to ensure system stability?
An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?
Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?
For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?
A user encounters "permission denied" errors when running GPU-accelerated containers on a Secure Boot-enabled system. What resolves this?
An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?
A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?
You are validating the environment of an NVIDIA GPU-accelerated data center during post-deployment checks. Which one action is essential to confirm that power and cooling are sufficient for the stable operation of NVIDIA DGX H100 systems?
An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?
If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE HOST CHANNEL ADAPTER to a QSFP port capable of both 100 GbE and 25 GbE, which of the following solutions would best meet this requirement?
A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?
During HPL execution on a DGX cluster, the benchmark fails with "not enough memory" errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?
Refer to the output:
~ $ sudo nvsm show healthinfo
—Timestamp: Sat Dec 16 16:26:32 2017 -0800
Version: 17.12-5
Checks—BIOS Revision [5.11].........................
DGX Serial Number [YSY72800016)..................
Verify installed DIMM memory sticks........................Healthy
...[output truncated)
Verify Ethernet controllers...........................Healthy
Verify installed GPU's..............................Unhealthy
Checking output of 'lspci' for expected GPU's
Missing GPU at PCI address '07:00.0'
Verify installed InfiniBand controllers....................Healthy
Verify PCIe switches..................................Healthy
...[output truncated)
What insights can a system administrator gain regarding the DGX system's health?