Pre-Summer Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: best70

NCP-AII NVIDIA AI Infrastructure Questions and Answers

Questions 4

During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?

Options:

A.

Rx power variance > 3dB between lanes

B.

Effective BER > 0 during the first 125 minutes of link operation

C.

Raw BER > 1e-12 or Effective BER > 1.5E-254 for < 6hr measurements

D.

Temperature > 85°C on transceiver module

Buy Now
Questions 5

You are expanding a DGX-based deep learning cluster to train on large, high-resolution images that cannot fit into local cache. Multiple nodes will access this data concurrently and require high performance. Which storage and networking solution best meets these requirements?

Options:

A.

Increase the SSD RAID-0 local cache size in each node so it can absorb most training data, making network storage type and speed less important for performance.

B.

Implement a standard NFS server on a 10GbE network because the cluster can access the export and job performance will not be impacted.

C.

Deploy a high-performance parallel file system across InfiniBand or 40/100GbE, ensuring at least 3 GB/s per node and scalable aggregate bandwidth for all cluster workloads.

D.

Recommend general-purpose object storage for all training data because it is optimized for deep learning workloads and distributed data access at any scale.

Buy Now
Questions 6

An administrator needs to perform a comprehensive pre-production stress test on a DGX H100 system. Which command validates GPU, CPU, memory, and storage components while following NVIDIA’s recommended procedure?

Options:

A.

nvidia-smi -q | grep " GPU Stress Test "

B.

sudo nvsm stress-test --force

C.

stress --cpu $(nproc) --io $(nproc) --timeout 600

D.

./gpu_burn 60

Buy Now
Questions 7

Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?

Options:

A.

Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.

B.

Validate the driver version post-install since the fresh install will overwrite the legacy drivers.

C.

Keep the older driver running alongside the new version in case you need to roll back the upgrade.

D.

Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.

Buy Now
Questions 8

A system administrator wants to configure MIG for seven slices on an H100 GPU in an NVIDIA HGX system. Which command should be used?

Options:

A.

mig-parted

B.

nvidia-smi

C.

nvcc

D.

nvlink-config

Buy Now
Questions 9

An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?

Options:

A.

tcpdump

B.

ib_write_lat

C.

ibdiagnet

D.

perfmon

Buy Now
Questions 10

During HPL execution on a DGX cluster, the benchmark fails with " not enough memory " errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?

Options:

A.

Reduce the problem size while maintaining the same block size.

B.

Set PMAP to 1 to enable process mapping.

C.

Increase block size to 6144 to maximize GPU utilization.

D.

Disable double-buffering via BCAST parameter.

Buy Now
Questions 11

You are an infrastructure engineer tasked with validating a new AI training cluster before releasing it to users. Your team wants to perform a NeMo burn-in to ensure both hardware and software are reliable and ready for production workloads. Which of the following actions are required as part of a proper NeMo burn-in process?

Pick the 2 correct responses below.

Options:

A.

Download a pre-trained NeMo model and use it for a quick accuracy check on a user dataset, then consider the burn-in complete if results are reasonable.

B.

Test inference using the NeMo API and approve the environment if the model outputs valid predictions.

C.

Run the configured NeMo training job repeatedly or for an extended duration, monitoring for errors, stalls, or performance drops across all GPUs and nodes.

D.

Configure a representative NeMo training or pretraining recipe and set up an executor to launch the job across intended nodes and GPUs.

Buy Now
Questions 12

A team is installing the NVIDIA Run:ai control plane on a Kubernetes cluster. Which two (2) options are most critical to validate before proceeding? (Pick the 2 correct responses below)

Options:

A.

Helm is installed on the installer machine.

B.

Ensure Kubernetes is running on the cluster.

C.

All cluster nodes have NVIDIA GPUs installed.

D.

NTP is disabled to simplify time synchronization.

Buy Now
Questions 13

A user wants to restrict a Docker container to use only GPUs 0 and 2. Which command achieves this?

Options:

A.

docker run --gpus ' " device=0,2 " ' nvidia/cuda:12.1-base nvidia-smi

B.

docker run -e NVIDIA_VISIBLE_DEVICES=0,2 nvidia/cuda:12.1-base nvidia-smi

C.

docker run --gpus all nvidia/cuda:12.1-base nvidia-smi -id=0,2

D.

docker run --device /dev/nvidia0,/dev/nvidia2 nvidia/cuda:12.1-base nvidia-smi

Buy Now
Questions 14

You are tasked with setting up High Availability (HA) for NVIDIA Base Command Manager (BCM) in a new GPU cluster. The cluster consists of a primary head node, a secondary head node, and several compute nodes. The requirements are automatic failover of BCM services, minimal disruption to workloads, and proper cluster health monitoring during and after installation. During your BCM HA installation and configuration process, which two of the following actions are mandatory for ensuring a robust and verified HA cluster configuration?

Pick the 2 correct responses below.

Options:

A.

Assign a floating Virtual IP address that can automatically migrate between the primary and secondary head nodes during failover.

B.

Compute nodes must be powered on and performing work to initiate synchronization of the head nodes.

C.

After configuration is complete, simulate a failover by stopping BCM services on the active head node to verify that all services are running on the secondary node with no interruption.

D.

Configure both head nodes to use independent static IP addresses for BCM services instead of relying on a shared virtual IP address.

E.

During configuration, explicitly synchronize both the configuration and state data directories from the primary to the secondary head node to ensure consistency.

Buy Now
Questions 15

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?

Options:

A.

QSA adapter.

B.

SFP connectors.

C.

SFP-to-1G BASE-T RJ45 adapter.

D.

Standard QSFP-to-QSFP DAC cable.

Buy Now
Questions 16

What is the primary purpose of running an NCCL burn-in test on a new GPU cluster?

Options:

A.

To test whether GPUs are properly detected by the operating system and have the correct drivers installed.

B.

To maximize GPU utilization for machine learning workloads and automatically tune deep learning frameworks.

C.

To detect and resolve hardware or interconnect issues before production by stressing GPU communication links.

D.

To benchmark application-specific runtime performance of AI models using real user data and production training scripts.

Buy Now
Questions 17

You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager cluster. Which two of the following actions are essential for a successful OS installation on the cluster’s head node?

Pick the 2 correct responses below.

Options:

A.

Download the latest BCM ISO and verify its integrity using the provided checksum, then start the installation.

B.

Configure network switches for PXE boot to all compute nodes before installing the OS on the head node.

C.

Set the desired time zone and configure NTP synchronization during the OS installation wizard.

D.

Start the head node OS installation process with the system BIOS set to legacy boot mode instead of UEFI.

Buy Now
Questions 18

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Options:

A.

Reduction of problem size (N) to accelerate computation.

B.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

C.

Doubling of GPU clock speeds through firmware updates and relevant configuration.

D.

Automatic NVLink bandwidth doubling via driver updates.

Buy Now
Questions 19

A systems engineer is updating firmware across a large DGX cluster using automation. What is the best practice for minimizing risk and ensuring cluster health during and after the process?

Options:

A.

Drain nodes from the scheduler, run pre-update diagnostics, update firmware in batches, and verify health post-update before scaling to the next batch.

B.

To save time, simultaneously update all nodes in the cluster without draining or diagnostics.

C.

Update nodes that have reported faults, leaving others on older firmware.

D.

Drain nodes from the scheduler, update firmware in batches, skip diagnostics and verify health post-update before scaling to the next batch.

Buy Now
Questions 20

An enterprise IT team has completed the physical installation of an AI Factory with a Spectrum-X Ethernet network connected to all GPU servers. They now need to ensure the environment is ready for scalable AI workload deployment. What is the recommended sequence of validation steps?

Options:

A.

Set up Active Directory and LDAP, configure role-based access controls and security settings first, install users, and skip network or hardware performance validation.

B.

Perform application benchmarking first, use performance logs to identify bottlenecks, update switch and server firmware afterward, and then tune the network using performance tests.

C.

Validate the software stack, test link connectivity and port health, run network benchmarks, run OSPF, ensure neighbors are exchanging route information, then stage AI workload tests.

D.

Confirm switch and server firmware configuration, test link connectivity and port health, run network benchmarks, validate the software stack, then stage AI workload tests.

Buy Now
Questions 21

A systems administrator is preparing a new DGX server for deployment. What is the most secure approach to configuring the BMC port during initial setup?

Options:

A.

Enable remote access to the BMC over the internet using the default admin credentials for initial troubleshooting.

B.

Connect the BMC port directly to the production network and retain default admin credentials for convenience.

C.

Leave the BMC port disconnected until after the operating system is fully configured and in production.

D.

Connect the BMC port to a dedicated and firewalled network and change the default admin credentials.

Buy Now
Questions 22

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status, and reissue update commands if any firmware appears inactive afterward.

B.

Execute a single AC power cycle on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node for confirmation of all component updates.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Initiate a cold power cycle on the system to activate firmware for components, reset the BMC using the recommended command, and perform an AC power cycle to ensure EROT and CPLD firmware is activated.

Buy Now
Questions 23

As the infrastructure lead for an NVIDIA AI Factory deployment, you have just uploaded the latest supported firmware packages to your DGX system. It is now critical to ensure all hardware components run the new firmware and the DGX returns to full operational capability. Which sequence best guarantees that all relevant components are correctly running updated firmware according to NVIDIA’s documentation and recommended operational steps?

Options:

A.

Perform a software-driven restart on the operating system of every compute node, then use advanced tools to check firmware status and reissue update commands if any firmware appears inactive afterward.

B.

Initiate the required cold reset or power cycle to activate updated firmware, reset the BMC using the recommended command, and perform an AC power cycle when required for EROT and CPLD firmware activation.

C.

Initiate a cold power cycle on all node trays to activate firmware, follow with a DGX reboot procedure, and use the management interface to finish activating CPLD firmware on the host.

D.

Execute a single operating system reboot on the DGX after the update process, then reset the software stack and verify status using diagnostic commands on each node.

Buy Now
Questions 24

One of the nodes in a cluster is not running as fast as the others and the system administrator needs to check the status of the GPUs on that system. What command should be used?

Options:

A.

lspci | grep NVIDIA

B.

nvidia-smi

C.

nvidia-gpu-status

D.

iblinkinfo

Buy Now
Questions 25

You are evaluating the integration of NVIDIA BlueField DPUs into your data center ' s storage architecture to optimize AI workloads. The storage solution chosen has incorporated BlueField DPUs to enhance performance and efficiency. Which of the following benefits directly results from this integration?

Options:

A.

Unlimited scalability by adding more DPUs without architectural changes.

B.

Elimination of latency issues in data processing tasks.

C.

Reduced CPU load by offloading data processing tasks to DPUs.

D.

Enhanced I/O performance with NVMe storage access speeds.

Buy Now
Questions 26

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

Options:

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node ' s GPUs need to be healthy.

Buy Now
Questions 27

Refer to the output:

~ $ sudo nvsm show healthinfo

—Timestamp: Sat Dec 16 16:26:32 2017 -0800

Version: 17.12-5

Checks—BIOS Revision [5.11].........................

DGX Serial Number [YSY72800016)..................

Verify installed DIMM memory sticks........................Healthy

...[output truncated)

Verify Ethernet controllers...........................Healthy

Verify installed GPU ' s..............................Unhealthy

Checking output of ' lspci ' for expected GPU ' s

Missing GPU at PCI address ' 07:00.0 '

Verify installed InfiniBand controllers....................Healthy

Verify PCIe switches..................................Healthy

...[output truncated)

What insights can a system administrator gain regarding the DGX system ' s health?

Options:

A.

A GPU tray upgrade failed.

B.

A GPU is missing on the DGX system.

C.

A GPU driver upgrade has failed.

D.

The system has passed the hardware health check successfully.

Buy Now
Questions 28

An administrator needs to verify HA functionality after configuring BCM (Bright Cluster Manager). Which command confirms the active head node and failover readiness?

Options:

A.

cmsh status to check HA status and active/standby roles.

B.

nvsm show health to validate GPU status on both head nodes.

C.

systemctl restart cmdaemon to force a failover test.

D.

ping < secondary-head-node-ip > to test basic connectivity.

Buy Now
Questions 29

A System Administrator needs to change the scheduling behavior of a single GPU to use a fixed share scheduler. What command achieves this?

Options:

A.

esxcli system module parameters set -m nvidia -p

B.

esxcli -i 0 -mig 18

C.

nvidia-smi -i 0 -mig 1

D.

mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1 =2

Buy Now
Questions 30

An engineer wants to verify that an NVIDIA GPU is accessible inside a Docker container for running deep learning workloads. The NVIDIA Container Toolkit is installed on a machine with working NVIDIA drivers. Which command demonstrates the correct way to run a container that can access all available GPUs?

Options:

A.

docker run --rm --runtime=docker nvidia/cuda nvidia-smi

B.

docker run --rm -it ubuntu:22.04 nvidia-smi

C.

docker run --rm --gpus all nvidia/cuda:12.4.6-base-ubuntu22.04 nvidia-smi

D.

docker run --rm nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Buy Now
Questions 31

You must validate all physical cabling as part of the network bring-up phase in a new NVIDIA GPU cluster deployment. The design requires you to confirm that each cable matches the intended topology, all links are functional, and future troubleshooting and scalability are supported. Which two steps are essential to an effective recommended cabling validation process during cluster deployment?

Pick the 2 correct responses below.

Options:

A.

Focus on validating the highest bandwidth links first, deferring non-critical cable mislabeling until after initial workloads are deployed and tested.

B.

Run link tests only after the entire network is built and powered on to avoid redundant troubleshooting during bring-up.

C.

Run the cable validation process incrementally during deployment, section by section, to catch and resolve errors as early as possible.

D.

Compare every cable’s physical connection to the planned topology diagram and validate correct ports and link paths.

Buy Now
Questions 32

A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?

Options:

A.

Navigate to ’Devices " > select a switch > " Cables ' tab to see ASIC firmware and transceiver versions.

B.

Use " Topology’ view to visually inspect cable icons.

C.

Run mlxlink -d lid- < LID > -m on each port manually.

D.

Export all switch logs and grep for ’FW Version " .

Buy Now
Questions 33

Which of the following tests should be used to check for the lowest possible latency between two nodes in a fabric?

Options:

A.

ib_read_bw

B.

ib_read_lat

C.

ib_write_bw

D.

ib_write_lat

Buy Now
Questions 34

After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?

Options:

A.

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

B.

nvidia-smi topo -m to inspect GPU topology connections

C.

DCGM Diags dcgmi diag -r 2

D.

ib_write_bw to measure InfiniBand bandwidth between nodes

Buy Now
Questions 35

You are responsible for ensuring interoperability between AI applications deployed across a diverse IT landscape, including an on-premises data center equipped with NVIDIA GPUs and multiple cloud platforms from different vendors. These environments need to support complex AI workflows that involve large-scale data processing, real-time analytics, and machine learning model training. To maintain consistent performance and flexibility, which strategy should you prioritize?

Options:

A.

Choose one vendor and standardize on one storage solution across all environments to simplify management and improve interoperability.

B.

Implement a multi-cloud strategy that uses only native storage solutions in each cloud platform while relying on middleware to ensure interoperability and data consistency.

C.

Ensure that all environments use compatible storage protocols and APIs, such as NFS or S3, to facilitate data exchange and integration across platforms.

D.

Focus only on increasing network bandwidth between locations to reduce latency and improve data transfer speeds.

Buy Now
Questions 36

An infrastructure engineer is preparing a new AI cluster for production use, relying on NVIDIA switches and high-speed optical transceivers for node connectivity. The team is finalizing network validation before launching large-scale training jobs. Why is it critical to confirm and align the firmware version on all switch transceivers prior to production?

Options:

A.

To guarantee that hardware inventory tools can report serial numbers and manufacturer codes for asset management, which is critical for future support and troubleshooting.

B.

To ensure stability, bandwidth, and compatibility across the cluster, avoiding link issues and performance loss.

C.

To allow the network operating system to automatically discover all connected transceivers with heterogeneous firmware.

D.

To reduce GPU memory consumption during distributed training jobs.

Buy Now
Exam Code: NCP-AII
Exam Name: NVIDIA AI Infrastructure
Last Update: May 23, 2026
Questions: 123

PDF + Testing Engine

$134.99

Testing Engine

$99.99

PDF (Q&A)

$84.99