NVIDIA AI Infrastructure - NCP-AII Exam Practice Test
A platform engineer uses NVIDIA DCGM to monitor hundreds of GPUs in production. One GPU repeatedly reports increasing corrected ECC memory errors, although training jobs continue to complete successfully. What is the most appropriate operational response?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
You are training a deep neural network using NCCL to coordinate communication across four GPUs in a single node. During early performance testing, you notice inconsistent scaling and longer-than-expected training times, even though all GPUs are being used. Which strategy would most effectively improve NCCL efficiency and collective operation performance in this setting?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A system administrator needs to enable MIG so that the end user can run multiple jobs on an NVIDIA A100 GPU. What command should be used?
Correct Answer: D
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
For a 48-hour NCCL burn-in test, which parameters ensure sustained fabric stress while detecting silent data corruption?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps? (Choose two.)
Correct Answer: B,D
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?
Correct Answer: D
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
To validate bisectional bandwidth across two racks in a Spectrum-X Ethernet fabric, which NCCL test configuration isolates East-West traffic?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
An engineer needs to verify the current firmware versions of all components (ATF, BSP, NIC, UEFI) on a BlueField-3 DPU's BMC. Which Redfish API command provides this information?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
An InfiniBand administrator needs to run performance benchmarks on new devices added to the fabric. What tool should be used to check the latency?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
During cluster deployment, the UFM Cable Validation Tool reports "Wrong-neighbor" errors on multiple InfiniBand links. What is the most efficient way to resolve this issue?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A system administrator is installing a GPU into a server and needs to avoid damaging the device.
What item should be used?
What item should be used?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
An engineer wants to verify that their NVIDIA GPU is accessible inside a Docker container for running deep learning workloads. They have installed the NVIDIA Container Toolkit on a machine with working NVIDIA drivers. Which command demonstrates the correct way to run a container that can access all available GPUs?
Correct Answer: B
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
A cluster administrator needs to validate transceiver firmware versions across 200 ports using UFM. Which GUI-based method provides a consolidated view?
Correct Answer: C
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).
During a scale-out expansion, the AI Factory team needs to ensure minimal communication latency between GPUs across multiple nodes. Which NVIDIA technology should be prioritized in the network design to achieve this?
Correct Answer: D
Vote an answer
Explanation: Only visible for PassTestking members. You can sign-up / login (it's free).