Part
2: Taming the Beast – Deep Dive into the Proxy-Aware GPU Initialization
Action
In Part 1 of this series, we laid the network foundation for running
secure Dataproc clusters. Now, let’s zoom in on the core component
responsible for installing and configuring NVIDIA GPU drivers in this
restricted environment: the install_gpu_driver.sh script
from the GoogleCloudDataproc/initialization-actions
repository.
This isn’t just any installation script; it has been significantly
enhanced to handle the nuances of Secure Boot and to operate seamlessly
behind an HTTP/S proxy.
The
Challenge: Installing GPU Drivers Without Direct Internet
The install_gpu_driver.sh script needs to:
- Download NVIDIA drivers, CUDA toolkits, cuDNN, NCCL, etc.
- Potentially compile kernel modules.
- Install OS packages and dependencies.
- Configure the system for GPU workloads.
All of these steps traditionally require internet access, which is
blocked in our target environment. Additionally, for Secure Boot, any
newly compiled kernel modules must be signed.
Key Enhancements in
install_gpu_driver.sh
To address these challenges, the script incorporates several key
features:
-
Robust Proxy Handling (
set_proxy
function): This function, also used in
gce-proxy-setup.sh, is called early. It reads metadata like
http-proxy,https-proxy, and
http-proxy-pem-urito configure:- System-wide environment variables (
HTTP_PROXY,
HTTPS_PROXY,NO_PROXY). NO_PROXYis carefully set to include
.google.comand.googleapis.comto allow
direct access to Google APIs and services via Private Google Access,
bypassing the SWP for these endpoints.- Package managers (
apt,dnf). curlanddirmngr(for GPG keys).- Java key stores.
- Conda/Mamba configurations.
- If
http-proxy-pem-uriis provided, the proxy’s CA
certificate is downloaded from GCS and installed into all relevant trust
stores, and proxy URLs are adjusted to use HTTPS.
- System-wide environment variables (
-
GCS Caching: To minimize reliance on the proxy
for large, frequently used files, the script implements GCS caching.
Before downloading large binaries like NVIDIA drivers or CUDA runfiles
from the internet, it checks a predefined path in the
dataproc-temp-bucket. If the file exists, it’s copied from
GCS, saving time and reducing proxy load. If not found, it downloads
through the proxy and then uploads to the cache for future use. -
Proxy-Aware GPG Key Import: The script uses a
customimport_gpg_keysfunction that fetches keys over
HTTPS (through the proxy) rather than using the default HKP protocol,
which often fails in restricted networks. -
Resilient Conda/Mamba Environment Creation:
Installing complex Conda environments with libraries like PyTorch,
TensorFlow, and RAPIDS through a proxy can be fragile. The script
includes:- Refined package lists for better dependency resolution.
- Use of Mamba for speed, with a fallback to Conda.
- GCS caching for the entire Conda pack, further speeding up
subsequent builds or cluster node initializations. - Sentinel files to detect and purge corrupted or incomplete cache
entries.
-
Secure Boot Signing Integration: While the main
signing logic is part of the custom image build process (Part 3), the
install_gpu_driver.shscript is designed to work in tandem
with it. It ensures that when kernel modules are built (e.g., from
NVIDIA’s open-gpu-kernel-modules), they are ready to be signed by the
keys provided via metadata. -
Deferred Configuration Mode: When the metadata
invocation-type=custom-imagesis set (as is done in Part
3), the script focuses on installing software and drivers, deferring
Hadoop/Spark-specific configurations. These deferred steps are handled
by a systemd service (dataproc-gpu-config.service) on the
first boot of a cluster node created from the custom image.
Conclusion of Part 2
The install_gpu_driver.sh initialization action is more
than just an installer; it’s a carefully crafted tool designed to handle
the complexities of secure, proxied environments. Its robust proxy
support, GCS caching, and awareness of the custom image build lifecycle
make it a critical enabler.
In Part 3, we’ll see how the
GoogleCloudDataproc/custom-images repository uses this
initialization action as a customization script to build the Secure
Boot-ready images.