Part 2: Taming the Beast – Deep Dive into the Proxy-Aware GPU Initialization Action


Part
2: Taming the Beast – Deep Dive into the Proxy-Aware GPU Initialization
Action

In Part 1 of this series, we laid the network foundation for running
secure Dataproc clusters. Now, let’s zoom in on the core component
responsible for installing and configuring NVIDIA GPU drivers in this
restricted environment: the install_gpu_driver.sh script
from the GoogleCloudDataproc/initialization-actions
repository.

This isn’t just any installation script; it has been significantly
enhanced to handle the nuances of Secure Boot and to operate seamlessly
behind an HTTP/S proxy.

The
Challenge: Installing GPU Drivers Without Direct Internet

The install_gpu_driver.sh script needs to:

  1. Download NVIDIA drivers, CUDA toolkits, cuDNN, NCCL, etc.
  2. Potentially compile kernel modules.
  3. Install OS packages and dependencies.
  4. Configure the system for GPU workloads.

All of these steps traditionally require internet access, which is
blocked in our target environment. Additionally, for Secure Boot, any
newly compiled kernel modules must be signed.

Key Enhancements in
install_gpu_driver.sh

To address these challenges, the script incorporates several key
features:

  • Robust Proxy Handling (set_proxy
    function):
    This function, also used in
    gce-proxy-setup.sh, is called early. It reads metadata like
    http-proxy, https-proxy, and
    http-proxy-pem-uri to configure:

    • System-wide environment variables (HTTP_PROXY,
      HTTPS_PROXY, NO_PROXY).
    • NO_PROXY is carefully set to include
      .google.com and .googleapis.com to allow
      direct access to Google APIs and services via Private Google Access,
      bypassing the SWP for these endpoints.
    • Package managers (apt, dnf).
    • curl and dirmngr (for GPG keys).
    • Java key stores.
    • Conda/Mamba configurations.
    • If http-proxy-pem-uri is provided, the proxy’s CA
      certificate is downloaded from GCS and installed into all relevant trust
      stores, and proxy URLs are adjusted to use HTTPS.
  • GCS Caching: To minimize reliance on the proxy
    for large, frequently used files, the script implements GCS caching.
    Before downloading large binaries like NVIDIA drivers or CUDA runfiles
    from the internet, it checks a predefined path in the
    dataproc-temp-bucket. If the file exists, it’s copied from
    GCS, saving time and reducing proxy load. If not found, it downloads
    through the proxy and then uploads to the cache for future use.

  • Proxy-Aware GPG Key Import: The script uses a
    custom import_gpg_keys function that fetches keys over
    HTTPS (through the proxy) rather than using the default HKP protocol,
    which often fails in restricted networks.

  • Resilient Conda/Mamba Environment Creation:
    Installing complex Conda environments with libraries like PyTorch,
    TensorFlow, and RAPIDS through a proxy can be fragile. The script
    includes:

    • Refined package lists for better dependency resolution.
    • Use of Mamba for speed, with a fallback to Conda.
    • GCS caching for the entire Conda pack, further speeding up
      subsequent builds or cluster node initializations.
    • Sentinel files to detect and purge corrupted or incomplete cache
      entries.
  • Secure Boot Signing Integration: While the main
    signing logic is part of the custom image build process (Part 3), the
    install_gpu_driver.sh script is designed to work in tandem
    with it. It ensures that when kernel modules are built (e.g., from
    NVIDIA’s open-gpu-kernel-modules), they are ready to be signed by the
    keys provided via metadata.

  • Deferred Configuration Mode: When the metadata
    invocation-type=custom-images is set (as is done in Part
    3), the script focuses on installing software and drivers, deferring
    Hadoop/Spark-specific configurations. These deferred steps are handled
    by a systemd service (dataproc-gpu-config.service) on the
    first boot of a cluster node created from the custom image.

Conclusion of Part 2

The install_gpu_driver.sh initialization action is more
than just an installer; it’s a carefully crafted tool designed to handle
the complexities of secure, proxied environments. Its robust proxy
support, GCS caching, and awareness of the custom image build lifecycle
make it a critical enabler.

In Part 3, we’ll see how the
GoogleCloudDataproc/custom-images repository uses this
initialization action as a customization script to build the Secure
Boot-ready images.


Leave a Reply