Part 2: Taming the Beast – Deep Dive into the Proxy-Aware GPU Initialization Action


Part
2: Taming the Beast – Deep Dive into the Proxy-Aware GPU Initialization
Action

In Part 1 of this series, we laid the network foundation for running
secure Dataproc clusters. Now, let’s zoom in on the core component
responsible for installing and configuring NVIDIA GPU drivers and the
associated ML stack in this restricted environment: the
install_gpu_driver.sh script from the LLC-Technologies-Collier/initialization-actions
repository (branch gpu-202601).

This isn’t just any installation script; it has been significantly
enhanced to handle the nuances of Secure Boot and to operate seamlessly
behind an HTTP/S proxy.

The
Challenge: Installing GPU Drivers Without Direct Internet

Our goal was to create a Dataproc custom image with NVIDIA GPU
drivers, sign the kernel modules for Secure Boot, and ensure the entire
process works seamlessly when the build VM and the eventual cluster
nodes have no direct internet access, relying solely on an HTTP/S proxy.
This involved:

  1. Proxy-Aware Build: Ensuring all build steps within
    the custom image creation process (package downloads, driver downloads,
    GPG keys, etc.) correctly use the customer’s proxy.
  2. Secure Boot Signing: Integrating kernel module
    signing using keys managed in GCP Secret Manager, especially when
    drivers are built from source.
  3. Conda Environment: Reliably and speedily installing
    a complex Conda environment with PyTorch, TensorFlow, Rapids, and other
    GPU-accelerated libraries through the proxy.
  4. Dataproc Integration: Making sure the custom image
    works correctly with Dataproc’s own startup, agent processes, and
    cluster-specific configurations like YARN.

The
Development Journey: Key Enhancements in
install_gpu_driver.sh

To address these challenges, the script incorporates several key
features:

  • Robust Proxy Handling (set_proxy
    function):

    • Challenge: Initial script versions had spotty proxy
      support. Many tools like apt, curl,
      gpg, and even gsutil failed in proxy-only
      environments.
    • Enhancements: The set_proxy function
      (also used in gce-proxy-setup.sh) was completely overhauled
      to parse various proxy metadata (http-proxy,
      https-proxy, proxy-uri,
      no-proxy). Critically, environment variables
      (HTTP_PROXY, HTTPS_PROXY,
      NO_PROXY) are now set before any network
      operations. NO_PROXY is carefully set to include
      .google.com and .googleapis.com to allow
      direct access to Google APIs via Private Google Access. System-wide
      trust stores (OS, Java, Conda) are updated with the proxy’s CA
      certificate if provided via http-proxy-pem-uri.
      gcloud, apt, dnf, and
      dirmngr are also configured to use the proxy.
  • Reliable GPG Key Fetching (import_gpg_keys
    function):

    • Challenge: Importing GPG keys for repositories
      often failed as keyservers use non-HTTP ports (e.g., 11371) blocked by
      firewalls, and gpg --recv-keys is not proxy-friendly.
    • Solution: A new import_gpg_keys
      function now fetches keys over HTTPS using curl, which
      respects the environment’s proxy settings. This replaced all direct
      gpg --recv-keys calls.
  • GCS Caching is King:
    • Challenge: Repeatedly downloading large files
      (drivers, CUDA, source code) through a proxy is slow and
      inefficient.
    • Solution: Implemented extensive GCS caching for
      NVIDIA drivers, CUDA runfiles, NVIDIA Open Kernel Module source
      tarballs, compiled kernel modules, and even packed Conda environments.
      Scripts now check a GCS bucket (dataproc-temp-bucket)
      before hitting the internet.
    • Impact: Dramatically speeds up subsequent runs and
      init action execution times on cluster nodes after the cache is
      warmed.
  • Conda Environment Stability & Speed:
    • Challenge: Large Conda environments are prone to
      solver conflicts and slow installation times.
    • Solution: Integrated Mamba for faster package
      solving. Refined package lists for better compatibility. Added logic to
      force-clean and rebuild the Conda environment cache on GCS and locally
      if inconsistencies are detected (e.g., driver installed but Conda env
      not fully set up).
  • Secure Boot & Kernel Module Signing:
    • Challenge: Custom-compiled kernel modules must be
      signed to load when Secure Boot is enabled.
    • Solution: The script integrates with GCP Secret
      Manager to fetch signing keys. The build_driver_from_github
      function now includes robust steps to compile, sign (using
      sign-file), install, and verify the signed modules.
  • Custom Image Workflow & Deferred Configuration:
    • Challenge: Cluster-specific settings (like YARN GPU
      configuration) should not be baked into the image.
    • Solution: The install_gpu_driver.sh
      script detects when it’s run during image creation
      (--metadata invocation-type=custom-images). In this mode,
      it defers cluster-specific setups to a systemd service
      (dataproc-gpu-config.service) that runs on the first boot
      of a cluster instance. This ensures that YARN and Spark configurations
      are applied in the context of the running cluster, not at image build
      time.

Conclusion of Part 2

The install_gpu_driver.sh initialization action is more
than just an installer; it’s a carefully crafted tool designed to handle
the complexities of secure, proxied environments. Its robust proxy
support, comprehensive GCS caching, refined Conda management, Secure
Boot signing capabilities, and awareness of the custom image build
lifecycle make it a critical enabler.

In Part 3, we’ll explore how the LLC-Technologies-Collier/custom-images
repository (branch proxy-exercise-2025-11) uses this
initialization action to build the complete, ready-to-deploy Secure Boot
GPU custom images.


Leave a Reply