Part 3: Building the Keystone – Dataproc Custom Images for Secure Boot & GPUs


Part
3: Building the Keystone – Dataproc Custom Images for Secure Boot &
GPUs

In Part 1, we established a secure, proxy-only network. In Part 2, we
explored the enhanced install_gpu_driver.sh initialization
action. Now, in Part 3, we’ll focus on using the LLC-Technologies-Collier/custom-images
repository (branch proxy-exercise-2025-11) to build the
actual custom Dataproc images embedded with NVIDIA drivers signed for
Secure Boot, all within our proxied environment.

Why Custom Images?

To run NVIDIA GPUs on Shielded VMs with Secure Boot enabled, the
NVIDIA kernel modules must be signed with a key trusted by the VM’s EFI
firmware. Since standard Dataproc images don’t include these
custom-signed modules, we need to build our own. This process also
allows us to pre-install a full stack of GPU-accelerated software.

The
custom-images Toolkit
(examples/secure-boot)

The examples/secure-boot directory within the
custom-images repository contains the necessary scripts and
configurations, refined through significant development to handle proxy
and Secure Boot challenges.

Key Components & Development Insights:

  • env.json: The central configuration
    file (as used in Part 1) for project, network, proxy, and bucket
    details. This became the single source of truth to avoid configuration
    drift.
  • create-key-pair.sh: Manages the Secure
    Boot signing keys (PK, KEK, DB) in Google Secret Manager, essential for
    the module signing.
  • build-and-run-podman.sh: Orchestrates
    the image build process in an isolated Podman container. This was
    introduced to standardize the build environment and encapsulate
    dependencies, simplifying what the user needs to install locally.
  • pre-init.sh: Sets up the build
    environment within the container and calls
    generate_custom_image.py. It crucially passes metadata
    derived from env.json (like proxy settings and Secure Boot
    key secret names) to the temporary build VM.
  • generate_custom_image.py: The core
    Python script that automates GCE VM creation, runs the customization
    script, and creates the final GCE image.
  • gce-proxy-setup.sh: This script from
    startup_script/ is vital. It’s injected into the temporary
    build VM and runs first to configure the OS, package
    managers (apt, dnf), tools (curl, wget, GPG), Conda, and Java to use the
    proxy settings passed in the metadata. This ensures the entire build
    process is proxy-aware.
  • install_gpu_driver.sh: Used as the
    --customization-script within the build VM. As detailed in
    Part 2, this script handles the driver/CUDA/ML stack installation and
    signing, now able to function correctly due to the proxy setup by
    gce-proxy-setup.sh.

Layered Image Strategy:

The pre-init.sh script employs a layered approach:

  1. secure-boot Image: Base image with
    Secure Boot certificates injected.
  2. tf Image: Based on
    secure-boot, this image runs the full
    install_gpu_driver.sh within the proxy-configured build VM
    to install NVIDIA drivers, CUDA, ML libraries (TensorFlow, PyTorch,
    RAPIDS), and sign the modules. This is the primary target image for our
    use case.

(Note: secure-proxy and proxy-tf layers
were experiments, but the -tf image combined with runtime
metadata emerged as the most effective solution for 2.2-debian12).

Build Steps:

  1. Clone Repos & Configure
    env.json:
    Ensure you have the
    custom-images and cloud-dataproc repos and a
    complete env.json as described in Part 1.

  2. Run the Build:
    bash # Example: Build a 2.2-debian12 based image set # Run from the custom-images repository root bash examples/secure-boot/build-and-run-podman.sh 2.2-debian12
    This command will build the layered images, leveraging the proxy
    settings from env.json via the metadata injected into the
    build VM. Note the final image name produced (e.g.,
    dataproc-2-2-deb12-YYYYMMDD-HHMMSS-tf).

Conclusion of Part 3

Through an iterative process, we’ve developed a robust workflow
within the custom-images repository to build Secure
Boot-compatible GPU images in a proxy-only environment. The key was
isolating the build in Podman, ensuring the build VM is fully
proxy-aware using gce-proxy-setup.sh, and leveraging the
enhanced install_gpu_driver.sh from Part 2.

In Part 4, we’ll bring it all together, deploying a Dataproc cluster
using this custom -tf image within the secure network, and
verifying the end-to-end functionality.


Leave a Reply