{"id":2117,"date":"2026-01-28T02:45:41","date_gmt":"2026-01-28T10:45:41","guid":{"rendered":"https:\/\/wp.c9h.org\/cj\/?p=2117"},"modified":"2026-01-29T00:57:37","modified_gmt":"2026-01-29T08:57:37","slug":"part-2-taming-the-beast-deep-dive-into-the-proxy-aware-gpu-initialization-action","status":"publish","type":"post","link":"https:\/\/wp.c9h.org\/cj\/?p=2117","title":{"rendered":"Part 2: Taming the Beast &#8211; Deep Dive into the Proxy-Aware GPU Initialization Action"},"content":{"rendered":"<h1\nid=\"part-2-taming-the-beast---deep-dive-into-the-proxy-aware-gpu-initialization-action\">Part<br \/>\n2: Taming the Beast &#8211; Deep Dive into the Proxy-Aware GPU Initialization<br \/>\nAction<\/h1>\n<p>In Part 1 of this series, we laid the network foundation for running<br \/>\nsecure Dataproc clusters. Now, let\u2019s zoom in on the core component<br \/>\nresponsible for installing and configuring NVIDIA GPU drivers and the<br \/>\nassociated ML stack in this restricted environment: the<br \/>\n<code>install_gpu_driver.sh<\/code> script from the <a\nhref=\"https:\/\/github.com\/LLC-Technologies-Collier\/initialization-actions\">LLC-Technologies-Collier\/initialization-actions<\/a><br \/>\nrepository (branch <code>gpu-202601<\/code>).<\/p>\n<p>This isn\u2019t just any installation script; it has been significantly<br \/>\nenhanced to handle the nuances of Secure Boot and to operate seamlessly<br \/>\nbehind an HTTP\/S proxy.<\/p>\n<h2\nid=\"the-challenge-installing-gpu-drivers-without-direct-internet\">The<br \/>\nChallenge: Installing GPU Drivers Without Direct Internet<\/h2>\n<p>Our goal was to create a Dataproc custom image with NVIDIA GPU<br \/>\ndrivers, sign the kernel modules for Secure Boot, and ensure the entire<br \/>\nprocess works seamlessly when the build VM and the eventual cluster<br \/>\nnodes have no direct internet access, relying solely on an HTTP\/S proxy.<br \/>\nThis involved:<\/p>\n<ol type=\"1\">\n<li><strong>Proxy-Aware Build:<\/strong> Ensuring all build steps within<br \/>\nthe custom image creation process (package downloads, driver downloads,<br \/>\nGPG keys, etc.) correctly use the customer\u2019s proxy.<\/li>\n<li><strong>Secure Boot Signing:<\/strong> Integrating kernel module<br \/>\nsigning using keys managed in GCP Secret Manager, especially when<br \/>\ndrivers are built from source.<\/li>\n<li><strong>Conda Environment:<\/strong> Reliably and speedily installing<br \/>\na complex Conda environment with PyTorch, TensorFlow, Rapids, and other<br \/>\nGPU-accelerated libraries through the proxy.<\/li>\n<li><strong>Dataproc Integration:<\/strong> Making sure the custom image<br \/>\nworks correctly with Dataproc\u2019s own startup, agent processes, and<br \/>\ncluster-specific configurations like YARN.<\/li>\n<\/ol>\n<h2\nid=\"the-development-journey-key-enhancements-in-install_gpu_driver.sh\">The<br \/>\nDevelopment Journey: Key Enhancements in<br \/>\n<code>install_gpu_driver.sh<\/code><\/h2>\n<p>To address these challenges, the script incorporates several key<br \/>\nfeatures:<\/p>\n<ul>\n<li><strong>Robust Proxy Handling (<code>set_proxy<\/code><br \/>\nfunction):<\/strong><\/p>\n<ul>\n<li><strong>Challenge:<\/strong> Initial script versions had spotty proxy<br \/>\nsupport. Many tools like <code>apt<\/code>, <code>curl<\/code>,<br \/>\n<code>gpg<\/code>, and even <code>gsutil<\/code> failed in proxy-only<br \/>\nenvironments.<\/li>\n<li><strong>Enhancements:<\/strong> The <code>set_proxy<\/code> function<br \/>\n(also used in <code>gce-proxy-setup.sh<\/code>) was completely overhauled<br \/>\nto parse various proxy metadata (<code>http-proxy<\/code>,<br \/>\n<code>https-proxy<\/code>, <code>proxy-uri<\/code>,<br \/>\n<code>no-proxy<\/code>). Critically, environment variables<br \/>\n(<code>HTTP_PROXY<\/code>, <code>HTTPS_PROXY<\/code>,<br \/>\n<code>NO_PROXY<\/code>) are now set <em>before<\/em> any network<br \/>\noperations. <code>NO_PROXY<\/code> is carefully set to include<br \/>\n<code>.google.com<\/code> and <code>.googleapis.com<\/code> to allow<br \/>\ndirect access to Google APIs via Private Google Access. System-wide<br \/>\ntrust stores (OS, Java, Conda) are updated with the proxy\u2019s CA<br \/>\ncertificate if provided via <code>http-proxy-pem-uri<\/code>.<br \/>\n<code>gcloud<\/code>, <code>apt<\/code>, <code>dnf<\/code>, and<br \/>\n<code>dirmngr<\/code> are also configured to use the proxy.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Reliable GPG Key Fetching (<code>import_gpg_keys<\/code><br \/>\nfunction):<\/strong><\/p>\n<ul>\n<li><strong>Challenge:<\/strong> Importing GPG keys for repositories<br \/>\noften failed as keyservers use non-HTTP ports (e.g., 11371) blocked by<br \/>\nfirewalls, and <code>gpg --recv-keys<\/code> is not proxy-friendly.<\/li>\n<li><strong>Solution:<\/strong> A new <code>import_gpg_keys<\/code><br \/>\nfunction now fetches keys over HTTPS using <code>curl<\/code>, which<br \/>\nrespects the environment\u2019s proxy settings. This replaced all direct<br \/>\n<code>gpg --recv-keys<\/code> calls.<\/li>\n<\/ul>\n<\/li>\n<li><strong>GCS Caching is King:<\/strong>\n<ul>\n<li><strong>Challenge:<\/strong> Repeatedly downloading large files<br \/>\n(drivers, CUDA, source code) through a proxy is slow and<br \/>\ninefficient.<\/li>\n<li><strong>Solution:<\/strong> Implemented extensive GCS caching for<br \/>\nNVIDIA drivers, CUDA runfiles, NVIDIA Open Kernel Module source<br \/>\ntarballs, compiled kernel modules, and even packed Conda environments.<br \/>\nScripts now check a GCS bucket (<code>dataproc-temp-bucket<\/code>)<br \/>\nbefore hitting the internet.<\/li>\n<li><strong>Impact:<\/strong> Dramatically speeds up subsequent runs and<br \/>\ninit action execution times on cluster nodes after the cache is<br \/>\nwarmed.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Conda Environment Stability &amp; Speed:<\/strong>\n<ul>\n<li><strong>Challenge:<\/strong> Large Conda environments are prone to<br \/>\nsolver conflicts and slow installation times.<\/li>\n<li><strong>Solution:<\/strong> Integrated Mamba for faster package<br \/>\nsolving. Refined package lists for better compatibility. Added logic to<br \/>\nforce-clean and rebuild the Conda environment cache on GCS and locally<br \/>\nif inconsistencies are detected (e.g., driver installed but Conda env<br \/>\nnot fully set up).<\/li>\n<\/ul>\n<\/li>\n<li><strong>Secure Boot &amp; Kernel Module Signing:<\/strong>\n<ul>\n<li><strong>Challenge:<\/strong> Custom-compiled kernel modules must be<br \/>\nsigned to load when Secure Boot is enabled.<\/li>\n<li><strong>Solution:<\/strong> The script integrates with GCP Secret<br \/>\nManager to fetch signing keys. The <code>build_driver_from_github<\/code><br \/>\nfunction now includes robust steps to compile, sign (using<br \/>\n<code>sign-file<\/code>), install, and verify the signed modules.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Custom Image Workflow &amp; Deferred Configuration:<\/strong>\n<ul>\n<li><strong>Challenge:<\/strong> Cluster-specific settings (like YARN GPU<br \/>\nconfiguration) should not be baked into the image.<\/li>\n<li><strong>Solution:<\/strong> The <code>install_gpu_driver.sh<\/code><br \/>\nscript detects when it\u2019s run during image creation<br \/>\n(<code>--metadata invocation-type=custom-images<\/code>). In this mode,<br \/>\nit defers cluster-specific setups to a systemd service<br \/>\n(<code>dataproc-gpu-config.service<\/code>) that runs on the first boot<br \/>\nof a cluster instance. This ensures that YARN and Spark configurations<br \/>\nare applied in the context of the running cluster, not at image build<br \/>\ntime.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h2 id=\"conclusion-of-part-2\">Conclusion of Part 2<\/h2>\n<p>The <code>install_gpu_driver.sh<\/code> initialization action is more<br \/>\nthan just an installer; it\u2019s a carefully crafted tool designed to handle<br \/>\nthe complexities of secure, proxied environments. Its robust proxy<br \/>\nsupport, comprehensive GCS caching, refined Conda management, Secure<br \/>\nBoot signing capabilities, and awareness of the custom image build<br \/>\nlifecycle make it a critical enabler.<\/p>\n<p>In Part 3, we\u2019ll explore how the <a\nhref=\"https:\/\/github.com\/LLC-Technologies-Collier\/custom-images\">LLC-Technologies-Collier\/custom-images<\/a><br \/>\nrepository (branch <code>proxy-exercise-2025-11<\/code>) uses this<br \/>\ninitialization action to build the complete, ready-to-deploy Secure Boot<br \/>\nGPU custom images.<\/p>\n\n<div class=\"twitter-share\"><a href=\"https:\/\/twitter.com\/intent\/tweet?via=cjamescollier\" class=\"twitter-share-button\">Tweet<\/a><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Part 2: Taming the Beast &#8211; Deep Dive into the Proxy-Aware GPU Initialization Action In Part 1 of this series, we laid the network foundation for running secure Dataproc clusters. Now, let\u2019s zoom in on the core component responsible for installing and configuring NVIDIA GPU drivers and the associated ML stack in this restricted environment: [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[322,60,17,79,316,330,49,47,323,102,18,125,45,166,1,100],"tags":[],"class_list":["post-2117","post","type-post","status-publish","format-standard","hentry","category-bookworm","category-colliertech","category-debian","category-free-software","category-gcp","category-google-cloud-dataproc","category-images","category-linux","category-nvidia","category-open-source","category-perl","category-pgp","category-release-announcements","category-software","category-uncategorized","category-x509"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1YDIB-y9","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=\/wp\/v2\/posts\/2117","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2117"}],"version-history":[{"count":2,"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=\/wp\/v2\/posts\/2117\/revisions"}],"predecessor-version":[{"id":2123,"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=\/wp\/v2\/posts\/2117\/revisions\/2123"}],"wp:attachment":[{"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2117"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2117"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wp.c9h.org\/cj\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2117"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}