Part
1: Building a Secure Network Foundation for Dataproc with GPUs &
SWP
Welcome to the first post in our series on running GPU-accelerated
Dataproc workloads in secure, enterprise-grade environments. Many
organizations need to operate within VPCs that have no direct internet
egress, instead routing all traffic through a Secure Web Proxy (SWP).
Additionally, security mandates often require the use of Shielded VMs
with Secure Boot enabled. This series will show you how to meet these
requirements for your Dataproc GPU clusters.
In this post, we’ll focus on laying the network foundation using
tools from the LLC-Technologies-Collier/cloud-dataproc
repository (branch proxy-sync-2026-01).
The Challenge: Network
Isolation & Control
Before we can even think about custom images or GPU drivers, we need
a network environment that:
- Prevents direct internet access from Dataproc cluster nodes.
- Forces all egress traffic through a manageable and auditable
SWP. - Provides the necessary connectivity for Dataproc to function and for
us to build images later. - Supports Secure Boot for all VMs.
The Toolkit:
LLC-Technologies-Collier/cloud-dataproc
To make setting up and tearing down these complex network
environments repeatable and consistent, we’ve developed a set of bash
scripts within the gcloud directory of the
cloud-dataproc repository. These scripts handle the
creation of VPCs, subnets, firewall rules, service accounts, and the
Secure Web Proxy itself.
Key Script:
gcloud/bin/create-dpgce-private
This script is the cornerstone for creating the private, proxied
environment. It automates:
- VPC and Subnet creation (for the cluster, SWP, and management).
- Setup of Certificate Authority Service and Certificate Manager for
SWP TLS interception. - Deployment of the SWP Gateway instance.
- Configuration of a Gateway Security Policy to control egress.
- Creation of necessary firewall rules.
- Result: Cluster nodes in this VPC have NO default
internet route and MUST use the SWP.
Configuration via env.json
We use a single env.json file to drive the
configuration. This file will also be used by the
custom-images scripts in Part 3. This env.json
should reside in your custom-images repository clone, and
you’ll symlink it into the cloud-dataproc/gcloud
directory.
Running the Setup:
# Assuming you have cloud-dataproc and custom-images cloned side-by-side
# And your env.json is in the custom-images root
cd cloud-dataproc/gcloud
# Symlink to the env.json in custom-images
ln -sf ../../custom-images/env.json env.json
# Run the creation script, but don't create a cluster yet
bash bin/create-dpgce-private --no-create-cluster
cd ../../custom-images
Node
Configuration: The Metadata Startup Script for Runtime
For the Dataproc cluster nodes to function correctly in this proxied
environment, they need to be configured to use the SWP on boot. We
achieve this using a GCE metadata startup script.
The script startup_script/gce-proxy-setup.sh (from the
custom-images repository) is designed to be run on each
cluster node at boot. It reads metadata like http-proxy and
http-proxy-pem-uri (which our cluster creation scripts in
Part 4 will pass) to configure the OS environment, package managers, and
other tools to use the SWP.
Upload this script to your GCS bucket:
# Run from the custom-images repository root
gsutil cp startup_script/gce-proxy-setup.sh gs://$(jq -r .BUCKET env.json)/custom-image-deps/
This script is essential for the runtime behavior of the
cluster nodes.
Conclusion of Part 1
With the cloud-dataproc scripts, we’ve laid the
groundwork by provisioning a secure VPC with controlled egress through
an SWP. We’ve also prepared the essential node-level proxy configuration
script (gce-proxy-setup.sh) in GCS, ready to be used by our
clusters.
Stay tuned for Part 2, where we’ll dive into the
install_gpu_driver.sh initialization action from the
LLC-Technologies-Collier/initialization-actions repository
(branch gpu-202601) and how it’s been adapted to install
all GPU-related software through the proxy during the image build
process.