Dataproc GPUs, Secure Boot, & Proxies


Part
1: Building a Secure Network Foundation for Dataproc with GPUs &
SWP

Welcome to the first post in our series on running GPU-accelerated
Dataproc workloads in secure, enterprise-grade environments. Many
organizations need to operate within VPCs that have no direct internet
egress, instead routing all traffic through a Secure Web Proxy (SWP).
Additionally, security mandates often require the use of Shielded VMs
with Secure Boot enabled. This series will show you how to meet these
requirements for your Dataproc GPU clusters.

In this post, we’ll focus on laying the network foundation using
tools from the GoogleCloudDataproc/cloud-dataproc
repository.

The Challenge: Network
Isolation & Control

Before we can even think about custom images or GPU drivers, we need
a network environment that:

  1. Prevents direct internet access from Dataproc cluster nodes.
  2. Forces all egress traffic through a manageable and auditable
    SWP.
  3. Provides the necessary connectivity for Dataproc to function and for
    us to build images.

The Toolkit:
GoogleCloudDataproc/cloud-dataproc

To make setting up and tearing down these complex network
environments repeatable and consistent, we’ve developed a set of bash
scripts within the gcloud directory of the
cloud-dataproc repository. These scripts handle the
creation of VPCs, subnets, firewall rules, service accounts, and the
Secure Web Proxy itself.

Key Script:
bin/create-dpgce-private

This script is the cornerstone for creating the private, proxied
environment. It automates:

  • VPC and Subnet creation (for the cluster, SWP, and management).
  • Setup of Certificate Authority Service and Certificate Manager for
    SWP TLS interception.
  • Deployment of the SWP Gateway instance.
  • Configuration of a Gateway Security Policy to control egress.
  • Creation of necessary firewall rules.
  • Result: Cluster nodes in this VPC have NO default
    internet route and MUST use the SWP.

Configuration via env.json

While this script is in the cloud-dataproc repo, we’ve
designed it to be driven by a single env.json file that
will also be used by the custom-images scripts in Part 2.
This env.json should reside in your
custom-images repository clone.

Example env.json Snippet for
Networking:

{
  "PROJECT_ID": "YOUR_GCP_PROJECT_ID",
  "REGION": "YOUR_GCP_REGION",
  "ZONE": "YOUR_GCP_ZONE",
  "SUBNET": "main-subnet",
  "BUCKET": "YOUR_GCS_BUCKET",
  "TEMP_BUCKET": "YOUR_GCS_TEMP_BUCKET",
  "RANGE": "10.10.0.0/24",
  "PRIVATE_RANGE": "10.11.0.0/24",
  "SWP_RANGE": "10.12.0.0/24",
  "SWP_IP": "10.11.0.250",
  "SWP_PORT": "3128",
  "SWP_HOSTNAME": "swp.your.domain.com",
  "PROXY_CERT_GCS_PATH": "gs://YOUR_PROXY_CERT_BUCKET/proxy.cer"
}

Running the Setup:

# Assuming you have cloud-dataproc and custom-images cloned side-by-side
cd cloud-dataproc/gcloud
# Symlink to the env.json in custom-images
ln -sf ../../custom-images/env.json env.json
# Run the creation script, but don't create a cluster yet
bash bin/create-dpgce-private --no-create-cluster
cd ../../custom-images

Node
Configuration: The Metadata Startup Script

One of the key learnings from developing this solution was that the
proxy settings need to be applied to the Virtual Machines very early in
the boot process, even before the Dataproc agent starts. To achieve
this, we use a GCE metadata startup script.

The script startup_script/gce-proxy-setup.sh (from the
custom-images repository) is designed to be run on each
cluster node at boot. It reads metadata like http-proxy and
http-proxy-pem-uri to configure the OS environment, package
managers, and other tools to use the SWP.

Upload this script to your GCS bucket:

# Run from the custom-images repository root
gsutil cp startup_script/gce-proxy-setup.sh gs://$(jq -r .BUCKET env.json)/custom-image-deps/

We will specify this script in the
--metadata startup-script-url flag when creating the
Dataproc cluster in Part 4.

Conclusion of Part 1

With the cloud-dataproc scripts, we’ve laid the
groundwork by provisioning a secure VPC with controlled egress through
an SWP. We’ve also prepared the essential node-level proxy configuration
script in GCS.

Stay tuned for Part 2, where we’ll dive into the
install_gpu_driver.sh initialization action and how it’s
been adapted to thrive in this proxied world.


Leave a Reply