Kubernetes on Google Cloud - Details
Here are key technical details about the Metaflow deployment on Google Cloud.
Architecture Diagram
GCP Resource List
Category | Resources | Purpose |
---|---|---|
Access Control | Service account | This is an identity that has all required permissions to run Metaflow workloads, either locally vs Google Cloud Storage, or all the way running in the GKE cluster. More info. |
Access Control | Service account key | This will be used by Metaflow to authenticate as the service account above. Note: This is needed for local runs as well as for Metaflow logic prior to workload tasks to GKE. For all GCP accesses from within a GKE pod, this credential is not required. |
Access Control | Role Assignments | Grants the service account above sufficient access to: Google Cloud Storage, GKE, and Cloud SQL (PostgreSQL). For specific details and conditions tied to these role assignments, please refer to the source code. |
Networking | Virtual network | Top-level private virtual network to house all Metaflow-related GCP resources. |
Networking | Subnet | To house the PostgreSQL DB |
Storage | Google Cloud Storage bucket | Metaflow artifacts will be stored here. This resides within the storage account above. |
Kubernetes | GKE cluster | This has built-in compute node autoscaling. There are two purposes. First, Metaflow services run on this cluster. Second, compute tasks from running flows will be run as pods in this cluster. |
Database | Cloud SQL instance | This is a PostgreSQL DB instance for indexing Metaflow run metadata. |
Required GCP Permissions for Deployment
The permissions required can be described by the following custom role (gcloud iam roles describe
output):
description: <DESCRIPTION>
includedPermissions:
- cloudsql.backupRuns.create
- cloudsql.backupRuns.delete
- cloudsql.backupRuns.get
- cloudsql.backupRuns.list
- cloudsql.databases.create
- cloudsql.databases.delete
- cloudsql.databases.get
- cloudsql.databases.list
- cloudsql.databases.update
- cloudsql.instances.addServerCa
- cloudsql.instances.clone
- cloudsql.instances.connect
- cloudsql.instances.create
- cloudsql.instances.createTagBinding
- cloudsql.instances.delete
- cloudsql.instances.deleteTagBinding
- cloudsql.instances.demoteMaster
- cloudsql.instances.export
- cloudsql.instances.failover
- cloudsql.instances.get
- cloudsql.instances.import
- cloudsql.instances.list
- cloudsql.instances.listEffectiveTags
- cloudsql.instances.listServerCas
- cloudsql.instances.listTagBindings
- cloudsql.instances.login
- cloudsql.instances.promoteReplica
- cloudsql.instances.resetSslConfig
- cloudsql.instances.restart
- cloudsql.instances.restoreBackup
- cloudsql.instances.rotateServerCa
- cloudsql.instances.startReplica
- cloudsql.instances.stopReplica
- cloudsql.instances.truncateLog
- cloudsql.instances.update
- cloudsql.sslCerts.create
- cloudsql.sslCerts.delete
- cloudsql.sslCerts.get
- cloudsql.sslCerts.list
- cloudsql.users.create
- cloudsql.users.delete
- cloudsql.users.get
- cloudsql.users.list
- cloudsql.users.update
- compute.globalAddresses.createInternal
- compute.globalAddresses.deleteInternal
- compute.globalAddresses.get
- compute.instanceGroupManagers.get
- compute.networks.create
- compute.networks.delete
- compute.networks.get
- compute.networks.removePeering
- compute.networks.updatePolicy
- compute.networks.use
- compute.subnetworks.create
- compute.subnetworks.delete
- compute.subnetworks.get
- container.clusterRoleBindings.create
- container.clusterRoleBindings.delete
- container.clusterRoleBindings.get
- container.clusterRoleBindings.list
- container.clusterRoleBindings.update
- container.clusterRoles.bind
- container.clusterRoles.create
- container.clusterRoles.delete
- container.clusterRoles.escalate
- container.clusterRoles.get
- container.clusterRoles.list
- container.clusterRoles.update
- container.clusters.create
- container.clusters.delete
- container.clusters.get
- container.configMaps.create
- container.configMaps.delete
- container.configMaps.get
- container.configMaps.list
- container.configMaps.update
- container.customResourceDefinitions.create
- container.customResourceDefinitions.delete
- container.customResourceDefinitions.get
- container.customResourceDefinitions.getStatus
- container.customResourceDefinitions.list
- container.customResourceDefinitions.update
- container.customResourceDefinitions.updateStatus
- container.deployments.create
- container.deployments.delete
- container.deployments.get
- container.deployments.getScale
- container.deployments.getStatus
- container.deployments.list
- container.deployments.rollback
- container.deployments.update
- container.deployments.updateScale
- container.deployments.updateStatus
- container.namespaces.create
- container.namespaces.delete
- container.namespaces.finalize
- container.namespaces.get
- container.namespaces.getStatus
- container.namespaces.list
- container.namespaces.update
- container.namespaces.updateStatus
- container.operations.get
- container.priorityClasses.create
- container.priorityClasses.delete
- container.priorityClasses.get
- container.priorityClasses.list
- container.priorityClasses.update
- container.roleBindings.create
- container.roleBindings.delete
- container.roleBindings.get
- container.roleBindings.list
- container.roleBindings.update
- container.roles.bind
- container.roles.create
- container.roles.delete
- container.roles.escalate
- container.roles.get
- container.roles.list
- container.roles.update
- container.secrets.create
- container.secrets.delete
- container.secrets.get
- container.secrets.list
- container.secrets.update
- container.serviceAccounts.create
- container.serviceAccounts.createToken
- container.serviceAccounts.delete
- container.serviceAccounts.get
- container.serviceAccounts.list
- container.serviceAccounts.update
- container.services.create
- container.services.delete
- container.services.get
- container.services.getStatus
- container.services.list
- container.services.proxy
- container.services.update
- container.services.updateStatus
- edgecontainer.clusters.create
- iam.serviceAccountKeys.create
- iam.serviceAccountKeys.get
- iam.serviceAccounts.actAs
- iam.serviceAccounts.create
- iam.serviceAccounts.delete
- iam.serviceAccounts.get
- iam.serviceAccounts.getIamPolicy
- iam.serviceAccounts.list
- iam.serviceAccounts.setIamPolicy
- resourcemanager.projects.get
- resourcemanager.projects.setIamPolicy
- servicenetworking.services.addPeering
- servicenetworking.services.get
- storage.buckets.create
- storage.buckets.delete
- storage.buckets.get
- storage.objects.delete
- storage.objects.list
name: projects/<PROJECT>/roles/metaflow_admin
stage: GA
title: Metaflow admin
Required GCP Permissions for Running Flows
Kubernetes Engine Developer Role
Note: as of Q3, 2022, there is no direct way to scope this to a specific GKE cluster.
Storage Object Admin Role
This should be granted under this IAM condition:
resource.name.startsWith("projects/_/buckets/<BUCKET_NAME>")
The bucket name can be found from the end user output from Terraform run. For example,
…
"METAFLOW_DATASTORE_SYSROOT_GS": "gs://ob-metaflow-storage-bucket-ci/tf-full-stack-sysroot",
…
GKE services list
We deploy these services in the GKE cluster:
Metaflow
- Metadata service - this supports read/write of metadata. Supports features such as:
- When flow is running, it POST's metadata here.
- Metaflow Client library calls this service to read metadata.
- The UI static service serves the web UI frontend bundle.
- The UI backend supports UI's data needs.
Argo Workflows
The quickstart Kubernetes manifest published by Argo Workflows spins up the following services:
kubectl get services -n argo
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
argo-server ClusterIP 10.0.26.126 <none> 2746/TCP 32m
httpbin ClusterIP 10.0.66.229 <none> 9100/TCP 32m
minio ClusterIP 10.0.173.242 <none> 9000/TCP,9001/TCP 32m
postgres ClusterIP 10.0.51.199 <none> 5432/TCP 32m
workflow-controller-metrics ClusterIP 10.0.139.237 <none> 9090/TCP 32m