Cloud Composer — GCE Persistent Disk Volume for KubernetesPodOperator
The Kubernetes executor was introduced in Apache Airflow 1.10.0. The Kubernetes executor creates a new pod for every task instance. Sometimes task requires to read configuration files or write output data to files.
This post explains how to add a GCE Persistent Disk to the GKE cluster using the ReadOnlyMany
access mode. This mode allows multiple Pods on different nodes to mount the disk for reading. In case if we want to use the storage in ReadWriteMany
mode, then something like NFS may be the ideal solution as GCE Persistent Disk doesn't provide such capability.
There are some restrictions when using a gcePersistentDisk
:
the nodes on which Pods are running must be GCE VMs
those VMs need to be in the same GCE project and zone as the persistent disk
Let’s take a closer look at how we can configure it.
We need to start from creating a GCE persistent disk in the same region as cloud composer, attach it to a GCE VM.
Once the disk is attach to VM, we need to mount the disk. SSH into the VM and type the below command.
In this example, sdb
is the device name for the new blank persistent disk.
- Format the disk
sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb
Replace sdb
with the device name of the disk that you are formatting.
Create a directory that serves as the mount point for the new disk on the VM and give read and write permission on the disk.
sudo mkdir -p /mnt/disk # Creating mount directory
# mount the disk to the instance
sudo mount -o discard,defaults /dev/sdb /mnt/disk
# give read and write permission
sudo chmod a+w /mnt/disk
Now to populate this mount disk with data files, assuming data files are in local system, below command will upload data
directory to the mount path.
# from host machine upload data directory to VM mount disk path:
gcloud compute scp --zone <zone> --recurse data <user>@<instance-name>:/mnt/disk
After the data files are prepopulated in gcs persistent disk, detach the disk from the VM.
We need to start defining our Persistent Volume Claim(PVC) and Persistent Volume (PV). To know more about PVC and PV, I have explained briefly about it in the below section.
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-volume
spec:
storageClassName: ""
capacity:
storage: 10G
accessModes:
- ReadOnlyMany
claimRef:
namespace: default
name: my-volume-claim
gcePersistentDisk:
pdName: my-persistent-disk
fsType: ext4
readOnly: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-volume-claim
spec:
storageClassName: ""
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 10G
kubectl apply -f gce-persistent-volume.yml
And this is our scenario where we are asking for a PV
that supports ReadOnlyMany
mode in GCP. It creates for us a PersistentVolume
which meets our requirements we listed in accessModes
section but it also supports ReadWriteOnce
mode.
PV
that was automatically provisioned by GCP in forPVC
supports two accessModes
and so we have to specify explicitely in KubernetesPodOperator
definition that we want to mount it in read-only mode, by default it is mounted in read-write mode.
from datetime import datetime
from airflow import models
from kubernetes.client import models as k8s
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
KubernetesPodOperator,
)
volume_mount = k8s.V1VolumeMount(
name="my-volume",
mount_path="/mnt/disk",
sub_path=None,
read_only=True,
)
volume = k8s.V1Volume(
name="my-volume",
persistent_volume_claim=k8s.V1PersistentVolumeClaimVolumeSource(
claim_name="my-volume-claim",
read_only=True # Pod Claims for volume in read-only access mode.
),
)
with models.DAG(
dag_id="my-dag",
schedule_interval=None,
start_date=datetime.datetime(2021, 7, 19),
) as dag:
kubernetes_min_pod = KubernetesPodOperator(
task_id="my_task",
name="my_pod_name",
cmds=[...],
startup_timeout_seconds=300,
arguments=[...],
env_vars={
"PATH_TO_DATA_DIR": "/mnt/disk/data", # ENV will be populated in the container.
},
namespace=...,
image=...,
volumes=[volume],
volume_mounts=[volume_mount],
)
Persistent Volume & Persistent Volume Claim
Persistent Volume, think of it as a cluster resource to store data or an abstract component which take the storage from the actual physical storage, like local hard drive, external NFS server, or a cloud storage. Kubernetes actually doesn’t care about storage so it gives us persistent volume component as an interface to the actual storage that a maintainer or adminstrator have to configured it and take care of . There are many types of persistent volumes that kubernetes supports. Persistent Volumes are not namespaced, that means they are accessible to the whole cluster.
Now, application that need to access the persistent volume has to claim the storage and kubernetes provide us with a component called Persistent Volume Claim. PVC claims a PV with certain storage size or capacity and some additional characteristic like access modes and whatever persistent volume matches this condition or satifies this claim will be used. Now, we have to use this claim in our Pod definition to access the PV. Pod request volume through the PV claim.
Basically it’s like saying:
“Hey, persistent claim! Give me a volume that supports ReadOnlyMany
mode."
*Note: *Persistent Volume Claim must exists in the same namespace as the pod using the claim, whereas persistent volume are not namespaced and are accessible to the whole cluster.