Overview¶
The documentation in this section illustrates officially HPE supported procedures to perform maintenance tasks on the CSI driver outside the scope of deploying and uninstalling the driver.
- Overview
- Migrate Encrypted Volumes
- Upgrade NFS Servers
- Manual Node Configuration
- Expose NFS Services Outside of the Kubernetes Cluster
Migrate Encrypted Volumes¶
Persistent volumes created with v2.1.1 or below using volume encryption, the CSI driver use LUKS2 (WikiPedia: Linux Unified Key Setup) and can't expand the PersistentVolumeClaim
. With v2.2.0 and above, LUKS1 is used and the CSI driver is capable of expanding the PVC
.
This procedure migrate (copy) data from LUKS2 to LUKS1 PVCs to allow expansion of the volume.
Note
It's not a limitation of LUKS2 to not allow expansion but rather how the CSI driver interact with the host.
Assumptions¶
These are the assumptions made throughout this procedure.
- Data to be migrated has a good backup to restore to, not just a snapshot.
- HPE CSI Driver for Kubernetes v2.3.0 or later installed.
- Worker nodes with access to the Quay registry and SCOD.
- Access to the commands
kubectl
,curl
,jq
andyq
. - Cluster privileges to manipulate
PersistentVolumes
. - None of the commands executed should return errors or have non-zero exit codes.
- Only
ReadWriteOnce
PVCs
are covered. - No custom
PVC
annotations.
Tip
There are many different ways to copy PVCs
. These steps outlines and uses one particular method developed and tested by HPE and similar workflows may be applied with other tools and procedures.
Prepare the Workload and Persistent Volume Claims¶
First, identify the PersistentVolume
to migrate from and set shell variables.
export OLD_SRC_PVC=<insert your existing PVC name here>
export OLD_SRC_PV=$(kubectl get pvc -o json | \
jq -r ".items[] | \
select(.metadata.name | \
test(\"${OLD_SRC_PVC}\"))".spec.volumeName)
Important
Ensure these shell variables are set at all times.
In order to copy data out of a PVC
, the running workload needs to be disassociated with the PVC
. It's not possible to scale the replicas to zero, the exception being ReadWriteMany
PVCs
which could lead to data inconsistency problems. These procedures assumes application consistency by having the workload shut down.
It's out of scope for this procedure to demonstrate how to shut down a particular workload. Ensure there are no volumeattachments
associated with the PersistentVolume
.
kubectl get volumeattachment -o json | \
jq -r ".items[] | \
select(.spec.source.persistentVolumeName | \
test(\"${OLD_SRC_PV}\"))".spec.source
Tip
For large volumeMode: Filesystem
PVCs
where copying data may take days, it's recommended to use the Optional Workflow with Filesystem Persistent Volume Claims that utilizes the PVC
dataSource
capability.
Create a new Persistent Volume Claim and Update Retain Policies¶
Create a new PVC
named "new-pvc" with enough space to host the data from the old source PVC
.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: new-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 32Gi
volumeMode: Filesystem
Important
If the source PVC
is a raw block volume, ensure volumeMode: Block
is set on the new PVC
.
Edit and set the shell variables for the newly created PVC
.
export NEW_DST_PVC_SIZE=32Gi
export NEW_DST_PVC_VOLMODE=Filesystem
export NEW_DST_PVC=new-pvc
export NEW_DST_PV=$(kubectl get pvc -o json | \
jq -r ".items[] | \
select(.metadata.name | \
test(\"${NEW_DST_PVC}\"))".spec.volumeName)
Hint
The PVC
name "new-pvc" is a placeholder name. When the procedure is done, the PVC
will have its original name restored.
Important Validation Steps¶
At this point, there should be six shell variables declared. Example:
$ env | grep _PV
NEW_DST_PVC_SIZE=32Gi
NEW_DST_PVC=new-pvc
OLD_SRC_PVC=old-pvc <-- This should be the original name of the PVC
NEW_DST_PVC_VOLMODE=Filesystem
NEW_DST_PV=pvc-ad7a05a9-c410-4c63-b997-51fb9fc473bf
OLD_SRC_PV=pvc-ca7c2f64-641d-4265-90f8-4aed888bd2c5
Regardless of the retainPolicy
set in the StorageClass
, ensure the persistentVolumeReclaimPolicy
is set to "Retain" for both PVs
.
kubectl patch pv/${OLD_SRC_PV} pv/${NEW_DST_PV} \
-p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
Data Loss Warning
It's EXTREMELY important no errors are returned from the above command. It WILL lead to data loss.
Validate the "persistentVolumeReclaimPolicy".
kubectl get pv/${OLD_SRC_PV} pv/${NEW_DST_PV} -o json | \
jq -r ".items[] | \
select(.metadata.name)".spec.persistentVolumeReclaimPolicy
Important
The above command should output nothing but two lines with the word "Retain" on it.
Copy Persistent Volume Claim and Reset¶
In this phase, the data will be copied from the original PVC
to the new PVC
with a Job
submitted to the cluster. Different tools are being used to perform the copy operation, ensure to pick the correct volumeMode
.
PVCs with volumeMode: Filesystem¶
curl -s https://scod.hpedev.io/csi_driver/examples/operations/pvc-copy-file.yaml | \
yq "( select(.spec.template.spec.volumes[] | \
select(.name == \"src-pv\") | \
.persistentVolumeClaim.claimName = \"${OLD_SRC_PVC}\")
" | kubectl apply -f-
Wait for the Job
to complete.
kubectl get job.batch/pvc-copy-file -w
Once the Job
has completed, validate exit status and log files.
kubectl get job.batch/pvc-copy-file -o jsonpath='{.status.succeeded}'
kubectl logs job.batch/pvc-copy-file
Delete the Job
from the cluster.
kubectl delete job.batch/pvc-copy-file
Proceed to restart the workload.
PVCs with volumeMode: Block¶
curl -s https://scod.hpedev.io/csi_driver/examples/operations/pvc-copy-block.yaml | \
yq "( select(.spec.template.spec.volumes[] | \
select(.name == \"src-pv\") | \
.persistentVolumeClaim.claimName = \"${OLD_SRC_PVC}\")
" | kubectl apply -f-
Wait for the Job
to complete.
kubectl get job.batch/pvc-copy-block -w
Hint
Data is copied block for block, verbatim, regardless of how much application data is stored in the block devices.
Once the Job
has completed, validate exit status and log files.
kubectl get job.batch/pvc-copy-block -o jsonpath='{.status.succeeded}'
kubectl logs job.batch/pvc-copy-block
Delete the Job
from the cluster.
kubectl delete job.batch/pvc-copy-block
Proceed to restart the workload.
Restart the Workload¶
This step requires both the old source PVC
and the new destination PVC
to be deleted. Once again, ensure the correct persistentVolumeReclaimPolicy
is set on the PVs
.
kubectl get pv/${OLD_SRC_PV} pv/${NEW_DST_PV} -o json | \
jq -r ".items[] | \
select(.metadata.name)".spec.persistentVolumeReclaimPolicy
Important
The above command should output nothing but two lines with the word "Retain" on it, if not revisit Important Validation Steps to apply the policy and ensure environment variables are set correctly.
Delete the PVCs
.
kubectl delete pvc/${OLD_SRC_PVC} pvc/${NEW_DST_PVC}
Next, allow the new PV
to be reclaimed.
kubectl patch pv ${NEW_DST_PV} -p '{"spec":{"claimRef": null }}'
Next, create a PVC
with the old source name and ensure it matches the size of the new destination PVC
.
curl -s https://scod.hpedev.io/csi_driver/examples/operations/pvc-copy.yaml | \
yq ".spec.volumeName = \"${NEW_DST_PV}\" | \
.metadata.name = \"${OLD_SRC_PVC}\" | \
.spec.volumeMode = \"${NEW_DST_PVC_VOLMODE}\" | \
.spec.resources.requests.storage = \"${NEW_DST_PVC_SIZE}\" \
" | kubectl apply -f-
Verify the new PVC
is "Bound" to the correct PV
.
kubectl get pvc/${OLD_SRC_PVC} -o json | \
jq -r ". | \
select(.spec.volumeName == \"${NEW_DST_PV}\").metadata.name"
If the command is successful, it should output your original PVC
name.
At this point the original workload should be deployed, verified and resumed.
Optionally, the old source PV
may be removed.
kubectl delete pv/${OLD_SRC_PV}
Optional Workflow with Filesystem Persistent Volume Claims¶
If there's a lot of content (millions of files, terabytes of data) that needs to be transferred in a volumeMode: Filesystem
PVC
it's recommended to transfer content incrementally. This is achieved by substituting the "old-pvc" with a dataSource
clone of the running workload and perform the copy from the clone onto the "new-pvc".
After the first transfer completes, the copy job may be recreated as many times as needed with a fresh clone of "old-pvc" until the downtime window has shrunk to an acceptable duration. For the final transfer, the actual source PVC
will be used instead of the clone.
This is an example PVC
.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: clone-of-pvc
spec:
dataSource:
name: this-is-the-current-prod-pvc
kind: PersistentVolumeClaim
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 32Gi
Tip
The capacity of the dataSource
clone must match the original PVC
.
Enabling and setting up the CSI snapshotter and related CRDs
is not necessary but it's recommended to be familiar with using CSI snapshots.
Upgrade NFS Servers¶
In the event the CSI driver contains updates to the NFS Server Provisioner, any running NFS server needs to be updated manually.
Upgrade to v2.5.0¶
Any prior deployed NFS servers may be upgraded to v2.5.0.
Upgrade to v2.4.2¶
No changes to NFS Server Provisioner image between v2.4.1 and v2.4.2.
Upgrade to v2.4.1¶
Any prior deployed NFS servers may be upgraded to v2.4.1.
Important
With v2.4.0 and onwards the NFS servers are deployed with default resource limits and in v2.5.0 resource requests were added. Those won't be applied on running NFS servers, only new ones.
Assumptions¶
- HPE CSI Driver or Operator v2.4.1 installed.
- All running NFS servers are running in the "hpe-nfs"
Namespace
. - Worker nodes with access to the Quay registry and SCOD.
- Access to the commands
kubectl
,yq
andcurl
. - Cluster privileges to manipulate resources in the "hpe-nfs"
Namespace
. - None of the commands executed should return errors or have non-zero exit codes.
Seealso
If NFS Deployments
are scattered across Namespaces
, use the Validation steps to find where they reside.
Patch Running NFS Servers¶
When patching the NFS Deployments
, the Pods
will restart and cause a pause in I/O for the NFS clients with active mounts. The clients will recover gracefully once the NFS Pod
is running again.
Patch all NFS Deployments
with the following.
curl -s https://scod.hpedev.io/csi_driver/examples/operations/patch-nfs-server-2.5.0.yaml | \
kubectl patch -n hpe-nfs \
$(kubectl get deploy -n hpe-nfs -o name) \
--patch-file=/dev/stdin
Tip
If it's desired to patch one NFS Deployment
at a time, replace the shell substituion with a Deployment
name.
Validation¶
This command will list all "hpe-nfs" Deployments
across the entire cluster. Each Deployment
should be using v3.0.5 of the "nfs-provisioner" image after the uprade is complete.
kubectl get deploy -A -o yaml | \
yq -r '.items[] | [] + { "Namespace": select(.spec.template.spec.containers[].name == "hpe-nfs").metadata.namespace, "Deployment": select(.spec.template.spec.containers[].name == "hpe-nfs").metadata.name, "Image": select(.spec.template.spec.containers[].name == "hpe-nfs").spec.template.spec.containers[].image }'
Note
The above line is very long.
Manual Node Configuration¶
With the release of HPE CSI Driver v2.4.0 it's possible to completely disable the node conformance and node configuration performed by the CSI node driver at startup. This transfers the responsibilty from the HPE CSI Driver to the Kubernetes cluster administrator to ensure worker nodes boot with a supported configuration.
Important
This feature is mainly for users who require 100% control of the worker nodes.
Stages of Initialization¶
There are two stages of initialization the administrator can control through parameters in the Helm chart.
disableNodeConformance¶
The node conformance runs with the entrypoint of the node driver container. The conformance inserts and runs a systemd service on the node that installs all required packages on the node to allow nodes to attach block storage devices and mount NFS exports. It starts all the required services and configure an important udev rule on the worker node.
This flag was intended to allow administrators to run the CSI driver on nodes with an unsupported or unconfigured package manager.
If node conformance needs to be disabled for any reason, these packages and services needs to be installed and running prior to installing the HPE CSI Driver:
- iSCSI (not necessary when using FC)
- Multipath
- XFS programs/utilities
- NFSv4 client
Package names and services vary greatly between different Linux distributions and it's the system administrator's duty to ensure these are available to the HPE CSI Driver.
disableNodeConfiguration¶
When disabling node configuration the CSI node driver will not touch the node at all. Besides indirectly disabling node conformance, all attempts to write configuration files or manipulate services during runtime are disabled.
Mandatory Configuration¶
These steps are REQUIRED for disabling either node configuration or conformance.
On each current and future worker node in the cluster:
# Don't let udev automatically scan targets(all luns) on Unit Attention.
# This will prevent udev scanning devices which we are attempting to remove.
if [ -f /lib/udev/rules.d/90-scsi-ua.rules ]; then
sed -i 's/^[^#]*scan-scsi-target/#&/' /lib/udev/rules.d/90-scsi-ua.rules
udevadm control --reload-rules
fi
iSCSI Configuration¶
Skip this step if only Fibre Channel is being used. This step is only required when node configuration is disabled.
iscsid.conf¶
This example is taken from a Rocky Linux 9.2 node with the HPE parameters applied. Certain parameters may differ for other distributions of either iSCSI or the host OS.
Note
The location of this file varies between Linux and iSCSI distributions.
Ensure iscsid
is stopped.
systemctl stop iscsid
Download: /etc/iscsi/iscsid.conf
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = manual
node.leading_login = No
node.session.timeo.replacement_timeout = 10
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 10
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 512
node.session.queue_depth = 256
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.reopen_max = 0
node.session.iscsi.FastAbort = Yes
node.session.scan = auto
Pro tip!
When nodes are provisioned from some sort of templating system with iSCSI pre-installed, it's notoriously common that nodes are provisioned with identical IQNs. This will lead to device attachment problems that aren't obvious to the user. Make sure each node has a unique IQN.
Ensure iscsid
is running and enabled:
systemctl enable --now iscsid
Seealso
Some Linux distributions requires the iscsi_tcp
kernel module to be loaded. Where kernel modules are loaded varies between Linux distributions.
Multipath Configuration¶
This step is only required when node configuration is disabled.
multipath.conf¶
The defaults section of the configuration file is merely a preference, make sure to leave the device and blacklist stanzas intact when potentially adding more entries from foreign devices.
Note
The location of this file varies between Linux and iSCSI distributions.
Ensure multipathd
is stopped.
systemctl stop multipathd
Download: /etc/multipath.conf
defaults {
user_friendly_names yes
find_multipaths no
uxsock_timeout 10000
}
blacklist {
devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
devnode "^hd[a-z]"
device {
product ".*"
vendor ".*"
}
}
blacklist_exceptions {
property "(ID_WWN|SCSI_IDENT_.*|ID_SERIAL)"
device {
vendor "Nimble"
product "Server"
}
device {
product "VV"
vendor "3PARdata"
}
device {
vendor "TrueNAS"
product "iSCSI Disk"
}
device {
vendor "FreeNAS"
product "iSCSI Disk"
}
}
devices {
device {
product "Server"
rr_min_io_rq 1
dev_loss_tmo infinity
path_checker tur
rr_weight uniform
no_path_retry 30
path_selector "service-time 0"
failback immediate
fast_io_fail_tmo 5
vendor "Nimble"
hardware_handler "1 alua"
path_grouping_policy group_by_prio
prio alua
}
device {
path_grouping_policy group_by_prio
path_checker tur
rr_weight "uniform"
prio alua
failback immediate
hardware_handler "1 alua"
no_path_retry 18
fast_io_fail_tmo 10
path_selector "round-robin 0"
vendor "3PARdata"
dev_loss_tmo infinity
detect_prio yes
features "0"
rr_min_io_rq 1
product "VV"
}
device {
path_selector "queue-length 0"
rr_weight priorities
uid_attribute ID_SERIAL
vendor "TrueNAS"
product "iSCSI Disk"
path_grouping_policy group_by_prio
}
device {
path_selector "queue-length 0"
hardware_handler "1 alua"
rr_weight priorities
uid_attribute ID_SERIAL
vendor "FreeNAS"
product "iSCSI Disk"
path_grouping_policy group_by_prio
}
}
Ensure multipathd
is running and enabled:
systemctl enable --now multipathd
Important Considerations¶
While both disabling conformance and configuration parameters lends itself to a more predictable behaviour when deploying nodes from templates with less runtime configuration, it's still not a complete solution for having immutable nodes. The CSI node driver creates a unique identity for the node and stores it in /etc/hpe-storage/node.gob
. This file must persist across reboots and redeployments of the node OS image. Immutable Linux distributions such as CoreOS persist the /etc
directory, some don't.
Expose NFS Services Outside of the Kubernetes Cluster¶
In certain situations it's practical to expose the NFS exports outside the Kubernetes cluster to allow external applications to access data as part of an ETL (Extract, Transform, Load) pipeline or similar.
Since this is an untested feature with questionable security standards, HPE does not recommend using this facility in production at this time. Reach out to your HPE account representative if this is a critical feature for your workloads.
Danger
The exports on the NFS servers does not have any network Access Control Lists (ACL) without root squash. Anyone with an NFS client that can reach the load balancer IP address have full access to the filesystem.
From ClusterIP to LoadBalancer¶
The NFS server Service
must be transformed into a "LoadBalancer".
In this example we'll assume a "RWX" PersistentVolumeClaim
named "my-pvc-1" and NFS resources deployed in the default Namespace
, "hpe-nfs".
Retrieve NFS UUID
export UUID=$(kubectl get pvc my-pvc-1 -o jsonpath='{.spec.volumeName}{"\n"}' | awk -Fpvc- '{print $2}')
Patch the NFS Service
:
kubectl patch -n hpe-nfs svc/hpe-nfs-${UUID} -p '{"spec":{"type": "LoadBalancer"}}'
The Service
will be assigned an external IP address by the load balancer deployed in the cluster. If there is no load balancer deployed, a MetalLB example is provided below.
MetalLB Example¶
Deploying MetalLB is outside the scope of this document. In this example, MetalLB was deployed on OpenShift 4.16 (Kubernetes v1.29) using the Operator provided by Red Hat in the "metallb-system" Namespace
.
Determine the IP address range that will be assigned to the load balancers. In this example, 192.168.1.40 to 192.168.1.60 is being used. Note that the worker nodes in this cluster already have reachable IP addresses in the 192.168.1.0/24 network, which is a requirement.
Create the MetalLB instances, IP address pool and Layer 2 advertisement.
---
apiVersion: metallb.io/v1beta1
kind: MetalLB
metadata:
name: metallb
namespace: metallb-system
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
namespace: metallb-system
name: hpe-nfs-servers
spec:
protocol: layer2
addresses:
- 192.168.1.40-192.168.1.60
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: l2advertisement
namespace: metallb-system
spec:
ipAddressPools:
- hpe-nfs-servers
Shortly, the external IP address of the NFS Service
patched in the previous steps should have an IP address assigned.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
hpe-nfs-UUID LoadBalancer 172.30.217.203 192.168.1.40 <long list of ports>
Mount the NFS Server from an NFS Client¶
Mounting the NFS export externally is now possible.
As root:
mount -t nfs4 192.168.1.40:/export /mnt
Note
If the NFS server is rescheduled in the Kubernetes cluster, the load balancer IP address follows, and the client will recover and resume IO after a few minutes.