Musings of an Eon...

GitLab Managed App: Cert Manager

Published 2019 Dec 31 @ 05:26

For anyone using GitLab’s Managed Apps, specifically the cert-manager, be advised that you need to re-install (uninstall/install) the app if you first installed it prior to September 22, 2019. Most people that need to have probably already done so, but my cert was active until right around Christmas, so by the time I realized my cert hadn’t renewed it was the holiday season. And, to my added frustration, the re-installation process was not smooth.

The Issue

Certs expire because the ACME client isn’t being updated and therefore Let’s Encrypt will not renew your certs. The old cert-manager Helm chart was deprecated and so there were no additional updates to the ACME client. The new cert-manager chart that gets installed in GitLab 12.3 and beyond is under active development and has an up-to-date ACME client that can properly renew certs.

Unfortunately, because it’s an entirely new chart rather than simply a new version of an existing chart, it’s not as simple as running helm upgrade to be on the latest version. This requires uninstalling the previous chart and deleting all corresponding resources before installing the new chart.

Note: GitLab uses an image with Helm pre-installed, but they are still using the 2.x version, not the latest 3.x. If you need to run your own Helm commands at any point, ensure that you are looking specifically at the 2.x documentation when using GitLab’s Helm image.

The Fix

There is shamefully no notification in GitLab’s Managed Apps section that alerts you to this change. Essentially, if you aren’t keeping up-to-date with their release notes, then you might have a breaking change you don’t know about with no easy way to identify the fix.

Thankfully, I spent about an hour tracking down the issue and found the official documentation had a very brief note on this very problem. You can read the official documentation, if you’d like but ultimately this is the important bit:

If you have installed Cert-Manager prior to GitLab 12.3, Let’s Encrypt will block requests from older versions of Cert-Manager.

If you’re using gitlab.com rather than your own personal GitLab deployment, you might not know whether you were running version 12.3 when you installed the cert manager. So I went through their release notes to find the official date that 12.3 was released, which was September 22, 2019. Any cert manager installations from this date or prior will likely need to be re-installed.

Also thankfully, there’s a simple way to look at your GKE console and tell if you need to update. If you are on the latest version of cert manager, you will have two corresponding workloads; three, if you didn’t manually disable the webhooks deployment. If you aren’t on the latest version you’ll have a single workload.

The resolution GitLab proposes: uninstall and install again.

Caveats

The uninstall phase will almost certainly succeed. Installing the newest version is less of a sure-thing. For me, I encountered multiple issues. I continued to see various errors pop up and ultimately I had to create a custom deployment YAML (based off the one GitLab was generating) to perform a bunch of cleanup and prep to get the installation working.

Upgrade Kubernetes

I’m not sure if this was actually a problem, but I figured it could have been since older K8S APIs might be hindering the installation process. My master and nodes were at 1.12.x and the latest was 1.15.x, which is a significant bump. Although no APIs would be missing between 1.12.x and 1.15.x, it’s possible that new APIs exist that might be used in the installation process.

When I attempted to perform the upgrade, however, I received an error stating Zone X is out of resources, try Y instead. After much web searching I could find nothing on what this error actually meant. However, I noticed that there was a beta feature in GKE called “surge upgrade” which allowed a cluster to temporarily increase the number of nodes to accommodate a version upgrade. Once I enabled that feature and set the maximum surge nodes to 1 I was able to successfully upgrade my single node.

However, for some reason I was unable to run my stack on a single node after the upgrade. I believe that one of the upgraded versions increased how much resources K8S stack takes up on a node and one of my pods was unable to be started. I had to allow auto-scaling which brought my node pool to a count of 2, meaning double the price, which I’m not thrilled about.

Unavailable API Services

If you get an error saying something like “unable to retrieve list of API servers” then that probably means “one or more of the API services is unavailable.” You can check this by executing kubectl get apiservice and looking for anything that has False in its row. If it’s a service related to cert-manager, just nuke it. If it’s for something else, check if you actually need it (or can re-deploy it) before continuing. Ultimately, however, all API services need to be available otherwise you get that error message.

If you can safely nuke an unavailable service, just use kubectl delete apiservice <service-name>.

Service Account Already Exists

It’s likely that you will receive this error if your re-install phase gave you any other errors. Because cert-manager does not fail atomically, meaning you can have some state changes even if the operation is ultimately a failure, there is a good chance that your overall state will be invalid when you are finally ready to install the newest cert-manager chart.

If you get a message about a service account already existing, check the Service & Ingress section to see if anything with cert-manager or certmanager in the name exists, and delete them. If you don’t see it there then welcome to the wonder of poor user experience, where a service exists but you can’t view it from the web UI. Thankfully this is simple to fix from the command-line.

Similar to the Unavailable API Services problem, we’ll use kubectl get serviceaccount -n gitlab-managed-apps. The -n gitlab-managed-apps is very important, because anything related to cert-manager won’t be in the default namespace. If the output shows anything with cert-manager or certmanager in the name, use the following command to delete each service account one-by-one:

kubectl delete serviceaccount -n gitlab-managed-apps <service-name>

Done.

Continued Difficulty

Let’s say you did all that and you’re still getting errors. Do all the above to ensure you’ve got something akin to a clean slate, then do the following:

  1. Install cert-manager via GitLab’s interface.
  2. Check your GCP Kubernetes Workloads for the failed install-certmanager pod.
  3. Click on install-certmanager and look at the YAML tab.
  4. Copy the YAML that GitLab generated into a file in Cloud Shell (e.g. custom-install-certmanager.yaml).
  5. Modify the value for COMMAND_SCRIPT to the following:
value: |-
set -xeo pipefail
helm init --upgrade
helm repo add certmanager https://charts.jetstack.io
helm repo update
kubectl apply -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.12/deploy/manifests/00-crds.yaml --validate=false
kubectl label --overwrite namespace gitlab-managed-apps certmanager.k8s.io/disable-validation=true
helm del --purge certmanager --tls --tls-ca-cert /data/helm/certmanager/config/ca.pem --tls-cert /data/helm/certmanager/config/cert.pem --tls-key /data/helm/certmanager/config/key.pem
helm install certmanager/cert-manager --name certmanager --version v0.12.0 --tls --tls-ca-cert /data/helm/certmanager/config/ca.pem --tls-cert /data/helm/certmanager/config/cert.pem --tls-key /data/helm/certmanager/config/key.pem --set rbac.create\=true,rbac.enabled\=true --namespace gitlab-managed-apps -f /data/helm/certmanager/config/values.yaml
kubectl apply -f /data/helm/certmanager/config/cluster_issuer.yaml
  1. Use kubectl apply -f <file> to start a pod that will execute the installation.

You might fail on the helm del command if a release named certmanager does not already exist for some reason. If that’s the case, simply remove that line and try again.

Hopefully if you got this far in your own debugging you managed to find some solution from these lessons I learned. If not, then I wish you much luck.