GitLab Managed App: Cert Manager
Published 2019 Dec 31 @ 05:26
For anyone using GitLab’s Managed Apps, specifically the
advised that you need to re-install (uninstall/install) the app if you first
installed it prior to September 22, 2019. Most people that need to have
probably already done so, but my cert was active until right around Christmas,
so by the time I realized my cert hadn’t renewed it was the holiday season. And,
to my added frustration, the re-installation process was not smooth.
Certs expire because the ACME client isn’t being updated and therefore
Let’s Encrypt will not renew your certs. The old
cert-manager Helm chart was deprecated and so there were no
additional updates to the ACME client. The new
cert-manager chart that gets
installed in GitLab 12.3 and beyond is under active development and has an
up-to-date ACME client that can properly renew certs.
Unfortunately, because it’s an entirely new chart rather than simply a new
version of an existing chart, it’s not as simple as running
helm upgrade to
be on the latest version. This requires uninstalling the previous chart and
deleting all corresponding resources before installing the new chart.
Note: GitLab uses an image with Helm pre-installed, but they are still using the 2.x version, not the latest 3.x. If you need to run your own Helm commands at any point, ensure that you are looking specifically at the 2.x documentation when using GitLab’s Helm image.
There is shamefully no notification in GitLab’s Managed Apps section that alerts you to this change. Essentially, if you aren’t keeping up-to-date with their release notes, then you might have a breaking change you don’t know about with no easy way to identify the fix.
Thankfully, I spent about an hour tracking down the issue and found the official documentation had a very brief note on this very problem. You can read the official documentation, if you’d like but ultimately this is the important bit:
If you have installed Cert-Manager prior to GitLab 12.3, Let’s Encrypt will block requests from older versions of Cert-Manager.
If you’re using
gitlab.com rather than your own personal GitLab deployment,
you might not know whether you were running version 12.3 when you installed the
cert manager. So I went through their release notes to find the official date
that 12.3 was released, which was September 22, 2019. Any cert manager
installations from this date or prior will likely need to be re-installed.
Also thankfully, there’s a simple way to look at your GKE console and tell if you need to update. If you are on the latest version of cert manager, you will have two corresponding workloads; three, if you didn’t manually disable the webhooks deployment. If you aren’t on the latest version you’ll have a single workload.
The resolution GitLab proposes: uninstall and install again.
The uninstall phase will almost certainly succeed. Installing the newest version is less of a sure-thing. For me, I encountered multiple issues. I continued to see various errors pop up and ultimately I had to create a custom deployment YAML (based off the one GitLab was generating) to perform a bunch of cleanup and prep to get the installation working.
I’m not sure if this was actually a problem, but I figured it could have been
since older K8S APIs might be hindering the installation process. My master and
nodes were at
1.12.x and the latest was
1.15.x, which is a significant bump.
Although no APIs would be missing between
1.15.x, it’s possible
that new APIs exist that might be used in the installation process.
When I attempted to perform the upgrade, however, I received an error stating
Zone X is out of resources, try Y instead. After much web searching I could
find nothing on what this error actually meant. However, I noticed that there
was a beta feature in GKE called “surge upgrade” which allowed a cluster to
temporarily increase the number of nodes to accommodate a version upgrade. Once
I enabled that feature and set the maximum surge nodes to
1 I was able to
successfully upgrade my single node.
However, for some reason I was unable to run my stack on a single node after the upgrade. I believe that one of the upgraded versions increased how much resources K8S stack takes up on a node and one of my pods was unable to be started. I had to allow auto-scaling which brought my node pool to a count of 2, meaning double the price, which I’m not thrilled about.
Unavailable API Services
If you get an error saying something like “unable to retrieve list of API
servers” then that probably means “one or more of the API services is
unavailable.” You can check this by executing
kubectl get apiservice and
looking for anything that has
False in its row. If it’s a service related to
cert-manager, just nuke it. If it’s for something else, check if you actually
need it (or can re-deploy it) before continuing. Ultimately, however, all
API services need to be available otherwise you get that error message.
If you can safely nuke an unavailable service, just use
kubectl delete apiservice <service-name>.
Service Account Already Exists
It’s likely that you will receive this error if your re-install phase gave you
any other errors. Because
cert-manager does not fail atomically, meaning you
can have some state changes even if the operation is ultimately a failure, there
is a good chance that your overall state will be invalid when you are finally
ready to install the newest
If you get a message about a service account already existing, check the
Service & Ingress section to see if anything with
certmanager in the name exists, and delete them. If you don’t see it there
then welcome to the wonder of poor user experience, where a service exists but
you can’t view it from the web UI. Thankfully this is simple to fix from the
Similar to the Unavailable API Services problem, we’ll use
kubectl get serviceaccount -n gitlab-managed-apps. The
-n gitlab-managed-apps is very
important, because anything related to
cert-manager won’t be in the default
namespace. If the output shows anything with
the name, use the following command to delete each service account one-by-one:
kubectl delete serviceaccount -n gitlab-managed-apps <service-name>
Let’s say you did all that and you’re still getting errors. Do all the above to ensure you’ve got something akin to a clean slate, then do the following:
cert-managervia GitLab’s interface.
- Check your GCP Kubernetes Workloads for the failed
- Click on
install-certmanagerand look at the
- Copy the YAML that GitLab generated into a file in Cloud Shell (e.g.
- Modify the
COMMAND_SCRIPTto the following:
value: |- set -xeo pipefail helm init --upgrade helm repo add certmanager https://charts.jetstack.io helm repo update kubectl apply -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.12/deploy/manifests/00-crds.yaml --validate=false kubectl label --overwrite namespace gitlab-managed-apps certmanager.k8s.io/disable-validation=true helm del --purge certmanager --tls --tls-ca-cert /data/helm/certmanager/config/ca.pem --tls-cert /data/helm/certmanager/config/cert.pem --tls-key /data/helm/certmanager/config/key.pem helm install certmanager/cert-manager --name certmanager --version v0.12.0 --tls --tls-ca-cert /data/helm/certmanager/config/ca.pem --tls-cert /data/helm/certmanager/config/cert.pem --tls-key /data/helm/certmanager/config/key.pem --set rbac.create\=true,rbac.enabled\=true --namespace gitlab-managed-apps -f /data/helm/certmanager/config/values.yaml kubectl apply -f /data/helm/certmanager/config/cluster_issuer.yaml
kubectl apply -f <file>to start a pod that will execute the installation.
You might fail on the
helm del command if a release named
does not already exist for some reason. If that’s the case, simply remove that
line and try again.
Hopefully if you got this far in your own debugging you managed to find some solution from these lessons I learned. If not, then I wish you much luck.