“Help! I Can’t Logon” or Restarting OpenShift After a Week of Downtime
Introduction
Being a good corporate citizen, I decided that I would save some power, and leave the Single Node OpenShift (SNO) server down whilst I had a week away from the keyboard. We usually shut it down in the evenings (it is a large old Dell server) and switch it on again in the morning to warm the office up.
When I got back from holiday, everything looked fine. The server booted, the messages on the console all looked fine, but the web console was nowhere to be found. What’s going on?
Attempts to connect to the API server using the “oc” CLI reported that the API server wasn’t there to talk to. This all feels rather worrying.
It’s tricky to diagnose a problem if you can’t get onto the service, so it’s really very good news that we followed the instructions to establish an SSH key when we installed SNO (sshKey in the installation config file).
Once we were SSH connected to the node, we could then establish a local connection to the API server, even though OAuth was unavailable – using a local KUBECONFIG that OpenShift generates and from there – and resolve the problem. Our issue related to expired certificates, which (see below) was fairly easy to resolve.
Method
Use the following steps to re-establish connectivity and authorise the outstanding certificate renewal requests:
- SSH as the “core” user onto your SNO server using the sshKey that you provided when you installed OpenShift:
- ssh core@openshift.mydomain.com
- Check the service status using the “crictl” CLI:
- sudo -i
- crictl ps -a | less
- Examine logs for running services using the container ID returned in the last command:
- crictl logs <container-id>
At this point we had realised that our problem was with certificate renewal requests that required action, but we could not use normal service logon methods (“oc login..”)
To overcome this, locate the internal load balancer KubeConfig file:
- cd /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs
- export KUBECONFIG=$(pwd)/lb-int.kubeconfig
Verify that we can talk to the API server now:
- oc get nodes
- oc get pods -A | grep -v Running
Check for certificate renewals:
- oc get csr
We had 11 – approve them all:
- oc get csr -o name | xargs oc adm certificate approve
Check on progress every couple of minutes – we had an additional CSR after a few minutes that needed approving. Keep an eye on pod startups and be patient – it takes a while for OpenShift to find its feet again and get everything running.
Managing Expectations…
Getting logged in over SSH and authorising the certificate renewals takes a few minutes. As mentioned above, there is a bit of SSH console watching and checking for other certificate renewals for five minutes or so.
Then, OpenShift needs to get all of the services restarted and working again. This does not happen as quickly as a normal boot, which is typically a couple of minutes. The timescales are typically in the range of 15 to 20 minutes. Even after all services are operational again, there seems to be another 5 minutes for everything to settle down.
When you rush to the web console because everything appears to be restarted and you still can’t logon, give yourself five minutes before retrying to avoid unnecessary anxiety!
Lessons Learned
Whilst this was largely an own goal, it was a valuable “learning opportunity” and we did come out of the other side! The following were my takeaways:
- Being able to SSH to the service and sudo to root was extremely useful. Without this, there was no real recovery route. Ensuring that the sshKey was configured during the installation was crucial to the recovery of the service.
- The standalone “crictl” command is very useful for checking the status of the stand-alone containers that the service depends on and reviewing logs. The documentation for the command can be found here. Note – you do not need to install it, it is delivered with OpenShift.
- The lb-int KubeConfig used above provided the only route to get connected to the API server and fix the outstanding certificate renewal requests.
- Because the OpenShift connection provided by “lb-int.kubeconfig” is “system:admin” / KubeAdmin, the SSH key must be secured.
- OpenShift depends on a lot of internally generated certificates that are – by design – short lived. For shorter periods (a couple of days) of downtime, OpenShift will recover all on its own. As evidenced here, longer periods can create more excitement, but can still be recovered – i.e. a week. Maybe two? Much longer than that and all bets are off and the recovery will be esoteric (fiddling with clocks and hoping OpenShift doesn’t notice) or will require reinstallation. Both of these are probably best avoided!
All good stuff, that hopefully you will never need!
This article was first published via The Triton Perspective, our monthly LinkedIn newsletter. Subscribe on LinkedIn to read future insights first, before they are published on our website.