feat: add troubleshooting guide

2026-06-21 12:10:49 +10:00 · 2022-02-06 02:38:45 +09:00 · 2022-02-06 02:38:45 +09:00 · 5392114d72
commit 5392114d72
parent 154c443f1a
6 changed files with 339 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -437,3 +437,4 @@ kubectl apply -f awx-secret-tls.yaml
  - [📝Uninstall deployed resouces](tips/uninstall.md)
  - [📝Deploy older version of AWX Operator](tips/deploy-older-operator.md)
  - [📝Upgrade AWX Operator and AWX](tips/upgrade-operator.md)
  - [📝Troubleshooting Guide](tips/troubleshooting.md)
--- a/galaxy/README.md
+++ b/galaxy/README.md
@ -184,9 +184,9 @@ This project is still under active development and there is no support, however,
 ### Patch K3s
-If you use Traefik which is K3s' Ingress controller as completely default, the Pod may not be able to get the client's IP address (see [k3s-io/k3s#2997](https://github.com/k3s-io/k3s/discussions/2997) for detail). In the current implementation of Pulp, this causes problems with the web UI being unreachable.
+If you use Traefik which is K3s' Ingress controller as completely default, the Pod may not be able to get the client's IP address (see [k3s-io/k3s#2997](https://github.com/k3s-io/k3s/discussions/2997) for details). In the current implementation of Pulp, this causes problems with the web UI being unreachable.
-For this reason, fix the Traefik configuration. For a single node like doing in this repository, the following command is easy to use.
+For this reason, you should fix the Traefik configuration. For a single node like doing in this repository, the following command is easy to use.
 ```bash
 kubectl -n kube-system patch deployment traefik --patch '{"spec":{"template":{"spec":{"hostNetwork":true}}}}'
@ -198,7 +198,7 @@ Then wait until your `traefik` by the following command is `1/1` `READY`.
 kubectl -n kube-system get deployment traefik
 ```
-Now your client's IP address can be passed correctly through X-Forwarded-For and X-Real-Ip headers.
+Now your client's IP address can be passed correctly through `X-Forwarded-For` and `X-Real-Ip` headers.
 ### Install Pulp Operator
--- a/tips/README.md
+++ b/tips/README.md
@ -5,3 +5,4 @@
 - [📝Uninstall deployed resouces](uninstall.md)
 - [📝Deploy older version of AWX Operator](deploy-older-operator.md)
 - [📝Upgrade AWX Operator and AWX](upgrade-operator.md)
 - [📝Troubleshooting Guide](troubleshooting.md)
--- a/tips/expose-hosts.md
+++ b/tips/expose-hosts.md
@ -35,6 +35,10 @@ One easy way to do this is to use `dnsmasq`.
 4. Add `--resolv-conf /etc/rancher/k3s/resolv.conf` as an argument for `k3s server` command.
   ```bash
   # Change configuration using script:
   $ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --resolv-conf /etc/rancher/k3s/resolv.conf
   # If you don't want to use the script, modify /etc/systemd/system/k3s.service manually:
   $ cat /etc/systemd/system/k3s.service
   ...
   ExecStart=/usr/local/bin/k3s \
--- a/tips/troubleshooting.md
+++ b/tips/troubleshooting.md
@ -0,0 +1,329 @@
 <!-- omit in toc -->
 # Troubleshooting Guide
 Some hints and guides for when you got stuck during deployment and daily use of AWX.
 <!-- omit in toc -->
 ## Table of Contents
 - [Troubles during Deployment](#troubles-during-deployment)
  - [First Step: Investigate your Situation](#first-step-investigate-your-situation)
    - [Investigate Status and Events of the Pods](#investigate-status-and-events-of-the-pods)
    - [Investigate Logs of the Containers inside the Pods](#investigate-logs-of-the-containers-inside-the-pods)
  - [The Pod is `Pending` with "1 Insufficient cpu, 1 Insufficient memory." event](#the-pod-is-pending-with-1-insufficient-cpu-1-insufficient-memory-event)
  - [The Pod is `Pending` with "1 pod has unbound immediate PersistentVolumeClaims." event](#the-pod-is-pending-with-1-pod-has-unbound-immediate-persistentvolumeclaims-event)
  - [The Pod is `Running` but stucked with "[wait-for-migrations] Waiting for database migrations..." message](#the-pod-is-running-but-stucked-with-wait-for-migrations-waiting-for-database-migrations-message)
  - [The Pod for PostgreSQL is in `CrashLoopBackOff` state and shows "Permission denied" log](#the-pod-for-postgresql-is-in-crashloopbackoff-state-and-shows-permission-denied-log)
 - [Troubles during Daily Use](#troubles-during-daily-use)
  - [Job failed with no output](#job-failed-with-no-output)
  - [Provisioning Callback does not work](#provisioning-callback-does-not-work)
 ## Troubles during Deployment
 ### First Step: Investigate your Situation
 You can start investigating troubles during deployment with following two things.
 - **Status** and **Events** of the Pods
 - **Logs** of the Containers inside the Pods
 #### Investigate Status and Events of the Pods
 First, check the `STATUS` for the Pods by this command.
 ```bash
 kubectl -n awx get pod
 ```
 If the Pods are working properly, its `STATUS` are `Running`. If your Pods are not in `Running` state e.g. `Pending`, `ImagePullBackOff` or `CrashLoopBackOff` etc., the Pods might have some problems. In the following example, the Pod `awx-84d5c45999-h7xm4` is in `Pending` state.
 ```bash
 $ kubectl -n awx get pod
 NAME                                               READY   STATUS    RESTARTS   AGE
 awx-operator-controller-manager-68d787cfbd-j6k7z   2/2     Running   0          7m43s
 awx-postgres-0                                     1/1     Running   0          4m6s
 awx-84d5c45999-h7xm4                               0/4     Pending   0          3m59s
 ```
 If you have the Pods which has the unexpected state instead of `Running`, the next step is checking `Events` for the Pod. The command to get `Events` for the pod is:
 ```bash
 kubectl -n awx describe pod <Pod Name>
 ```
 By this command, you can get the `Events` for the Pod you specified at the end of the output.
 ```bash
 $ kubectl -n awx describe pod awx-84d5c45999-h7xm4
 ...
 Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  106s  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.     👈👈👈
  Warning  FailedScheduling  105s  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.     👈👈👈
 ```
 In most cases, you can find the reason why the Pod is not `Running` from `Events`. In the example above, I can see that it is due to lack of CPU or memory.
 #### Investigate Logs of the Containers inside the Pods
 The logs also helpful to get the reason why something went wrong. In particular, if the status of the Pod is `Running` but the Pod does not works as expected, you should check the logs.
 The commands to get the logs are following. `-f` is optional, useful to watch the logs as well as get the logs.
 ```bash
 # Get the logs of specific Pod.
 # If the Pod includes multiple containers, container name has to be specified.
 kubectl -n awx logs -f <Pod Name>
 kubectl -n awx logs -f <Pod Name> -c <Container Name>
 # Get the logs of specific Pod which is handled by Deployment resource.
 # If the Pod includes multiple containers, container name has to be specified.
 kubectl -n awx logs -f deployment/<Deployment Name>
 kubectl -n awx logs -f deployment/<Deployment Name> -c <Container Name>
 # Get the logs of specific Pod which is handled by StatefulSet resource
 # If the Pod includes multiple containers, container name has to be specified.
 kubectl -n awx logs -f statefulset/<Deployment Name>
 kubectl -n awx logs -f statefulset/<Deployment Name> -c <Container Name>
 ```
 For AWX Operator and AWX, specifically, the following commands are helpful.
 - Logs of AWX Operator
  - `kubectl -n awx logs -f deployment/awx-operator-controller-manager -c awx-manager`
 - Logs of AWX related containers
  - `kubectl -n awx logs -f deployment/awx -c awx-web`
  - `kubectl -n awx logs -f deployment/awx -c awx-task`
  - `kubectl -n awx logs -f deployment/awx -c awx-ee`
  - `kubectl -n awx logs -f deployment/awx -c redis`
 - Logs of PostgreSQL
  - `kubectl -n awx logs -f statefulset/awx-postgres`
 ### The Pod is `Pending` with "1 Insufficient cpu, 1 Insufficient memory." event
 If your Pod is in `Pending` state and its `Events` shows following events, the reason is that the node does not have enough CPU and memory to start the Pod. By default AWX requires at least 2 CPUs and 4 GB RAM. In addition more resources are required to run K3s and the OS itself.
 ```bash
 $ kubectl -n awx describe pod awx-84d5c45999-h7xm4
 ...
 Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  106s  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.     👈👈👈
  Warning  FailedScheduling  105s  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.     👈👈👈
 ```
 Typical solutions are one of the following:
 - **Add more CPUs or memory to your K3s node.**
  - If you have at least 3 CPUs and 5 GB RAM, AWX may work.
 - **Reduce resource requests for the containers.**
  - The minimum resouce requirements can be ignored by adding three lines in `base/awx.yml`.
    ```yaml
    ...
    spec:
      ...
      web_resource_requirements: {}     👈👈👈
      task_resource_requirements: {}     👈👈👈
      ee_resource_requirements: {}     👈👈👈
    ```
  - You can specify more specific value for each containers. Refer [official documentation](https://github.com/ansible/awx-operator/blob/0.16.1/README.md#containers-resource-requirements) for details.
  - In this way you can run AWX with fewer resources, but you may encounter performance issues.
 ### The Pod is `Pending` with "1 pod has unbound immediate PersistentVolumeClaims." event
 If your Pod is in `Pending` state and its `Events` shows following events, the reason is that no usable Persisten Volumes are available.
 ```bash
 $ kubectl -n awx describe pod awx-84d5c45999-h7xm4
 ...
 Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  24s   default-scheduler  0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.     👈👈👈
 ```
 Check the `STATUS` of your PVs and ensure your PVs doesn't have `Available` or `Bound` state.
 ```bash
 $ kubectl get pv
 NAME                     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                         STORAGECLASS             REASON   AGE
 awx-projects-volume      2Gi        RWO            Retain           Released   awx/awx-projects-claim        awx-projects-volume               17h
 awx-postgres-volume      2Gi        RWO            Retain           Released   awx/postgres-awx-postgres-0   awx-postgres-volume               17h
 ```
 Probably this is the second (or more) time to deploy AWX for you. These PVs which have `Released` state are tied to your old (and probably no longer exists now) PVCs you created in the past.
 There are a few things you should to know about the PVs in Kubernetes.
 - Once a PV is bound from a PVC, it keeps the PVC name in its `claimRef` entry. This will be shown in the `CLAIM` column in the result of the command `kubectl get pv`.
 - The `Released` state of the PV means that the PV was bound by PVC in the `claimRef` entry in the past but now the PVC does not exist. **The PV in this state cannot be bound by any PVC other than the one recorded in `claimRef`.**
 - To allow the PV to bind from a PVC other than the one recorded in `claimRef`, the `claimRef` entry must be empty and the PV must has `Available` state.
 To solve this, typical solutions are one of the following:
 - **Patch the PV to empty `claimRef` entry for the PV.**
  - Invoke following commands:
    ```bash
    kubectl patch pv <PV Name> -p '{"spec":{"claimRef": null}}'
    ```
 - **Delete the PV and recreate it.**
  - Invoke following commands:
    ```bash
    # Delete the PV
    kubectl delete pv <PV Name>
    # Recreate the PV
    kubectl apply -k base
    ```
 ### The Pod is `Running` but stucked with "[wait-for-migrations] Waiting for database migrations..." message
 Sometimes your AWX pod is `Running` state correctly but not functional at all, and its log shows following message repeatedly.
 ```bash
 kubectl -n awx logs -f deployment/awx -c awx-web
 [wait-for-migrations] Waiting for database migrations...
 [wait-for-migrations] Attempt 1 of 30
 [wait-for-migrations] Waiting 0.5 seconds before next attempt
 [wait-for-migrations] Attempt 2 of 30
 [wait-for-migrations] Waiting 1 seconds before next attempt
 [wait-for-migrations] Attempt 3 of 30
 [wait-for-migrations] Waiting 2 seconds before next attempt
 [wait-for-migrations] Attempt 4 of 30
 [wait-for-migrations] Waiting 4 seconds before next attempt
 ...
 ```
 This problem occurs when the AWX pod and the PostgreSQL pod cannot communicate properly. In most cases, the cause of this is the network on your K3s.
 To solve this, check or try the following:
 - **Ensure your PostgreSQL (typically the Pod named `awx-postgres-0`)is in `Running` state.**
 - **Ensure your `firewalld` or `ufw` has been disabled on your K3s host.**
 - **Ensure your `awx-postgres-configuration` has correct values, especially if you're using external PostgreSQL.**
 - **Uninstall K3s and install it again.**
 ### The Pod for PostgreSQL is in `CrashLoopBackOff` state and shows "Permission denied" log
 In this situation, your Pod for PostgreSQL is in `CrashLoopBackOff` state and its log shows following error message.
 ```bash
 $ kubectl -n awx get pod
 NAME                                               READY   STATUS             RESTARTS   AGE
 awx-operator-controller-manager-68d787cfbd-j6k7z   2/2     Running            0          7m43s
 awx-postgres-0                                     1/1     CrashLoopBackOff   3          4m6s
 awx-84d5c45999-h7xm4                               4/4     Running            0          3m59s
 $ kubectl -n awx logs statefulset/awx-postgres
 mkdir: cannot create directory '/var/lib/postgresql/data': Permission denied
 ```
 You should check the permissions and the owner of directories where used as PV on your K3s host. If you followed my guide, it would be `/data/postgres`. There is additional `data` directory created by K3s under `/data/postgres`.
 ```bash
 $ ls -ld /data/postgres /data/postgres/data
 drwxr-xr-x. 2 root root 18 Aug 20 10:09 /data/postgres
 drwxr-xr-x. 3 root root 20 Aug 20 10:09 /data/postgres/data
 ```
 In my environment, `755` and `root:root` (`0:0`) works correctly. So you can try:
 ```bash
 sudo chmod 755 /data/postgres /data/postgres/data
 sudo chown 0:0 /data/postgres /data/postgres/data
 ```
 Or, you can also try `999:0` as owner/group for the directory.
 ```bash
 sudo chmod 755 /data/postgres /data/postgres/data
 sudo chown 999:0 /data/postgres /data/postgres/data
 ```
 `999` is [the UID of the `postgres` user which used in the container](https://github.com/docker-library/postgres/blob/master/12/bullseye/Dockerfile#L23).
 ## Troubles during Daily Use
 ### Job failed with no output
 If the job is invoked to a large number of hosts or is running long time, sometimes the job is marked as failed and no log will be displayed in the Output tab.
 This is a problem caused by log rotation on Kubernetes. Refer [ansible/awx#10366](https://github.com/ansible/awx/issues/10366) for details.
 In the case of K3s, you can reduce the possibility of this issue by changing the configuration as follows.
 ```bash
 # Change configuration using script:
 $ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"
 # If you don't want to use the script, modify /etc/systemd/system/k3s.service manually:
 $ cat /etc/systemd/system/k3s.service
 ...
 ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \     👈👈👈
        'container-log-max-files=4' \     👈👈👈
        '--kubelet-arg' \     👈👈👈
        'container-log-max-size=50Mi' \     👈👈👈
 ```
 Then restart K3s. The K3s service can be safely restarted without affecting the running resources.
 ```bash
 sudo systemctl daemon-reload
 sudo systemctl restart k3s
 ```
 ### Provisioning Callback does not work
 If you use Traefik which is K3s' Ingress controller as completely default, the Pod may not be able to get the client's IP address (see [k3s-io/k3s#2997](https://github.com/k3s-io/k3s/discussions/2997) for details). Therefore, the feature called Provisioning Callback in AWX does not work properly sinse AWX can't determine actual IP address of the remote host who request callback.
 For this reason, you should fix the Traefik configuration. For a single node like doing in this repository, the following command is easy to use.
 ```bash
 kubectl -n kube-system patch deployment traefik --patch '{"spec":{"template":{"spec":{"hostNetwork":true}}}}'
 ```
 Then wait until your `traefik` by the following command is `1/1` `READY`.
 ```bash
 kubectl -n kube-system get deployment traefik
 ```
 Now your client's IP address can be passed correctly through `X-Forwarded-For` and `X-Real-Ip` headers.
 The last step is modifying AWX. By default, AWX uses only `REMOTE_ADDR` and `REMOTE_HOST` headers to determine the remote host (means HTTP client). Therefore, you have to make AWX to use `X-Forwarded-For` header.
 Modify your `base/awx.yaml` and add following three lines.
 ```bash
 ...
 spec:
    ...
    extra_settings:     👈👈👈
    - setting: REMOTE_HOST_HEADERS     👈👈👈
        value: "['HTTP_X_FORWARDED_FOR', 'REMOTE_ADDR', 'REMOTE_HOST']"     👈👈👈
 ```
 Then apply this change and wait for your AWX will be reconfigured.
 ```bash
 kubectl apply -k base
 ```
 You can watch its progress by following command as did when you deploy AWX at the first time.
 ```bash
 kubectl -n awx logs -f deployments/awx-operator-controller-manager -c awx-manager
 ```
 Now your Provisioning Callback should work. In my environment, the name of the host in the inventory have to be defined using IP address instead of DNS hostname.
--- a/tips/upgrade-operator.md
+++ b/tips/upgrade-operator.md
@ -163,7 +163,7 @@ localhost                  : ok=54   changed=0    unreachable=0    failed=0    s
 ## ❓ Troubleshooting
-Some hists for when you got stuck during upgrade.
+Some hints for when you got stuck during upgrade.
 ### New Pod gets stuck in `Pending` state