doc/troubleshooting.md

   1 # Troubleshooting
   2
   3 ## Where are the logs?
   4
   5 In addition to the monitoring and examining instructions in the
   6 [Deployment](installation-guide.md#deployment) section of the
   7 installation guide, ICN records its execution in various log files.
   8 These logs can be found in the 'logs' subdirectory of the component,
   9 for example 'deploy/ironic/logs'.
  10
  11 The logs of the Bare Metal Operator, Cluster API, and Flux controllers
  12 can be examined using standard K8s tools.
  13
  14 ## Early provisioning fails
  15
  16 First confirm that the BMC and PXE Boot configuration are correct as
  17 described in the [Configuration](installation-guide.md#configuration)
  18 section of the installation guide.
  19
  20 It is also recommended to enable the KVM console in the machine using
  21 Raritan console or Intel web BMC console to observe early boot output
  22 during provisioning.
  23
  24   ![BMC console](figure-3.png)
  25
  26 Examining the BareMetalHost resource of the failing machine and the
  27 logs of Bare Metal Operator and Ironic Pods may also provide a
  28 description of why the provisioning is failing.
  29
  30 A description of the BareMetalHost states can be found in the [Bare
  31 Metal Operator
  32 documentation](https://github.com/metal3-io/baremetal-operator/blob/main/docs/baremetalhost-states.md).
  33
  34 ### openstack baremetal
  35
  36 In rare cases, the Ironic and Bare Metal Operator information may get
  37 out of sync. In this case, using the 'openstack baremetal' tool can be
  38 used to delete the stale information.
  39
  40 The general procedure (shown on the jump server) is:
  41
  42 - Locate UUID of active node.
  43
  44       # kubectl -n metal3 get bmh -o json | jq '.items[]|.status.provisioning.ID'
  45       "99f64101-04f3-47bf-89bd-ef374097fcdc"
  46
  47 - Examine ironic information for stale node and port values.
  48
  49       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal node list
  50       +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
  51       | UUID                                 | Name        | Instance UUID                        | Power State | Provisioning State | Maintenance |
  52       +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
  53       | 0ec36f3b-80d1-41e6-949a-9ba40a87f625 | None        | None                                 | None        | enroll             | False       |
  54       | 99f64101-04f3-47bf-89bd-ef374097fcdc | pod11-node3 | 6e16529d-a1a4-450c-8052-46c82c87ca7b | power on    | manageable         | False       |
  55       +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
  56       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal port list
  57       +--------------------------------------+-------------------+
  58       | UUID                                 | Address           |
  59       +--------------------------------------+-------------------+
  60       | c65b1324-2cdd-44d0-8d25-9372068add02 | 00:1e:67:f1:5b:91 |
  61       +--------------------------------------+-------------------+
  62
  63 - Delete the stale node and port.
  64
  65       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal node delete 0ec36f3b-80d1-41e6-949a-9ba40a87f625
  66       Deleted node 0ec36f3b-80d1-41e6-949a-9ba40a87f625
  67       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal port delete c65b1324-2cdd-44d0-8d25-9372068add02
  68       Deleted port c65b1324-2cdd-44d0-8d25-9372068add02
  69
  70 - Create a new port.
  71
  72       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal port create --node 99f64101-04f3-47bf-89bd-ef374097fcdc 00:1e:67:f1:5b:91
  73       +-----------------------+--------------------------------------+
  74       | Field                 | Value                                |
  75       +-----------------------+--------------------------------------+
  76       | address               | 00:1e:67:f1:5b:91                    |
  77       | created_at            | 2021-04-27T22:24:08+00:00            |
  78       | extra                 | {}                                   |
  79       | internal_info         | {}                                   |
  80       | local_link_connection | {}                                   |
  81       | node_uuid             | 99f64101-04f3-47bf-89bd-ef374097fcdc |
  82       | physical_network      | None                                 |
  83       | portgroup_uuid        | None                                 |
  84       | pxe_enabled           | True                                 |
  85       | updated_at            | None                                 |
  86       | uuid                  | 93366f0a-aa12-4815-b524-b95839bfa05d |
  87       +-----------------------+--------------------------------------+
  88
  89 ## Pod deployment fails due to Docker rate limits
  90
  91 If a Pod fails to start and the Pod status (`kubectl describe pod
  92 ...`) shows that the Docker pull rate limit has been reached, it is
  93 possible to point ICN to a [Docker registry
  94 mirror](https://docs.docker.com/registry/recipes/mirror/).
  95
  96 To enable the mirror on the jump server set `DOCKER_REGISTRY_MIRROR`
  97 in `user_config.sh` before installing the jump server or following the
  98 Docker's
  99 [instructions](https://docs.docker.com/registry/recipes/mirror/#configure-the-docker-daemon)
 100 to configure the daemon.
 101
 102 To enable the mirror in the provisioned cluster, set the
 103 `dockerRegistryMirrors` value of the cluster chart.
 104
 105 ## Helm release stuck in 'pending-install'
 106
 107 If the HelmRelease status for a chart in the workload cluster shows
 108 that an install or upgrade is pending and e.g. no Pods are being
 109 created, it is possible the Helm controller was restarted during
 110 install of the HelmRelease.
 111
 112 The fix is to remove the Helm Secret of the failing release.  After
 113 this, Flux will complete reconcilation succesfully.
 114
 115      kubectl --kubeconfig=icn-admin.conf -n emco delete secret sh.helm.release.v1.db.v1
 116
 117 ## No change in BareMetalHost state
 118
 119 Provisioning can take a fair amount of time, refer to [Monitoring
 120 progress](installation-guide.md#monitoring-progress) to see where the
 121 process is.
 122
 123 A description of the BareMetalHost states can be found in the [Bare
 124 Metal Operator
 125 documentation](https://github.com/metal3-io/baremetal-operator/blob/main/docs/baremetalhost-states.md).
 126
 127 ## BareMetalHost never transitions from Available to Provisioned
 128
 129 If the BareMetalHost has an owner but is not transitioning from
 130 Available to Provisioned, it is possible that the chart values are
 131 misconfigured. Examine the capm3-controller-manager logs for error
 132 messages:
 133
 134     # kubectl -n capm3-system logs capm3-controller-manager-7db896996c-7dls7 | grep ^E
 135     ...
 136     E0512 18:00:24.781426       1 controller.go:304] controller/metal3data "msg"="Reconciler error" "error"="Failed to create secrets: Nic name not found ens5" "name"="icn-nodepool-0" "namespace"="metal3" "reconciler group"="infrastructure.cluster.x-k8s.io" "reconciler kind"="Metal3Data"
 137
 138 In the above instance, the NIC name in the chart values (`ens5`) was
 139 incorrect and setting the correct name resolved the issue.
 140
 141 ## Vagrant destroy fails with `cannot undefine domain with nvram`
 142
 143 The fix is to destroy each machine individually.  For the default ICN
 144 virtual machine deployment:
 145
 146     vagrant destroy -f jump
 147     virsh -c qemu:///system destroy vm-machine-1
 148     virsh -c qemu:///system undefine --nvram --remove-all-storage vm-machine-1
 149     virsh -c qemu:///system destroy vm-machine-2
 150     virsh -c qemu:///system undefine --nvram --remove-all-storage vm-machine-2