Update documentation for Cluster-API and Flux
[icn.git] / doc / troubleshooting.md
1 # Troubleshooting
2
3 ## Where are the logs?
4
5 In addition to the monitoring and examining instructions in the
6 [Deployment](installation-guide.md#deployment) section of the
7 installation guide, ICN records its execution in various log files.
8 These logs can be found in the 'logs' subdirectory of the component,
9 for example 'deploy/ironic/logs'.
10
11 The logs of the Bare Metal Operator, Cluster API, and Flux controllers
12 can be examined using standard K8s tools.
13
14 ## Early provisioning fails
15
16 First confirm that the BMC and PXE Boot configuration are correct as
17 described in the [Configuration](installation-guide.md#configuration)
18 section of the installation guide.
19
20 It is also recommended to enable the KVM console in the machine using
21 Raritan console or Intel web BMC console to observe early boot output
22 during provisioning.
23
24   ![BMC console](figure-3.png)
25
26 Examining the BareMetalHost resource of the failing machine and the
27 logs of Bare Metal Operator and Ironic Pods may also provide a
28 description of why the provisioning is failing.
29
30 ### openstack baremetal
31
32 In rare cases, the Ironic and Bare Metal Operator information may get
33 out of sync. In this case, using the 'openstack baremetal' tool can be
34 used to delete the stale information.
35
36 The general procedure (shown on the jump server) is:
37
38 - Locate UUID of active node.
39
40       # kubectl -n metal3 get bmh -o json | jq '.items[]|.status.provisioning.ID'
41       "99f64101-04f3-47bf-89bd-ef374097fcdc"
42
43 - Examine ironic information for stale node and port values.
44
45       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal node list
46       +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
47       | UUID                                 | Name        | Instance UUID                        | Power State | Provisioning State | Maintenance |
48       +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
49       | 0ec36f3b-80d1-41e6-949a-9ba40a87f625 | None        | None                                 | None        | enroll             | False       |
50       | 99f64101-04f3-47bf-89bd-ef374097fcdc | pod11-node3 | 6e16529d-a1a4-450c-8052-46c82c87ca7b | power on    | manageable         | False       |
51       +--------------------------------------+-------------+--------------------------------------+-------------+--------------------+-------------+
52       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal port list
53       +--------------------------------------+-------------------+
54       | UUID                                 | Address           |
55       +--------------------------------------+-------------------+
56       | c65b1324-2cdd-44d0-8d25-9372068add02 | 00:1e:67:f1:5b:91 |
57       +--------------------------------------+-------------------+
58
59 - Delete the stale node and port.
60
61       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal node delete 0ec36f3b-80d1-41e6-949a-9ba40a87f625
62       Deleted node 0ec36f3b-80d1-41e6-949a-9ba40a87f625
63       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal port delete c65b1324-2cdd-44d0-8d25-9372068add02
64       Deleted port c65b1324-2cdd-44d0-8d25-9372068add02
65
66 - Create a new port.
67
68       # OS_TOKEN=fake-token OS_URL=http://localhost:6385/ openstack baremetal port create --node 99f64101-04f3-47bf-89bd-ef374097fcdc 00:1e:67:f1:5b:91
69       +-----------------------+--------------------------------------+
70       | Field                 | Value                                |
71       +-----------------------+--------------------------------------+
72       | address               | 00:1e:67:f1:5b:91                    |
73       | created_at            | 2021-04-27T22:24:08+00:00            |
74       | extra                 | {}                                   |
75       | internal_info         | {}                                   |
76       | local_link_connection | {}                                   |
77       | node_uuid             | 99f64101-04f3-47bf-89bd-ef374097fcdc |
78       | physical_network      | None                                 |
79       | portgroup_uuid        | None                                 |
80       | pxe_enabled           | True                                 |
81       | updated_at            | None                                 |
82       | uuid                  | 93366f0a-aa12-4815-b524-b95839bfa05d |
83       +-----------------------+--------------------------------------+
84
85 ## Helm release stuck in 'pending-install'
86
87 If the HelmRelease status for a chart in the workload cluster shows
88 that an install or upgrade is pending and e.g. no Pods are being
89 created, it is possible the Helm controller was restarted during
90 install of the HelmRelease.
91
92 The fix is to remove the Helm Secret of the failing release.  After
93 this, Flux will complete reconcilation succesfully.
94
95      kubectl --kubeconfig=icn-admin.conf -n emco delete secret sh.helm.release.v1.db.v1
96
97 ## No change in BareMetalHost state
98
99 Provisioning can take a fair amount of time, refer to [Monitoring
100 progress](installation-guide.md#monitoring-progress) to see where the
101 process is.