Ceilometer compute `retry_on_disconnect` using `no-wait`

It was discovered a problem on a production setup of Ceilometer compute with metrics stopping to be gathered. While troubleshooting, we found the following error message. ``` ERROR ceilometer.polling.manager [-] Prevent pollster cpu from polling ``` That error message happened after the following message: ``` WARNING ceilometer.compute.pollsters [-] Cannot inspect data of CPUPollster for <UUID>, non-fatal reason: Failed to inspect instance <UUID> stats, can not get info from libvirt: Unable to read from monitor: Connection reset by peer: NoDataException: Failed to inspect instance <UUID> stats, can not get info from libvirt: Unable to read from monitor: Connection reset by peer ``` The instance was running just fine in the host. It seems a concurrency issue with some other process that made the instance locked/unavailable to ceilometer computer pollsters. Ceilometer was unable to connect to Libvirt (after 2 retries), and the code is designed to prevent Ceilometer from continuing trying. Therefore, the "CPU" metric pollster was put in permanent error. To fix the issue, We needed to restart Ceilometer in the affected hosts. However, until we discovered this issue, we lost the amount 3 days of data. ``` @libvirt_utils.raise_nodata_if_unsupported @libvirt_utils.retry_on_disconnect def inspect_instance(self, instance, duration=None): domain = self._get_domain_not_shut_off_or_raise(instance) ``` It will try to retrieve the domain (VM) object (XML description) via libvirt. If it fails, it will retry via `@libvirt_utils.retry_on_disconnect`; if that fails, it marks the metric in permanent error with the annotation: `@libvirt_utils.raise_nodata_if_unsupported`. Other metrics continued working. Therefore, I investigated a bit deeper, and the problem seems to be here: ``` retry_on_disconnect = tenacity.retry( retry=tenacity.retry_if_exception(is_disconnection_exception), stop=tenacity.stop_after_attempt(2)) ``` The `retry_on_disconnect` annotation is not configuring the "tenacity" retry library wait. The default is "no wait". Therefore, the retries have a bigger chance of being affected by very minor instabilities (microseconds connection issues can generate a problem with this configuration). One alternative to avoid such problems in the future is to use a wait configuration such as the one being proposed. Then, ceilometer computer pollsters would wait/sleep before retrying, which would provide some time for the system to be available for the compute pollsters. In this proposal, we would wait 2^x * 3 seconds between each retry starting with 1 second, then up to 60 seconds. Change-Id: I9a2d46f870dc2d2791a7763177773dc0cf8aed9d
author: Rafael Weingärtner <rafael@apache.org> 2021-04-08 09:43:36 -0300
committer: Rafael Weingärtner <rafael@apache.org> 2021-05-04 15:34:21 -0300
commit: b664d4ea01c4a94254601ea5af4ea7d1da70a941 (patch)
tree: e5a1da203e82614543c09ae8a3471089a7f46b53 /requirements.txt
parent: 122c55591fa90989e66fb803d9a5aac2db8a7211 (diff)
download: ceilometer-b664d4ea01c4a94254601ea5af4ea7d1da70a941.tar.gz
1 files changed, 1 insertions, 1 deletions
diff --git a/requirements.txt b/requirements.txt
index 3713e08b..db677365 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -31,7 +31,7 @@ python-cinderclient>=3.3.0  # Apache-2.0
 PyYAML>=5.1 # MIT
 requests!=2.9.0,>=2.8.1 # Apache-2.0
 stevedore>=1.20.0 # Apache-2.0
-tenacity>=4.12.0  # Apache-2.0
+tenacity>=6.3.1,<7.0.0  # Apache-2.0
 tooz[zake]>=1.47.0 # Apache-2.0
 os-xenapi>=0.3.3 # Apache-2.0
 oslo.cache>=1.26.0 # Apache-2.0
author	Rafael Weingärtner <rafael@apache.org>	2021-04-08 09:43:36 -0300
committer	Rafael Weingärtner <rafael@apache.org>	2021-05-04 15:34:21 -0300
commit	b664d4ea01c4a94254601ea5af4ea7d1da70a941 (patch)
tree	e5a1da203e82614543c09ae8a3471089a7f46b53 /requirements.txt
parent	122c55591fa90989e66fb803d9a5aac2db8a7211 (diff)
download	ceilometer-b664d4ea01c4a94254601ea5af4ea7d1da70a941.tar.gz