From: Anil Belur Date: Wed, 8 Oct 2025 07:26:45 +0000 (+1000) Subject: CI: Disable stack cost retrieval to stop job hang X-Git-Tag: v0.92.6^0 X-Git-Url: https://gerrit.linuxfoundation.org/infra/gitweb?a=commitdiff_plain;h=c40abb0ddc7fe40e31d8ef66355c341e0b15d447;p=releng%2Fglobal-jjb.git CI: Disable stack cost retrieval to stop job hang This is a temporary workaround to address critical job hangs caused by lftools openstack stack cost command lacking timeout handling. Problem (IT-28614): The lftools stack cost command hangs indefinitely when: - Pricing API (pricing.vexxhost.net) is slow/unresponsive - OpenStack SDK queries for nested stacks take too long - Network operations lack timeout parameters This causes: - Jobs stuck at 'Retrieving stack cost for: ' - Downstream jobs waiting on checkpoints indefinitely - Jenkins queue buildup (66+ jobs waiting reported) - Requires manual intervention to cancel stuck jobs Solution: Disable stack cost retrieval by commenting out the lftools call and hardcode stack cost to 0 in the stack-cost file. Known Regression: Stack costs will be reported as 0 in archived cost.csv files, losing cost tracking data for OpenStack resources. Instance-level costs are still collected via job-cost.sh pricing API calls. Root Cause: lftools/openstack/stack.py cost() function lacks: - timeout parameter for urllib.request.urlopen() (line 87) - timeout handling for OpenStack SDK resource enumeration - graceful degradation on network failures Next Steps: 1. Implement timeout handling in lftools 2. Once lftools is fixed, revert this workaround 3. Create tracking issue for the regression (stack costs = 0) This stopgap prioritizes job reliability over cost data collection. Issue-ID: IT-28614 Change-Id: I9d054347e485f4843c7287aec49fb7aff0962e96 Signed-off-by: Anil Belur --- diff --git a/releasenotes/notes/disable-stack-cost-retrieval-6ab5fd84c23b994e.yaml b/releasenotes/notes/disable-stack-cost-retrieval-6ab5fd84c23b994e.yaml new file mode 100644 index 00000000..b08d8a20 --- /dev/null +++ b/releasenotes/notes/disable-stack-cost-retrieval-6ab5fd84c23b994e.yaml @@ -0,0 +1,32 @@ +--- +fixes: + - | + **STOPGAP FIX:** Temporarily disable stack cost retrieval in + openstack-stack-delete.sh to prevent jobs from hanging indefinitely. + + **Problem:** The lftools openstack stack cost command hangs without + timeout handling when querying the pricing API or OpenStack resources, + causing: + + - Jobs stuck at "Retrieving stack cost" phase + - Subsequent jobs waiting on checkpoints + - Jenkins queue buildup and capacity issues + + **Solution:** Stack cost is now hardcoded to 0 until lftools implements + proper timeout handling for network operations. + + **Known Regression:** Stack costs will be reported as 0 in cost.csv + files, losing cost tracking data for OpenStack resources. Instance-level + costs from the pricing API are still collected via job-cost.sh. + + **Root Cause:** lftools lacks timeout parameters for urllib.request.urlopen() + and OpenStack SDK operations in the stack cost calculation. + + **Tracking:** + + - Fixes: IT-28614 (Jobs hanging on stack cost retrieval) + - Regression: Stack costs always reported as 0 (needs lftools fix) + - Upstream: lftools needs timeout implementation for openstack stack cost command + + **Next Steps:** Once lftools implements timeouts, this workaround should + be reverted to restore cost collection functionality. diff --git a/shell/openstack-stack-delete.sh b/shell/openstack-stack-delete.sh index 158dcfc2..a04fdb33 100644 --- a/shell/openstack-stack-delete.sh +++ b/shell/openstack-stack-delete.sh @@ -21,13 +21,17 @@ lf-activate-venv --python python3 "lftools[openstack]" \ python-openstackclient \ urllib3~=1.26.15 -echo "INFO: Retrieving stack cost for: $OS_STACK_NAME" -if ! lftools openstack --os-cloud "$OS_CLOUD" stack cost "$OS_STACK_NAME" > stack-cost; then - echo "WARNING: Unable to get stack costs, continuing anyway" - echo "total: 0" > stack-cost -else - echo "DEBUG: Successfully retrieved stack cost: $(cat stack-cost)" -fi +# Disabled stack cost retrieval due to lftools hanging without timeout +# See: OpenDaylight Jenkins issue with stuck jobs +# echo "INFO: Retrieving stack cost for: $OS_STACK_NAME" +# if ! lftools openstack --os-cloud "$OS_CLOUD" stack cost "$OS_STACK_NAME" > stack-cost; then +# echo "WARNING: Unable to get stack costs, continuing anyway" +# echo "total: 0" > stack-cost +# else +# echo "DEBUG: Successfully retrieved stack cost: $(cat stack-cost)" +# fi +echo "INFO: Stack cost retrieval disabled, setting cost to 0" +echo "total: 0" > stack-cost # Delete the stack even if the stack-cost script fails lftools openstack --os-cloud "$OS_CLOUD" stack delete "$OS_STACK_NAME" \