Memory leak in NewBatchSpanProcessor following upgrade to v1.18.0 #5410

criscola · 2025-02-05T13:02:38Z

/kind bug

What steps did you take and what happened:

We are experiencing a strange memory leak with capz controller after upgrading to v1.18.0. We enabled pprof and saw that trace.NewBatchSpanProcessor memory usage is constantly growing. We can see from the call graph that ARMClientOptions is calling OLTPTracerProvider which in turn calls the abovementioned method. Please see pprof trace heap3.zip.

We also get periodic errors in the logs, perhaps it's connected:

traces export: context deadline exceeded: rpc error: code = Unavailable desc = name resolver error: produced zero addresses

Note that we disabled tracing entirely.

Anything else you would like to add:

Slack thread: https://kubernetes.slack.com/archives/CEX9HENG7/p1738688304962559

capz-controller-manager deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '3'
  creationTimestamp: '2025-02-03T16:25:02Z'
  generation: 3
  labels:
    cluster.x-k8s.io/provider: infrastructure-azure
    clusterctl.cluster.x-k8s.io: ''
    control-plane: capz-controller-manager
  name: capz-controller-manager
  namespace: capz-system
  ownerReferences:
    - apiVersion: operator.cluster.x-k8s.io/v1alpha2
      kind: InfrastructureProvider
      name: azure
      uid: a183bcbb-ce1a-47dd-aa0c-58097667637e
  resourceVersion: '482372454'
  uid: 9045ec53-108c-40df-b0bc-6f6d75b05513
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      cluster.x-k8s.io/provider: infrastructure-azure
      control-plane: capz-controller-manager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: manager
      creationTimestamp: null
      labels:
        azure.workload.identity/use: 'true'
        cluster.x-k8s.io/provider: infrastructure-azure
        control-plane: capz-controller-manager
    spec:
      containers:
        - args:
            - '--leader-elect'
            - '--diagnostics-address=:8443'
            - '--insecure-diagnostics=true'
            - '--feature-gates=MachinePool=true'
            - '--v=4'
            - '--profiler-address=localhost:6060'
            - '--service-reconcile-timeout=2m'
            - '--azurecluster-concurrency=2'
            - '--azuremachine-concurrency=200'
            - '--azuremachinepool-concurrency=30'
            - '--azuremachinepoolmachine-concurrency=200'
          env:
            - name: AZURE_SUBSCRIPTION_ID
              valueFrom:
                secretKeyRef:
                  key: subscription-id
                  name: capz-manager-bootstrap-credentials
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: GOMEMLIMIT
              value: 850MiB
            - name: POD_UID
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.uid
          image: >-
            registry.k8s.io/cluster-api-azure/cluster-api-azure-controller:v1.18.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: healthz
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: manager
          ports:
            - containerPort: 9443
              name: webhook-server
              protocol: TCP
            - containerPort: 9440
              name: healthz
              protocol: TCP
            - containerPort: 8443
              name: metrics
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readyz
              port: healthz
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              memory: 512Mi
            requests:
              cpu: 10m
              memory: 512Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            runAsGroup: 65532
            runAsUser: 65532
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /tmp/k8s-webhook-server/serving-certs
              name: cert
              readOnly: true
            - mountPath: /var/run/secrets/azure/tokens
              name: azure-identity-token
              readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: capz-manager
      serviceAccountName: capz-manager
      terminationGracePeriodSeconds: 10
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
      volumes:
        - name: cert
          secret:
            defaultMode: 420
            secretName: capz-webhook-service-cert
        - name: azure-identity-token
          projected:
            defaultMode: 420
            sources:
              - serviceAccountToken:
                  audience: api://AzureADTokenExchange
                  expirationSeconds: 3600
                  path: azure-identity-token
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: '2025-02-03T16:25:02Z'
      lastUpdateTime: '2025-02-04T15:22:05Z'
      message: >-
        ReplicaSet "capz-controller-manager-b4bbbd857" has successfully
        progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
    - lastTransitionTime: '2025-02-05T12:55:05Z'
      lastUpdateTime: '2025-02-05T12:55:05Z'
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: 'True'
      type: Available
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Environment:

cluster-api-provider-azure version: v1.18.0
Kubernetes version: (use kubectl version): Client Version: v1.31.3, Kustomize Version: v5.4.2, Server Version: v1.30.8-gke.1162000
OS (e.g. from /etc/os-release): cos-113-18244-236-77 w Docker v24.0.9 and containerd v1.7.24

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to CAPZ Planning Feb 5, 2025

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 5, 2025

github-project-automation bot moved this to Todo in CAPZ Planning Feb 5, 2025

nawazkh added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in NewBatchSpanProcessor following upgrade to v1.18.0 #5410

Memory leak in NewBatchSpanProcessor following upgrade to v1.18.0 #5410

criscola commented Feb 5, 2025

Memory leak in NewBatchSpanProcessor following upgrade to v1.18.0 #5410

Memory leak in NewBatchSpanProcessor following upgrade to v1.18.0 #5410

Comments

criscola commented Feb 5, 2025