Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in NewBatchSpanProcessor following upgrade to v1.18.0 #5410

Open
criscola opened this issue Feb 5, 2025 · 0 comments
Open

Memory leak in NewBatchSpanProcessor following upgrade to v1.18.0 #5410

criscola opened this issue Feb 5, 2025 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@criscola
Copy link

criscola commented Feb 5, 2025

/kind bug

What steps did you take and what happened:

We are experiencing a strange memory leak with capz controller after upgrading to v1.18.0. We enabled pprof and saw that trace.NewBatchSpanProcessor memory usage is constantly growing. We can see from the call graph that ARMClientOptions is calling OLTPTracerProvider which in turn calls the abovementioned method. Please see pprof trace heap3.zip.

Image

We also get periodic errors in the logs, perhaps it's connected:

traces export: context deadline exceeded: rpc error: code = Unavailable desc = name resolver error: produced zero addresses

Note that we disabled tracing entirely.

Anything else you would like to add:

Slack thread: https://kubernetes.slack.com/archives/CEX9HENG7/p1738688304962559

capz-controller-manager deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: '3'
  creationTimestamp: '2025-02-03T16:25:02Z'
  generation: 3
  labels:
    cluster.x-k8s.io/provider: infrastructure-azure
    clusterctl.cluster.x-k8s.io: ''
    control-plane: capz-controller-manager
  name: capz-controller-manager
  namespace: capz-system
  ownerReferences:
    - apiVersion: operator.cluster.x-k8s.io/v1alpha2
      kind: InfrastructureProvider
      name: azure
      uid: a183bcbb-ce1a-47dd-aa0c-58097667637e
  resourceVersion: '482372454'
  uid: 9045ec53-108c-40df-b0bc-6f6d75b05513
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      cluster.x-k8s.io/provider: infrastructure-azure
      control-plane: capz-controller-manager
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/default-container: manager
      creationTimestamp: null
      labels:
        azure.workload.identity/use: 'true'
        cluster.x-k8s.io/provider: infrastructure-azure
        control-plane: capz-controller-manager
    spec:
      containers:
        - args:
            - '--leader-elect'
            - '--diagnostics-address=:8443'
            - '--insecure-diagnostics=true'
            - '--feature-gates=MachinePool=true'
            - '--v=4'
            - '--profiler-address=localhost:6060'
            - '--service-reconcile-timeout=2m'
            - '--azurecluster-concurrency=2'
            - '--azuremachine-concurrency=200'
            - '--azuremachinepool-concurrency=30'
            - '--azuremachinepoolmachine-concurrency=200'
          env:
            - name: AZURE_SUBSCRIPTION_ID
              valueFrom:
                secretKeyRef:
                  key: subscription-id
                  name: capz-manager-bootstrap-credentials
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: GOMEMLIMIT
              value: 850MiB
            - name: POD_UID
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.uid
          image: >-
            registry.k8s.io/cluster-api-azure/cluster-api-azure-controller:v1.18.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /healthz
              port: healthz
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: manager
          ports:
            - containerPort: 9443
              name: webhook-server
              protocol: TCP
            - containerPort: 9440
              name: healthz
              protocol: TCP
            - containerPort: 8443
              name: metrics
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readyz
              port: healthz
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              memory: 512Mi
            requests:
              cpu: 10m
              memory: 512Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            runAsGroup: 65532
            runAsUser: 65532
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /tmp/k8s-webhook-server/serving-certs
              name: cert
              readOnly: true
            - mountPath: /var/run/secrets/azure/tokens
              name: azure-identity-token
              readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: capz-manager
      serviceAccountName: capz-manager
      terminationGracePeriodSeconds: 10
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
        - effect: NoSchedule
          key: node-role.kubernetes.io/control-plane
      volumes:
        - name: cert
          secret:
            defaultMode: 420
            secretName: capz-webhook-service-cert
        - name: azure-identity-token
          projected:
            defaultMode: 420
            sources:
              - serviceAccountToken:
                  audience: api://AzureADTokenExchange
                  expirationSeconds: 3600
                  path: azure-identity-token
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: '2025-02-03T16:25:02Z'
      lastUpdateTime: '2025-02-04T15:22:05Z'
      message: >-
        ReplicaSet "capz-controller-manager-b4bbbd857" has successfully
        progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
    - lastTransitionTime: '2025-02-05T12:55:05Z'
      lastUpdateTime: '2025-02-05T12:55:05Z'
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: 'True'
      type: Available
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

Environment:

  • cluster-api-provider-azure version: v1.18.0
  • Kubernetes version: (use kubectl version): Client Version: v1.31.3, Kustomize Version: v5.4.2, Server Version: v1.30.8-gke.1162000
  • OS (e.g. from /etc/os-release): cos-113-18244-236-77 w Docker v24.0.9 and containerd v1.7.24
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 5, 2025
@nawazkh nawazkh added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Todo
Development

No branches or pull requests

3 participants