Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS Spot Nodepool error #2217

Open
ra002890 opened this issue Jul 9, 2024 · 5 comments
Open

AKS Spot Nodepool error #2217

ra002890 opened this issue Jul 9, 2024 · 5 comments

Comments

@ra002890
Copy link

ra002890 commented Jul 9, 2024

Bicep version
0.28.1 (ba1e9f8c1e)

Describe the bug
When we try to create an AKS Nodepool with scaleSetPriority value to Spot we receive the following error:
{"code": "InvalidParameter", "message": "Preflight validation check for resource(s) for container service aks-neuralsearchx-prod in resource group neuralsearchx_group failed. Message: The value of parameter agentPoolProfile.upgrade.maxSurge is invalid. Error details: Spot pools can't set max surge. Please see https://aka.ms/aks-naming-rules for more details.. Details: "}

We are not setting maxSurge, neither any upgradeSettings. We have tried to set this parameter as null, but no success.

To Reproduce
Try to create an AKS nodepool with spot nodes using bicep script.

@alex-frankel
Copy link
Collaborator

Can you share the bicep code you are using to deploy AKS? What happens if you don't use a parameter and hard code the value in the resource declaration?

@ra002890
Copy link
Author

ra002890 commented Jul 9, 2024

Thanks for the prompt answer!
This is the bicep script that I am trying to use.

param location string = resourceGroup().location
param projectName string = 'nsx-${uniqueString(resourceGroup().id)}'
param projectEnv string = 'test'

@description('Specifies the id of the virtual network.')
param virtualNetworkId string

@description('Specifies the name of the default subnet hosting the AKS cluster.')
param aksSubnetName string = 'AksSubnet'

@description('Specifies the CIDR notation IP range from which to assign pod IPs when kubenet is used.')
param aksClusterPodCidr string = '10.244.0.0/16'

@description('A CIDR notation IP range from which to assign service cluster IPs. It must not overlap with any Subnet IP ranges.')
param aksClusterServiceCidr string = '10.2.0.0/16'

@description('Specifies the IP address assigned to the Kubernetes DNS service. It must be within the Kubernetes service address range specified in serviceCidr.')
param aksClusterDnsServiceIP string = '10.2.0.10'

var virtualNetworkName = last(split(virtualNetworkId, '/'))

resource virtualNetwork 'Microsoft.Network/virtualNetworks@2020-08-01' existing = {
  name: virtualNetworkName
}

resource aksSubnet 'Microsoft.Network/virtualNetworks/subnets@2020-08-01' existing = {
  parent: virtualNetwork
  name: aksSubnetName
}

resource networkContributorRole 'Microsoft.Authorization/roleDefinitions@2022-04-01' existing = {
  name: '4d97b98b-1d4f-4787-a291-c67834d212e7'
  scope: subscription()
}

// AKS Cluster
resource aksCluster 'Microsoft.ContainerService/managedClusters@2024-01-02-preview' = {
  name: 'aks-${projectName}-${projectEnv}'
  location: location
  sku: {
    name: 'Base'
    tier: 'Free'
  }
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    dnsPrefix: 'aks-${projectName}-${projectEnv}-k8s'
    agentPoolProfiles: [
      {
        name: 'agentpool'
        count: 1
        vmSize: 'Standard_B4ms'
        vnetSubnetID: aksSubnet.id
        mode: 'System'
      }
      {
        name: 't4spotpool'
        count: 1
        minCount: 1
        maxCount: 3
        vmSize: 'Standard_NC4as_T4_v3'
        spotMaxPrice: json('0.3')
        mode: 'User'
        vnetSubnetID: aksSubnet.id
        nodeLabels: {
          gpu: 't4'
        }
        nodeTaints: [
          'sku=gpu:NoSchedule'
        ]
        linuxOSConfig: {
          transparentHugePageEnabled: 'madvise'
          transparentHugePageDefrag: 'defer+madvise'
          swapFileSizeMB: 26000
          sysctls: {
            netCoreSomaxconn: 163849
            netIpv4TcpTwReuse: true
            netIpv4IpLocalPortRange: '32000 60000'
          }
        }
        kubeletConfig: {
          cpuManagerPolicy: 'static'
          cpuCfsQuota: true
          cpuCfsQuotaPeriod: '200ms'
          imageGcHighThreshold: 90
          imageGcLowThreshold: 70
          topologyManagerPolicy: 'best-effort'
          allowedUnsafeSysctls: [
            'kernel.msg*'
            'net.*'
          ]
          failSwapOn: false
        }
        upgradeSettings: {
          maxSurge: null
        }
        scaleSetPriority: 'Spot'
        enableAutoScaling: true
      }
    ]
    networkProfile: {
      networkPlugin: 'kubenet'
      podCidr: aksClusterPodCidr
      serviceCidr: aksClusterServiceCidr
      dnsServiceIP: aksClusterDnsServiceIP
    }
  }
  tags: {
    environment: projectEnv
  }
}

// Assign the Network Contributor role to the Application Load Balancer user-assigned managed identity with the association subnet as as scope
resource aksNetworkContributorRoleAssignment 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(resourceGroup().name, 'aksManagedEntity', networkContributorRole.id)
  scope: resourceGroup()
  properties: {
    roleDefinitionId: networkContributorRole.id
    principalId: aksCluster.identity.principalId
    principalType: 'ServicePrincipal'
  }
}

@alex-frankel
Copy link
Collaborator

Got it - this is more than likely not an issue with Bicep itself. I would recommend opening up a support case so this can be routed to the AKS team. I will also share this with the AKS PG to see if they can help in the meantime.

@matthchr
Copy link
Member

matthchr commented Jul 9, 2024

This looks like an AKS validation bug to me. If you set the Type field of the spot AgentPool to be VirtualMachineScaleSets I believe the error will stop.

@stephaniezyen stephaniezyen transferred this issue from Azure/bicep Jul 10, 2024
@matthchr
Copy link
Member

We'll get a fix in so that in the future you don't need this workaround (setting Type).

It won't roll out for a few weeks though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

3 participants