Istio中使用Flagger实现金丝雀发布

更新于：2024年1月2日

Istio中使用Flagger实现金丝雀发布

🔰 说明：

次部分实战未测试，因为istio gateway部分部署失败，因此这部分demo无法测试。暂且维护下文档，晚点再次解决下gateway部署问题。

本节实战

实战名称
🚩 实战：Gateway API在istio里的安装及测试-2023.12.23(测试成功)

前言

灰度发布也叫金丝雀部署 ，是指通过控制流量的比例，实现新老版本的逐步替换。比如对于服务 A 有两个版本（蓝和绿两个版本），当前两个版本同时部署，但是 version1 比例 90% ，version2 比例 10% ，然后我们可以观察 version2 的实际运行效果，如果符合预期，则可以逐步调整流量占比，比如调整为 80:20 -> 70:30 -> 10:90 -> 0:100 ，最终 version1 版本下线，全部替换成 version2 版本。如果验证失败，切换 100%流量回 v1 版本（回滚）。

灰度发布的特点是：

在 Istio 中要实现灰度发布有多种方案，比如 Flagger、Argo Rollouts 等。

Flagger

Flagger 是一个渐进式交付的 Kubernetes Operator，它可以自动执行 Kubernetes 上运行的应用程序的发布过程。它通过在测量指标和运行一致性测试的同时逐渐将流量转移到新版本，降低了在生产中引入新软件版本的风险。

Flagger 通过使用服务网格（App Mesh、Istio、Linkerd、Kuma、Open Service Mesh）或 Ingress 控制器（Contour、Gloo、NGINX、Skipper、 Traefik、APISIX）用于流量路由。对于发布分析，Flagger 可以查询 Prometheus、InfluxDB、Datadog、New Relic、CloudWatch、Stackdriver 或 Graphite，并使用 Slack、MS Teams、Discord 和 Rocket 来发出警报。

注意：

Flagger 可以使用 Kubernetes CRD 进行配置，并且与任何为 Kubernetes 制作的 CI/CD 解决方案兼容。由于 Flagger 是声明性的对 Kubernetes 事件做出反应，因此它可以与诸如此类的工具一起在 GitOps 管道中使用。

安装 Flagger

要使用 Flagger，需要先选择一个受支持的路由提供商（比如我们这里使用 Istio），然后使用 Helm 或 Kustomize 安装 Flagger。

Flagger 需要 Kubernetes 集群 v1.16 或更高版本以及 Istio v1.5 或更高版本。

首先当然需要安装 Istio，并开启 Prometheus 插件：

# demo 或者 default 都可以
istioctl manifest install --set profile=demo -y

# istio 根目录
kubectl apply -f samples/addons/prometheus.yaml

然后在 istio-system 命名空间安装 Flagger：

$ git clone https://github.com/fluxcd/flagger && cd flagger
$ kubectl apply -k kustomize/istio
$ kubectl get pods -n istio-system -l app=flagger
NAME                      READY   STATUS    RESTARTS   AGE
flagger-ff76bfdff-kkcmz   1/1     Running   0          17m

测试应用

下面我们创建一个名为 test 的命名空间，并为其启用 Istio sidecar 自动注入：

kubectl create ns test
kubectl label namespace test istio-injection=enabled

接下来我们使用 flagger 官方提供的 podinfo 应用来进行测试：

kubectl apply -k kustomize/podinfo

该命令会为 podinfo 应用创建对应的 Deployment 和一个 HPA 对象。

$ kubectl get deployment -n test
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
podinfo   2/2     2            0           96s
$ kubectl get hpa -n test
NAME      REFERENCE            TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
podinfo   Deployment/podinfo   <unknown>/99%   2         4         2          60s

部署后，我们可以看到 podinfo 应用的容器数量已经变成了 2 个（自动注入了 istio sidecar），而且 HPA 也已经生效。

[root@master1 ~]#kubectl get pods -n test
NAME                       READY   STATUS    RESTARTS   AGE
podinfo-584c4546df-5rbdz   2/2     Running   0          2d10h
podinfo-584c4546df-lkw2k   2/2     Running   0          2d10h

接着我们再部署一个负载测试服务用于在金丝雀分析期间生成流量：

kubectl apply -k kustomize/tester

查看：

[root@master1 flagger]#kubectl get po -ntest
NAME                                  READY   STATUS    RESTARTS   AGE
flagger-loadtester-78dd9787d4-957xk   2/2     Running   0          40m
podinfo-584c4546df-5rbdz              2/2     Running   0          2d10h
podinfo-584c4546df-lkw2k              2/2     Running   0          2d10h

Canary CRD

Flagger 通过使用名为 Canary 的 CRD 自定义资源自动化 Kubernetes 负载的发布过程。

比如对于上面的 podinfo 应用我们可以定义一个如下所示的渐进式流量转移的金丝雀发布：

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  service:
    port: 9898
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: load-test
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

上面对象表示当部署新版本的应用程序时，Flagger 会逐渐将流量转移到金丝雀版本上，同时测量请求的成功率以及平均响应持续时间，当然也可以使用自定义指标、验收和负载测试来扩展金丝雀分析，以强化应用程序发布过程的验证过程。

如果在同一集群中运行多个服务网格或 Ingress 控制器，则可以使用 spec.provider 覆盖特定 Canary 的全局 Provider。

Canary target

一个 Canary 资源可以以 Kubernetes Deployment 或 DaemonSet 为目标。比如我们可以定义如下所示的 Kubernetes Deployment 目标：

spec:
  progressDeadlineSeconds: 60
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: podinfo
    primaryScalerReplicas:
      minReplicas: 2
      maxReplicas: 5

根据上述配置，Flagger 将会自动生成以下的 Kubernetes 对象：

deployment/<targetRef.name>-primary
hpa/<autoscalerRef.name>-primary

primary deployment（主版本）被视为应用程序的稳定版本，默认情况下所有流量都会路由到此版本，并且目标 deployment 会缩放为零。Flagger 将检测目标 deployment 的变更，并在将新版本提升为主版本之前执行金丝雀分析。

另外还可以使用 .spec.autoscalerRef.primaryScalerReplicas 覆盖生成的主 HPA 的副本扩展配置。当希望为主工作负载使用不同的扩展配置而不是使用原始工作负载 HPA 中的相同值时，这非常有用。

注意：目标 Deployment 必须具有格式为 app: <DEPLOYMENT-NAME> 的单个标签选择器，比如：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: podinfo
spec:
  selector:
    matchLabels:
      app: podinfo
  template:
    metadata:
      labels:
        app: podinfo

除了 app 标签之外，Flagger 还支持 name 和 app.kubernetes.io/name 选择器。如果你使用不同的约定，可以在 Flagger 启动参数中使用 -selector-labels=my-app-label 命令标志指定标签。

如果目标 Deployment 使用了 secrets、configmaps，Flagger 将使用 -primary 后缀创建每个对象的副本，并在主 Deployment 中引用这些对象。如果你使用 flagger.app/config-tracking: disabled 对 ConfigMap 或 Secret 进行了注释，Flagger 将使用相同的对象来进行主要部署，而不是创建主要副本。此外我们也可以使用 Flagger 启动参数中的 -enable-config-tracking=false 命令标志来全局禁用 secrets/configmaps 跟踪。

autoscalerRef 字段是可选的，指定后 Flagger 将在目标和主要部署扩大或缩小时暂停流量增加。HPA 可以帮助减少金丝雀分析期间的资源使用，指定自动缩放器引用后，对自动缩放器所做的任何更改仅在 Deployment 部署成功启动并完成时才会在 primary hpa 中生效。或者我们可以创建两个 HPA，一个用于金丝雀，一个用于主应用，以更新 HPA，而无需进行新的部署，由于金丝雀部署规模将缩减为 0，金丝雀上的 HPA 将处于非活动状态。

progressDeadlineSeconds 字段表示金丝雀部署升级（回滚之前）最大处理时间，单位为秒，默认为十分钟。

Canary service

Canary 资源决定了目标工作负载在集群内的暴露方式。Canary 目标应暴露一个 TCP 端口，Flagger 将使用该端口来创建 ClusterIP 服务。

spec:
  service:
    name: podinfo
    port: 9898
    portName: http
    appProtocol: http
    targetPort: 9898
    portDiscovery: true

目标工作负载的容器端口应与 service.port 或 service.targetPort 匹配，其中 service.name 是可选的，默认为 spec.targetRef.name，service.targetPort 可以是容器端口或名称，service.portName 是可选的（默认为 http），如果你的工作负载使用 gRPC，则将端口名称设置为 grpc，service.appProtocol 也是可选的。

如果启用了端口发现，Flagger 会扫描目标工作负载并提取容器端口（不包括 Canary 服务和服务网格 sidecar 端口中指定的端口），生成 ClusterIP 服务时将使用这些端口。

基于 canary 规范服务，Flagger 创建以下 Kubernetes ClusterIP 服务：

<service.name>.<namespace>.svc.cluster.local：selector app=-primary
<service.name>-primary.<namespace>.svc.cluster.local：selector app=-primary
<service.name>-canary.<namespace>.svc.cluster.local：selector app=

这可确保 podinfo.test:9898 的流量将路由到应用程序的最新稳定版本，而 podinfo-canary.test:9898 地址仅在金丝雀分析期间可用，可用于一致性测试或负载测试。我们可以配置 Flagger 为生成的服务设置注释和标签：

spec:
  service:
    port: 9898
    apex:
      annotations:
        test: "test"
      labels:
        test: "test"
    canary:
      annotations:
        test: "test"
      labels:
        test: "test"
    primary:
      annotations:
        test: "test"
      labels:
        test: "test"

apex 注解会添加到生成的 Kubernetes Service 和生成的服务网格或 Ingress 对象中。当然除了端口映射和元数据之外，spec.service 还可以配置包含 URI 匹配和重写规则、超时和重试策略：

spec:
  service:
    port: 9898
    match:
      - uri:
          prefix: /
    rewrite:
      uri: /
    retries:
      attempts: 3
      perTryTimeout: 1s
    timeout: 5s

当使用 Istio 作为网格提供程序时，还可以指定 HTTP Header、CORS 和流量策略、Istio 网关和主机等信息。比如以下资源对象就暴露 frontend.test.svc.cluster.local:9898 上的网格内部和 frontend.example.com 上的网格外部的工作负载，我们必须为外部主机指定 Istio 入口网关。

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: frontend
  namespace: test
spec:
  service:
    # container port
    port: 9898
    # service port name (optional, will default to "http")
    portName: http-frontend
    # Istio gateways (optional)
    gateways:
      - istio-system/public-gateway
      - mesh
    # Istio virtual service host names (optional)
    hosts:
      - frontend.example.com
    # Istio traffic policy
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    # HTTP match conditions (optional)
    match:
      - uri:
          prefix: /
    # HTTP rewrite (optional)
    rewrite:
      uri: /
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
    # Add headers (optional)
    headers:
      request:
        add:
          x-some-header: "value"
    # cross-origin resource sharing policy (optional)
    corsPolicy:
      allowOrigin:
        - example.com
      allowMethods:
        - GET
      allowCredentials: false
      allowHeaders:
        - x-some-header
      maxAge: 24h

对于上述规范，Flagger 将生成以下 VirtualService 对象：

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
  namespace: test
spec:
  gateways:
    - istio-system/public-gateway
    - mesh
  hosts:
    - frontend.example.com
    - frontend
  http:
    - corsPolicy:
        allowHeaders:
          - x-some-header
        allowMethods:
          - GET
        allowOrigin:
          - example.com
        maxAge: 24h
      headers:
        request:
          add:
            x-some-header: "value"
      match:
        - uri:
            prefix: /
      rewrite:
        uri: /
      route:
        - destination:
            host: podinfo-primary
          weight: 100
        - destination:
            host: podinfo-canary
          weight: 0
      retries:
        attempts: 3
        perTryTimeout: 1s
        retryOn: "gateway-error,connect-failure,refused-stream"

Flagger 使用 VirtualService 和 DestinationRule 与金丝雀服务规范保持同步。

Canary status

我们可以使用 kubectl 获取集群范围内金丝雀部署的当前状态：

$ kubectl get canaries --all-namespaces

NAMESPACE   NAME      STATUS        WEIGHT   LASTTRANSITIONTIME
test        podinfo   Progressing   15       2019-06-30T14:05:07Z
prod        frontend  Succeeded     0        2019-06-30T16:15:07Z
prod        backend   Failed        0        2019-06-30T17:05:07Z

status.conditions 反映了金丝雀分析的最新状态。

$ kubectl -n test get canary/podinfo -oyaml | awk '/status/,0'
status:
  canaryWeight: 0
  failedChecks: 0
  iterations: 0
  lastAppliedSpec: "14788816656920327485"
  lastPromotedSpec: "14788816656920327485"
  conditions:
  - lastTransitionTime: "2019-07-10T08:23:18Z"
    lastUpdateTime: "2019-07-10T08:23:18Z"
    message: Canary analysis completed successfully, promotion finished.
    reason: Succeeded
    status: "True"
    type: Promoted

可能的状态条件包括：Initialized、Waiting、Progressing、WaitingPromotion、Promoting、Finalising、Succeeded 或者 Failed。

我们可以用下面的命令来检查金丝雀的状态：

# update the container image
kubectl set image deployment/podinfo podinfod=stefanprodan/podinfo:3.0.1

# wait for Flagger to detect the change
ok=false
until ${ok}; do
    kubectl get canary/podinfo | grep 'Progressing' && ok=true || ok=false
    sleep 5
done

# wait for the canary analysis to finish
kubectl wait canary/podinfo --for=condition=promoted --timeout=5m

# check if the deployment was successful
kubectl get canary/podinfo | grep Succeeded

Canary analysis

金丝雀分析定义：

部署策略的类型
用于验证金丝雀版本的指标
用于一致性测试、负载测试的 webhooks
报警设置

analysis:
  # schedule interval (default 60s)
  interval:
  # max number of failed metric checks before rollback
  threshold:
  # max traffic percentage routed to canary
  # percentage (0-100)
  maxWeight:
  # canary increment step
  # percentage (0-100)
  stepWeight:
  # promotion increment step
  # percentage (0-100)
  stepWeightPromotion:
  # total number of iterations
  # used for A/B Testing and Blue/Green
  iterations:
  # threshold of primary pods that need to be available to consider it ready
  # before starting rollout. this is optional and the default is 100
  # percentage (0-100)
  primaryReadyThreshold: 100
  # threshold of canary pods that need to be available to consider it ready
  # before starting rollout. this is optional and the default is 100
  # percentage (0-100)
  canaryReadyThreshold: 100
  # canary match conditions
  # used for A/B Testing
  match:
    -  # HTTP header
  # key performance indicators
  metrics:
    -  # metric check
  # alerting
  alerts:
    -  # alert provider
  # external checks
  webhooks:
    -  # hook

金丝雀分析会定期运行，直到达到最大流量权重或迭代次数。每次运行时，Flagger 都会调用 Webhook，检查指标，如果达到失败检查阈值，则停止分析并回滚金丝雀。如果配置了警报，Flagger 将使用报警提供程序发布分析结果。

金丝雀部署

接下来我们就可以创建一个 Canary 自定义资源来实现我们的金丝雀发布了。Canary 对象是 Flagger 的核心，它描述了金丝雀发布的目标。

如下所示，我们为 podinfo 应用创建一个 Canary 对象：

# podinfo-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  targetRef: # deployment 引用
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  progressDeadlineSeconds: 60 # 金丝雀部署升级最大处理时间（以秒为单位）（默认600秒）,这里可以设置大一点，不然有的节点镜像拉取可能会很久
  autoscalerRef: # HPA 引用（可选）
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    port: 9898
    targetPort: 9898
    gateways: # Istio 网关（可选）这里要有gateway。。。，自己要创建的。
      - istio-system/public-gateway
    hosts: # VirtualService 主机名 (optional)
      - podinfo.k8s.local
    trafficPolicy: # Istio 流量策略（可选）
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    retries: # Istio 重试策略（可选）
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis: # 金丝雀分析
    interval: 1m # 金丝雀分析间隔时间（默认 60s）
    threshold: 5 # 金丝雀分析失败阈值（默认 5）
    maxWeight: 50 # 金丝雀最大流量权重（默认 50）
    stepWeight: 10 # 金丝雀流量权重步长（默认 10）
    metrics:
      - name: request-success-rate
        # minimum req success rate (non 5xx responses)
        # percentage (0-100)
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        # maximum req duration P99
        # milliseconds
        thresholdRange:
          max: 500
        interval: 30s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          # 使用 hey 工具对 podinfo-canary 进行为期1分钟的负载测试，每秒发送10个请求，且测试过程中会维持2个并发连接。
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

上面的配置文件中，我们定义了 podinfo 应用的金丝雀发布策略，其中 targetRef 指定了要进行金丝雀发布的 Deployment 对象，service 指定了金丝雀发布的服务，analysis 指定了金丝雀分析策略，这里我们指定了两个内置的指标检查：request-success-rate 和 request-duration，其中 request-success-rate 指定了 HTTP 请求成功率，request-duration 指定了请求持续时间。对于每个指标，你可以使用 thresholdRange 和窗口大小或时间序列指定可接受的值范围和时间间隔。内置检查适用于每个服务网格/Ingress 控制器，并通过 Prometheus 查询实现。

在 service 中我们指定了 Istio 的 Gateway（istio-system/public-gateway）以及 VirtualService 要使用的主机名。

首先我们可以为该应用创建一个 Gateway 对象：

我的k8s里gateway部署存在问题，这里没办法测试。。。；😥

# public-gateway.yaml
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: public-gateway
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
    - port:
        number: 80
        name: http
        protocol: HTTP
      hosts:
        - "*"

部署：

[root@master1 flagger]#kubectl apply -f public-gateway.yaml 
gateway.networking.istio.io/public-gateway created

另外我们在上面的对象中通过 webhooks 字段指定了金丝雀分析期间要执行的测试，其中 acceptance-test 用于在金丝雀分析开始之前执行，load-test 用于在金丝雀分析期间执行。

接下来我们可以直接创建 Canary 对象了：

kubectl apply -f podinfo-canary.yaml

当创建了 Canary 对象后，Flagger 会自动创建一个名为 pod-info-primary 的 Deployment 以及两个版本的 Service 对象(还有 HPA 对象)：

$ kubectl get deploy -n test
NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
flagger-loadtester   1/1     1            1           42m
podinfo              0/0     0            0           46m
podinfo-primary      2/2     2            2           7m3s
$ kubectl get svc -ntest
NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
flagger-loadtester   ClusterIP   10.106.172.190   <none>        80/TCP     35m
podinfo-canary       ClusterIP   10.101.184.213   <none>        9898/TCP   39s
podinfo-primary      ClusterIP   10.110.105.36    <none>        9898/TCP   39s
$ kubectl get hpa -ntest
NAME              REFERENCE                    TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
podinfo           Deployment/podinfo           <unknown>/99%   2         4         0          37h
podinfo-primary   Deployment/podinfo-primary   <unknown>/99%   2         4         2          37h

可以看到我们原本的 podinfo 应用已经从 podinfo 这个 Deployment 迁移到了 podinfo-primary 这个 Deployment 之上。

此外还有 Istio 相关的对象：

$ kubectl get vs -ntest
NAME      GATEWAYS                          HOSTS                             AGE
podinfo   ["istio-system/public-gateway"]   ["podinfo.k8s.local","podinfo"]   91s
$ kubectl get dr -ntest
NAME              HOST              AGE
podinfo-canary    podinfo-canary    95s
podinfo-primary   podinfo-primary   95s

我们可以查看下自动生成的 VirtualService 对象：

$ kubectl get vs -ntest podinfo -oyaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: podinfo
  namespace: test
spec:
  gateways:
  - istio-system/public-gateway
  hosts:
  - podinfo.k8s.local
  - podinfo
  http:
  - retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: gateway-error,connect-failure,refused-stream
    route:
    - destination:
        host: podinfo-primary
      weight: 100
    - destination:
        host: podinfo-canary
      weight: 0

从上面的配置中我们可以看到当前 podinfo 应用的流量全部被路由到了 podinfo-primary 对象上，而 podinfo-canary 对象的流量权重为 0，当然同样可以查看 DestinationRule 对象：

$ kubectl get dr podinfo-primary -ntest -oyaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: podinfo-primary
  namespace: test
spec:
  host: podinfo-primary
  trafficPolicy:
    tls:
      mode: DISABLE
$ kubectl get dr podinfo-canary -ntest -oyaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: podinfo-canary
  namespace: test
spec:
  host: podinfo-canary
  trafficPolicy:
    tls:
      mode: DISABLE

所以默认情况下现在我们访问到的就是 podinfo-primary 这个 Deployment 对象，也就是目前的默认版本。我们可以在浏览器中访问 podinfo 来查看当前的版本：

自动金丝雀发布

我们可以看到现在的版本是 podinfo v6.0.0，接下来我们来升级应用触发金丝雀发布。要触发金丝雀发布，可以由以下任何对象的更改来触发：

Deployment PodSpec（容器镜像、命令、端口、环境变量、资源等）
作为卷挂载或映射到环境变量的 ConfigMaps
作为卷挂载或映射到环境变量的 Secrets
比如我们可以直接修改 Deployment 对象的镜像版本来触发自动化的金丝雀发布：

kubectl -n test set image deployment/podinfo podinfod=ghcr.io/stefanprodan/podinfo:6.0.1

Flagger 检测到 Deployment 更改后就会开始新的部署：

$ kubectl describe canaries podinfo -ntest
# ......
Events:
  Type     Reason  Age                From     Message
  ----     ------  ----               ----     -------
  Warning  Synced  15m                flagger  podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
  Normal   Synced  14m (x2 over 15m)  flagger  all the metrics providers are available!
  Normal   Synced  14m                flagger  Initialization done! podinfo.test
  Normal   Synced  56s                flagger  New revision detected! Scaling up podinfo.test

需要注意在金丝雀分析期间对 Deployment 应用新的更改，Flagger 将重新启动分析。

第一步是先会去扩容 podinfo 应用：

$ kubectl get pods -ntest
NAME                                  READY   STATUS    RESTARTS   AGE
flagger-loadtester-78dd9787d4-dq5fc   2/2     Running   0          67m
podinfo-5d5dbc4d84-f2mp6              2/2     Running   0          31s
podinfo-5d5dbc4d84-gd8ln              2/2     Running   0          31s
podinfo-primary-64f865cf4-bhr79       2/2     Running   0          3m31s
podinfo-primary-64f865cf4-tgsdj       2/2     Running   0          3m31s

然后就会根据我们在 Canary 对象中定义的金丝雀分析策略来进行分析，并一步步将金丝雀版本的权重提高。

Warning Synced 4m4s flagger podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
Normal Synced 3m4s (x2 over 4m4s) flagger all the metrics providers are available!
Normal Synced 3m4s flagger Initialization done! podinfo.test
Normal Synced 64s flagger New revision detected! Scaling up podinfo.test
Normal Synced 4s flagger Starting canary analysis for podinfo.test
Normal Synced 4s flagger Pre-rollout check acceptance-test passed
Normal Synced 4s flagger Advance podinfo.test canary weight 10

最后会自动将流量全部切换到金丝雀版本上。

整个过程就是通过控制 VirtualService 的权重来实现的金丝雀发布。

自动回滚

在金丝雀分析期间，我们可以生成 HTTP 500 错误和高延迟来测试 Flagger 是否暂停发布。

比如我们触发另一个金丝雀发布：

kubectl -n test set image deployment/podinfo podinfod=ghcr.io/stefanprodan/podinfo:6.0.2

然后进入 loadtester 容器：

kubectl -n test exec -it flagger-loadtester-xx-xx sh

使用下面的命令来生成 HTTP 500 错误：

watch curl http://podinfo-canary:9898/status/500

也可以添加延迟：

watch curl http://podinfo-canary:9898/delay/1

当失败检查的次数达到金丝雀分析配置的阈值时，流量将路由回主节点，金丝雀将缩放为零，并将部署标记为失败。

Normal   Synced  8m10s (x3 over 45m)  flagger  New revision detected! Scaling up podinfo.test
Normal   Synced  7m10s (x2 over 44m)  flagger  Pre-rollout check acceptance-test passed
Normal   Synced  7m10s (x2 over 44m)  flagger  Advance podinfo.test canary weight 10
Normal   Synced  7m10s (x2 over 44m)  flagger  Starting canary analysis for podinfo.test
Warning  Synced  6m10s                flagger  Halt podinfo.test advancement success rate 55.86% < 99%
Warning  Synced  5m10s                flagger  Halt podinfo.test advancement success rate 97.61% < 99%
Warning  Synced  4m10s                flagger  Halt podinfo.test advancement success rate 8.00% < 99%
Warning  Synced  3m10s                flagger  Halt podinfo.test advancement success rate 98.13% < 99%
Warning  Synced  2m10s                flagger  Halt podinfo.test advancement success rate 7.69% < 99%
Warning  Synced  70s (x2 over 14m)    flagger  Canary failed! Scaling down podinfo.test
Warning  Synced  70s                  flagger  Rolling back podinfo.test failed checks threshold reached 5

会话亲和性

虽然 Flagger 可以单独执行加权路由和 A/B 测试，但通过 Istio，它可以将两者结合起来，从而形成具有会话关联性的 Canary 版本。这种部署策略将金丝雀发布与 A/B 测试相结合，当我们尝试逐步向用户推出新功能时，金丝雀发布是很有帮助的，但由于其路由的特性（基于权重），即使用户之前已经被路由到新版本，他们仍然还有路由到应用程序的旧版本上，这种情况可能不符合我们的预期。

由于 A/B 测试对于需要会话关联的应用程序特别有用，因此我们将基于 cookie 的路由与常规的基于权重的路由集成在一起，这意味着一旦用户接触到我们应用程序的新版本（基于流量权重），他们总是会被路由到该版本，不会被路由回我们应用程序的旧版本。

我们可以通过在 Canary 对象中指定 .spec.anasyis.sessionAffinity 来启用此功能：

analysis:
  # schedule interval (default 60s)
  interval: 1m
  # max number of failed metric checks before rollback
  threshold: 10
  # max traffic percentage routed to canary
  # percentage (0-100)
  maxWeight: 50
  # canary increment step
  # percentage (0-100)
  stepWeight: 2
  # session affinity config
  sessionAffinity:
    # name of the cookie used
    cookieName: flagger-cookie
    # max age of the cookie (in seconds)
    # optional; defaults to 86400
    maxAge: 21600

其中 .spec.analysis.sessionAffinity.cookieName 是存储的 Cookie 的名称，cookie 的值是随机生成的字符串，充当唯一标识符。对于上述配置，在 Canary 运行期间路由到 Canary 版本的请求的响应头将如下所示：

Set-Cookie: flagger-cookie=LpsIaLdoNZ; Max-Age=21600

Canary 运行结束并且所有流量都转移回主应用后，所有响应都将具有以下 Header：

Set-Cookie: flagger-cookie=LpsIaLdoNZ; Max-Age=-1

这告诉客户端删除 cookie，确保用户系统中没有垃圾 cookie。

如果触发新的 Canary 运行，响应标头将被路由到 Canary 版本的所有请求设置一个新的 cookie：

Set-Cookie: flagger-cookie=McxKdLQoIN; Max-Age=21600

比如我们这里的 podinfo 这个金丝雀对象如果想要启用会话亲和性，我们可以这样配置：

# podinfo-canary-session-affinity.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  progressDeadlineSeconds: 60
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    port: 9898
    targetPort: 9898
    gateways:
      - istio-system/public-gateway
    hosts:
      - podinfo.k8s.local
    trafficPolicy:
      tls:
        mode: DISABLE
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    sessionAffinity: # session 亲和性配置
      cookieName: flagger-cookie # cookie 名称
      maxAge: 21600 # cookie 最大存活时间（秒），默认为 86400
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 30s
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"

重新更新 Canary 对象：

kubectl apply -f podinfo-canary-session-affinity.yaml

更新后我们可以重新触发金丝雀发布：

kubectl -n test set image deployment/podinfo podinfod=ghcr.io/stefanprodan/podinfo:6.0.0

当在金丝雀的过程中，如果前端应用被路由到了 6.0.0 版本那么就会始终被路由到 6.0.0 版本，直到金丝雀发布结束，在请求中我们也可以看到对应的 cookie 信息：

这个时候我们查看 VirtualService 对象可以发现里面就包含了会话亲和性的相关配置：

$ kubectl get vs -ntest podinfo -oyaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: podinfo
  namespace: test
spec:
  gateways:
  - istio-system/public-gateway
  hosts:
  - podinfo.k8s.local
  - podinfo
  http:
  - match:
    - headers:
        Cookie:
          regex: .*flagger-cookie.*tmVCwNFKaj.*
    name: sticky-route
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: gateway-error,connect-failure,refused-stream
    route:
    - destination:
        host: podinfo-primary
      weight: 0
    - destination:
        host: podinfo-canary
      weight: 100
  - retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: gateway-error,connect-failure,refused-stream
    route:
    - destination:
        host: podinfo-primary
      weight: 50
    - destination:
        host: podinfo-canary
      headers:
        response:
          add:
            Set-Cookie: flagger-cookie=tmVCwNFKaj; Max-Age=21600
      weight: 50

通过 VirtualService 对象将请求头中添加上 Cookie 信息，然后根据 Cookie 信息来进行路由，这样就可以实现会话亲和性了。

流量镜像

对于执行读取操作的应用程序，可以将 Flagger 配置为通过流量镜像驱动金丝雀版本。Istio 流量镜像将复制每个传入请求，将一个请求发送到主服务，并将一个请求发送到金丝雀服务，来自主节点的响应被发送回用户，来自金丝雀的响应被丢弃。收集两个请求的指标，以便仅当金丝雀指标在阈值范围内时才会继续部署。

我们可以通过用迭代替换 stepWeight/maxWeight 并将 analysis.mirror 设置为 true 来启用流量镜像：

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  analysis:
    interval: 1m
    threshold: 5
    # 迭代总数
    iterations: 10
    # 启用流量镜像
    mirror: true
    # 将流量镜像到金丝雀版本的权重（默认为100%）
    mirrorWeight: 100
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"

通过上述配置，Flagger 将通过以下步骤运行金丝雀版本：

检测新版本
金丝雀版本从零开始扩展
等待 HPA 设置金丝雀最小副本数
检查 Canary pod 的健康状况
运行验收测试
如果测试失败则中止金丝雀发布
开始负载测试
将 100% 的流量从主版本镜像到金丝雀版本
每分钟检查请求成功率和请求持续时间
如果达到指标检查失败阈值，则中止金丝雀发布
达到迭代次数后停止流量镜像
将实时流量路由到 Canary Pod
升级金丝雀
等待主要部署完成
等待 HPA 设置主最小副本数
检查主 Pod 的运行状况
将实时流量切换回主版本
将金丝雀规模归零

上述过程我们还可以通过自定义指标检查、webhook、手动升级批准以及 Slack 或 MS Teams 通知进行扩展。

Istio A/B 测试

接下来我们了解下如何使用 Istio 和 Flagger 进行 A/B 测试。除了加权路由之外，Flagger 还可以配置为根据 HTTP 匹配条件将流量路由到金丝雀版本，在 A/B 测试场景中，我们会使用 HTTP Header 或 cookie 来定位特定的用户群体，这对于需要会话关联的前端应用程序特别有用。

A/B 测试实验一般有 2 个目的：

判断哪个更好：比如在 APP 界面上做了一个新的修改，究竟效果会不会更新，需要数据来判定
计算收益：例如最近新上线了一个直播功能，那么直播功能究竟给平台带了来多少额外的 DAU，多少额外的使用时长，多少直播以外的视频观看时长等

我们一般比较熟知的是上述第 1 个目的，对于第 2 个目的，对于收益的量化，计算 ROI，往往对数据分析师和管理者非常重要。对于一般的 A/B 测试，其实本质上就是把平台的流量均匀分为几个组，每个组添加不同的策略，然后根据这几个组的用户数据指标，例如：留存、人均观看时长、基础互动率等等核心指标，最终选择一个最好的组上线。

所以 A/B 测试其实没有一个固定的标准，一般都是根据业务场景来定制的，比如我们可以根据用户的地域、设备、版本、渠道、用户行为等等来进行分组，然后针对不同的分组进行不同的策略，最后根据不同的指标来选择最好的组。

同样这里我们对上面的 podinfo 应用来进行 A/B 测试，创建一个如下所示的 Canary 对象：

# podinfo-ab.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  progressDeadlineSeconds: 60
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    port: 9898
    gateways:
      - istio-system/public-gateway
    hosts:
      - podinfo.k8s.local
    trafficPolicy:
      tls:
        mode: DISABLE
  analysis:
    interval: 1m
    iterations: 10
    # 回滚前的最大失败迭代次数
    threshold: 2
    # 金丝雀匹配条件
    match:
      - headers:
          user-agent:
            regex: ".*Chrome.*"
      - headers:
          cookie:
            regex: "^(.*?;)?(type=insider)(;.*)?$"
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 30s
    webhooks: # 金丝雀分析期间生成流量
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 15s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 -H 'Cookie: type=insider' http://podinfo.test:9898/"

在上面的对象中我们增加了一个 match 字段，用于指定金丝雀匹配条件，这里我们指定了两个匹配条件，一个是 user-agent，另一个是 cookie，表示将针对 Chrome 用户和拥有 type=insider cookie 的用户运行 10 分钟的金丝雀分析。

我们可以直接更新 Canary 对象：

kubectl apply -f podinfo-ab.yaml

更新后我们可以重新触发金丝雀发布：

kubectl -n test set image deployment/podinfo podinfod=ghcr.io/stefanprodan/podinfo:6.0.1

然后就会开始金丝雀发布过程了：

$ kubectl describe canary podinfo -ntest

Events:
  Type    Reason  Age                  From     Message
  ----    ------  ----                 ----     -------
  Normal  Synced  39m                  flagger  Pre-rollout check acceptance-test passed
  Normal  Synced  39m                  flagger  Advance podinfo.test canary weight 10
  Normal  Synced  38m                  flagger  Advance podinfo.test canary weight 20
  Normal  Synced  37m                  flagger  Advance podinfo.test canary weight 30
  Normal  Synced  36m                  flagger  Advance podinfo.test canary weight 40
  Normal  Synced  35m                  flagger  Advance podinfo.test canary weight 50
  Normal  Synced  34m                  flagger  Copying podinfo.test template spec to podinfo-primary.test
  Normal  Synced  32m (x2 over 33m)    flagger  (combined from similar events): Promotion completed! Scaling down podinfo.test
  Normal  Synced  2m14s (x2 over 40m)  flagger  New revision detected! Scaling up podinfo.test
  Normal  Synced  74s (x2 over 39m)    flagger  Starting canary analysis for podinfo.test
  Normal  Synced  74s                  flagger  Advance podinfo.test canary iteration 1/10
  Normal  Synced  14s                  flagger  Advance podinfo.test canary iteration 2/10

这个时候如何我们打开 Chrome 浏览器访问 podinfo 应用，那么就会被路由到金丝雀版本上，而如果我们使用其他浏览器访问 podinfo 应用，那么就会被路由到主版本上：

这样就可以实现 A/B 测试了，当然在实际的工作中 A/B 测试的条件可能会更加复杂，比如我们可以根据用户的地域、设备、版本、渠道、用户行为等等来进行分组，这需要结合实际的业务场景来进行配置。

自定义指标

作为分析过程的一部分，Flagger 可以验证服务级别目标 (SLO)，例如可用性、错误率百分比、平均响应时间以及基于应用程序特定指标的任何其他目标。如果在 SLO 分析过程中发现性能下降，版本将自动回滚，将对最终用户的影响降到最低。

Flagger 附带两个内置指标检查：HTTP 请求成功率和持续时间。

analysis:
  metrics:
    - name: request-success-rate
      interval: 1m
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
    - name: request-duration
      interval: 1m
      # maximum req duration P99
      # milliseconds
      thresholdRange:
        max: 500

默认情况下，Flagger 使用 Prometheus 查询来测量请求成功率和持续时间。

在 Istio 中 HTTP 请求成功率对应的 PromQL 语句如下所示：

sum(
    rate(
        istio_requests_total{
          reporter="destination",
          destination_workload_namespace=~"{{ namespace }}",
          destination_workload=~"{{ target }}",
          response_code!~"5.*"
        }[{{ interval }}]
    )
)
/
sum(
    rate(
        istio_requests_total{
          reporter="destination",
          destination_workload_namespace=~"{{ namespace }}",
          destination_workload=~"{{ target }}"
        }[{{ interval }}]
    )
)

同样 Istio 中 HTTP 请求的持续时间对应的 PromQL 语句为：

histogram_quantile(0.99,
  sum(
    irate(
      istio_request_duration_milliseconds_bucket{
        reporter="destination",
        destination_workload=~"{{ target }}",
        destination_workload_namespace=~"{{ namespace }}"
      }[{{ interval }}]
    )
  ) by (le)
)

istio_requests_total 以及 istio_request_duration_milliseconds_bucket 这两个指标都是 Istio 自带的，前面可观测性章节中我们已经介绍过了。

如果两个内置的指标检查不足以满足需求，Flagger 还支持自定义指标检查进行扩展。使用 MetricTemplate 自定义资源，可以将 Flagger 配置为连接到指标提供程序并运行返回 float64 值的查询，查询结果用于根据指定的阈值范围验证金丝雀。

比如我们想要自定义一个 Prometheus 的指标，那么可以通过将提供程序类型设置为 prometheus 并在 PromQL 中编写查询来创建针对 Prometheus 服务器的自定义指标检查。比如定义一个如下所示的指标模板：

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: not-found-percentage
  namespace: istio-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.istio-system:9090
  query: |
    100 - sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace="{{ namespace }}",
              destination_workload="{{ target }}",
              response_code!="404"
            }[{{ interval }}]
        )
    )
    /
    sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace="{{ namespace }}",
              destination_workload="{{ target }}"
            }[{{ interval }}]
        )
    ) * 100

然后在 Canary 对象中引用这个指标模板即可：

analysis:
  metrics:
    - name: "404s percentage"
      templateRef:
        name: not-found-percentage
        namespace: istio-system
      thresholdRange:
        max: 5
      interval: 1m