配置健康探测

概述

在应用程序中实施健康探测是 Kubernetes 自动执行一些任务的好方法，以便在发生错误时提高可用性。

由于 osm-edge 重新配置应用 Pod，使其通过代理 sidecar 重定向所有传入和传出的网络流量，kubelet 调用的 httpGet 和 tcpSocket 健康探测会因为代理缺少 mTLS 要求的上下文而失败。

osm-edge 增加了配置，通过代理暴露探测端点，并重写新 Pod 的探测定义，以引用代理暴露的端点，使得 httpGet 健康探测可以在服务网格中工作。原始探针的所有功能仍可用，osm-edge 只是将其与代理前置，以便 kubelet 能够与之通信。

需要特殊的配置来支持服务网中的tcpSocket健康探针。由于 osm-edge 通过 Pipy 重定向所有网络流量，所有的端口在 Pod 中都是开放的。这导致所有的 TCP 连接被路由到注入了 Pipy sidecar 的 Pod，看起来是成功的。为了使 tcpSocket 健康探测在网格中正常工作，osm-edge 将探测改写为 httpGet 探测，并添加了一个 iptables 命令，以绕过 osm-healthcheck 暴露端点的 Pipy 代理。osm-healthcheck 容器被添加到 Pod 中，处理来自 kubelet 的 HTTP 健康探测请求。处理程序从请求的 Original-Tcp-port 头中获取原始 TCP 端口，并尝试在指定端口上打开一个 socket。httpGet 探针的响应状态代码反映 TCP 连接是否成功。

Probe	Path	Port
Liveness	/osm-liveness-probe	15901
Readiness	/osm-readiness-probe	15902
Startup	/osm-startup-probe	15903
Healthcheck	/osm-healthcheck	15904

对于HTTP和tcpSocket探测，端口和路径会被修改。对于HTTPS探针，端口被修改，但路径保持不变。

只有预定义的httpGet和tcpSocket探针会被修改。如果一个探针未被定义，则不会在其位置上添加。只要exec探针（包括使用grpc_health_probe的探针）的命令不访问localhost以外的网络，则不会被修改，并正常工作。

例子

下面的例子显示 osm-edge 如何处理网格中的 Pod 的健康探测。

HTTP

假设Pod中的一个容器定义了如下livenessProbe探测：

livenessProbe:
  httpGet:
    path: /liveness
    port: 14001
    scheme: HTTP

当 Pod 被创建时，osm-edge 将修改探针为以下内容：

livenessProbe:
  httpGet:
    path: /osm-liveness-probe
    port: 15901
    scheme: HTTP

该 Pod 的代理将包含以下 Pipy 配置。

一个 Pipy 集群，它映射到原始探针端口 14001：

{
  "Probes": {
      "ReadinessProbes": null,
      "LivenessProbes": [
        {
          "httpGet": {
            "path": "/osm-liveness-probe",
            "port": 15901,
            "scheme": "HTTP"
          },
          "timeoutSeconds": 1,
          "periodSeconds": 10,
          "successThreshold": 1,
          "failureThreshold": 3
        }
      ],
      "StartupProbes": null
    }
  }
}

为新的代理暴露的 HTTP 端点 /osm-liveness-probe 建立监听器，端口为 15901，映射到上述集群：

.listen(probeScheme ? 15901 : 0)
.link(
  'http_liveness', () => probeScheme === 'HTTP',
  'connection_liveness', () => Boolean(probeTarget),
  'deny_liveness'
)

`tcpSocket`

假设Pod中的一个容器定义了如下livenessProbe探测：

livenessProbe:
  tcpSocket:
    port: 14001

当 Pod 被创建时，osm-edge 将修改探针为以下内容：

livenessProbe:
  httpGet:
    httpHeaders:
    - name: Original-Tcp-Port
      value: "14001"
    path: /osm-healthcheck
    port: 15903
    scheme: HTTP

访问 15904 端口的请求绕过了 Pipy 代理，被引向 osm-healthcheck 端点。

如何在网格中验证 POD 的健康状态

Kubernetes将自动轮询配置了启动（startup）、存活（liveness）和就绪（readiness）探测器的Pod的健康检查端点。

当启动探测失败时，Kubernetes将生成一个事件（通过kubectl describe pod <pod name>可见）并重新启动Pod。kubectl describe的输出如下：

...
Events:
  Type     Reason     Age              From               Message
  ----     ------     ----             ----               -------
  Normal   Scheduled  17s              default-scheduler  Successfully assigned bookstore/bookstore-v1-699c79b9dc-5g8zn to osm-control-plane
  Normal   Pulled     16s              kubelet            Successfully pulled image "flomesh/init:v1.1.0" in 26.5835ms
  Normal   Created    16s              kubelet            Created container osm-init
  Normal   Started    16s              kubelet            Started container osm-init
  Normal   Pulling    16s              kubelet            Pulling image "flomesh/init:v1.1.0"
  Normal   Pulling    15s              kubelet            Pulling image "flomesh/pipy:0.5.0"
  Normal   Pulling    15s              kubelet            Pulling image "flomesh/bookstore:v1.1.0"
  Normal   Pulled     15s              kubelet            Successfully pulled image "flomesh/bookstore:v1.1.0" in 319.9863ms
  Normal   Started    15s              kubelet            Started container bookstore-v1
  Normal   Created    15s              kubelet            Created container bookstore-v1
  Normal   Pulled     14s              kubelet            Successfully pulled image "flomesh/pipy:0.5.0" in 755.2666ms
  Normal   Created    14s              kubelet            Created container sidecar
  Normal   Started    14s              kubelet            Started container sidecar
  Warning  Unhealthy  13s              kubelet            Startup probe failed: Get "http://10.244.0.23:15903/osm-startup-probe": dial tcp 10.244.0.23:15903: connect: connection refused
  Warning  Unhealthy  3s (x2 over 8s)  kubelet            Startup probe failed: HTTP probe failed with statuscode: 503

当存活探测失败时，Kubernetes将生成一个事件（通过kubectl describe pod <pod name>可见）并重新启动Pod。kubectl describe的输出如下：

...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  59s                default-scheduler  Successfully assigned bookstore/bookstore-v1-746977967c-jqjt4 to osm-control-plane
  Normal   Pulling    58s                kubelet            Pulling image "flomesh/init:v1.1.0"
  Normal   Created    58s                kubelet            Created container osm-init
  Normal   Started    58s                kubelet            Started container osm-init
  Normal   Pulled     58s                kubelet            Successfully pulled image "flomesh/init:v1.1.0" in 23.415ms
  Normal   Pulled     57s                kubelet            Successfully pulled image "flomesh/pipy:0.5.0" in 678.1391ms
  Normal   Pulled     57s                kubelet            Successfully pulled image "flomesh/bookstore:v1.1.0" in 230.3681ms
  Normal   Created    57s                kubelet            Created container sidecar
  Normal   Pulling    57s                kubelet            Pulling image "flomesh/pipy:0.5.0"
  Normal   Started    56s                kubelet            Started container sidecar
  Normal   Pulled     44s                kubelet            Successfully pulled image "flomesh/bookstore:v1.1.0" in 20.6731ms
  Normal   Created    44s (x2 over 57s)  kubelet            Created container bookstore-v1
  Normal   Started    43s (x2 over 57s)  kubelet            Started container bookstore-v1
  Normal   Pulling    32s (x3 over 58s)  kubelet            Pulling image "flomesh/bookstore:v1.1.0"
  Warning  Unhealthy  32s (x6 over 50s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing    32s (x2 over 44s)  kubelet            Container bookstore-v1 failed liveness probe, will be restarted

当就绪探测失败时，Kubernetes将生成一个事件（通过kubectl describe pod <pod name>可以看到），并确保服务流量不会路由到这些不健康的POD上。一个准备就绪探测失败的Pod的kubectl describe输出如下：

...
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  32s               default-scheduler  Successfully assigned bookstore/bookstore-v1-5848999cb6-hp6qg to osm-control-plane
  Normal   Pulling    31s               kubelet            Pulling image "flomesh/init:v1.1.0"
  Normal   Pulled     31s               kubelet            Successfully pulled image "flomesh/init:v1.1.0" in 19.8726ms
  Normal   Created    31s               kubelet            Created container osm-init
  Normal   Started    31s               kubelet            Started container osm-init
  Normal   Created    30s               kubelet            Created container bookstore-v1
  Normal   Pulled     30s               kubelet            Successfully pulled image "flomesh/bookstore:v1.1.0" in 314.3628ms
  Normal   Pulling    30s               kubelet            Pulling image "flomesh/bookstore:v1.1.0"
  Normal   Started    30s               kubelet            Started container bookstore-v1
  Normal   Pulling    30s               kubelet            Pulling image "flomesh/pipy:0.5.0"
  Normal   Pulled     29s               kubelet            Successfully pulled image "flomesh/pipy:0.5.0" in 739.3931ms
  Normal   Created    29s               kubelet            Created container sidecar
  Normal   Started    29s               kubelet            Started container sidecar
  Warning  Unhealthy  0s (x3 over 20s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

Pod的 status 也可看出其尚未可用，在 kubectl get pod 输出中可以看到。例如：

NAME                            READY   STATUS    RESTARTS   AGE
bookstore-v1-5848999cb6-hp6qg   1/2     Running   0          85s

Pod的健康探测也可以通过转发Pod的必要端口并使用curl或任何其他HTTP客户端发出请求手动调用。例如，为了验证bookstore-v1 demo Pod的有效性探测，获取Pod的名称并转发15901端口：

kubectl port-forward -n bookstore deployment/bookstore-v1 15901

然后，在一个单独的终端里，可以使用curl来检查端点。下面是一个健康的bookstore-v1的例子：

$ curl -i localhost:15901/osm-liveness-probe
HTTP/1.1 200 OK
date: Wed, 31 Mar 2021 16:00:01 GMT
content-length: 1396
content-type: text/html; charset=utf-8

<!doctype html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en">
  ...
</html>

已知问题

#3773

排错

如果有健康探测持续失败，请执行以下步骤以确定根本原因：

验证网格中的 Pod 上的 httpGet 和 tcpSocket 探针是否被修改。

启动、存活和就绪的 httpGet 探针必须被 osm-edge 修改。端口必须被修改为 15901、15902 和 15903，分别适用于存活、就绪和启动 httpGet 探针。只有 HTTP（不包括 HTTPS）探针的路径将被修改，此外还有 /osm-liveness-probe、/osm-readiness-probe 或 /osm-starttup-probe。

同时，验证 Pod 的 Pipy 配置中是否包含修改后的端点的监听。

为了让 tcpSocket 探针在网格中生效，必须将其改写为 httpGet 探针。端口必须被修改为 15904，以用于存活、就绪和启动探测。路径必须设置为 /osm-healthcheck。HTTP 头 Original-TCP-Port，必须设置为 tcpSocket 探针定义中指定的原始端口。另外，验证 osm-healthcheck 容器是否正在运行。检查 osm-healthcheck 日志以获得更多信息。

更多细节见上面的例子。
确定Kubernetes在调度或启动Pod时是否遇到了任何其他错误。

使用kubectl describe命令查找关于不健康的Pod近期的错误。解决这些错误并再次验证POD的健康状态。
确定Pod是否遇到了一个运行时错误。

使用kubectl logs检查容器日志，寻找容器启动后发生的错误。解决这些错误并再次验证Pod的健康状况。

反馈

该内容是否有帮助？

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.