EKSでCiliumを試す2

以前EKSでCiliumを試したときに、そのままだと上手く動かなかったので、半年くらい経ってもう一度各コンポーネントの最新バージョンで試してみたメモ。

コンポーネント バージョン 備考
eksctl 0.54.0
Kubernetes バージョン 1.20
プラットフォームのバージョン eks.1
VPC CNI Plugin 1.7.10
Cilium 1.10.1

VPC CNI Plugin は最新は1.8.0だが、1.7の最新のパッチバージョンが推奨のようなのでそちらで試す。

クラスターの作成

クラスターを作成する。

cat <<EOF > cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: cilium
  region: ap-northeast-1
  version: "1.20"
vpc:
  cidr: "10.0.0.0/16"

availabilityZones:
  - ap-northeast-1a
  - ap-northeast-1c

managedNodeGroups:
  - name: managed-ng-1
    minSize: 2
    maxSize: 2
    desiredCapacity: 2
    ssh:
      allow: true
      publicKeyName: default
      # enableSsm: true

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

iam:
  withOIDC: true
EOF
eksctl create cluster -f cluster.yaml

VPC CNI Plugin の最新化

バージョンを確認する。Ciliumのマニュアル上も1.7.9以上にしろと書いてある。

$ k get ds -n kube-system -o wide
NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE   CONTAINERS   IMAGES                                                                                SELECTOR
aws-node     2         2         2       2            2           <none>          23m   aws-node     602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1    k8s-app=aws-node
kube-proxy   2         2         2       2            2           <none>          23m   kube-proxy   602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/eks/kube-proxy:v1.20.4-eksbuild.2   k8s-app=kube-proxy

eksctlがaws-nodeを勝手にIRSAで動かしてくれるので、そのARNを確認する。

$ k get sa -n kube-system aws-node -o yaml | grep role-arn
    eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXXXX:role/eksctl-cilium-addon-iamserviceaccount-kube-s-Role1-PUQJWEEGQJXC

EKS addon化してバージョンアップする。

eksctl create addon --cluster cilium \
  --name vpc-cni --version 1.7.10 \
  --service-account-role-arn=arn:aws:iam::XXXXXXXXXXXX:role:role/eksctl-cilium-addon-iamserviceaccount-kube-s-Role1-PUQJWEEGQJXC \
  --force
$ k get ds -n kube-system -o wide
NAME         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE   CONTAINERS   IMAGES                                                                                SELECTOR
aws-node     2         2         2       2            2           <none>          31m   aws-node     602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/amazon-k8s-cni:v1.7.10-eksbuild.1   k8s-app=aws-node
kube-proxy   2         2         2       2            2           <none>          31m   kube-proxy   602401143452.dkr.ecr.ap-northeast-1.amazonaws.com/eks/kube-proxy:v1.20.4-eksbuild.2   k8s-app=kube-proxy

Ciliumのインストール

Helmリポジトリを追加する。

helm repo add cilium https://helm.cilium.io/
helm repo update

HelmでCiliumをインストールする。

helm install cilium cilium/cilium --version 1.10.1 \
  --namespace kube-system \
  --set cni.chainingMode=aws-cni \
  --set enableIPv4Masquerade=false \
  --set tunnel=disabled \
  --set nodeinit.enabled=true \
  --set endpointRoutes.enabled=true

Ciliumがインストールされたことを確認する。

$ kubectl get po -A
NAMESPACE     NAME                               READY   STATUS    RESTARTS   AGE
kube-system   aws-node-v8zjq                     1/1     Running   0          4m11s
kube-system   aws-node-zsc4s                     1/1     Running   0          3m37s
kube-system   cilium-57vtb                       1/1     Running   0          39s
kube-system   cilium-dfr7x                       1/1     Running   0          39s
kube-system   cilium-node-init-5cxj2             1/1     Running   0          39s
kube-system   cilium-node-init-rnt69             1/1     Running   0          39s
kube-system   cilium-operator-689d85cb47-bmtjd   1/1     Running   0          39s
kube-system   cilium-operator-689d85cb47-jk4tm   1/1     Running   0          39s
kube-system   coredns-54bc78bc49-bmqkk           1/1     Running   0          13s
kube-system   coredns-54bc78bc49-kphgv           1/1     Running   0          28s
kube-system   kube-proxy-n6sdq                   1/1     Running   0          20m
kube-system   kube-proxy-rcp65                   1/1     Running   0          20m

Ciliumをインストールした後、Ciliumでポリシーを適用するために自動で再起動されたCoreDNSが起動しなくなるという問題はなく、問題なさそう。

以下のIssueもクローズされている。

手順にあるとおり、再起動が必要なPodを確認する。

for ns in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
     ceps=$(kubectl -n "${ns}" get cep \
         -o jsonpath='{.items[*].metadata.name}')
     pods=$(kubectl -n "${ns}" get pod \
         -o custom-columns=NAME:.metadata.name,NETWORK:.spec.hostNetwork \
         | grep -E '\s(<none>|false)' | awk '{print $1}' | tr '\n' ' ')
     ncep=$(echo "${pods} ${ceps}" | tr ' ' '\n' | sort | uniq -u | paste -s -d ' ' -)
     for pod in $(echo $ncep); do
       echo "${ns}/${pod}";
     done
done

特になし。

テスト

テストがCLIできるようになっているので、CLIでテストしてみる。

CLIをインストールする。

curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-darwin-amd64.tar.gz{,.sha256sum}
shasum -a 256 -c cilium-darwin-amd64.tar.gz.sha256sum
tar xzvfC cilium-darwin-amd64.tar.gz ${HOME}/bin
rm cilium-darwin-amd64.tar.gz{,.sha256sum}
$ cilium version
cilium-cli: v0.8.2 compiled with go1.16.5 on darwin/amd64
$ cilium status --wait
    /¯¯\
 /¯¯\__/¯¯\    Cilium:         OK
 \__/¯¯\__/    Operator:       OK
 /¯¯\__/¯¯\    Hubble:         disabled
 \__/¯¯\__/    ClusterMesh:    disabled
    \__/

DaemonSet         cilium             Desired: 2, Ready: 2/2, Available: 2/2
Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Containers:       cilium             Running: 2
                  cilium-operator    Running: 2
Image versions    cilium             quay.io/cilium/cilium:v1.10.1@sha256:f5fcdfd4929af5a8903b02da61332eea41dcdb512420b8c807e2e2904270561c: 2
                  cilium-operator    quay.io/cilium/operator-generic:v1.10.1@sha256:a1588ee00a15f2f2b419e4acd36bd57d64a5f10eb52d0fd4de689e558a913cd8: 2

テストを実行する。

$ cilium connectivity test
ℹ️  Monitor aggregation detected, will skip some flow validation steps
✨ [cilium.ap-northeast-1.eksctl.io] Creating namespace for connectivity check...
✨ [cilium.ap-northeast-1.eksctl.io] Deploying echo-same-node service...
✨ [cilium.ap-northeast-1.eksctl.io] Deploying same-node deployment...
✨ [cilium.ap-northeast-1.eksctl.io] Deploying client deployment...
✨ [cilium.ap-northeast-1.eksctl.io] Deploying client2 deployment...
✨ [cilium.ap-northeast-1.eksctl.io] Deploying echo-other-node service...
✨ [cilium.ap-northeast-1.eksctl.io] Deploying other-node deployment...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for deployments [echo-other-node] to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for CiliumEndpoint for pod cilium-test/client-7b7bf54b85-75nmw to appear...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for CiliumEndpoint for pod cilium-test/client2-666976c95b-n29pg to appear...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-697d5d69b7-qxfm5 to appear...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-7967996674-qvm6t to appear...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for Service cilium-test/echo-other-node to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for NodePort 10.0.1.129:30561 (cilium-test/echo-other-node) to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for NodePort 10.0.1.129:31548 (cilium-test/echo-same-node) to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for NodePort 10.0.58.170:30561 (cilium-test/echo-other-node) to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for NodePort 10.0.58.170:31548 (cilium-test/echo-same-node) to become ready...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for Cilium pod kube-system/cilium-57vtb to have all the pod IPs in eBPF ipcache...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for Cilium pod kube-system/cilium-dfr7x to have all the pod IPs in eBPF ipcache...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for pod cilium-test/client-7b7bf54b85-75nmw to reach kube-dns service...
⌛ [cilium.ap-northeast-1.eksctl.io] Waiting for pod cilium-test/client2-666976c95b-n29pg to reach kube-dns service...
🔭 Enabling Hubble telescope...
⚠️  Unable to contact Hubble Relay, disabling Hubble telescope and flow validation: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp [::1]:4245: connect: connection refused"
ℹ️  Expose Relay locally with: kubectl port-forward -n kube-system deployment/hubble-relay 4245:4245
🏃 Running tests...

[=] Test [no-policies]
.............................
[=] Test [client-ingress]
..
[=] Test [echo-ingress]
....
[=] Test [to-fqdns]
..
  ℹ️  📜 Applying CiliumNetworkPolicy 'client-egress-to-fqdns-google' to namespace 'cilium-test'..
  [-] Scenario [to-fqdns/pod-to-world]
  [.] Action [to-fqdns/pod-to-world/https-to-google: cilium-test/client2-666976c95b-n29pg (10.0.39.210) -> google-https (google.com:443)]
  [.] Action [to-fqdns/pod-to-world/http-to-google: cilium-test/client-7b7bf54b85-75nmw (10.0.31.143) -> google-http (google.com:80)]
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://google.com:80" failed: command terminated with exit code 22
  [.] Action [to-fqdns/pod-to-world/http-to-www-google: cilium-test/client-7b7bf54b85-75nmw (10.0.31.143) -> www-google-http (www.google.com:80)]
  ℹ️  📜 Deleting CiliumNetworkPolicy 'client-egress-to-fqdns-google' from namespace 'cilium-test'..

[=] Test [to-entities-world]
...
[=] Test [allow-all]
.........................
[=] Test [dns-only]
.......
[=] Test [client-egress]
....
[=] Test [to-cidr-1111]
....
📋 Test Report
❌ 1/9 tests failed (1/81 actions), 0 warnings, 0 tests skipped, 0 scenarios skipped:
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-google: cilium-test/client-7b7bf54b85-75nmw (10.0.31.143) -> google-http (google.com:80)

Error: Connectivity test failed: 1 tests failed

Googleへの疎通確認に失敗している。

ポリシーなしの状態では問題ない。

$ k run pod1 --image=nginx
pod/pod1 created
$ k get po
NAME   READY   STATUS    RESTARTS   AGE
pod1   1/1     Running   0          10s
$ k exec -it pod1 -- bash
root@pod1:/# curl http://google.com/
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
root@pod1:/# exit
exit

注意点でいくつかアドバンスドな機能に制限があると書いてあるので、これがその辺かもしれない。