Container Insights の EKS Fargate サポートを試す

AWS Distro for OpenTelemetry を使用した Container Insights の EKS Fargate サポートを試すメモ。

https://aws.amazon.com/jp/blogs/containers/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/

原理的には、API Server をプロキシとして利用して cAdvisor のメトリクスを取得するもので、以前から Container Insights の Prometheus サポートを使ってできたことと同じに思える。

以前試したのがこちら。

クラスターの作成

1.21 のクラスターをノードなしで作成する。

CLUSTER_NAME="container-insights-fargate"
cat << EOF > cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ${CLUSTER_NAME}
  region: ap-northeast-1
  version: "1.21"
vpc:
  cidr: "10.0.0.0/16"

availabilityZones:
  - ap-northeast-1a
  - ap-northeast-1c

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

iam:
  withOIDC: true
EOF

eksctl create cluster -f cluster.yaml

ノードを作成する。

cat << EOF > managed-ng-1.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ${CLUSTER_NAME}
  region: ap-northeast-1

managedNodeGroups:
  - name: managed-ng-1
    minSize: 2
    maxSize: 10
    desiredCapacity: 2
    privateNetworking: true
    iam:
      attachPolicyARNs:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
EOF

eksctl create nodegroup -f managed-ng-1.yaml

Admin ロールにも権限をつけておく。

USER_NAME="Admin:{{SessionName}}"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/Admin"
eksctl create iamidentitymapping --cluster ${CLUSTER_NAME} --arn ${ROLE_ARN} --username ${USER_NAME} --group system:masters

Fargate プロファイルの作成

Fargate プロファイルを 2 つ作成する。

eksctl create fargateprofile \
    --cluster  ${CLUSTER_NAME} \
    --name fargate-container-insights \
    --namespace fargate-container-insights
eksctl create fargateprofile \
    --cluster  ${CLUSTER_NAME} \
    --name applications \
    --namespace golang

ADOT コレクターのデプロイ

ブログ記事のヘルパースクリプトを実行して ServiceAccount と IAM ロールを作成する。

#!/bin/bash
CLUSTER_NAME="container-insights-fargate"
REGION="ap-northeast-1"
SERVICE_ACCOUNT_NAMESPACE=fargate-container-insights
SERVICE_ACCOUNT_NAME=adot-collector
SERVICE_ACCOUNT_IAM_ROLE=EKS-Fargate-ADOT-ServiceAccount-Role
SERVICE_ACCOUNT_IAM_POLICY=arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

eksctl utils associate-iam-oidc-provider \
--cluster=$CLUSTER_NAME \
--approve

eksctl create iamserviceaccount \
--cluster=$CLUSTER_NAME \
--region=$REGION \
--name=$SERVICE_ACCOUNT_NAME \
--namespace=$SERVICE_ACCOUNT_NAMESPACE \
--role-name=$SERVICE_ACCOUNT_IAM_ROLE \
--attach-policy-arn=$SERVICE_ACCOUNT_IAM_POLICY \
--approve

ADOT コレクターをデプロイする。この手順のガイドはここでマニフェストはここにある。

cat << EOF > adot-collector.yaml
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/proxy
      - nodes/metrics
      - services
      - endpoints
      - pods
      - pods/proxy
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: [ "/metrics/cadvisor"]
    verbs: ["get", "list", "watch"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: adotcol-admin-role-binding
subjects:
  - kind: ServiceAccount
    name: adot-collector
    namespace: fargate-container-insights
roleRef:
  kind: ClusterRole
  name: adotcol-admin-role
  apiGroup: rbac.authorization.k8s.io

# collector configuration section
# update `ClusterName=YOUR-EKS-CLUSTER-NAME` in the env variable OTEL_RESOURCE_ATTRIBUTES
# update `region=YOUR-AWS-REGION` in the emfexporter with the name of the AWS Region where you want to collect Container Insights metrics.
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: adot-collector-config
  namespace: fargate-container-insights
  labels:
    app: aws-adot
    component: adot-collector-config
data:
  adot-collector-config: |
    receivers:
      prometheus:
        config:
          global:
            scrape_interval: 1m
            scrape_timeout: 40s

          scrape_configs:
          - job_name: 'kubelets-cadvisor-metrics'
            sample_limit: 10000
            scheme: https

            kubernetes_sd_configs:
            - role: node
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

            relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_node_label_(.+)
                # Only for Kubernetes ^1.7.3.
                # See: https://github.com/prometheus/prometheus/issues/2916
              - target_label: __address__
                # Changes the address to Kube API server's default address and port
                replacement: kubernetes.default.svc:443
              - source_labels: [__meta_kubernetes_node_name]
                regex: (.+)
                target_label: __metrics_path__
                # Changes the default metrics path to kubelet's proxy cadvdisor metrics endpoint
                replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
            metric_relabel_configs:
              # extract readable container/pod name from id field
              - action: replace
                source_labels: [id]
                regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
                target_label: rkt_container_name
                replacement: '$${2}-$${1}'
              - action: replace
                source_labels: [id]
                regex: '^/system\.slice/(.+)\.service$'
                target_label: systemd_service_name
                replacement: '$${1}'
    processors:
      # rename labels which apply to all metrics and are used in metricstransform/rename processor
      metricstransform/label_1:
        transforms:
          - include: .*
            match_type: regexp
            action: update
            operations:
              - action: update_label
                label: name
                new_label: container_id
              - action: update_label
                label: kubernetes_io_hostname
                new_label: NodeName
              - action: update_label
                label: eks_amazonaws_com_compute_type
                new_label: LaunchType

      # rename container and pod metrics which we care about.
      # container metrics are renamed to `new_container_*` to differentiate them with unused container metrics
      metricstransform/rename:
        transforms:
          - include: container_spec_cpu_quota
            new_name: new_container_cpu_limit_raw
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_spec_cpu_shares
            new_name: new_container_cpu_request
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_cpu_usage_seconds_total
            new_name: new_container_cpu_usage_seconds_total
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_spec_memory_limit_bytes
            new_name: new_container_memory_limit
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_cache
            new_name: new_container_memory_cache
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_max_usage_bytes
            new_name: new_container_memory_max_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_usage_bytes
            new_name: new_container_memory_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_working_set_bytes
            new_name: new_container_memory_working_set
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_rss
            new_name: new_container_memory_rss
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_swap
            new_name: new_container_memory_swap
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failcnt
            new_name: new_container_memory_failcnt
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failures_total
            new_name: new_container_memory_hierarchical_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: new_container_memory_hierarchical_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: new_container_memory_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
          - include: container_memory_failures_total
            new_name: new_container_memory_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
          - include: container_fs_limit_bytes
            new_name: new_container_filesystem_capacity
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          - include: container_fs_usage_bytes
            new_name: new_container_filesystem_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
          # POD LEVEL METRICS
          - include: container_spec_cpu_quota
            new_name: pod_cpu_limit_raw
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_spec_cpu_shares
            new_name: pod_cpu_request
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_cpu_usage_seconds_total
            new_name: pod_cpu_usage_seconds_total
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_spec_memory_limit_bytes
            new_name: pod_memory_limit
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_cache
            new_name: pod_memory_cache
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_max_usage_bytes
            new_name: pod_memory_max_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_usage_bytes
            new_name: pod_memory_usage
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_working_set_bytes
            new_name: pod_memory_working_set
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_rss
            new_name: pod_memory_rss
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_swap
            new_name: pod_memory_swap
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failcnt
            new_name: pod_memory_failcnt
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
          - include: container_memory_failures_total
            new_name: pod_memory_hierarchical_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: pod_memory_hierarchical_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
          - include: container_memory_failures_total
            new_name: pod_memory_pgfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
          - include: container_memory_failures_total
            new_name: pod_memory_pgmajfault
            action: insert
            match_type: regexp
            experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
          - include: container_network_receive_bytes_total
            new_name: pod_network_rx_bytes
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_receive_packets_dropped_total
            new_name: pod_network_rx_dropped
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_receive_errors_total
            new_name: pod_network_rx_errors
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_receive_packets_total
            new_name: pod_network_rx_packets
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_bytes_total
            new_name: pod_network_tx_bytes
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_packets_dropped_total
            new_name: pod_network_tx_dropped
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_errors_total
            new_name: pod_network_tx_errors
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
          - include: container_network_transmit_packets_total
            new_name: pod_network_tx_packets
            action: insert
            match_type: regexp
            experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}

      # filter out only renamed metrics which we care about
      filter:
        metrics:
          include:
            match_type: regexp
            metric_names:
              - new_container_.*
              - pod_.*

      # convert cumulative sum datapoints to delta
      cumulativetodelta:
        metrics:
          - new_container_cpu_usage_seconds_total
          - pod_cpu_usage_seconds_total
          - pod_memory_pgfault
          - pod_memory_pgmajfault
          - pod_memory_hierarchical_pgfault
          - pod_memory_hierarchical_pgmajfault
          - pod_network_rx_bytes
          - pod_network_rx_dropped
          - pod_network_rx_errors
          - pod_network_rx_packets
          - pod_network_tx_bytes
          - pod_network_tx_dropped
          - pod_network_tx_errors
          - pod_network_tx_packets
          - new_container_memory_pgfault
          - new_container_memory_pgmajfault
          - new_container_memory_hierarchical_pgfault
          - new_container_memory_hierarchical_pgmajfault

      # convert delta to rate
      deltatorate:
        metrics:
          - new_container_cpu_usage_seconds_total
          - pod_cpu_usage_seconds_total
          - pod_memory_pgfault
          - pod_memory_pgmajfault
          - pod_memory_hierarchical_pgfault
          - pod_memory_hierarchical_pgmajfault
          - pod_network_rx_bytes
          - pod_network_rx_dropped
          - pod_network_rx_errors
          - pod_network_rx_packets
          - pod_network_tx_bytes
          - pod_network_tx_dropped
          - pod_network_tx_errors
          - pod_network_tx_packets
          - new_container_memory_pgfault
          - new_container_memory_pgmajfault
          - new_container_memory_hierarchical_pgfault
          - new_container_memory_hierarchical_pgmajfault

      experimental_metricsgeneration/1:
        rules:
          - name: pod_network_total_bytes
            unit: Bytes/Second
            type: calculate
            metric1: pod_network_rx_bytes
            metric2: pod_network_tx_bytes
            operation: add
          - name: pod_memory_utilization_over_pod_limit
            unit: Percent
            type: calculate
            metric1: pod_memory_working_set
            metric2: pod_memory_limit
            operation: percent
          - name: pod_cpu_usage_total
            unit: Millicore
            type: scale
            metric1: pod_cpu_usage_seconds_total
            operation: multiply
            # core to millicore: multiply by 1000
            # millicore seconds to millicore nanoseconds: multiply by 10^9
            scale_by: 1000
          - name: pod_cpu_limit
            unit: Millicore
            type: scale
            metric1: pod_cpu_limit_raw
            operation: divide
            scale_by: 100

      experimental_metricsgeneration/2:
        rules:
          - name: pod_cpu_utilization_over_pod_limit
            type: calculate
            unit: Percent
            metric1: pod_cpu_usage_total
            metric2: pod_cpu_limit
            operation: percent

      # add `Type` and rename metrics and labels
      metricstransform/label_2:
        transforms:
          - include: pod_.*
            match_type: regexp
            action: update
            operations:
              - action: add_label
                new_label: Type
                new_value: "Pod"
          - include: new_container_.*
            match_type: regexp
            action: update
            operations:
              - action: add_label
                new_label: Type
                new_value: Container
          - include: .*
            match_type: regexp
            action: update
            operations:
              - action: update_label
                label: namespace
                new_label: Namespace
              - action: update_label
                label: pod
                new_label: PodName
          - include: ^new_container_(.*)$$
            match_type: regexp
            action: update
            new_name: container_$$1

      # add cluster name from env variable and EKS metadata
      resourcedetection:
        detectors: [env, eks]

      batch:
        timeout: 60s

    # only pod level metrics in metrics format, details in https://aws-otel.github.io/docs/getting-started/container-insights/eks-fargate
    exporters:
      awsemf:
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{PodName}'
        namespace: 'ContainerInsights'
        region: ap-northeast-1
        resource_to_telemetry_conversion:
          enabled: true
        eks_fargate_container_insights_enabled: true
        parse_json_encoded_attr_values: ["kubernetes"]
        dimension_rollup_option: NoDimensionRollup
        metric_declarations:
          - dimensions: [ [ClusterName, LaunchType], [ClusterName, Namespace, LaunchType], [ClusterName, Namespace, PodName, LaunchType]]
            metric_name_selectors:
              - pod_cpu_utilization_over_pod_limit
              - pod_cpu_usage_total
              - pod_cpu_limit
              - pod_memory_utilization_over_pod_limit
              - pod_memory_working_set
              - pod_memory_limit
              - pod_network_rx_bytes
              - pod_network_tx_bytes

    extensions:
      health_check:

    service:
      pipelines:
        metrics:
          receivers: [prometheus]
          processors: [metricstransform/label_1, resourcedetection, metricstransform/rename, filter, cumulativetodelta, deltatorate, experimental_metricsgeneration/1, experimental_metricsgeneration/2, metricstransform/label_2, batch]
          exporters: [awsemf]
      extensions: [health_check]

# configure the service and the collector as a StatefulSet
---
apiVersion: v1
kind: Service
metadata:
  name: adot-collector-service
  namespace: fargate-container-insights
  labels:
    app: aws-adot
    component: adot-collector
spec:
  ports:
    - name: metrics # default endpoint for querying metrics.
      port: 8888
  selector:
    component: adot-collector
  type: ClusterIP

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: adot-collector
  namespace: fargate-container-insights
  labels:
    app: aws-adot
    component: adot-collector
spec:
  selector:
    matchLabels:
      app: aws-adot
      component: adot-collector
  serviceName: adot-collector-service
  template:
    metadata:
      labels:
        app: aws-adot
        component: adot-collector
    spec:
      serviceAccountName: adot-collector
      securityContext:
        fsGroup: 65534
      containers:
        - image: amazon/aws-otel-collector:v0.15.1
          name: adot-collector
          imagePullPolicy: Always
          command:
            - "/awscollector"
            - "--config=/conf/adot-collector-config.yaml"
          env:
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "ClusterName=container-insights-fargate"
          resources:
            limits:
              cpu: 2
              memory: 2Gi
            requests:
              cpu: 200m
              memory: 400Mi
          volumeMounts:
            - name: adot-collector-config-volume
              mountPath: /conf
      volumes:
        - configMap:
            name: adot-collector-config
            items:
              - key: adot-collector-config
                path: adot-collector-config.yaml
          name: adot-collector-config-volume
---
EOF

$ k apply -f adot-collector.yaml
clusterrole.rbac.authorization.k8s.io/adotcol-admin-role created
clusterrolebinding.rbac.authorization.k8s.io/adotcol-admin-role-binding created
configmap/adot-collector-config created
service/adot-collector-service created
statefulset.apps/adot-collector created

サンプルアプリのデプロイ

サンプルアプリをデプロイする。

cat << EOF > samle-deploy.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  namespace: golang
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webapp
      role: webapp-service
  template:
    metadata:
      labels:
        app: webapp
        role: webapp-service             
    spec: 
      containers:          
        - name: go  
          image: public.ecr.aws/awsvijisarathy/prometheus-webapp:latest
          imagePullPolicy: Always   
          resources:
            requests:
              cpu: "256m"
              memory: "512Mi"
            limits:
              cpu: "512m"
              memory: "1024Mi"            
EOF

$ k create ns golang
namespace/golang created
$ k apply -f samle-deploy.yaml
deployment.apps/webapp created

Pod とノードを確認する。

$ k get po -A
NAMESPACE                    NAME                       READY   STATUS    RESTARTS   AGE
fargate-container-insights   adot-collector-0           1/1     Running   0          27h
golang                       webapp-8b785b667-hcrts     1/1     Running   0          27h
golang                       webapp-8b785b667-v6gfk     1/1     Running   0          27h
kube-system                  aws-node-vxwcs             1/1     Running   0          31h
kube-system                  aws-node-xmcfb             1/1     Running   0          31h
kube-system                  coredns-76f4967988-qb6gd   1/1     Running   0          31h
kube-system                  coredns-76f4967988-sdlnc   1/1     Running   0          31h
kube-system                  kube-proxy-g7m2g           1/1     Running   0          31h
kube-system                  kube-proxy-tf7dq           1/1     Running   0          31h
$ k get no
NAME                                                      STATUS   ROLES    AGE   VERSION
fargate-ip-10-0-103-250.ap-northeast-1.compute.internal   Ready    <none>   27h   v1.21.2-eks-06eac09
fargate-ip-10-0-80-22.ap-northeast-1.compute.internal     Ready    <none>   27h   v1.21.2-eks-06eac09
fargate-ip-10-0-92-178.ap-northeast-1.compute.internal    Ready    <none>   27h   v1.21.2-eks-06eac09
ip-10-0-118-159.ap-northeast-1.compute.internal           Ready    <none>   31h   v1.21.5-eks-9017834
ip-10-0-93-77.ap-northeast-1.compute.internal             Ready    <none>   31h   v1.21.5-eks-9017834

メトリクスの確認

パフォーマンスログを確認する。

f:id:sotoiwa:20220301180331p:plain

ログイベントでは以下のように EMF のログが確認できる。

{
    "ClusterName": "container-insights-fargate",
    "LaunchType": "fargate",
    "Namespace": "golang",
    "NodeName": "ip-10-0-103-250.ap-northeast-1.compute.internal",
    "PodName": "webapp-8b785b667-hcrts",
    "Type": "Pod",
    "_aws": {
        "CloudWatchMetrics": [
            {
                "Namespace": "ContainerInsights",
                "Dimensions": [
                    [
                        "ClusterName",
                        "LaunchType"
                    ],
                    [
                        "ClusterName",
                        "LaunchType",
                        "Namespace"
                    ],
                    [
                        "ClusterName",
                        "LaunchType",
                        "Namespace",
                        "PodName"
                    ]
                ],
                "Metrics": [
                    {
                        "Name": "pod_cpu_usage_total",
                        "Unit": "Millicore"
                    },
                    {
                        "Name": "pod_cpu_utilization_over_pod_limit",
                        "Unit": "Percent"
                    }
                ]
            }
        ],
        "Timestamp": 1646124298710
    },
    "beta_kubernetes_io_arch": "amd64",
    "beta_kubernetes_io_os": "linux",
    "cloud.platform": "aws_eks",
    "cloud.provider": "aws",
    "cpu": "total",
    "failure_domain_beta_kubernetes_io_region": "ap-northeast-1",
    "failure_domain_beta_kubernetes_io_zone": "ap-northeast-1c",
    "host.name": "fargate-ip-10-0-103-250.ap-northeast-1.compute.internal",
    "id": "/kubepods/burstable/pod9e99f29e-5e25-40c1-aa39-35e198dd0997",
    "instance": "fargate-ip-10-0-103-250.ap-northeast-1.compute.internal",
    "job": "kubelets-cadvisor-metrics",
    "kubernetes": {
        "host": "ip-10-0-103-250.ap-northeast-1.compute.internal",
        "namespace_name": "golang",
        "pod_name": "webapp-8b785b667-hcrts"
    },
    "kubernetes_io_arch": "amd64",
    "kubernetes_io_os": "linux",
    "opencensus.resourcetype": "cloud",
    "pod_cpu_usage_seconds_total": 0.00044769273935354483,
    "pod_cpu_usage_total": 0.4476927393535448,
    "pod_cpu_utilization_over_pod_limit": 0.08953854787070896,
    "port": "",
    "scheme": "https",
    "service.name": "kubelets-cadvisor-metrics",
    "topology_kubernetes_io_region": "ap-northeast-1",
    "topology_kubernetes_io_zone": "ap-northeast-1c"
}

EMF が含まれないログもあり、なぜ EMF が含まれるものとそうでないものがあるのかが理解できていない。

{
    "ClusterName": "container-insights-fargate",
    "LaunchType": "fargate",
    "Namespace": "golang",
    "NodeName": "ip-10-0-103-250.ap-northeast-1.compute.internal",
    "PodName": "webapp-8b785b667-hcrts",
    "Type": "Pod",
    "beta_kubernetes_io_arch": "amd64",
    "beta_kubernetes_io_os": "linux",
    "cloud.platform": "aws_eks",
    "cloud.provider": "aws",
    "failure_domain_beta_kubernetes_io_region": "ap-northeast-1",
    "failure_domain_beta_kubernetes_io_zone": "ap-northeast-1c",
    "failure_type": "pgfault",
    "host.name": "fargate-ip-10-0-103-250.ap-northeast-1.compute.internal",
    "id": "/kubepods/burstable/pod9e99f29e-5e25-40c1-aa39-35e198dd0997",
    "instance": "fargate-ip-10-0-103-250.ap-northeast-1.compute.internal",
    "job": "kubelets-cadvisor-metrics",
    "kubernetes": {
        "host": "ip-10-0-103-250.ap-northeast-1.compute.internal",
        "namespace_name": "golang",
        "pod_name": "webapp-8b785b667-hcrts"
    },
    "kubernetes_io_arch": "amd64",
    "kubernetes_io_os": "linux",
    "opencensus.resourcetype": "cloud",
    "pod_memory_hierarchical_pgfault": 0,
    "port": "",
    "scheme": "https",
    "scope": "hierarchy",
    "service.name": "kubelets-cadvisor-metrics",
    "topology_kubernetes_io_region": "ap-northeast-1",
    "topology_kubernetes_io_zone": "ap-northeast-1c"
}

メトリクスとしても Pod 名毎に取得されている。

f:id:sotoiwa:20220301180402p:plain

自動ダッシュボードでもメトリクスが確認できる。ノードのメトリクスはとれていない。

f:id:sotoiwa:20220301180417p:plain

やはり、やっていることは以前から Container Insights の Prometheus サポートを使えばできたことと同じだが、CloudWatch エージェントではなく ADOT コレクターを使用して、かつ適切なセットアップが提供されているということかと思う。