kube-state-metrics を Container Insights に送る方法を確認したメモ。
クラスターの作成
1.24でクラスターを作成する。ノードなしで作成する。
CLUSTER_NAME="mycluster" cat << EOF > cluster.yaml apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: ${CLUSTER_NAME} region: ap-northeast-1 version: "1.24" vpc: cidr: "10.0.0.0/16" availabilityZones: - ap-northeast-1a - ap-northeast-1c cloudWatch: clusterLogging: enableTypes: ["*"] iam: withOIDC: true EOF
eksctl create cluster -f cluster.yaml
ノードを作成する。
cat << EOF > m1.yaml apiVersion: eksctl.io/v1alpha5 kind: ClusterConfig metadata: name: ${CLUSTER_NAME} region: ap-northeast-1 managedNodeGroups: - name: m1 minSize: 2 maxSize: 2 desiredCapacity: 2 privateNetworking: true iam: attachPolicyARNs: - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore EOF
eksctl create nodegroup -f m1.yaml
Adminロールにも権限をつけておく。
CLUSTER_NAME="mycluster" USER_NAME="Admin:{{SessionName}}" AWS_ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account) ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/Admin" eksctl create iamidentitymapping --cluster ${CLUSTER_NAME} --arn ${ROLE_ARN} --username ${USER_NAME} --group system:masters
Prometheus のデプロイ
helmefile.yaml
に以下を追加する。
repositories: - name: prometheus-community url: https://prometheus-community.github.io/helm-charts releases: - name: prometheus namespace: prometheus createNamespace: true chart: prometheus-community/prometheus version: 23.0.0 values: - ./prometheus/values.yaml
prometheus/values.yaml
を作成する。
# https://github.com/prometheus-community/helm-charts/blob/main/charts/prometheus/values.yaml server: persistentVolume: enabled: false alertmanager: enabled: false kube-state-metrics: enabled: true prometheus-node-exporter: enabled: true prometheus-pushgateway: enabled: false
デプロイする。
helmfile apply
Container Insights のログ収集のデプロイ
Fluent Bit 用の IAM ロールを作成する。
eksctl create iamserviceaccount \ --cluster=${CLUSTER_NAME} \ --namespace=amazon-cloudwatch \ --name=fluent-bit \ --attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \ --approve \ --role-only
helmefile.yaml
に以下を追加する。
releases: - name: container-insights chart: ./container_insights namespace: amazon-cloudwatch createNamespace: true
マニフェストをダウンロードして container_insights
フォルダに配置する。
curl -O https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml
ConfigMap のマニフェストを作成する。
ClusterName=${CLUSTER_NAME} RegionName=ap-northeast-1 FluentBitHttpPort='2020' FluentBitReadFromHead='Off' [[ ${FluentBitReadFromHead} = 'On' ]] && FluentBitReadFromTail='Off'|| FluentBitReadFromTail='On' [[ -z ${FluentBitHttpPort} ]] && FluentBitHttpServer='Off' || FluentBitHttpServer='On' kubectl create configmap fluent-bit-cluster-info --dry-run=client -o yaml \ --from-literal=cluster.name=${ClusterName} \ --from-literal=http.server=${FluentBitHttpServer} \ --from-literal=http.port=${FluentBitHttpPort} \ --from-literal=read.head=${FluentBitReadFromHead} \ --from-literal=read.tail=${FluentBitReadFromTail} \ --from-literal=logs.region=${RegionName} -n amazon-cloudwatch > fluent-bit-cluster-info.yaml
fluent-bit.yaml
の IRSA 用のアノテーションを修正する。
eksctl get iamserviceaccount --cluster ${CLUSTER_NAME}
apiVersion: v1 kind: ServiceAccount metadata: name: fluent-bit namespace: amazon-cloudwatch annotations: eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXXXX:role/eksctl-mycluster-addon-iamserviceaccount-ama-Role1-1OSUU3XOAFAJZ
デプロイする。
helmfile apply
Container Insights のインフラメトリクス収集のデプロイ
ADOT Collector 用の IAM ロールを作成する。
k create ns aws-otel-eks eksctl create iamserviceaccount \ --cluster=${CLUSTER_NAME} \ --namespace=aws-otel-eks \ --name=aws-otel-sa \ --attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \ --approve \ --role-only
helmefile.yaml
に以下を追加する。
releases: - name: adot-collector chart: ./adot_collector namespace: aws-otel-eks createNamespace: true
マニフェストをダウンロードして adot_collector
フォルダに配置する。
curl -O https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml
Namespace は作成済みなので削除する。
IRSA のアノテーションを付与する。
--- # create cwagent service account and role binding apiVersion: v1 kind: ServiceAccount metadata: name: aws-otel-sa namespace: aws-otel-eks annotations: eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXXXX:role/eksctl-mycluster-addon-iamserviceaccount-aws-Role1-1SKD8JAEOLGKW --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: aoc-agent-role rules: - apiGroups: [""] resources: ["pods", "nodes", "endpoints"] verbs: ["list", "watch", "get"] - apiGroups: ["apps"] resources: ["replicasets"] verbs: ["list", "watch", "get"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["list", "watch"] - apiGroups: [""] resources: ["nodes/proxy"] verbs: ["get"] - apiGroups: [""] resources: ["nodes/stats", "configmaps", "events"] verbs: ["create", "get"] - apiGroups: [""] resources: ["configmaps"] verbs: ["update"] - apiGroups: [""] resources: ["configmaps"] resourceNames: ["otel-container-insight-clusterleader"] verbs: ["get","update", "create"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] verbs: ["create","get", "update"] - apiGroups: ["coordination.k8s.io"] resources: ["leases"] resourceNames: ["otel-container-insight-clusterleader"] verbs: ["get","update", "create"] --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: aoc-agent-role-binding subjects: - kind: ServiceAccount name: aws-otel-sa namespace: aws-otel-eks roleRef: kind: ClusterRole name: aoc-agent-role apiGroup: rbac.authorization.k8s.io --- # create Daemonset apiVersion: apps/v1 kind: DaemonSet metadata: name: aws-otel-eks-ci namespace: aws-otel-eks spec: selector: matchLabels: name: aws-otel-eks-ci template: metadata: labels: name: aws-otel-eks-ci spec: containers: - name: aws-otel-collector image: public.ecr.aws/aws-observability/aws-otel-collector:latest env: - name: K8S_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: HOST_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: K8S_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace imagePullPolicy: Always command: - "/awscollector" - "--config=/conf/otel-agent-config.yaml" volumeMounts: - name: rootfs mountPath: /rootfs readOnly: true - name: dockersock mountPath: /var/run/docker.sock readOnly: true - name: containerdsock mountPath: /run/containerd/containerd.sock - name: varlibdocker mountPath: /var/lib/docker readOnly: true - name: sys mountPath: /sys readOnly: true - name: devdisk mountPath: /dev/disk readOnly: true - name: otel-agent-config-vol mountPath: /conf resources: limits: cpu: 200m memory: 200Mi requests: cpu: 200m memory: 200Mi volumes: - configMap: name: otel-agent-conf items: - key: otel-agent-config path: otel-agent-config.yaml name: otel-agent-config-vol - name: rootfs hostPath: path: / - name: dockersock hostPath: path: /var/run/docker.sock - name: varlibdocker hostPath: path: /var/lib/docker - name: containerdsock hostPath: path: /run/containerd/containerd.sock - name: sys hostPath: path: /sys - name: devdisk hostPath: path: /dev/disk/ serviceAccountName: aws-otel-sa
ConfigMap は別ファイルに切り出す。
--- apiVersion: v1 kind: ConfigMap metadata: name: otel-agent-conf namespace: aws-otel-eks labels: app: opentelemetry component: otel-agent-conf data: otel-agent-config: | extensions: health_check: receivers: awscontainerinsightreceiver: processors: batch/metrics: timeout: 60s exporters: awsemf: namespace: ContainerInsights log_group_name: '/aws/containerinsights/{ClusterName}/performance' log_stream_name: '{NodeName}' resource_to_telemetry_conversion: enabled: true dimension_rollup_option: NoDimensionRollup parse_json_encoded_attr_values: [Sources, kubernetes] metric_declarations: # node metrics - dimensions: [[NodeName, InstanceId, ClusterName]] metric_name_selectors: - node_cpu_utilization - node_memory_utilization - node_network_total_bytes - node_cpu_reserved_capacity - node_memory_reserved_capacity - node_number_of_running_pods - node_number_of_running_containers - dimensions: [[ClusterName]] metric_name_selectors: - node_cpu_utilization - node_memory_utilization - node_network_total_bytes - node_cpu_reserved_capacity - node_memory_reserved_capacity - node_number_of_running_pods - node_number_of_running_containers - node_cpu_usage_total - node_cpu_limit - node_memory_working_set - node_memory_limit # pod metrics - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]] metric_name_selectors: - pod_cpu_utilization - pod_memory_utilization - pod_network_rx_bytes - pod_network_tx_bytes - pod_cpu_utilization_over_pod_limit - pod_memory_utilization_over_pod_limit - dimensions: [[PodName, Namespace, ClusterName], [ClusterName]] metric_name_selectors: - pod_cpu_reserved_capacity - pod_memory_reserved_capacity - dimensions: [[PodName, Namespace, ClusterName]] metric_name_selectors: - pod_number_of_container_restarts # cluster metrics - dimensions: [[ClusterName]] metric_name_selectors: - cluster_node_count - cluster_failed_node_count # service metrics - dimensions: [[Service, Namespace, ClusterName], [ClusterName]] metric_name_selectors: - service_number_of_running_pods # node fs metrics - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]] metric_name_selectors: - node_filesystem_utilization # namespace metrics - dimensions: [[Namespace, ClusterName], [ClusterName]] metric_name_selectors: - namespace_number_of_running_pods service: pipelines: metrics: receivers: [awscontainerinsightreceiver] processors: [batch/metrics] exporters: [awsemf] extensions: [health_check]
デプロイする。
helmfile apply
Container Insights の Prometheus メトリクス収集のデプロイ
ADOT Collector 用の IAM ロールを作成する。
eksctl create iamserviceaccount \ --cluster=${CLUSTER_NAME} \ --namespace=aws-otel-eks \ --name=aws-otel-collector \ --attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \ --approve \ --role-only
マニフェストをダウンロードして adot_collector
フォルダに配置する。
curl -O https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-prometheus.yaml
Namespace は作成済みなので削除する。
IRSA のアノテーションを付与する。このマニフェストでは AppMesh/HAProxy/JMX/Memcached/Nginx のメトリクスを収集するための固定の定義が使われるので、カスタマイズした定義を使うため、インフラメトリクス収集用の定義を参考にして、ConfigMap をマウントするようにする。
--- apiVersion: v1 kind: ServiceAccount metadata: name: aws-otel-collector namespace: aws-otel-eks annotations: eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXXXXX:role/eksctl-mycluster-addon-iamserviceaccount-aws-Role1-T5JD5TS9NP53 --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: otel-prometheus-role rules: - apiGroups: - "" resources: - nodes - nodes/proxy - services - endpoints - pods verbs: - get - list - watch - apiGroups: - extensions resources: - ingresses verbs: - get - list - watch - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: otel-prometheus-role-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: otel-prometheus-role subjects: - kind: ServiceAccount name: aws-otel-collector namespace: aws-otel-eks --- apiVersion: apps/v1 kind: Deployment metadata: name: aws-otel-collector namespace: aws-otel-eks labels: name: aws-otel-collector spec: replicas: 1 selector: matchLabels: name: aws-otel-collector template: metadata: labels: name: aws-otel-collector spec: serviceAccountName: aws-otel-collector containers: - name: aws-otel-collector image: public.ecr.aws/aws-observability/aws-otel-collector:latest command: - "/awscollector" - "--config=/conf/otel-agent-config.yaml" env: - name: AWS_REGION value: ap-northeast-1 - name: OTEL_RESOURCE_ATTRIBUTES value: ClusterName=mycluster imagePullPolicy: Always resources: limits: cpu: 256m memory: 512Mi requests: cpu: 32m memory: 24Mi volumeMounts: - name: otel-agent-config-vol mountPath: /conf volumes: - name: otel-agent-config-vol configMap: name: otel-agent-conf-prometheus items: - key: otel-agent-config path: otel-agent-config.yaml
この Issue コメントを参考にして、カスタム定義を入れた ConfigMap を作成する。スクレイプは static_configs で固定の設定なので、kube-state-metrics をデプロした Namespace 名と Service 名に合わせる必要がある。
--- apiVersion: v1 kind: ConfigMap metadata: name: otel-agent-conf-prometheus namespace: aws-otel-eks labels: app: opentelemetry component: otel-agent-conf data: otel-agent-config: | extensions: health_check: receivers: prometheus: config: global: scrape_interval: 1m scrape_timeout: 10s scrape_configs: - job_name: 'kube-state-metrics' static_configs: - targets: [ 'prometheus-kube-state-metrics.prometheus.svc.cluster.local:8080' ] processors: resourcedetection/ec2: detectors: [ env ] timeout: 2s override: false resource: attributes: - key: TaskId from_attribute: job action: insert - key: receiver value: "prometheus" action: insert exporters: awsemf: namespace: ContainerInsights/Prometheus log_group_name: "/aws/containerinsights/{ClusterName}/prometheus" log_stream_name: "{TaskId}" resource_to_telemetry_conversion: enabled: true dimension_rollup_option: NoDimensionRollup metric_declarations: - dimensions: [ [ ClusterName, deployment, namespace ], [ ClusterName, namespace ], [ ClusterName ] ] metric_name_selectors: - "^kube_deployment_spec_replicas$" - "^kube_deployment_status_replicas$" - "^kube_deployment_status_replicas_ready$" - "^kube_deployment_status_replicas_available$" - "^kube_deployment_status_replicas_unavailable$" label_matchers: - label_names: - service.name regex: "^kube-state-metrics$" - dimensions: [ [ ClusterName, statefulset, namespace ], [ ClusterName, namespace ], [ ClusterName ] ] metric_name_selectors: - "^kube_statefulset_replicas$" - "^kube_statefulset_status_replicas$" - "^kube_statefulset_status_replicas_ready$" - "^kube_statefulset_status_replicas_available$" label_matchers: - label_names: - service.name regex: "^kube-state-metrics$" - dimensions: [ [ ClusterName, daemonset, namespace ], [ ClusterName, namespace ], [ ClusterName ] ] metric_name_selectors: - "^kube_daemonset_status_desired_number_scheduled$" - "^kube_daemonset_status_number_ready$" - "^kube_daemonset_status_number_available$" - "^kube_daemonset_status_number_unavailable$" label_matchers: - label_names: - service.name regex: "^kube-state-metrics$" - dimensions: [ [ ClusterName, namespace, phase ], [ ClusterName, phase ], [ ClusterName ] ] metric_name_selectors: - "^kube_pod_status_ready$" - "^kube_pod_status_scheduled$" - "^kube_pod_status_unschedulable$" - "^kube_pod_status_phase$" label_matchers: - label_names: - service.name regex: "^kube-state-metrics$" - dimensions: [ [ ClusterName, condition ] ] metric_name_selectors: - "^kube_node_status_condition$" label_matchers: - label_names: - service.name regex: "^kube-state-metrics$" service: pipelines: metrics: receivers: [prometheus] processors: [resourcedetection/ec2, resource] exporters: [awsemf]
デプロイする。
helmfile apply
確認
/aws/containerinsights/mycluster/prometheus ロググループを確認する。TaskId がとれていないのか、ログストリーム名が設定できていないが、EMF 形式のログが送られている。
{ "ClusterName": "mycluster", "OTelLib": "otelcol/prometheusreceiver", "Version": "1", "_aws": { "CloudWatchMetrics": [ { "Namespace": "ContainerInsights/Prometheus", "Dimensions": [ [ "ClusterName", "daemonset", "namespace" ], [ "ClusterName", "namespace" ], [ "ClusterName" ] ], "Metrics": [ { "Name": "kube_daemonset_status_number_available" }, { "Name": "kube_daemonset_status_number_unavailable" }, { "Name": "kube_daemonset_status_desired_number_scheduled" }, { "Name": "kube_daemonset_status_number_ready" } ] } ], "Timestamp": 1688664351790 }, "daemonset": "kube-proxy", "http.scheme": "http", "kube_daemonset_annotations": 1, "kube_daemonset_created": 1688650419, "kube_daemonset_labels": 1, "kube_daemonset_metadata_generation": 1, "kube_daemonset_status_current_number_scheduled": 2, "kube_daemonset_status_desired_number_scheduled": 2, "kube_daemonset_status_number_available": 2, "kube_daemonset_status_number_misscheduled": 0, "kube_daemonset_status_number_ready": 2, "kube_daemonset_status_number_unavailable": 0, "kube_daemonset_status_observed_generation": 1, "kube_daemonset_status_updated_number_scheduled": 2, "namespace": "kube-system", "net.host.name": "prometheus-kube-state-metrics.prometheus.svc.cluster.local", "net.host.port": "8080", "prom_metric_type": "gauge", "receiver": "prometheus", "service.instance.id": "prometheus-kube-state-metrics.prometheus.svc.cluster.local:8080", "service.name": "kube-state-metrics" }
どのメトリクスをとるかはもう少し精査が必要だが、メトリクスもとれている。
補足
ADOT Collector ではなく、CloudWatch Prometheus Agent でも同じことが可能。
スクレイプ設定は以下のようになる。
--- # create configmap for prometheus scrape config apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: amazon-cloudwatch data: # prometheus config prometheus.yaml: | global: scrape_interval: 1m scrape_timeout: 10s scrape_configs: - job_name: 'kube-state-metrics' static_configs: - targets: [ 'kube-state-metrics.kube-system.svc.cluster.local:8080' ]
マッピング設定は以下のようになる。ADOT Collector の場合と微妙に違うので注意。
--- # create configmap for prometheus cwagent config apiVersion: v1 kind: ConfigMap metadata: name: prometheus-cwagentconfig namespace: amazon-cloudwatch data: # cwagent json config cwagentconfig.json: | { "logs": { "metrics_collected": { "prometheus": { "prometheus_config_path": "/etc/prometheusconfig/prometheus.yaml", "emf_processor": { "metric_declaration": [ { "source_labels": ["job"], "label_matcher": "^kube-state-metrics$", "dimensions": [ [ "ClusterName", "deployment", "namespace" ], [ "ClusterName", "namespace" ], [ "ClusterName" ] ], "metric_selectors": [ "^kube_deployment_spec_replicas$", "^kube_deployment_status_replicas$", "^kube_deployment_status_replicas_ready$", "^kube_deployment_status_replicas_available$", "^kube_deployment_status_replicas_unavailable$" ] }, { "source_labels": ["job"], "label_matcher": "^kube-state-metrics$", "dimensions": [ [ "ClusterName", "statefulset", "namespace" ], [ "ClusterName", "namespace" ], [ "ClusterName" ] ], "metric_selectors": [ "^kube_statefulset_replicas$", "^kube_statefulset_status_replicas$", "^kube_statefulset_status_replicas_ready$", "^kube_statefulset_status_replicas_available$" ] }, { "source_labels": ["job"], "label_matcher": "^kube-state-metrics$", "dimensions": [ [ "ClusterName", "daemonset", "namespace" ], [ "ClusterName", "namespace" ], [ "ClusterName" ] ], "metric_selectors": [ "^kube_daemonset_status_desired_number_scheduled$", "^kube_daemonset_status_number_ready$", "^kube_daemonset_status_number_available$", "^kube_daemonset_status_number_unavailable$" ] }, { "source_labels": ["job"], "label_matcher": "^kube-state-metrics$", "dimensions": [ [ "ClusterName", "namespace", "phase" ], [ "ClusterName", "phase" ], [ "ClusterName" ] ], "metric_selectors": [ "^kube_pod_status_ready$", "^kube_pod_status_scheduled$", "^kube_pod_status_unschedulable$", "^kube_pod_status_phase$" ] }, { "source_labels": ["job"], "label_matcher": "^kube-state-metrics$", "dimensions": [ [ "ClusterName", "condition" ] ], "metric_selectors": [ "^kube_node_status_condition$" ] } ] } } }, "force_flush_interval": 5 } }
送られた EMF のログは以下のようになっていた。
{ "CloudWatchMetrics": [ { "Metrics": [ { "Name": "kube_daemonset_status_number_ready" }, { "Name": "kube_daemonset_status_desired_number_scheduled" }, { "Name": "kube_daemonset_status_number_available" }, { "Name": "kube_daemonset_status_number_unavailable" } ], "Dimensions": [ [ "ClusterName", "daemonset", "namespace" ], [ "ClusterName", "namespace" ], [ "ClusterName" ] ], "Namespace": "ContainerInsights/Prometheus" } ], "ClusterName": "fully-private", "Timestamp": "1688665159898", "Version": "0", "daemonset": "efs-csi-node", "instance": "kube-state-metrics.kube-system.svc.cluster.local:8080", "job": "kube-state-metrics", "kube_daemonset_annotations": 1, "kube_daemonset_created": 1687945935, "kube_daemonset_labels": 1, "kube_daemonset_metadata_generation": 1, "kube_daemonset_status_current_number_scheduled": 2, "kube_daemonset_status_desired_number_scheduled": 2, "kube_daemonset_status_number_available": 2, "kube_daemonset_status_number_misscheduled": 0, "kube_daemonset_status_number_ready": 2, "kube_daemonset_status_number_unavailable": 0, "kube_daemonset_status_observed_generation": 1, "kube_daemonset_status_updated_number_scheduled": 2, "namespace": "kube-system", "prom_metric_type": "gauge" }