在 Kubernetes 集群中部署 Prometheus 来监控集群,并用 Grafana 数据可视化

在 Kubernetes 集群中部署 Prometheus来 监控集群,并用 Grafana 数据可视化

Prometheus 部署
  1. 先创建一个目录来存放管理配置文件

    1
    mkdir prometheus

    本次 prometheus 采用 nfs 挂载方式来存储数据,同时使用 configMap 管理配置文件。并且我们将所有的 prometheus 存储在 kube-system

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    #生成配置文件 

    cat >> prometheus-configmap.yaml <<EOF
    # Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: prometheus-config
    namespace: prometheus
    labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: EnsureExists
    data:
    prometheus.yml: |
    rule_files:
    - /etc/config/rules/*.rules

    scrape_configs:
    - job_name: prometheus
    static_configs:
    - targets:
    - localhost:9090

    - job_name: kubernetes-apiservers
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - action: keep
    regex: default;kubernetes;https
    source_labels:
    - __meta_kubernetes_namespace
    - __meta_kubernetes_service_name
    - __meta_kubernetes_endpoint_port_name
    scheme: https
    tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-nodes-kubelet
    kubernetes_sd_configs:
    - role: node
    relabel_configs:
    - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
    scheme: https
    tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-nodes-cadvisor
    kubernetes_sd_configs:
    - role: node
    relabel_configs:
    - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
    - target_label: __metrics_path__
    replacement: /metrics/cadvisor
    scheme: https
    tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

    - job_name: kubernetes-service-endpoints
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scrape
    - action: replace
    regex: (https?)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_scheme
    target_label: __scheme__
    - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_path
    target_label: __metrics_path__
    - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_service_annotation_prometheus_io_port
    target_label: __address__
    - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
    - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
    - action: replace
    source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name

    - job_name: kubernetes-services
    kubernetes_sd_configs:
    - role: service
    metrics_path: /probe
    params:
    module:
    - http_2xx
    relabel_configs:
    - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_service_annotation_prometheus_io_probe
    - source_labels:
    - __address__
    target_label: __param_target
    - replacement: blackbox
    target_label: __address__
    - source_labels:
    - __param_target
    target_label: instance
    - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
    - source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
    - source_labels:
    - __meta_kubernetes_service_name
    target_label: kubernetes_name

    - job_name: kubernetes-pods
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - action: keep
    regex: true
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_scrape
    - action: replace
    regex: (.+)
    source_labels:
    - __meta_kubernetes_pod_annotation_prometheus_io_path
    target_label: __metrics_path__
    - action: replace
    regex: ([^:]+)(?::\d+)?;(\d+)
    replacement: $1:$2
    source_labels:
    - __address__
    - __meta_kubernetes_pod_annotation_prometheus_io_port
    target_label: __address__
    - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
    - action: replace
    source_labels:
    - __meta_kubernetes_namespace
    target_label: kubernetes_namespace
    - action: replace
    source_labels:
    - __meta_kubernetes_pod_name
    target_label: kubernetes_pod_name
    alerting:
    alertmanagers:
    - static_configs:
    - targets: ["alertmanager:80"]
    EOF

    配置文件解释(这里的 configmap 实际上就是 prometheus 的配置)

    上面包含了 3 个模块 global、rule_files 和 scrape_configs

    • 其中 global 模块控制 Prometheus Server 的全局配置
    • scrape_interval: 表示 prometheus抓取指标数据的频率,默认是 15 s,我们可以覆盖这个值
    • evaluation_interval: 用来控制评估规则的频率,prometheus 使用规则产生新的时间序列数据或者产生警报
    • rule_files 模块制定了规则所在的位置,prometheus 可以根据这个配置加载规则,用于生产新的时间序列数据或者报警信息,当前我们没有配置任何规则,后期会添加
    • scrape_configs 用于控制 prometheus 监控哪些资源。由于 prometheus 通过 http 的方式来暴露它本身的监控数据,prometheus 也能够监控本身的健康情况。在默认的配置有一个单独的 job,叫做 prometheus,它采集 prometheus 服务本身的时间序列数据。这个 job 包含了一个单独的、静态配置的目标;监听 localhost 上的 9090 端口
    • prometheus 默认会通过目标的 /metrics 路径采集 metrics。所以,默认的 job 通过 URL:http://localhost:9090/metrics 采集metrics。收集到时间序列包含 prometheus 服务本身的状态和性能。如果我们还有其他的资源需要监控,可以直接配置在该模块下即可

    监控 Prometheus

    1
    kubectl create -f prometheus-configmap.yaml

    更新 Configmap

    1
    curl -XPOST http://<Prometheus Service IP>:<port>/-/reload

    检查一下有没有创建成功

    1
    kubectl get configmaps -n kube-system |grep prometheus

    配置文件创建完成,如果以后我们有新的资源需要被监控,我们只需要将 ConfigMap 对象更新即可

  2. 创建 prometheus 的 Pod 资源

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    cat >>prometheus-deploy.yaml <<EOF
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
    name: prometheus
    namespace: kube-system
    labels:
    app: prometheus
    spec:
    template:
    metadata:
    labels:
    app: prometheus
    spec:
    serviceAccountName: prometheus
    containers:
    - image: prom/prometheus:v2.4.3
    name: prometheus
    command:
    - "/bin/prometheus"
    args:
    - "--config.file=/etc/prometheus/prometheus.yml"
    - "--storage.tsdb.path=/prometheus"
    - "--storage.tsdb.retention=30d"
    - "--web.enable-admin-api" # 控制对 admin HTTP API 的访问,其中包括删除时间序列等功能
    - "--web.enable-lifecycle" # 支持热更新,直接执行 localhost:9090/-/reload 立即生效
    ports:
    - containerPort: 9090
    protocol: TCP
    name: http
    volumeMounts:
    - mountPath: "/prometheus"
    subPath: prometheus
    name: data
    - mountPath: "/etc/prometheus"
    name: config-volume
    resources:
    requests:
    cpu: 100m
    memory: 512Mi
    limits:
    cpu: 100m
    memory: 512Mi
    securityContext:
    runAsUser: 0
    volumes:
    - name: data
    persistentVolumeClaim:
    claimName: prometheus
    - configMap:
    name: prometheus-config
    name: config-volume
    EOF

    稍微解释一下配置参数

    我们在启动程序的时候,除了指定 prometheus.yaml(configmap) 以外,还通过 storage.tsdb.path 指定了 TSDB 数据的存储路径、通过storage.tsdb.rentention 设置了保留多长时间的数据,还有下面的 web.enable-admin-api 参数可以用来开启对 admin api 的访问权限,参数 web.enable-lifecyle 用来开启支持热更新,有了这个参数之后, prometheus.yaml(configmap) 文件只要更新了,通过执行localhost:9090/-/reload就会立即生效

    我们添加了一行 securityContext,其中 runAsUser 设置为 0,这是因为 prometheus 运行过程中使用的用户是 nobody,如果不配置可能会出现权限问题

    prometheus.yaml 文件对应的 ConfigMap 对象通过 volume 的形式挂载进 Pod,这样 ConfigMap 更新后,对应的 pod 也会热更新,然后我们在执行上面的 reload 请求,prometheus 配置就生效了。除此之外,对了将时间数据进行持久化,我们将数据目录和一个 pvc 对象进行了绑定,所以我们需要提前创建 pvc 对象

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    cat >>prometheus-volume.yaml <<EOF
    apiVersion: v1
    kind: PersistentVolume
    metadata:
    name: prometheus
    spec:
    capacity:
    storage: 10Gi
    accessModes:
    - ReadWriteOnce
    persistentVolumeReclaimPolicy: Recycle
    nfs:
    server: 10.4.82.138
    path: /data/k8s

    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: prometheus
    namespace: kube-system
    spec:
    accessModes:
    - ReadWriteOnce
    resources:
    requests:
    storage: 10Gi
    EOF

    修改上面的 nfs server 和 path

    1
    kubectl create -f prometheus-volume.yaml

    检查一下有没有挂载好

    1
    2
    kubectl get pvc --all-namespaces
    kubectl get pv prometheus

    配置 rbac 认证

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    cat >>prometheus-rbac.yaml <<EOF
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: prometheus
    namespace: kube-system
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: prometheus
    rules:
    - apiGroups:
    - ""
    resources:
    - nodes
    - services
    - endpoints
    - pods
    - nodes/proxy
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - ""
    resources:
    - configmaps
    - nodes/metrics
    verbs:
    - get
    - nonResourceURLs:
    - /metrics
    verbs:
    - get
    ---
    apiVersion: rbac.authorization.k8s.io/v1beta1
    kind: ClusterRoleBinding
    metadata:
    name: prometheus
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: prometheus
    subjects:
    - kind: ServiceAccount
    name: prometheus
    namespace: kube-system
    EOF

    创建 rbac

    1
    kubectl create -f prometheus-rbac.yaml

    创建 prometheus 服务

    1
    kubectl create -f prometheus-deploy.yaml

    检查服务有没有起来

    1
    2
    kubectl get pod -n kube-system |grep prometheus
    # 是 running 就 ok

    现在我们 prometheus 服务状态是已经正常了,但是我们还无法访问 prometheus 的 webui 服务,我们还需要创建一个 service

    name 是 prometheus-svc

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    cat >>prometeheus-svc.yaml <<EOF
    apiVersion: v1
    kind: Service
    metadata:
    name: prometheus-svc
    namespace: kube-system
    labels:
    app: prometheus
    spec:
    selector:
    app: prometheus
    type: NodePort
    ports:
    - name: web
    port: 9090
    targetPort: http
    EOF

    创建这个服务

    1
    kubectl create -f prometheus-svc.yaml

    看一下分配的端口,用 IP+ 端口访问

    1
    kubectl get svc -n kube-system |grep prometheus
  3. 创建 grafana 配置文件

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    cat >>grafana.yaml <<EOF
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    name: grafana
    namespace: prometheus
    spec:
    replicas: 1
    selector:
    matchLabels:
    app: grafana
    template:
    metadata:
    labels:
    app: grafana
    spec:
    containers:
    - name: grafana
    image: grafana/grafana
    ports:
    - containerPort: 3000
    protocol: TCP
    resources:
    limits:
    cpu: 100m
    memory: 256Mi
    requests:
    cpu: 100m
    memory: 256Mi
    volumeMounts:
    - name: grafana-data
    mountPath: /var/lib/grafana
    subPath: grafana
    securityContext:
    fsGroup: 472
    runAsUser: 472
    volumes:
    - name: grafana-data
    persistentVolumeClaim:
    claimName: grafana

    ---

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: grafana
    namespace: prometheus
    spec:
    storageClassName: "managed-nfs-storage"
    accessModes:
    - ReadWriteMany
    resources:
    requests:
    storage: 5Gi

    ---

    apiVersion: v1
    kind: Service
    metadata:
    name: grafana
    namespace: prometheus
    spec:
    type: NodePort
    ports:
    - port : 80
    targetPort: 3000
    nodePort: 30007
    selector:
    app: grafana
    EOF

    创建grafana-ingress.yaml

    #–optional可选的,没有外网访问需求可以直接跳过这一步

    ingress 文档: https://kubernetes.io/zh/docs/concepts/services-networking/ingress/

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
       cat >>grafana-ingress.yaml <<EOF
    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
    name: grafana
    namespace: prometheus
    spec:
    rules:
    - host: k8s.grafana
    http:
    paths:
    - path: /
    backend:
    serviceName: grafana
    servicePort: 80
    EOF

    应用 grafana

    1
    kubectl apply -f grafana.yaml -f grafana-ingress.yaml

    可以在 KubeSphere 中查看分配到的 ip 地址和端口,配置好 Grafana,就可以导入模板看到可视化界面啦

    容器监控模板:315 8588 3146 8685
    主机监控模板:8919 9276 10467 10171 9965

监控类型 模板ID
Kubernetes node_exporter 14249、8919、10262、5228
Mikrotik RouterOS 10950、14933(SWOS)
Synology 14284、14364
SNMP 14857(Mikrotik)、
Windows 9837
Proxmox PVE 13307
Openwrt 11147
Linux 12633
WireGuard 12177

DashBoard 看起来好看但是还没调教好的:8727、9733、9916、12798、11414、12095

funny DashBoard:3287(PUBG)、11993、11994(Minecraft)、12237(COVID-19)、14199(网络延迟)

一些我的仪表盘的效果

微信截图_20220323214600.png

微信截图_20220323231243.png

微信截图_20220323230620.png

微信截图_20220323214702.png

微信截图_20220324002923.png

微信截图_20220323231021.png

安装 Prometheus 和 Grafana(懒人版)

别人三年前写的东西了,不建议使用
建议不要在完全翻墙的环境下去拉镜像,很有可能拉不成功,有时候拉失败,要考虑一下是不是网络环境的问题

Prometheus and Grafana

快速安装

1
2
kubectl apply \
--filename https://raw.githubusercontent.com/giantswarm/prometheus/master/manifests-all.yaml

不建议使用,有点 bug

需要改 state-metrics image: bitnami/kube-state-metrics:latest