在 Kubernetes 集群中部署 Prometheus 来监控集群,并用 Grafana 数据可视化
在 Kubernetes 集群中部署 Prometheus来 监控集群,并用 Grafana 数据可视化
Prometheus 部署
先创建一个目录来存放管理配置文件
1
mkdir prometheus
本次 prometheus 采用 nfs 挂载方式来存储数据,同时使用 configMap 管理配置文件。并且我们将所有的 prometheus 存储在
kube-system
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165#生成配置文件
cat >> prometheus-configmap.yaml <<EOF
# Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: prometheus
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
prometheus.yml: |
rule_files:
- /etc/config/rules/*.rules
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
- job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-nodes-kubelet
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __metrics_path__
replacement: /metrics/cadvisor
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- job_name: kubernetes-services
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module:
- http_2xx
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_probe
- source_labels:
- __address__
target_label: __param_target
- replacement: blackbox
target_label: __address__
- source_labels:
- __param_target
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- source_labels:
- __meta_kubernetes_service_name
target_label: kubernetes_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:80"]
EOF配置文件解释(这里的 configmap 实际上就是 prometheus 的配置)
上面包含了 3 个模块 global、rule_files 和 scrape_configs
- 其中 global 模块控制 Prometheus Server 的全局配置
- scrape_interval: 表示 prometheus抓取指标数据的频率,默认是 15 s,我们可以覆盖这个值
- evaluation_interval: 用来控制评估规则的频率,prometheus 使用规则产生新的时间序列数据或者产生警报
- rule_files 模块制定了规则所在的位置,prometheus 可以根据这个配置加载规则,用于生产新的时间序列数据或者报警信息,当前我们没有配置任何规则,后期会添加
- scrape_configs 用于控制 prometheus 监控哪些资源。由于 prometheus 通过 http 的方式来暴露它本身的监控数据,prometheus 也能够监控本身的健康情况。在默认的配置有一个单独的 job,叫做 prometheus,它采集 prometheus 服务本身的时间序列数据。这个 job 包含了一个单独的、静态配置的目标;监听 localhost 上的 9090 端口
- prometheus 默认会通过目标的 /metrics 路径采集 metrics。所以,默认的 job 通过 URL:http://localhost:9090/metrics 采集metrics。收集到时间序列包含 prometheus 服务本身的状态和性能。如果我们还有其他的资源需要监控,可以直接配置在该模块下即可
监控 Prometheus
1
kubectl create -f prometheus-configmap.yaml
更新 Configmap
1
curl -XPOST http://<Prometheus Service IP>:<port>/-/reload
检查一下有没有创建成功
1
kubectl get configmaps -n kube-system |grep prometheus
配置文件创建完成,如果以后我们有新的资源需要被监控,我们只需要将
ConfigMap
对象更新即可创建 prometheus 的 Pod 资源
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53cat >>prometheus-deploy.yaml <<EOF
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus
namespace: kube-system
labels:
app: prometheus
spec:
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- image: prom/prometheus:v2.4.3
name: prometheus
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention=30d"
- "--web.enable-admin-api" # 控制对 admin HTTP API 的访问,其中包括删除时间序列等功能
- "--web.enable-lifecycle" # 支持热更新,直接执行 localhost:9090/-/reload 立即生效
ports:
- containerPort: 9090
protocol: TCP
name: http
volumeMounts:
- mountPath: "/prometheus"
subPath: prometheus
name: data
- mountPath: "/etc/prometheus"
name: config-volume
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 100m
memory: 512Mi
securityContext:
runAsUser: 0
volumes:
- name: data
persistentVolumeClaim:
claimName: prometheus
- configMap:
name: prometheus-config
name: config-volume
EOF稍微解释一下配置参数
我们在启动程序的时候,除了指定
prometheus.yaml
(configmap) 以外,还通过storage.tsdb.path
指定了 TSDB 数据的存储路径、通过storage.tsdb.rentention
设置了保留多长时间的数据,还有下面的 web.enable-admin-api 参数可以用来开启对 admin api 的访问权限,参数web.enable-lifecyle
用来开启支持热更新,有了这个参数之后,prometheus.yaml
(configmap) 文件只要更新了,通过执行localhost:9090/-/reload
就会立即生效我们添加了一行 securityContext,其中
runAsUser
设置为 0,这是因为 prometheus 运行过程中使用的用户是 nobody,如果不配置可能会出现权限问题prometheus.yaml 文件对应的 ConfigMap 对象通过 volume 的形式挂载进 Pod,这样 ConfigMap 更新后,对应的 pod 也会热更新,然后我们在执行上面的 reload 请求,prometheus 配置就生效了。除此之外,对了将时间数据进行持久化,我们将数据目录和一个 pvc 对象进行了绑定,所以我们需要提前创建 pvc 对象
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28cat >>prometheus-volume.yaml <<EOF
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
nfs:
server: 10.4.82.138
path: /data/k8s
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
EOF修改上面的 nfs server 和 path
1
kubectl create -f prometheus-volume.yaml
检查一下有没有挂载好
1
2kubectl get pvc --all-namespaces
kubectl get pv prometheus配置 rbac 认证
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49cat >>prometheus-rbac.yaml <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups:
- ""
resources:
- nodes
- services
- endpoints
- pods
- nodes/proxy
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-system
EOF创建 rbac
1
kubectl create -f prometheus-rbac.yaml
创建 prometheus 服务
1
kubectl create -f prometheus-deploy.yaml
检查服务有没有起来
1
2kubectl get pod -n kube-system |grep prometheus
# 是 running 就 ok现在我们 prometheus 服务状态是已经正常了,但是我们还无法访问 prometheus 的 webui 服务,我们还需要创建一个 service
name 是 prometheus-svc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17cat >>prometeheus-svc.yaml <<EOF
apiVersion: v1
kind: Service
metadata:
name: prometheus-svc
namespace: kube-system
labels:
app: prometheus
spec:
selector:
app: prometheus
type: NodePort
ports:
- name: web
port: 9090
targetPort: http
EOF创建这个服务
1
kubectl create -f prometheus-svc.yaml
看一下分配的端口,用 IP+ 端口访问
1
kubectl get svc -n kube-system |grep prometheus
创建 grafana 配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72cat >>grafana.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana
ports:
- containerPort: 3000
protocol: TCP
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- name: grafana-data
mountPath: /var/lib/grafana
subPath: grafana
securityContext:
fsGroup: 472
runAsUser: 472
volumes:
- name: grafana-data
persistentVolumeClaim:
claimName: grafana
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana
namespace: prometheus
spec:
storageClassName: "managed-nfs-storage"
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: prometheus
spec:
type: NodePort
ports:
- port : 80
targetPort: 3000
nodePort: 30007
selector:
app: grafana
EOF创建grafana-ingress.yaml
#–optional可选的,没有外网访问需求可以直接跳过这一步
ingress 文档: https://kubernetes.io/zh/docs/concepts/services-networking/ingress/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16cat >>grafana-ingress.yaml <<EOF
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: grafana
namespace: prometheus
spec:
rules:
- host: k8s.grafana
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 80
EOF应用 grafana
1
kubectl apply -f grafana.yaml -f grafana-ingress.yaml
可以在 KubeSphere 中查看分配到的 ip 地址和端口,配置好 Grafana,就可以导入模板看到可视化界面啦
容器监控模板:315 8588 3146 8685
主机监控模板:8919 9276 10467 10171 9965
监控类型 | 模板ID |
---|---|
Kubernetes node_exporter | 14249、8919、10262、5228 |
Mikrotik RouterOS | 10950、14933(SWOS) |
Synology | 14284、14364 |
SNMP | 14857(Mikrotik)、 |
Windows | 9837 |
Proxmox PVE | 13307 |
Openwrt | 11147 |
Linux | 12633 |
WireGuard | 12177 |
DashBoard 看起来好看但是还没调教好的:8727、9733、9916、12798、11414、12095
funny DashBoard:3287(PUBG)、11993、11994(Minecraft)、12237(COVID-19)、14199(网络延迟)
一些我的仪表盘的效果
安装 Prometheus 和 Grafana(懒人版)
别人三年前写的东西了,不建议使用
建议不要在完全翻墙的环境下去拉镜像,很有可能拉不成功,有时候拉失败,要考虑一下是不是网络环境的问题
快速安装
1 | kubectl apply \ |
不建议使用,有点 bug
需要改 state-metrics image: bitnami/kube-state-metrics:latest