Prometheus各类监控及监控指标和告警规则

Prometheus各类监控及监控指标和告警规则

news/2024/9/8 10:41:54/文章来源:https://blog.csdn.net/eighters/article/details/140707649

目录

linux docker监控

linux 系统进程监控

linux 系统os监控

windows 系统os监控

配置文件&告警规则

Prometheus配置文件

node_alert.rules

docker_container.rules

mysql_alert.rules

vmware.rules

Alertmanager告警规则

consoul注册服务

Dashboard JSON文件

linux docker监控

获取的是docker stats命令的统计结果，可以页面方式展示出来。

cadvisor.tar

上传cadvisor.tar包，导入后修改tag，运行容器

docker load -i cadvisor.tardocker tag gcr.io/cadvisor/cadvisor:latest google/cadvisor:latestdocker run -d --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --name=cadvisor google/cadvisor:latest

容器运行后如下：

访问cadvisor http://ip:8080

linux 系统进程监控

通过正则、绝对路径、名字等获取指定进程的运行状况

process-exporter-0.7.5.linux-amd64.tar.gz

参考我的另一篇文章

Prometheus监控主机进程-CSDN博客

默认端口 9256

linux 系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

node_exporter放到指定路径后

cat /etc/systemd/system/node-exporter.service

[Unit]
Description=Prometheus Node exporter
After=network.target[Service]
ExecStart=/opt/monitoring/node_exporter[Install]
WantedBy=multi-user.target

默认端口：9100

windows 系统os监控

通过exporter获取当前系统的Cpu、内存、硬盘等OS资源

windows_exporter-0.26.0-amd64.msi

1.关闭防火墙

2.管理员模式双击执行

3.services.msc服务管理检查windows-exporter服务自动启动即可

默认端口：9182

配置文件&告警规则

/opt/monitor/prometheus目录下

Prometheus配置文件

cat /opt/monitor/prometheus/prometheus.yml 
# my global config
global:scrape_interval:     10s # By default, scrape targets every 15 seconds.scrape_timeout: 5sevaluation_interval: 10s # By default, scrape targets every 15 seconds.# scrape_timeout is set to the global default (10s).# Attach these labels to any time series or alerts when communicating with# external systems (federation, remote storage, Alertmanager).external_labels:monitor: 'zqa_monitor'# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:- 'node_alert.rules'- 'mysql_alert.rules'- 'docker_container.rules'# - "first.rules"# - "second.rules"# alert
alerting:alertmanagers:- scheme: httpstatic_configs:- targets:- "alertmanager:9093"# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.- job_name: 'prometheus'# Override the global default and scrape targets from this job every 5 seconds.scrape_interval: 5sstatic_configs:- targets: ['localhost:9090']#- job_name: 'cadvisor'# Override the global default and scrape targets from this job every 5 seconds.# scrape_interval: 5s#dns_sd_configs:#- names:#  - 'tasks.cadvisor'#  type: 'A'#  port: 8080#static_configs:#     - targets: ['10.33.70.218:8080']- job_name: 'node-exporter'# Override the global default and scrape targets from this job every 5 seconds.scrape_interval: 5sstatic_configs:- targets: ['10.100.10.100:9182']consul_sd_configs:- server: '10.33.70.203:8500'services: ['node-exporter-dev']- job_name: 'mysql-exporter'scrape_interval: 5sstatic_configs:- targets: ['10.33.70.218:9104', '10.33.70.166:9104', '10.33.70.224:9104']- job_name: 'postgres-exporter'scrape_interval: 5sstatic_configs:- targets: ['123.57.190.129:9187']- job_name: 'vsphere-exporter'scrape_interval: 5sstatic_configs:- targets: ['10.33.70.22:9272']- job_name: 'es-exporter'scrape_interval: 5sstatic_configs:- targets: ['123.57.216.51:9114']- job_name: 'pushgateway'scrape_interval: 30sstatic_configs:- targets: ['39.104.94.83:19091']labels:instance: pushgatewayhonor_labels: true- job_name: "cadvisor"scrape_interval: 10smetrics_path: '/metrics'static_configs:- targets: ["47.93.21.11:8080]#- job_name: 'kafka-exporter'#  scrape_interval: 5s#  static_configs:#       - targets: [ '10.100.7.1:9308']#  - job_name: 'pushgateway'
#    scrape_interval: 10s
#    dns_sd_configs:
#    - names:
#      - 'tasks.pushgateway'
#      type: 'A'
#      port: 9091#     static_configs:
#          - targets: ['node-exporter:9100']

node_alert.rules

groups:
- name: zqaalertrules:- alert:  机器宕机expr: up == 0for: 2mlabels:severity: criticalannotations:summary: "Instance {{ $labels.instance }} down"description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."- alert: 负载率expr: node_load1 > 8for: 5mlabels:severity: warningannotations:summary: "Instance {{ $labels.instance }} under high load"description: "{{ $labels.instance }} of job {{ $labels.job }} is under high load."- alert: 可用内存小于5%expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 5for: 10mlabels:severity: warningannotations:summary: Host out of memory (instance {{ $labels.instance }})description: "节点内存告警 (< 5% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert:  磁盘使用率expr: (100 - ((node_filesystem_avail_bytes{device!~'rootfs'} * 100) / node_filesystem_size_bytes{device!~'rootfs'}) > 90)for: 5mlabels:severity: Highannotations:summary: "{{$labels.instance}}: High Disk usage detected"description: "{{$labels.instance}}: 硬盘使用率大于 90% (当前值:{{ $value }})"- alert: Cpu使用率expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 95for: 10mlabels:severity: warningannotations:summary: "{{$labels.instance}}: High Cpu usage detected"description: "{{$labels.instance}}: CPU 使用率大于 95% (current value is:{{ $value }})"# - alert: 进程恢复#   expr: ceil(time() - max by(instance, groupname) (namedprocess_namegroup_oldest_start_time_seconds)) < 60#   for: 0s#   labels:#     severity: warning#   annotations:#     summary: "进程重启"#     description: "进程{{ $labels.groupname }}在{{ $value }}秒前重启过"- alert: 进程退出告警# expr: max by(instance, groupname) (rate(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"}[5m])) < 0expr: namedprocess_namegroup_num_procs{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor*|^lizhu_agent*|^lizhurunner*"} == 0for: 30slabels:severity: warningannotations:summary: "进程退出"description: "进程{{ $labels.groupname }}退出了"  #  - alert: 进程退出告警
#    expr: max_over_time(namedprocess_namegroup_oldest_start_time_seconds{groupname=~"^vsftpd.*|^proxy.*|^goproxy.*|^lizhu_monitor.*|^lizhu_agent.*|^lizhurunner.*"}[1d]) < (time() - 10*60)
#    for: 1s
#    labels:
#      severity: warning
#    annotations:
#      description: 进程组 {{ $labels.groupname }} 中的进程在最近10分钟内退出了
#      summary: 进程退出#- alert: 机器硬盘读取速率#  expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 200#  for: 5m#  labels:#    severity: warning#  annotations:#    summary: Host unusual disk read rate (instance {{ $labels.instance }})#    description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"#- alert: 机器硬盘写入速率#  expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 120#  for: 2m#  labels:#    severity: warning#  annotations:#    summary: Host unusual disk write rate (instance {{ $labels.instance }})#    description: "Disk is probably writing too much data VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: HostOomKillDetectedexpr: increase(node_vmstat_oom_kill[1m]) > 0for: 0mlabels:severity: warningannotations:summary: Host OOM kill detected (instance {{ $labels.instance }})description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: Esxi主机连接丢失expr: vmware_host_power_state != 1for: 1m labels:severity: criticalannotations:summary: "Esxi 物理机IP: {{ $labels.host_name }} 丢失连接"description: "VMware host {{ $labels.host_name }} is not connected to the virtualization platform."

docker_container.rules

groups:
- name: zqaalertrules:- alert: ContainerAbsentexpr: absent(container_last_seen)for: 5mlabels:severity: warningannotations:summary: "无容器 容器:{{$labels.instance }}"description: "5分钟检查容器不存在,当前值为:{{ $value }}"- alert: ContainerCpuUsageexpr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY(instance, name)*100 ) > 300for: 2mlabels:severity: warningannotations:summary: "容器cpu使用率告警,容器:{{$labels.instance }}"description: "容器cpu使用率超过300%,当前值为:{{ $value }}"- alert: ContainerMemoryUsageexpr: (sum(container_memory_working_set_bytes{name!=""})BY (instance, name) /sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100 ) > 80for: 2mlabels:severity: warningannotations:summary: "容器内存使用率告警,容器:{{$labels.instance }}"description: "容器内存使用率超过80%,当前值为:{{ $value }}"- alert: ContainerVolumeIOUsageexpr: (sum(container_fs_io_current{name!=""}) BY (instance, name) * 100) >80 for: 2mlabels:severity: warningannotations:summary: "容器存储IO使用率告警,容器:{{$labels.instance }}"description: "容器存储IO使用率超过80%,当前值为:{{ $value }}"- alert: ContainerHighThrottleRateexpr: rate(container_cpus_cfs_throttled_seconds_total[3m]) > 1 for: 2mlabels:severity: warningannotations:summary: "容器限制告警,容器:{{$labels.instance }}"description: "容器被限制,当前值为:{{ $value }}"

mysql_alert.rules

groups:
- name: zqaalertrules:- alert:  Mysql 宕机expr: mysql_up == 0for: 1mlabels:severity: criticalannotations:summary: MySQL down (instance {{ $labels.instance }})description: "MySQL instance is down on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: MysqlTooManyConnections(>80%)expr: max_over_time(mysql_global_status_threads_connected[1m]) / mysql_global_variables_max_connections * 100 > 80for: 2mlabels:severity: warningannotations:summary: MySQL too many connections (> 80%) (instance {{ $labels.instance }})description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: MysqlHighThreadsRunningexpr: max_over_time(mysql_global_status_threads_running[1m]) / mysql_global_variables_max_connections * 100 > 60for: 2mlabels:severity: warningannotations:summary: MySQL high threads running (instance {{ $labels.instance }})description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"- alert: Mysql慢查询expr: increase(mysql_global_status_slow_queries[1m]) > 0for: 60mlabels:severity: warningannotations:summary: MySQL slow queries (instance {{ $labels.instance }})description: "MySQL server mysql has some new slow query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

vmware.rules

- name: VMware Host Connection Staterules:- alert: HostDisconnectedexpr: vmware_host_power_state == "connected"for: 5m # 规定主机连接状态必须持续5分钟才会触发警报labels:severity: warningannotations:summary: "VMware host {{ $labels.instance }} disconnected"description: "VMware host {{ $labels.instance }} is not connected to the virtualization platform."

Alertmanager告警规则

通过定义组来监控组内机器

cat vim /opt/monitor/alertmanager/config.yml


global:resolve_timeout: 5msmtp_from: 'ops@xxx.com'smtp_smarthost: 'smtp.feishu.cn:465'smtp_auth_username: 'ops@xxx.com'smtp_auth_password: 'ydWhsFDk3pF50TZg'smtp_require_tls: falsesmtp_hello: 'ZQA监控告警'route:group_by: ['zqaalert']group_wait: 60s # 在触发第一个警报后，等待相同分组内的所有警报的最长时间group_interval: 10m   # 系统每隔10分钟检查一次是否有新的警报需要处理repeat_interval: 60m  # 在发送警报通知后，在重复发送通知之间等待的时间。设置为1小时意味着如果同一组内的警报在 1小时再次触发receiver: 'web.hook'
receivers:
#- name: 'web.hook.prometheusalert'
- name: 'web.hook'webhook_configs:- url: 'http://10.33.70.22:9094/prometheusalert?type=fs&tpl=prometheus-fs&fsurl=https://open.feishu.cn/open-apis/bot/v2/hook/7fe7f42d-242b-42eb-837c-028cfc84adb8'

consoul注册服务

* */1 * * * ip addr | awk '/^[0-9]+: / {}; /inet.*global/ {print gensub(/(.*)\/(.*)/, "\\1", "g", $2)}' |grep "10.33"|head -1|xargs -i curl -X PUT -d  '{"id": "node-exporter-{}","name": "node-exporter-dev","address": "{}","port": 9100,"tags": ["env-dev"],"checks": [{"http": "http://{}:9100/metrics", "interval": "5s"}]}'  http://consul.intra.xxx.net/v1/agent/service/register

有现成的consoul容器，运行即可

Dashboard JSON文件

以下是我认为比较好用的 grafana 的 dashboards文件

Grafana dashboards | Grafana Labs

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.xdnf.cn/news/1489344.html

如若内容造成侵权/违法违规/事实不符，请联系一条长河网进行投诉反馈，一经查实，立即删除！

相关文章

(8) ubuntu ROS 安装

(8) ubuntu ROS 安装

文章目录安装流程1. 进入ros官网2. 根据自己ubuntu系统选择版本（我是20.04的ubuntu）3.根据流程开始安装3.1 设置sources.list 4.验证ros5.安装rosdep 安装流程 1. 进入ros官网 https://www.ros.org/ 2. 根据自己ubuntu系统选择版本（我是2…

阅读更多...

排查C++软件异常的常见思路与方法（实战经验总结）

目录 1、概述 2、常用的C++异常排查思路与方法 2.1、IDE调试 2.1.1、Debug和Release下的调试 2.1.2、VS附加到进程调试 2.1.3、Windbg附加到进程调试 2.2、添加日志打印 2.3、分块注释代码 2.4、数据断点 2.5、历史版本比对法 2.6、Windbg静态分析与动态调试 2.6.1…

阅读更多...

如何发现快速发现分析生产问题SQL

如何发现快速发现分析生产问题SQL

Performance Schema介绍 Performance Schema提供了有关MySQL服务器内部运行的操作上的底层指标。为了解释清楚Performance Schema的工作机制，先介绍两个概念。第一个概念是程序插桩（instrument）。程序插桩在MySQL代码中插入探测代码&#xf…

阅读更多...

Hadoop单机版环境搭建

Hadoop单机版环境搭建

一 . 案例信息 Hadoop 的安装部署的模式一共有三种： 本地模式，默认的模式，无需运行任何守护进程（ daemon ），所有程序都在单个 JVM 上执行。由于在本机模式下测试和调试 MapReduce 程序较为方便&#x…

阅读更多...

鸿蒙开发——axios封装请求、拦截器

鸿蒙开发——axios封装请求、拦截器

描述：接口用的是PHP，框架TP5 源码地址链接：https://pan.quark.cn/s/a610610ca406 提取码：rbYX 请求登录 HttpUtil HttpApi 使用方法

阅读更多...

PHP8.3.9安装记录，Phpmyadmin访问提示缺少mysqli

ubuntu 22.0.4 腾讯云主机下载好依赖 sudo apt update sudo apt install -y build-essential libxml2-dev libssl-dev libcurl4-openssl-dev pkg-config libbz2-dev libreadline-dev libicu-dev libsqlite3-dev libwebp-dev 下载php8.3.9安装包 nullhttps://www.php.net/d…

阅读更多...

基于Qt的视频剪辑

基于Qt的视频剪辑

在Qt中进行视频剪辑可以通过多种方式实现，但通常需要使用一些额外的库来处理视频数据。以下是一些常见的方法和步骤： 使用FFmpeg FFmpeg是一个非常强大的多媒体框架，可以用来处理视频和音频数据。你可以使用FFmpeg的命令行工具或者其库来实现…

阅读更多...

Github 2024-07-26 Java开源项目日报 Top10

Github 2024-07-26 Java开源项目日报 Top10

根据Github Trendings的统计，今日(2024-07-26统计)共有10个项目上榜。根据开发语言中项目的数量，汇总情况如下：开发语言项目数量Java项目9HTML项目1TypeScript项目1非开发语言项目1JavaGuide - Java 程序员学习和面试指南创建周期：2118 天开发语言：Java协议类型：Apache…

阅读更多...

springboot使用Gateway做网关并且配置全局拦截器

springboot使用Gateway做网关并且配置全局拦截器

一、为什么要用网关统一入口： 作用：作为所有客户端请求的统一入口。说明：所有客户端请求都通过网关进行路由，网关负责将请求转发到后端的微服务路由转发： 作用：根据请求的URL、方法等信息将请求路由到…

阅读更多...

C#初级——枚举

C#初级——枚举

枚举枚举是一组命名整型常量。 enum 枚举名字 { 常量1, 常量2, …… 常量n }; 枚举的常量是由 , 分隔的列表。并且，在这个整型常量列表中，通常默认第一位枚举符号的值为0，此后的枚举符号的值都比前一位大1。在将枚举赋值给 int 类型的…

阅读更多...

java计算机毕设课设—记账管理系统（附源码和安装视频）

java计算机毕设课设—记账管理系统（附源码和安装视频）

这是什么系统？ java计算机毕设课设—记账管理系统（附源码和安装视频） 记账管理系统主要用于财务人员可以从账务中判断公司的发展方向。对个人和家庭而言，通过记账可以制定日后的消费计划，这样才能为理财划出清晰合理…

阅读更多...

Scrapy 爬取旅游景点相关数据（三）

Scrapy 爬取旅游景点相关数据（三）

这一节我们将之前爬取到的景点数据进行解析，并且保存为excel，便于后续使用，本节包含 （1） 景点数据解析 （2）数据保存到excel 1 编写爬虫这次继续改进第二节的爬虫，新建一个爬虫文…

阅读更多...

【Java基础】动态代理与代理模式哪些事儿

【Java基础】动态代理与代理模式哪些事儿

文章目录代理静态代理动态代理基于接口的jdk动态的demo源码解析Proxy.newProxyInstancejdk 动态的生成的字节码基于父类的cglib动态代理源码解析代理设计模式应用场景 Spring AOP小结代理代理其实就是扩展目标对象的功能，比如普通人不具备超人能力&#xff0c…

阅读更多...

青少年绘画大赛兰州站：童梦起航致敬科学续写降压0号之父强国梦

青少年绘画大赛兰州站：童梦起航致敬科学续写降压0号之父强国梦

2024年7月21日，“鹤舞童梦致敬科学精神”青少年绘画大赛在兰州隆重启幕。活动邀请了多位重量级嘉宾担任评委，包括中国美术家协会会员、甘肃省油画协会常务理事马爱兵，兰州交通大学天佑美术馆馆长王欣，以及国家一级美术师蔡晓斌。…

阅读更多...

什么是护网？2024护网行动怎么参加？一文详解_护网具体是做啥的

什么是护网？2024护网行动怎么参加？一文详解_护网具体是做啥的

前言最近的全国护网可谓是正在火热的进行中，有很多网安小白以及准大一网安的同学在后台问我，到底什么是护网啊？怎么参加呢？有没有相关的学习资料呢？在下不才，连夜整理出来了这篇护网详解文章，希…

阅读更多...

Linux笔记 --- 基础指令

Linux笔记 --- 基础指令

1.了解命令行快捷键打开终端：altctrlT 2.入门命令 1）cd 切换工作路径，使用时直接在后面写下当前目录下的下级目录即可跳转，也有特殊用法，在此列出 2）ls ls 列举当前目录下的内容常见用法有两种&#xff…

阅读更多...

若依ruoyi+AI项目二次开发

若依ruoyi+AI项目二次开发

//------------------------- //定义口味名称和口味列表静态数据 const dishFlavorListSelectref([ {name:"辣度",value:["不辣","微辣","中辣","重辣"]}, {name:"忌口",value:["不要葱","不要…

阅读更多...

【PostgreSQL 16】专栏日常

【PostgreSQL 16】专栏日常

本专栏从 3 个月前开始着手准备，利用周末及节假日的时间来整理。 ldczzDESKTOP-HVJOUVN MINGW64 ~/mypostgres (dev) $ git lg |tee * 7a7f468 - (HEAD -> dev, origin/main, origin/dev, main) 完成服务端编程的初步整理 (6 minutes ago) <Laven Liu> * …

阅读更多...

freertos的学习cubemx版

freertos的学习cubemx版

HAL 库的freertos 1 实时 2 任务->线程 3 移植 CMSIS_V2 V1版本 NVIC配置全部是抢占优先级第四组抢占级别有 0-15 编码规则， 变量名 ：类型前缀， c - char S - int16_t L - int32_t U - unsigned Uc - uint8_t Us - uint…

阅读更多...

企业公户验证API如何使用JAVA、Python、PHP语言进行应用

企业公户验证API如何使用JAVA、Python、PHP语言进行应用

在纷繁复杂的金融与商业领域，确保每笔交易的安全与合规是至关重要的。而企业公户验证API，正是这样一位默默守护的数字卫士，它通过智能化的手段，简化了企业对公账户验证流程，让繁琐的审核变得快捷且可靠。什么是企业公…

阅读更多...

最新文章