Prometheus + Grafana

2024 年 8 月 2 日 星期五(已编辑)
8

Prometheus + Grafana

如果购置了多台服务器,就需要一个服务器监控程序来查看它们的运行状态

最开始我是用的是探针,如那吒监控、ServerStatus

但是哪吒监控经常出各种 bug,ServerStatus 依靠大佬们个人维护,有好几个已经停更了

于是想重新找一个开源工具搭建

Grafana + Prometheus + node_exporter 就是一个非常好的服务器状态监控解决方案

  • node_exporter 运行在客户极上,将收集到的系统数据按格式整理好放在网页上
  • Prometheus 定期到客户机收集数据,按时间序列保存
  • Grafana 从 Prometheus 读取数据,将数据按时间序列显示为图表等形式

实现效果

Install

apt update
apt install ufw
ufw allow 22
ufw enable

(Clients) node_exporter

  • Latest Release · prometheus/node_exporter

    用最新版本替换

  • 在每个客户机上执行以下内容

    wget https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz
    tar -xzvf node_exporter-1.8.2.linux-amd64.tar.gz
    sudo mv node_exporter-1.8.2.linux-amd64/node_exporter /usr/local/bin
    rm node_exporter-*.tar.gz
    rm -r node_exporter-*.linux-amd64*
    sudo useradd -rs /bin/false node_exporter
    vim /etc/systemd/system/node_exporter.service
    [Unit]
    Description=node_exporter
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=node_exporter
    Group=node_exporter
    Type=simple
    Restart=on-failure
    RestartSec=5s
    ExecStart=/usr/local/bin/node_exporter
    
    [Install]
    WantedBy=multi-user.target
    sudo systemctl daemon-reload
    sudo systemctl enable --now node_exporter
    sudo systemctl status node_exporter
  • 此时可以使用 <Server IP>:9100/metrics 查看导出的数据

    ufw allow from <Server IP> to any port 9100 comment 'node_exporter'

(Server) Prometheus

  • Prometheus 官网:Prometheus

  • Prometheus Latest Release:Latest Release · prometheus/prometheus

    用最新版本替换

    wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz
    tar -xzvf prometheus-2.53.1.linux-amd64.tar.gz
    cd prometheus-2.53.1.linux-amd64
    sudo mv prometheus promtool /usr/local/bin/
    sudo mkdir -p /etc/prometheus /var/lib/prometheus
    sudo mv prometheus.yml /etc/prometheus/prometheus.yml
    sudo mv consoles/ console_libraries/ /etc/prometheus/
    cd ..
    rm -r prometheus-2.53.1.linux-amd64.tar.gz
    rm -r prometheus-2.53.1.linux-amd64
    sudo useradd -rs /bin/false prometheus
    sudo chown -R prometheus: /etc/prometheus /var/lib/prometheus
    vim /etc/systemd/system/prometheus.service
    [Unit]
    Description=Prometheus
    Wants=network-online.target
    After=network-online.target
    
    [Service]
    User=prometheus
    Group=prometheus
    Type=simple
    Restart=on-failure
    RestartSec=5s
    ExecStart=/usr/local/bin/prometheus \
        --config.file /etc/prometheus/prometheus.yml \
        --storage.tsdb.path /var/lib/prometheus/ \
        --web.console.templates=/etc/prometheus/consoles \
        --web.console.libraries=/etc/prometheus/console_libraries \
        --web.listen-address=0.0.0.0:9090 \
        --web.enable-lifecycle \
        --log.level=info
    
    [Install]
    WantedBy=multi-user.target
    sudo systemctl daemon-reload
    sudo systemctl enable --now prometheus
    sudo systemctl status prometheus
  • 此时可以通过 http://<Server IP>:9090 访问 Prometheus 仪表盘

    ufw allow 9090 comment 'prometheus'

(Server) Add Clients to Server

  • 每次添加客户机时按以下方式更新 Prometheus

    vim /etc/prometheus/prometheus.yml
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: "prometheus"
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
          - targets: ["localhost:9090"]
    
        # 添加以下内容
      - job_name: "remote_collector"
        scrape_interval: 1m
        static_configs:
          - targets: ["<Client 1 IP>:9100", "<Client 2 IP>:9100"]
        relabel_configs:
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 1 Name>'
            regex: '<Client 1 IP>:9100'
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 2 Name>'
            regex: '<Client 2 IP>:9100'
    systemctl restart prometheus
  • 前往 <Server IP>:9090, Status, Targets,将显示所有 Clients

(Server) Grafana

sudo apt-get install -y apt-transport-https software-properties-common wget
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana
systemctl daemon-reload
systemctl enable --now grafana-server.service
systemctl status grafana-server
ufw allow 3000 comment 'grafana'
  • 此时可以通过 <Server IP>:3000 访问 Grafana
  1. 浏览器访问 <Server IP>:3000,初始用户名和初试密码均为 admin,登录成功后修改密码
  2. 点击左上角三横线,展开 Connections,点击 Data sources
  3. 点击 Add data source, Prometheus
    • URL: http://localhost:9090
  4. 点击 Save & test
  5. 点击左上角三横线,Dashboards, New, Import
  6. 输入 ID,如 1860,点击 Load,底部选择数据源为 Prometheus,点击 Import
  7. 完成,现在可以通过 <Server IP>:3000 查看仪表盘

Traffic Statistics: vnstat

使用 Grafana + Prometheus + node_exporter 可以实时监控客户端传递的数据,对各种实时数据的监控效果良好

然而,对于需要进行时间段汇总的任务,如流量统计等,效果非常有限,而且数据和实际值差别较大

由于其仅记录每个时间点的数据,无法像数据库那样根据客户端传输的数据更新每个时段的流量信息

因此,我采用 vnstat 进行流量信息统计,然后导出给 Prometheus,从而在 Grafana 面板上展示

但是我找了一圈没有找到 vnstat Exporter,只好自己手搓了一个,实现的效果略显粗糙,仅供参考

(Clients) vnstat

  1. Install vnstat

    apt install vnstat
    systemctl enable vnstat
    vnstat 从安装完成后开始统计流量信息,每五分钟更新一次,如果还没有信息就稍等一会
    • 按小时查看流量

      root@localhost:~# vnstat -h
      
       eth0  /  hourly
      
               hour        rx      |     tx      |    total    |   avg. rate
           ------------------------+-------------+-------------+---------------
           2024-08-01
               21:00      9.07 MiB |   22.31 MiB |   31.39 MiB |   73.13 kbit/s
               22:00      9.36 MiB |   22.78 MiB |   32.14 MiB |   74.90 kbit/s
               23:00     10.91 MiB |   44.68 MiB |   55.59 MiB |  129.53 kbit/s
           2024-08-02
               00:00    118.04 MiB |    6.00 GiB |    6.11 GiB |   14.59 Mbit/s
               01:00    124.24 GiB |    7.18 GiB |  131.42 GiB |  313.57 Mbit/s
               02:00     45.43 GiB |    8.37 GiB |   53.80 GiB |  128.36 Mbit/s
           ------------------------+-------------+-------------+---------------
      此外,可以按 5 分钟 `-5`、日 `-d`、月 `-m`、年 `-y` 查看、导出为 json `--json`
  1. 创建 vnstat_exporter.py 脚本

    只成功用 prometheus_client 写了一个 Python 脚本,对资源消耗较高,之后有机会重写成 Shell 脚本

    apt install python3-pip
    pip install prometheus-client
    vim /usr/local/bin/vnstat_exporter.py
    from prometheus_client import start_http_server, Gauge
    import subprocess
    import json
    import time
    import argparse
    import re
    
    # Define metrics
    traffic_gauge = Gauge('vnstat_traffic', 'Traffic usage from vnstat',
                           ['interface', 'time_unit', 'type', 'direction'])
    
    available_traffic_gauge = Gauge('available_traffic', 'Available traffic',
                                      ['available_traffic_cycle', 'available_traffic_direction'])
    
    def convert_to_bytes(traffic_str):
        """
        Converts a traffic string (e.g. '2TB', '500GB', '250MB') to bytes.
        """
        unit_multipliers = {
            'B': 1,
            'KB': 1024,
            'MB': 1024**2,
            'GB': 1024**3,
            'TB': 1024**4,
        }
    
        # Match the number and the unit
        match = re.match(r'(\d+(?:\.\d+)?)\s*([KMGTP]?B)', traffic_str.strip())
        if match:
            value = float(match.group(1))
            unit = match.group(2)
            return value * unit_multipliers[unit]
        else:
            raise ValueError(f"Invalid traffic string: {traffic_str}")
    
    def parse_vnstat_output(output):
        """
        Parses the vnstat JSON output and updates Prometheus metrics.
        """
        data = json.loads(output)
    
        for interface in data.get('interfaces', []):
            iface_name = interface.get('name', 'unknown')
            traffic = interface.get('traffic', {})
    
            # Process traffic data for each time unit
    
            # Process 5-minute data if available
            for entry in traffic.get('fiveminute', []):
                timestamp = f"{entry['date']['year']}-{entry['date']['month']:02d}-{entry['date']['day']:02d} {entry['time']['hour']:02d}:{entry['time']['minute']:02d}"
                rx = entry.get('rx', 0)
                tx = entry.get('tx', 0)
                total = rx + tx
                print(f"5-min data: {iface_name}, {timestamp}, rx={rx}, tx={tx}, total={total}")
                traffic_gauge.labels(interface=iface_name, time_unit='five_minute', type='total', direction='in').set(rx)
                traffic_gauge.labels(interface=iface_name, time_unit='five_minute', type='total', direction='out').set(tx)
                traffic_gauge.labels(interface=iface_name, time_unit='five_minute', type='total', direction='total').set(total)
    
            # Process hourly data if available
            for entry in traffic.get('hour', []):
                timestamp = f"{entry['date']['year']}-{entry['date']['month']:02d}-{entry['date']['day']:02d} {entry['time']['hour']:02d}:00"
                rx = entry.get('rx', 0)
                tx = entry.get('tx', 0)
                total = rx + tx
                print(f"Hour data: {iface_name}, {timestamp}, rx={rx}, tx={tx}, total={total}")
                traffic_gauge.labels(interface=iface_name, time_unit='hour', type='total', direction='in').set(rx)
                traffic_gauge.labels(interface=iface_name, time_unit='hour', type='total', direction='out').set(tx)
                traffic_gauge.labels(interface=iface_name, time_unit='hour', type='total', direction='total').set(total)
    
            # Process daily data if available
            for entry in traffic.get('day', []):
                date = f"{entry['date']['year']}-{entry['date']['month']:02d}-{entry['date']['day']:02d}"
                rx = entry.get('rx', 0)
                tx = entry.get('tx', 0)
                total = rx + tx
                print(f"Day data: {iface_name}, {date}, rx={rx}, tx={tx}, total={total}")
                traffic_gauge.labels(interface=iface_name, time_unit='day', type='total', direction='in').set(rx)
                traffic_gauge.labels(interface=iface_name, time_unit='day', type='total', direction='out').set(tx)
                traffic_gauge.labels(interface=iface_name, time_unit='day', type='total', direction='total').set(total)
    
            # Process monthly data if available
            for entry in traffic.get('month', []):
                date = f"{entry['date']['year']}-{entry['date']['month']:02d}"
                rx = entry.get('rx', 0)
                tx = entry.get('tx', 0)
                total = rx + tx
                print(f"Month data: {iface_name}, {date}, rx={rx}, tx={tx}, total={total}")
                traffic_gauge.labels(interface=iface_name, time_unit='month', type='total', direction='in').set(rx)
                traffic_gauge.labels(interface=iface_name, time_unit='month', type='total', direction='out').set(tx)
                traffic_gauge.labels(interface=iface_name, time_unit='month', type='total', direction='total').set(total)
    
            # Process yearly data if available
            for entry in traffic.get('year', []):
                date = f"{entry['date']['year']}"
                rx = entry.get('rx', 0)
                tx = entry.get('tx', 0)
                total = rx + tx
                print(f"Year data: {iface_name}, {date}, rx={rx}, tx={tx}, total={total}")
                traffic_gauge.labels(interface=iface_name, time_unit='year', type='total', direction='in').set(rx)
                traffic_gauge.labels(interface=iface_name, time_unit='year', type='total', direction='out').set(tx)
                traffic_gauge.labels(interface=iface_name, time_unit='year', type='total', direction='total').set(total)
    
    def update_metrics(available_traffic_cycle, available_traffic_direction, available_traffic):
        """
        Fetches vnstat data and updates Prometheus metrics.
        """
        try:
            output = subprocess.check_output(['vnstat', '--json'], text=True)
            print("Raw vnstat JSON output:")
            print(output)  # Print the raw JSON data for inspection
            parse_vnstat_output(output)
    
            # Check if available traffic is unlimited
            if available_traffic == '0':
                # Set available traffic to infinity or a very high value
                available_traffic_bytes = float('inf')  # 表示无限流量
            else:
                # Convert available traffic to bytes
                available_traffic_bytes = convert_to_bytes(available_traffic)
    
            # Set available traffic gauge
            available_traffic_gauge.labels(available_traffic_cycle=available_traffic_cycle, available_traffic_direction=available_traffic_direction).set(available_traffic_bytes)
        except subprocess.CalledProcessError as e:
            print(f"Error fetching vnstat data: {e}")
            print(f"Command output: {e.output}")
    
    if __name__ == '__main__':
        # Argument parsing
        parser = argparse.ArgumentParser(description='vnstat exporter for Prometheus')
        parser.add_argument('--available_traffic_cycle', required=True, help='Cycle for available traffic (e.g. monthly)')
        parser.add_argument('--available_traffic_direction', required=True, help='Direction for available traffic (e.g. total)')
        parser.add_argument('--available_traffic', required=True, help='Amount of available traffic (e.g. 2TB or 0 for unlimited)')
    
        args = parser.parse_args()
    
        # Start Prometheus metrics server
        start_http_server(9112)
        while True:
            update_metrics(args.available_traffic_cycle, args.available_traffic_direction, args.available_traffic)
            time.sleep(60)  # Update every 60 seconds
  1. 创建 vnstat_exporter 服务

    vim /etc/systemd/system/vnstat_exporter.service
    [Unit]
    Description=vnstat exporter
    
    [Service]
    ExecStart=/usr/bin/python3 /usr/local/bin/vnstat_exporter.py \
      --available_traffic_cycle "Monthly" \
      --available_traffic_direction "In/Out" \
      --available_traffic "2TB"
    WorkingDirectory=/root
    Restart=always
    User=root
    
    [Install]
    WantedBy=multi-user.target
    • 修改其中的 available_traffic_cycle available_traffic_direction available_traffic
    • available_traffic 0 为无限

      systemctl daemon-reload
      systemctl enable --now vnstat_exporter
  2. 现在可以前往 <Server IP>:9112/metrics 查看输出信息,如果没有稍等五分钟

  3. 启用防火墙

    ufw allow from <Server IP> to any port 9112 comment 'vnstat_exporter'

(Server) Add Clients to Server

  1. 编辑 Prometheus 配置

    vim /etc/prometheus/prometheus.yml
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: "prometheus"
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
          - targets: ["localhost:9090"]
      - job_name: "remote_collector"
        scrape_interval: 1m
        static_configs:
          - targets: ["<Client 1 IP>:9100", "<Client 2 IP>:9100"]
        relabel_configs:
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 1 Name>'
            regex: '<Client 1 IP>:9100'
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 2 Name>'
            regex: '<Client 2 IP>:9100'
    
        # 添加以下内容
      - job_name: 'vnstat_exporter'
        scrape_interval: 1m
        static_configs:
          - targets: ["<Client 1 IP>:9112", "<Client 2 IP>:9112"]
        relabel_configs:
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 1 Name>'
            regex: '<Client 1 IP>:9112'
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 2 Name>'
            regex: '<Client 2 IP>:9112'
  2. 添加 Grafana 变量

    在 Dashboard 点击 齿轮Variables, + New variable

    • Name: Traffic_Unit

    • Label: Traffic Unit

    • Query type: Label values

    • Label: time_unit

    • Metric: vnstat_traffic

      点击 Apply

  3. 配置 Grafana 面板

    点击 Add, Visualization,Query 里选择 Code,输入

    • 出口流量

      vnstat_traffic{time_unit="$Traffic_Unit",type="total",direction="out",instance="$node"}
    • 入口流量

      vnstat_traffic{time_unit="$Traffic_Unit",type="total",direction="in",instance="$node"}
    • 双向流量

      vnstat_traffic{time_unit="$Traffic_Unit",type="total",direction="total",instance="$node"}
      • 可用流量
      available_traffic{instance="$node"}
      • 流量方向
      available_traffic{instance="$node"}
      • 下方 Options, Legend 选择 Custom,输入 {{available_traffic_direction}}

      • 右侧搜索 Text mode 选择 Name

      • 流量周期

      available_traffic{instance="$node"}

      下方 Options, Legend 选择 Custom,输入 {{available_traffic_cycle}}

      右侧搜索 Text mode 选择 Name

      • 右侧选项 Standard Option, Unit 选择 bytes(IEC)
  4. 此时可以在顶上 Host 选择主机、在 Traffic Unit 选择统计周期

服务器续费信息

(Clients) server_exporter

vim /usr/local/bin/server_exporter.py
from prometheus_client import start_http_server, Gauge
import time
import argparse

# Define Prometheus metrics
renewal_date_gauge = Gauge('renewal_date', 'Renewal date of the service (timestamp)',
                            ['renewal_cycle'])
renewal_price_gauge = Gauge('renewal_price', 'Renewal price of the service',
                             ['renewal_currency'])

def update_metrics(renewal_date, renewal_cycle, renewal_price, renewal_currency):
    """
    Update Prometheus metrics
    """
    # Convert renewal date to timestamp
    try:
        # Here we assume renewal_date is a valid date string, e.g., '2024-12-31'
        timestamp = time.mktime(time.strptime(renewal_date, '%Y-%m-%d'))
        renewal_date_gauge.labels(renewal_cycle=renewal_cycle).set(timestamp)
    except ValueError as e:
        print(f"Invalid renewal date format: {renewal_date}. Error: {e}")

    # Update renewal price metric
    try:
        renewal_price_value = float(renewal_price)  # Ensure price is a number
        renewal_price_gauge.labels(renewal_currency=renewal_currency).set(renewal_price_value)
    except ValueError as e:
        print(f"Invalid renewal price format: {renewal_price}. Error: {e}")

if __name__ == '__main__':
    # Argument parsing
    parser = argparse.ArgumentParser(description='Renewal information exporter for Prometheus')
    parser.add_argument('--renewal_date', required=True, help='Renewal date of the service (e.g. YYYY-MM-DD)')
    parser.add_argument('--renewal_cycle', required=True, help='Renewal cycle (e.g. monthly, yearly)')
    parser.add_argument('--renewal_price', required=True, help='Renewal price (e.g. 29.99)')
    parser.add_argument('--renewal_currency', required=True, help='Currency for the renewal price (e.g. USD, EUR)')

    args = parser.parse_args()

    # Start Prometheus metrics server
    start_http_server(9113)  # Use a different port to avoid conflicts with vnstat_exporter
    while True:
        update_metrics(args.renewal_date, args.renewal_cycle, args.renewal_price, args.renewal_currency)
        time.sleep(60*60*24)  # Update once a day
vim /etc/systemd/system/server_exporter.service
[Unit]
Description=server exporter

[Service]
ExecStart=/usr/bin/python3 /usr/local/bin/server_exporter.py \
  --renewal_date "2024-12-31" \
  --renewal_cycle "Annually" \
  --renewal_price "12.34" \
  --renewal_currency "USD"
WorkingDirectory=/root
Restart=always
User=root

[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable --now server_exporter
ufw allow from <Server IP> to any port 9113 comment 'server_exporter'

(Server) Add Clients to Server

  1. 编辑 Prometheus 配置

    vim /etc/prometheus/prometheus.yml
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: "prometheus"
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
          - targets: ["localhost:9090"]
      - job_name: "remote_collector"
        scrape_interval: 1m
        static_configs:
          - targets: ["<Client 1 IP>:9100", "<Client 2 IP>:9100"]
        relabel_configs:
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 1 Name>'
            regex: '<Client 1 IP>:9100'
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 2 Name>'
            regex: '<Client 2 IP>:9100'
    
      - job_name: 'vnstat_exporter'
        scrape_interval: 1m
        static_configs:
          - targets: ["<Client 1 IP>:9112", "<Client 2 IP>:9112"]
        relabel_configs:
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 1 Name>'
            regex: '<Client 1 IP>:9112'
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 2 Name>'
            regex: '<Client 2 IP>:9112'
    
      # 添加以下内容
      - job_name: 'server_exporter'
        scrape_interval: 1d
        static_configs:
          - targets: ["<Client 1 IP>:9113", "<Client 2 IP>:9113"]
        relabel_configs:
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 1 Name>'
            regex: '<Client 1 IP>:9113'
          - source_labels: [__address__]
            target_label: instance
            replacement: '<Client 2 Name>'
            regex: '<Client 2 IP>:9113'
  2. 在 Grafana 中添加面板

    • 参考可用流量的设置

    • 续费日期

      renewal_date{instance=~"$host"} * 1000
      • 下方 Options, Legend 选择 Custom,输入 {{renewal_cycle}}
      • 右侧删除面板标题
    • 续费价格

      renewal_price{instance=~"$host"}
      • 下方 Options, Legend 选择 Custom,输入 {{renewal_currency}}
      • 右侧删除面板标题

使用社交账号登录

  • Loading...
  • Loading...
  • Loading...
  • Loading...
  • Loading...