Home » Services that work for you

Prometheus: the all-seeing eye of infrastructure

Comprehensive infrastructure surveillance system

1 min · 164 words · Potato Energy Team, ponfertato | Translations:

🇷🇺

Table of Contents

Monitoring Platform 📊
Alerting System 🚨

Monitoring Platform 📊

Purpose 24/7 monitoring of key indicators:

Service availability (HTTP/ICMP/DNS)
Resource utilization (CPU/RAM/Disk)
Abnormal activity
Execution SLA

Technical Implementation

Metrics collection: 20s interval
Storage: 30 days retention
Samples: Blackbox for 8 types of tests
Exporters: Node, cAdvisor, ASF, HA

Security and Access

Dashboard: potatoenergy.ru/prometheus (group dev)
Alerts: Discord/Telegram for critical incidents
Encryption: TLS for all exporters
Audit: Signature metrics

Features

Automatic anomaly detection
Grafana custom dashboards
Integration with 15+ data sources
Incident Escalation System

Alerting System 🚨

Principles of Operation

200+ pre-defined rules for:
- Service availability
- Resource utilization thresholds
- Network traffic anomalies
- Application errors
Multilayer routing:

route:
  receiver: grafana
  routes:
    - match: severity=critical
      receivers: [discord, telegram]

notifications for the dev group only:

Discord: Channel #infra-alerts
Telegram: Private channel with bot.
Escalation after 30 minutes without confirmation

#Why is this important?

Proactive detection of issues before impacting users
Single point of truth for analyzing incidents
Automated documentation via tags
Resource optimization through historical data

All non-critical alerts are handled during project business hours.