当前位置: 首页 > news >正文

云原生可观测性体系建设实战

云原生可观测性体系建设实战

一、引言

可观测性是云原生架构中的核心能力,能够帮助开发者理解系统内部状态、快速定位问题。本文将深入探讨云原生可观测性的核心概念、技术栈选择、实战配置以及最佳实践。

二、可观测性核心概念

2.1 可观测性三大支柱

graph TD A[可观测性] --> B[Metrics] A --> C[Logs] A --> D[Tracing] B --> E[Prometheus] B --> F[Grafana] C --> G[ELK] C --> H[Loki] D --> I[Jaeger] D --> J[SkyWalking]

2.2 三大支柱对比

支柱用途工具存储类型
Metrics指标监控Prometheus、Grafana时序数据库
Logs日志管理ELK、Loki全文检索
Tracing分布式追踪Jaeger、SkyWalking分布式追踪系统

三、指标监控体系

3.1 Prometheus配置

global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)

3.2 自定义指标采集

@Component public class CustomMetrics { private final MeterRegistry meterRegistry; public CustomMetrics(MeterRegistry meterRegistry) { this.meterRegistry = meterRegistry; } @Bean public Timer requestTimer() { return Timer.builder("http.request.duration") .description("HTTP request duration") .tags("service", "order-service") .publishPercentiles(0.5, 0.9, 0.99) .register(meterRegistry); } @Bean public Counter requestCounter() { return Counter.builder("http.request.count") .description("HTTP request count") .tags("service", "order-service") .register(meterRegistry); } @Bean public Gauge activeConnections() { return Gauge.builder("http.active.connections", () -> getActiveConnections()) .description("Active HTTP connections") .register(meterRegistry); } private int getActiveConnections() { // 返回当前活跃连接数 return 0; } }

3.3 Grafana仪表盘配置

{ "dashboard": { "id": null, "title": "Service Health Dashboard", "panels": [ { "id": 1, "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_request_count[5m]))", "legendFormat": "{{service}}" } ] }, { "id": 2, "title": "Response Time", "type": "graph", "targets": [ { "expr": "avg(http_request_duration_seconds_p50)", "legendFormat": "P50" }, { "expr": "avg(http_request_duration_seconds_p90)", "legendFormat": "P90" }, { "expr": "avg(http_request_duration_seconds_p99)", "legendFormat": "P99" } ] }, { "id": 3, "title": "Error Rate", "type": "singlestat", "targets": [ { "expr": "sum(rate(http_request_count{status=~\"5xx\"}[5m])) / sum(rate(http_request_count[5m])) * 100" } ] } ] } }

四、日志管理体系

4.1 Loki配置

auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 common: instance_addr: 127.0.0.1 path_prefix: /tmp/loki storage: filesystem: chunks_directory: /tmp/loki/chunks rules_directory: /tmp/loki/rules replication_factor: 1 ring: instance_addr: 127.0.0.1 kvstore: store: inmemory schema_config: configs: - from: 2024-01-01 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h ruler: alertmanager_url: http://localhost:9093

4.2 Promtail配置

server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://localhost:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: - localhost labels: job: varlogs __path__: /var/log/*log - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace

4.3 ELK配置

version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0 environment: - discovery.type=single-node - ES_JAVA_OPTS=-Xms512m -Xmx512m ports: - "9200:9200" volumes: - es_data:/usr/share/elasticsearch/data logstash: image: docker.elastic.co/logstash/logstash:8.10.0 volumes: - ./logstash/config:/usr/share/logstash/config - ./logstash/pipeline:/usr/share/logstash/pipeline ports: - "5000:5000" depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.10.0 ports: - "5601:5601" depends_on: - elasticsearch volumes: es_data:

五、分布式追踪体系

5.1 Jaeger配置

apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger spec: strategy: allInOne allInOne: image: jaegertracing/all-in-one:1.49 options: query: base-path: /jaeger ingress: enabled: true hosts: - jaeger.example.com

5.2 SkyWalking配置

server: port: 12800 servlet: context-path: / spring: application: name: skywalking-oap-server storage: type: elasticsearch elasticsearch: clusterNodes: localhost:9200 indexShardsNumber: 2 indexReplicasNumber: 1 receiver: otlp: protocols: grpc: port: 4317 http: port: 4318 zipkin: host: 0.0.0.0 port: 9411

5.3 OpenTelemetry配置

receivers: otlp: protocols: grpc: http: exporters: otlp: endpoint: "jaeger:4317" tls: insecure: true prometheus: endpoint: "prometheus:9090" processors: batch: service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]

六、告警与通知

6.1 Prometheus告警规则

groups: - name: service-alerts rules: - alert: HighCPUUsage expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0.8 for: 5m labels: severity: critical annotations: summary: "High CPU usage detected" description: "CPU usage is above 80% for more than 5 minutes" - alert: HighMemoryUsage expr: avg(container_memory_usage_bytes / container_memory_limit_bytes) > 0.85 for: 5m labels: severity: warning annotations: summary: "High Memory usage detected" description: "Memory usage is above 85% for more than 5 minutes" - alert: ServiceDown expr: up == 0 for: 2m labels: severity: critical annotations: summary: "Service is down" description: "{{ $labels.service }} is not responding" - alert: HighErrorRate expr: sum(rate(http_request_count{status=~"5xx"}[5m])) / sum(rate(http_request_count[5m])) > 0.05 for: 3m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is above 5% for more than 3 minutes"

6.2 Alertmanager配置

global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'slack' receivers: - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX' channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'service']

七、可观测性最佳实践

7.1 指标设计原则

- [ ] 使用统一的命名规范 - [ ] 添加必要的标签(service、instance、endpoint) - [ ] 避免高基数标签 - [ ] 设置合理的采集频率 - [ ] 使用直方图代替计数器

7.2 日志设计原则

- [ ] 结构化日志格式(JSON) - [ ] 包含trace_id和span_id - [ ] 包含时间戳和级别 - [ ] 敏感信息脱敏 - [ ] 合理设置日志级别

7.3 追踪设计原则

- [ ] 注入trace上下文到所有服务调用 - [ ] 设置合理的采样率 - [ ] 添加自定义span标签 - [ ] 关联日志和指标 - [ ] 设置trace保留策略

7.4 可观测性检查清单

指标监控: - [ ] 配置Prometheus采集 - [ ] 创建Grafana仪表盘 - [ ] 设置告警规则 - [ ] 配置Alertmanager 日志管理: - [ ] 配置日志收集(Loki/ELK) - [ ] 结构化日志格式 - [ ] 配置日志保留策略 - [ ] 设置日志级别 分布式追踪: - [ ] 配置追踪系统(Jaeger/SkyWalking) - [ ] 注入trace上下文 - [ ] 设置采样率 - [ ] 关联日志和指标 告警通知: - [ ] 设置合理的告警阈值 - [ ] 配置通知渠道 - [ ] 设置告警抑制规则 - [ ] 配置告警收敛

八、总结

可观测性是云原生系统的核心能力,通过指标监控、日志管理和分布式追踪三大支柱,可以全面了解系统运行状态。通过合理配置Prometheus、Grafana、Loki、Jaeger等工具,构建完善的可观测性体系,能够快速定位问题、优化性能、提升系统可靠性。


参考资料:

  • Prometheus Documentation: https://prometheus.io/docs/
  • Grafana Documentation: https://grafana.com/docs/
  • Loki Documentation: https://grafana.com/docs/loki/latest/
  • Jaeger Documentation: https://www.jaegertracing.io/docs/
  • SkyWalking Documentation: https://skywalking.apache.org/docs/
http://www.cnnetsun.cn/news/2678579.html

相关文章:

  • 如何用茉莉花插件3步搞定Zotero中文文献管理:终极完整指南
  • AMD显卡驱动瘦身神器:Radeon Software Slimmer终极配置指南
  • Linux运维排查:用turbostat揪出服务器耗电异常的元凶(附CentOS 8/7实战命令)
  • Gemini股东大会核心材料首次曝光(含董事会闭门纪要与Q2模型训练预算分配表)
  • Gemini用户评论分析全链路拆解(2024Q2千万级样本实证)
  • 终极视频压缩指南:用CompressO免费开源工具轻松瘦身你的媒体文件
  • WeChatMsg:如何将微信聊天记录转化为结构化数据资产
  • 突破性工具:从JSXBIN二进制迷雾到清晰JavaScript代码的革命性解码方案
  • 综合算法 XVI | LeetCode 精选 100 题(上)
  • 综合算法 XVIII | LeetCode 精选 100 题(下)
  • 微信聊天记录永久保存终极指南:5分钟免费导出完整数据
  • 基于Arduino Nano的双通道示波器DIY:集成信号源与频率计
  • 基于Arduino与超声波传感器的工作专注度提醒器设计与实现
  • Downkyi终极指南:轻松搞定B站高清视频下载的完整解决方案
  • 第3章:codex 安装配置与环境准备
  • 微信聊天记录永久保存:如何用WeChatMsg开源工具守护你的数字记忆
  • 如何完整保存微信聊天记录?终极免费方案告别数据丢失困扰
  • 终极免费工具:三步搞定国家中小学智慧教育平台电子课本下载
  • Video2X终极指南:如何用AI让老旧视频秒变4K高清大片
  • 为什么你的Gemini账单翻倍了?——资深MLOps工程师逐行比对新旧计费规则(含12个隐藏费用触发点)
  • 【电力装备制造业智能化转型】【数据基础设施篇】【1】客户既有数据源的接入策略
  • 传统收藏追求稀有贵重,编写平凡好物收藏管理程序,记录日常平凡物件,颠覆收藏必贵重。
  • GPT还是MBR?给SATA/NVMe固态硬盘分区选错,重装系统白忙活
  • Zotero Style插件终极指南:如何解决高能进度条显示问题
  • 多模态记忆:让 AI Agent 记忆各种类型的信息
  • Anno 1800 Mod Loader终极指南:XML智能合并与高级模组制作
  • 欧拉系统上安装ToDesk 4.3.1.0,除了rpm -Uvh,这些启动和排错命令你更得会
  • 生产环境实战:手把手教你用mongosh命令行连接MongoDB(含认证与参数详解)
  • Arduino三色信号灯与蜂鸣器互动装置:从零实现嵌入式系统入门项目
  • 终极指南:3分钟免费检测微信单向好友,清理无效社交关系