云原生可观测性体系建设实战
云原生可观测性体系建设实战
一、引言
可观测性是云原生架构中的核心能力,能够帮助开发者理解系统内部状态、快速定位问题。本文将深入探讨云原生可观测性的核心概念、技术栈选择、实战配置以及最佳实践。
二、可观测性核心概念
2.1 可观测性三大支柱
graph TD A[可观测性] --> B[Metrics] A --> C[Logs] A --> D[Tracing] B --> E[Prometheus] B --> F[Grafana] C --> G[ELK] C --> H[Loki] D --> I[Jaeger] D --> J[SkyWalking]2.2 三大支柱对比
| 支柱 | 用途 | 工具 | 存储类型 |
|---|---|---|---|
| Metrics | 指标监控 | Prometheus、Grafana | 时序数据库 |
| Logs | 日志管理 | ELK、Loki | 全文检索 |
| Tracing | 分布式追踪 | Jaeger、SkyWalking | 分布式追踪系统 |
三、指标监控体系
3.1 Prometheus配置
global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)3.2 自定义指标采集
@Component public class CustomMetrics { private final MeterRegistry meterRegistry; public CustomMetrics(MeterRegistry meterRegistry) { this.meterRegistry = meterRegistry; } @Bean public Timer requestTimer() { return Timer.builder("http.request.duration") .description("HTTP request duration") .tags("service", "order-service") .publishPercentiles(0.5, 0.9, 0.99) .register(meterRegistry); } @Bean public Counter requestCounter() { return Counter.builder("http.request.count") .description("HTTP request count") .tags("service", "order-service") .register(meterRegistry); } @Bean public Gauge activeConnections() { return Gauge.builder("http.active.connections", () -> getActiveConnections()) .description("Active HTTP connections") .register(meterRegistry); } private int getActiveConnections() { // 返回当前活跃连接数 return 0; } }3.3 Grafana仪表盘配置
{ "dashboard": { "id": null, "title": "Service Health Dashboard", "panels": [ { "id": 1, "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_request_count[5m]))", "legendFormat": "{{service}}" } ] }, { "id": 2, "title": "Response Time", "type": "graph", "targets": [ { "expr": "avg(http_request_duration_seconds_p50)", "legendFormat": "P50" }, { "expr": "avg(http_request_duration_seconds_p90)", "legendFormat": "P90" }, { "expr": "avg(http_request_duration_seconds_p99)", "legendFormat": "P99" } ] }, { "id": 3, "title": "Error Rate", "type": "singlestat", "targets": [ { "expr": "sum(rate(http_request_count{status=~\"5xx\"}[5m])) / sum(rate(http_request_count[5m])) * 100" } ] } ] } }四、日志管理体系
4.1 Loki配置
auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 common: instance_addr: 127.0.0.1 path_prefix: /tmp/loki storage: filesystem: chunks_directory: /tmp/loki/chunks rules_directory: /tmp/loki/rules replication_factor: 1 ring: instance_addr: 127.0.0.1 kvstore: store: inmemory schema_config: configs: - from: 2024-01-01 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h ruler: alertmanager_url: http://localhost:90934.2 Promtail配置
server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://localhost:3100/loki/api/v1/push scrape_configs: - job_name: system static_configs: - targets: - localhost labels: job: varlogs __path__: /var/log/*log - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_container_name] action: replace target_label: container - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace4.3 ELK配置
version: '3.8' services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:8.10.0 environment: - discovery.type=single-node - ES_JAVA_OPTS=-Xms512m -Xmx512m ports: - "9200:9200" volumes: - es_data:/usr/share/elasticsearch/data logstash: image: docker.elastic.co/logstash/logstash:8.10.0 volumes: - ./logstash/config:/usr/share/logstash/config - ./logstash/pipeline:/usr/share/logstash/pipeline ports: - "5000:5000" depends_on: - elasticsearch kibana: image: docker.elastic.co/kibana/kibana:8.10.0 ports: - "5601:5601" depends_on: - elasticsearch volumes: es_data:五、分布式追踪体系
5.1 Jaeger配置
apiVersion: jaegertracing.io/v1 kind: Jaeger metadata: name: jaeger spec: strategy: allInOne allInOne: image: jaegertracing/all-in-one:1.49 options: query: base-path: /jaeger ingress: enabled: true hosts: - jaeger.example.com5.2 SkyWalking配置
server: port: 12800 servlet: context-path: / spring: application: name: skywalking-oap-server storage: type: elasticsearch elasticsearch: clusterNodes: localhost:9200 indexShardsNumber: 2 indexReplicasNumber: 1 receiver: otlp: protocols: grpc: port: 4317 http: port: 4318 zipkin: host: 0.0.0.0 port: 94115.3 OpenTelemetry配置
receivers: otlp: protocols: grpc: http: exporters: otlp: endpoint: "jaeger:4317" tls: insecure: true prometheus: endpoint: "prometheus:9090" processors: batch: service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]六、告警与通知
6.1 Prometheus告警规则
groups: - name: service-alerts rules: - alert: HighCPUUsage expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0.8 for: 5m labels: severity: critical annotations: summary: "High CPU usage detected" description: "CPU usage is above 80% for more than 5 minutes" - alert: HighMemoryUsage expr: avg(container_memory_usage_bytes / container_memory_limit_bytes) > 0.85 for: 5m labels: severity: warning annotations: summary: "High Memory usage detected" description: "Memory usage is above 85% for more than 5 minutes" - alert: ServiceDown expr: up == 0 for: 2m labels: severity: critical annotations: summary: "Service is down" description: "{{ $labels.service }} is not responding" - alert: HighErrorRate expr: sum(rate(http_request_count{status=~"5xx"}[5m])) / sum(rate(http_request_count[5m])) > 0.05 for: 3m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is above 5% for more than 3 minutes"6.2 Alertmanager配置
global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'slack' receivers: - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/XXX/XXX/XXX' channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'service']七、可观测性最佳实践
7.1 指标设计原则
- [ ] 使用统一的命名规范 - [ ] 添加必要的标签(service、instance、endpoint) - [ ] 避免高基数标签 - [ ] 设置合理的采集频率 - [ ] 使用直方图代替计数器7.2 日志设计原则
- [ ] 结构化日志格式(JSON) - [ ] 包含trace_id和span_id - [ ] 包含时间戳和级别 - [ ] 敏感信息脱敏 - [ ] 合理设置日志级别7.3 追踪设计原则
- [ ] 注入trace上下文到所有服务调用 - [ ] 设置合理的采样率 - [ ] 添加自定义span标签 - [ ] 关联日志和指标 - [ ] 设置trace保留策略7.4 可观测性检查清单
指标监控: - [ ] 配置Prometheus采集 - [ ] 创建Grafana仪表盘 - [ ] 设置告警规则 - [ ] 配置Alertmanager 日志管理: - [ ] 配置日志收集(Loki/ELK) - [ ] 结构化日志格式 - [ ] 配置日志保留策略 - [ ] 设置日志级别 分布式追踪: - [ ] 配置追踪系统(Jaeger/SkyWalking) - [ ] 注入trace上下文 - [ ] 设置采样率 - [ ] 关联日志和指标 告警通知: - [ ] 设置合理的告警阈值 - [ ] 配置通知渠道 - [ ] 设置告警抑制规则 - [ ] 配置告警收敛八、总结
可观测性是云原生系统的核心能力,通过指标监控、日志管理和分布式追踪三大支柱,可以全面了解系统运行状态。通过合理配置Prometheus、Grafana、Loki、Jaeger等工具,构建完善的可观测性体系,能够快速定位问题、优化性能、提升系统可靠性。
参考资料:
- Prometheus Documentation: https://prometheus.io/docs/
- Grafana Documentation: https://grafana.com/docs/
- Loki Documentation: https://grafana.com/docs/loki/latest/
- Jaeger Documentation: https://www.jaegertracing.io/docs/
- SkyWalking Documentation: https://skywalking.apache.org/docs/
