当前位置: 首页 > news >正文

部署文档 - Kubernetes监控与日志收集系统

一、环境准备

1.1 检查Kubernetes集群状态

# 检查集群节点状态 kubectl get nodes -o wide # 检查集群组件状态 kubectl get cs # 检查存储类 kubectl get storageclass

1.2 创建必要目录

# 创建工作目录 mkdir -p k8s-monitoring cd k8s-monitoring mkdir -p manifests logs

二、资源监控系统部署

2.1 创建监控命名空间

kubectl create namespace monitoring

2.2 准备监控配置

2.2.1 创建Prometheus Stack配置文件
cat > prometheus-values.yaml << 'EOF' # 请替换以下配置中的占位符: # <INTERNAL_REGISTRY> - 替换为内网镜像仓库地址 # <STORAGE_CLASS> - 替换为实际的存储类名称 # <GRAFANA_PASSWORD> - 替换为Grafana管理员密码 global: imageRegistry: "<INTERNAL_REGISTRY>" imagePullSecrets: ["regcred"] prometheusOperator: serviceMonitorSelectorNilUsesHelmValues: false serviceMonitorSelector: {} podMonitorSelectorNilUsesHelmValues: false podMonitorSelector: {} prometheus: prometheusSpec: retention: "10d" scrapeInterval: "30s" evaluationInterval: "30s" resources: requests: memory: "400Mi" cpu: "200m" limits: memory: "2Gi" cpu: "1000m" storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] storageClassName: "<STORAGE_CLASS>" resources: requests: storage: "50Gi" serviceMonitorSelectorNilUsesHelmValues: false serviceMonitorSelector: {} ruleSelectorNilUsesHelmValues: false ruleSelector: {} kube-state-metrics: resources: requests: memory: "32Mi" cpu: "10m" limits: memory: "128Mi" cpu: "100m" nodeExporter: resources: requests: memory: "30Mi" cpu: "10m" limits: memory: "50Mi" cpu: "200m" grafana: adminUser: "admin" adminPassword: "<GRAFANA_PASSWORD>" persistence: enabled: true size: "10Gi" storageClassName: "<STORAGE_CLASS>" alertmanager: enabled: false EOF # 使用sed命令替换占位符(或手动编辑) sed -i 's/<INTERNAL_REGISTRY>/registry.internal.company.com/g' prometheus-values.yaml sed -i 's/<STORAGE_CLASS>/standard/g' prometheus-values.yaml sed -i 's/<GRAFANA_PASSWORD>/admin123/g' prometheus-values.yaml

2.3 安装Prometheus Stack

# 1. 在外网环境下载chart包 helm pull prometheus-community/kube-prometheus-stack --version 45.0.0 # 2. 将chart包传输到内网环境 # 假设chart包已放置在当前目录 # 3. 解压并安装 tar -xzf kube-prometheus-stack-45.0.0.tgz helm install prometheus-stack ./kube-prometheus-stack \ -n monitoring \ -f prometheus-values.yaml

2.4 验证监控安装

# 检查所有Pod状态 kubectl get pods -n monitoring -w # 等待所有Pod变为Running状态后,执行以下检查 kubectl get all -n monitoring # 检查持久化卷声明 kubectl get pvc -n monitoring # 测试Prometheus服务 kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 & # 浏览器访问 http://localhost:9090 # 测试Grafana服务 kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80 & # 浏览器访问 http://localhost:3000 # 用户名: admin, 密码: <GRAFANA_PASSWORD> # 关闭端口转发 kill %1 %2

三、日志收集系统部署

3.1 创建日志命名空间

kubectl create namespace logging

3.2 部署RBAC权限

3.2.1 创建ServiceAccount和ClusterRole
cat > fluent-bit-rbac.yaml << 'EOF' apiVersion: v1 kind: ServiceAccount metadata: name: fluent-bit namespace: logging --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: fluent-bit-read rules: - apiGroups: [""] resources: - namespaces - pods - pods/logs verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: fluent-bit-read roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: fluent-bit-read subjects: - kind: ServiceAccount name: fluent-bit namespace: logging EOF kubectl apply -f fluent-bit-rbac.yaml

3.3 创建Fluent Bit配置

3.3.1 创建ConfigMap
cat > fluent-bit-configmap.yaml << 'EOF' apiVersion: v1 kind: ConfigMap metadata: name: fluent-bit-config namespace: logging data: fluent-bit.conf: | [SERVICE] Flush 5 Log_Level info Daemon off Parsers_File parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 @INCLUDE input-kubernetes.conf @INCLUDE filter-kubernetes.conf @INCLUDE output-file.conf input-kubernetes.conf: | [INPUT] Name tail Tag kube.* Path /var/log/containers/*.log Parser docker DB /var/log/flb_kube.db DB.Sync Normal Mem_Buf_Limit 5MB Skip_Long_Lines On Refresh_Interval 10 filter-kubernetes.conf: | [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Kube_Tag_Prefix kube.var.log.containers. Merge_Log On Merge_Log_Key log_processed Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude Off Labels On Annotations Off [FILTER] Name modify Match * Add node_name ${NODE_NAME} Add host_ip ${HOST_IP} output-file.conf: | [OUTPUT] Name file Match * Path /var/log/k8s-logs/ Format template Template {time}-{kubernetes['namespace_name']}-{kubernetes['pod_name']}-{kubernetes['container_name']}.log Retry_Limit False parsers.conf: | [PARSER] Name docker Format json Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%LZ Time_Keep On Decode_Field_As escaped_utf8 log do_next Decode_Field_As json log EOF kubectl apply -f fluent-bit-configmap.yaml

3.4 部署Fluent Bit DaemonSet

3.4.1 创建DaemonSet
cat > fluent-bit-daemonset.yaml << 'EOF' # 请替换 <INTERNAL_REGISTRY> 为内网镜像仓库地址 apiVersion: apps/v1 kind: DaemonSet metadata: name: fluent-bit namespace: logging spec: selector: matchLabels: k8s-app: fluent-bit-logging template: metadata: labels: k8s-app: fluent-bit-logging spec: serviceAccountName: fluent-bit tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fluent-bit image: <INTERNAL_REGISTRY>/fluent/fluent-bit:2.1.9 imagePullPolicy: IfNotPresent env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: HOST_IP valueFrom: fieldRef: fieldPath: status.hostIP resources: requests: memory: "50Mi" cpu: "10m" limits: memory: "200Mi" cpu: "500m" volumeMounts: - name: varlog mountPath: /var/log readOnly: true - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true - name: fluent-bit-config mountPath: /fluent-bit/etc/ - name: flb-storage mountPath: /var/log/flb-storage/ - name: fluent-bit-token mountPath: /var/run/secrets/kubernetes.io/serviceaccount readOnly: true livenessProbe: httpGet: path: /api/v1/health port: 2020 initialDelaySeconds: 30 periodSeconds: 30 readinessProbe: httpGet: path: /api/v1/health port: 2020 initialDelaySeconds: 5 periodSeconds: 10 volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers - name: fluent-bit-config configMap: name: fluent-bit-config - name: flb-storage hostPath: path: /var/log/flb-storage type: DirectoryOrCreate - name: fluent-bit-token projected: sources: - serviceAccountToken: audience: fluent-bit expirationSeconds: 3600 path: token EOF # 替换镜像地址 sed -i 's/<INTERNAL_REGISTRY>/registry.internal.company.com/g' fluent-bit-daemonset.yaml kubectl apply -f fluent-bit-daemonset.yaml

3.5 配置Node节点日志存储

3.5.1 在每个Node上执行日志目录配置
# 创建配置脚本 cat > setup-node-logs.sh << 'EOF' #!/bin/bash # 创建日志存储目录 LOG_DIR="/var/log/k8s-logs" FLB_STORAGE_DIR="/var/log/flb-storage" mkdir -p $LOG_DIR mkdir -p $FLB_STORAGE_DIR chmod 755 $LOG_DIR chmod 755 $FLB_STORAGE_DIR # 创建logrotate配置 cat > /etc/logrotate.d/k8s-pod-logs << 'LOGROTATE_EOF' /var/log/k8s-logs/*.log { daily rotate 30 compress delaycompress missingok notifempty create 0644 root root dateext dateformat -%Y%m%d sharedscripts postrotate find /var/log/k8s-logs/ -name "*.log.*.gz" -mtime +60 -delete endscript } LOGROTATE_EOF echo "Node日志配置完成" echo "日志目录: $LOG_DIR" echo "Fluent Bit存储目录: $FLB_STORAGE_DIR" EOF # 设置脚本权限 chmod +x setup-node-logs.sh # 将脚本复制到所有Node并执行 # 注意:需要SSH访问所有Node节点 # 示例(假设节点列表): NODES="node1 node2 node3" for NODE in $NODES; do scp setup-node-logs.sh $NODE:/tmp/ ssh $NODE "sudo /tmp/setup-node-logs.sh" done

3.6 验证日志收集部署

# 检查DaemonSet状态 kubectl get daemonset -n logging # 检查Pod状态 kubectl get pods -n logging -o wide # 查看Fluent Bit日志 kubectl logs -n logging -l k8s-app=fluent-bit-logging --tail=20 # 检查Fluent Bit配置 kubectl exec -n logging -it $(kubectl get pod -n logging -l k8s-app=fluent-bit-logging -o jsonpath='{.items[0].metadata.name}') -- cat /fluent-bit/etc/fluent-bit.conf # 测试Fluent Bit健康检查 kubectl port-forward -n logging svc/fluent-bit 2020:2020 & curl http://localhost:2020/api/v1/health kill %1

四、功能验证测试

4.1 创建测试应用

# 创建测试命名空间 kubectl create namespace test-monitoring # 部署测试应用 cat > test-deployment.yaml << 'EOF' apiVersion: apps/v1 kind: Deployment metadata: name: test-app namespace: test-monitoring spec: replicas: 3 selector: matchLabels: app: test-app template: metadata: labels: app: test-app spec: containers: - name: nginx image: nginx:alpine ports: - containerPort: 80 resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m" - name: log-generator image: busybox command: ["sh", "-c"] args: - | counter=0 while true; do echo "Test log message $counter at $(date)" >> /proc/1/fd/1 counter=$((counter+1)) sleep 10 done resources: requests: memory: "64Mi" cpu: "50m" limits: memory: "128Mi" cpu: "100m" EOF kubectl apply -f test-deployment.yaml # 创建测试服务 kubectl expose deployment test-app -n test-monitoring --port=80

4.2 验证监控功能

# 等待Pod启动 kubectl get pods -n test-monitoring -w # 查看监控指标 # 1. 访问Prometheus kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090 & # 在Prometheus UI中查询: # container_cpu_usage_seconds_total{namespace="test-monitoring"} # container_memory_working_set_bytes{namespace="test-monitoring"} # 2. 访问Grafana kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80 & # 登录Grafana,查看预置的Kubernetes监控面板 # 关闭端口转发 kill %1 %2

4.3 验证日志收集功能

# 查看测试Pod所在节点 kubectl get pods -n test-monitoring -o wide # 登录任意节点查看日志文件 # 假设Pod在node1上 ssh node1 "ls -la /var/log/k8s-logs/ | head -10" ssh node1 "tail -f /var/log/k8s-logs/*test-app*.log" # 或者查看Fluent Bit收集状态 kubectl logs -n logging -l k8s-app=fluent-bit-logging --tail=50 | grep -i test-app

4.4 验证动态扩缩容

# 扩展测试应用 kubectl scale deployment test-app -n test-monitoring --replicas=5 # 检查新Pod日志是否被收集 kubectl get pods -n test-monitoring -o wide # 登录新Pod所在节点查看日志文件 # 缩减测试应用 kubectl scale deployment test-app -n test-monitoring --replicas=2

五、清理测试资源

# 清理测试应用 kubectl delete namespace test-monitoring # 可选:清理监控和日志系统 # kubectl delete namespace monitoring # kubectl delete namespace logging # 清理Node上的日志目录(如果需要) # 在每个Node上执行: # sudo rm -rf /var/log/k8s-logs/* # sudo rm -rf /var/log/flb-storage/*

六、维护命令参考

6.1 监控系统维护

# 查看监控组件状态 kubectl get all -n monitoring # 查看Prometheus存储使用 kubectl exec -n monitoring -it prometheus-prometheus-stack-prometheus-0 -- df -h # 重启Prometheus(如果需要) kubectl delete pod -n monitoring -l app.kubernetes.io/name=prometheus # 更新监控配置 helm upgrade prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring -f prometheus-values.yaml

6.2 日志系统维护

# 查看Fluent Bit状态 kubectl get daemonset -n logging kubectl get pods -n logging -o wide # 重启Fluent Bit(滚动重启) kubectl rollout restart daemonset fluent-bit -n logging # 查看日志收集统计 kubectl port-forward -n logging svc/fluent-bit 2020:2020 & curl http://localhost:2020/api/v1/metrics # 检查磁盘空间(在每个Node上) df -h /var/log du -sh /var/log/k8s-logs/

6.3 日志轮转管理

# 手动触发日志轮转(在每个Node上) sudo logrotate -f /etc/logrotate.d/k8s-pod-logs # 查看logrotate状态 sudo logrotate -d /etc/logrotate.d/k8s-pod-logs # 清理旧日志(保留最近30天) find /var/log/k8s-logs/ -name "*.log" -mtime +30 -delete find /var/log/k8s-logs/ -name "*.log.*.gz" -mtime +60 -delete

七、故障排查

7.1 常见问题检查

# 1. Pod无法启动 kubectl describe pod <pod-name> -n <namespace> kubectl logs <pod-name> -n <namespace> --previous # 2. 镜像拉取失败 kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events # 3. 存储卷问题 kubectl describe pvc <pvc-name> -n <namespace> # 4. 权限问题 kubectl auth can-i get pods --as=system:serviceaccount:logging:fluent-bit # 5. 网络连接问题 kubectl exec -n logging <fluent-bit-pod> -- curl -k https://kubernetes.default.svc:443/healthz

7.2 日志收集问题排查

# 检查Fluent Bit配置 kubectl exec -n logging <fluent-bit-pod> -- cat /fluent-bit/etc/fluent-bit.conf # 检查日志文件权限 kubectl exec -n logging <fluent-bit-pod> -- ls -la /var/log/containers/ # 开启调试模式 # 修改ConfigMap,将Log_Level改为debug,然后重启DaemonSet

八、升级和扩展

8.1 升级监控系统

# 查看当前版本 helm list -n monitoring # 升级到新版本 helm repo update helm upgrade prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring -f prometheus-values.yaml # 回滚(如果需要) helm rollback prometheus-stack 1 -n monitoring

8.2 扩展日志收集

# 增加Fluent Bit资源限制 # 编辑fluent-bit-daemonset.yaml,修改resources部分,然后应用 # 添加新的日志过滤规则 # 编辑fluent-bit-configmap.yaml,在filter-kubernetes.conf中添加新的过滤器

部署完成确认清单

  • Prometheus Stack所有Pod正常运行
  • Grafana可以正常访问和登录
  • Prometheus可以查询到监控指标
  • Fluent Bit在所有Node上运行
  • Node上创建了/var/log/k8s-logs目录
  • logrotate配置已安装
  • 测试应用日志可以被收集
  • 监控指标可以正常显示
  • 动态扩缩容测试通过

注意事项

  1. 所有镜像需要提前导入内网镜像仓库
  2. 存储类(StorageClass)需要根据实际环境配置
  3. 生产环境务必修改Grafana管理员密码
  4. 根据集群规模调整资源限制(requests/limits)
  5. 定期检查磁盘空间,避免日志占满磁盘
  6. 建议定期备份重要配置(如Grafana仪表板)

这个部署文档提供了从零开始部署监控和日志收集系统的完整步骤。请根据实际环境替换文档中的占位符,并按顺序执行命令。

http://www.cnnetsun.cn/news/2939948.html

相关文章:

  • 让老旧安卓电视重获新生:MyTV-Android轻量直播应用体验分享
  • 埃森哲AI架构师揭秘:让AI学会“看人下菜碟“的省力新招法
  • 【Springboot毕设全套源码+文档】基于SpringBoot的显卡之家的设计与实现(丰富项目+远程调试+讲解+定制)
  • 第 28 篇:重传机制:超时与快速重传
  • Oracle异步描述符调整等待事件:原理、诊断与优化实战
  • OpenRouter Fusion 搅动 AGI 格局:当「多模型协作」打平「单模型最强」,通往 AGI 的路可能不止一条 - 微元算力(weytoken)
  • 笔记本性能解锁指南:ACPI修改与功耗调校实战
  • 2026论文隐藏级降AI率网站大曝光:三步直降AIGC率至安全阈值!
  • B2B市场人与销售协同作战:从甩锅到共赢的协作机制设计
  • 4213432
  • SaaS版还是私有化部署?TMS选型的“灵魂拷问”终于有答案了
  • 66、HTTP协议(课外拓展)---------网络编程
  • ArcSWAT模型Error 63输出转换错误:成因解析与系统化解决方案
  • 基于multisim的0-200度数字温度计
  • Xceed WPF Toolkit:让Windows桌面应用开发效率提升300%的秘密武器
  • 【毕业设计】安全认证型校园论坛系统的设计与实现(人脸识别 + 实名认证) 基于 SpringBoot 的实名人脸识别校园社区论坛系统研发(源码+文档+远程调试,全bao定制等)
  • Ubuntu下Festival中文TTS从编译到自然语音实战
  • LTM推出“AI 1000”计划,培养新一代前线部署工程师
  • 共生时代:当AI成为你的合著者、策展人与批评家
  • 多维聚合实战:从SQL CUBE到Pandas透视的工程化方法
  • Ubuntu 20.04中文TTS实战:espeak-ng+mbrola语音合成全链路打通
  • 从Jupyter到生产:机器学习模型部署的工程化实践
  • 嵌入式ADC低功耗设计:从Normal到Powerdown的五种模式解析与工程实践
  • 2026年聊天回复工具排行榜:深度实测综合解析
  • 收藏!2026年产品经理必懂的10大AI新概念,轻松跟开发对上话!
  • 微信小程序图片裁剪终极指南:we-cropper完整使用教程
  • Python 高手编程系列三十五 :Hy
  • AI 调用账单太“烧钱”?阿里云 AI 网关上线 FinOps 能力,实现 Token 成本精细化治理
  • 如何一键下载200+网站小说:开源小说下载器的终极指南
  • STM32CubeIDE调试报错‘Failed to start GDB server’?别急着重启电脑,试试这5个排查步骤