java

关注公众号 jb51net

关闭
首页 > 软件编程 > java > SpringBoot监控与链路追踪

SpringBoot监控与链路追踪的完整指南

作者:霸道流氓气质

本文详细介绍了Spring Boot Actuator、Sleuth与Zipkin在微服务架构中的应用,重点讲解了健康检查、指标监控、链路追踪及Prometheus与Grafana集成,助你构建全面的的可观测性体系,需要的朋友可以参考下

一、监控与追踪的价值

微服务架构中,一次用户请求可能跨越多个服务。监控和追踪解决两个核心问题:

用户请求 → 网关 → 订单服务 → 库存服务 → 数据库
                      ↓
                  支付服务 → 第三方支付

问题:这次请求为什么慢了 3 秒?到底卡在哪一步?
答案:链路追踪告诉你。

二、Spring Boot Actuator — 健康与指标

2.1 什么是 Actuator

Actuator 是 Spring Boot 内置的生产就绪特性模块,提供一系列 HTTP 端点(Endpoints),暴露应用的运行状态、指标、配置信息等。

2.2 依赖引入

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

2.3 核心端点

端点路径作用
health/actuator/health应用健康状态(UP/DOWN)
info/actuator/info应用信息(版本、Git 信息等)
metrics/actuator/metrics应用指标(JVM、HTTP、自定义)
env/actuator/env环境变量和配置属性
loggers/actuator/loggers运行时动态调整日志级别
threaddump/actuator/threaddump线程转储
heapdump/actuator/heapdump堆转储文件下载
prometheus/actuator/prometheusPrometheus 格式指标输出
beans/actuator/beans所有 Spring Bean 列表
mappings/actuator/mappings所有 URL 映射

2.4 配置

# application.yml
management:
  # 端点暴露配置
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus,loggers,threaddump
      base-path: /actuator  # 默认路径前缀
  # 健康检查详情
  endpoint:
    health:
      show-details: when-authorized  # always / never / when-authorized
      show-components: always
    loggers:
      enabled: true
  # 健康检查组件
  health:
    db:
      enabled: true
    redis:
      enabled: true
    elasticsearch:
      enabled: true
    diskspace:
      enabled: true
      threshold: 10MB
  # 指标配置
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active:default}
    export:
      prometheus:
        enabled: true
  # Actuator 端口(可选,与业务端口分离)
  server:
    port: 8081

2.5 健康检查详解

默认健康指示器

Spring Boot 自动检测依赖并注册对应的健康检查:

指示器检测对象自动注册条件
DataSourceHealthIndicatorMySQL/数据库连接引入了 DataSource
RedisHealthIndicatorRedis 连接引入了 spring-data-redis
ElasticsearchRestHealthIndicatorES 集群状态引入了 spring-data-elasticsearch
KafkaHealthIndicatorKafka Broker引入了 spring-kafka
RabbitHealthIndicatorRabbitMQ 连接引入了 spring-amqp
DiskSpaceHealthIndicator磁盘空间默认启用
NacosDiscoveryHealthIndicatorNacos 注册中心引入了 nacos-discovery

自定义健康指示器

/**
 * 自定义健康检查 - 检测第三方支付服务可用性.
 */
@Component
public class PaymentGatewayHealthIndicator implements HealthIndicator {

    private final RestTemplate restTemplate;
    private final String paymentHealthUrl;

    public PaymentGatewayHealthIndicator(RestTemplate restTemplate,
                                         @Value("${payment.gateway.health-url}") String url) {
        this.restTemplate = restTemplate;
        this.paymentHealthUrl = url;
    }

    @Override
    public Health health() {
        try {
            long start = System.currentTimeMillis();
            ResponseEntity<String> response = restTemplate.getForEntity(
                    paymentHealthUrl, String.class);
            long latency = System.currentTimeMillis() - start;

            if (response.getStatusCode().is2xxSuccessful()) {
                return Health.up()
                        .withDetail("url", paymentHealthUrl)
                        .withDetail("latency_ms", latency)
                        .build();
            } else {
                return Health.down()
                        .withDetail("statusCode", response.getStatusCode().value())
                        .build();
            }
        } catch (Exception e) {
            return Health.down()
                    .withDetail("error", e.getMessage())
                    .build();
        }
    }
}

健康检查响应示例

{
  "status": "UP",
  "components": {
    "db": {
      "status": "UP",
      "details": {
        "database": "MySQL",
        "validationQuery": "isValid()"
      }
    },
    "redis": {
      "status": "UP",
      "details": {
        "version": "7.0.12"
      }
    },
    "elasticsearch": {
      "status": "UP",
      "details": {
        "cluster_name": "my-cluster",
        "status": "green",
        "number_of_nodes": 3
      }
    },
    "paymentGateway": {
      "status": "UP",
      "details": {
        "url": "https://pay.example.com/health",
        "latency_ms": 45
      }
    },
    "diskSpace": {
      "status": "UP",
      "details": {
        "total": 107374182400,
        "free": 53687091200,
        "threshold": 10485760
      }
    }
  }
}

2.6 Kubernetes 就绪/存活探针集成

# application.yml
management:
  endpoint:
    health:
      probes:
        enabled: true
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true
# Kubernetes Deployment
spec:
  containers:
    - name: order-service
      livenessProbe:
        httpGet:
          path: /actuator/health/liveness
          port: 8081
        initialDelaySeconds: 30
        periodSeconds: 10
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /actuator/health/readiness
          port: 8081
        initialDelaySeconds: 20
        periodSeconds: 5
        failureThreshold: 3
探针含义失败后果
Liveness应用是否还活着失败则 K8s 重启 Pod
Readiness应用是否准备好接收流量失败则从 Service 摘除

2.7 指标(Metrics)

内置指标

# 查看所有可用指标名
GET /actuator/metrics

# 查看具体指标
GET /actuator/metrics/jvm.memory.used
GET /actuator/metrics/http.server.requests
GET /actuator/metrics/system.cpu.usage

常用内置指标:

指标说明
jvm.memory.usedJVM 已用内存
jvm.memory.maxJVM 最大内存
jvm.gc.pauseGC 暂停时间
jvm.threads.live活跃线程数
http.server.requestsHTTP 请求统计(含状态码、URI、耗时)
system.cpu.usage系统 CPU 使用率
process.cpu.usage进程 CPU 使用率
hikaricp.connections.active连接池活跃连接数
kafka.consumer.records-lagKafka 消费积压

自定义业务指标

/**
 * 自定义业务指标.
 */
@Slf4j
@Component
@RequiredArgsConstructor
public class OrderMetrics {

    private final MeterRegistry meterRegistry;

    private Counter orderCreateCounter;
    private Counter orderCreateFailCounter;
    private Timer orderProcessTimer;
    private AtomicInteger pendingOrderGauge;

    @PostConstruct
    public void init() {
        // 计数器:订单创建总数
        orderCreateCounter = Counter.builder("order.create.total")
                .description("订单创建总数")
                .tag("service", "order-service")
                .register(meterRegistry);

        // 计数器:订单创建失败数
        orderCreateFailCounter = Counter.builder("order.create.failure")
                .description("订单创建失败数")
                .tag("service", "order-service")
                .register(meterRegistry);

        // 计时器:订单处理耗时
        orderProcessTimer = Timer.builder("order.process.duration")
                .description("订单处理耗时")
                .publishPercentiles(0.5, 0.95, 0.99)  // P50, P95, P99
                .register(meterRegistry);

        // 仪表盘:待处理订单数
        pendingOrderGauge = new AtomicInteger(0);
        Gauge.builder("order.pending.count", pendingOrderGauge, AtomicInteger::get)
                .description("待处理订单数")
                .register(meterRegistry);
    }

    /**
     * 记录订单创建.
     */
    public void recordOrderCreated() {
        orderCreateCounter.increment();
    }

    /**
     * 记录订单创建失败.
     */
    public void recordOrderFailed() {
        orderCreateFailCounter.increment();
    }

    /**
     * 记录订单处理耗时.
     */
    public Timer.Sample startTimer() {
        return Timer.start(meterRegistry);
    }

    public void stopTimer(Timer.Sample sample) {
        sample.stop(orderProcessTimer);
    }

    /**
     * 更新待处理订单数.
     */
    public void updatePendingCount(int count) {
        pendingOrderGauge.set(count);
    }
}

使用方式:

@Slf4j
@Service
@RequiredArgsConstructor
public class OrderServiceImpl implements OrderService {

    private final OrderMetrics orderMetrics;

    @Override
    public OrderResultDto createOrder(OrderCreateDto dto) {
        Timer.Sample timer = orderMetrics.startTimer();
        try {
            // 业务逻辑...
            OrderResultDto result = doCreateOrder(dto);
            orderMetrics.recordOrderCreated();
            return result;
        } catch (Exception e) {
            orderMetrics.recordOrderFailed();
            throw e;
        } finally {
            orderMetrics.stopTimer(timer);
        }
    }
}

2.8 Prometheus + Grafana 集成

引入 Prometheus 依赖

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Prometheus 抓取配置

# prometheus.yml
scrape_configs:
  - job_name: 'spring-boot-apps'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s
    static_configs:
      - targets: ['order-service:8081', 'inventory-service:8081']
    # 或使用 Nacos 服务发现
    # nacos_sd_configs:
    #   - server: 'nacos:8848'

/actuator/prometheus输出示例

# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{area="heap",id="G1 Eden Space"} 1.048576E7
jvm_memory_used_bytes{area="heap",id="G1 Old Gen"} 5.24288E7
# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{method="GET",uri="/api/orders",status="200"} 1523
http_server_requests_seconds_sum{method="GET",uri="/api/orders",status="200"} 45.67
# HELP order_create_total 订单创建总数
# TYPE order_create_total counter
order_create_total{service="order-service"} 8923.0

三、Sleuth + Zipkin — 分布式链路追踪

3.1 核心概念

概念说明类比
Trace一次完整的请求链路一次用户操作的全程记录
Span链路中的一个操作单元每个微服务处理的一个步骤
TraceId全局唯一标识,贯穿整条链路快递单号
SpanId每个 Span 的唯一标识每个中转站编号
ParentSpanId父 Span 的 ID上一个中转站
Annotation时间戳标记(请求到达、响应发送等)到站/离站时间
TraceId: abc123(贯穿全程)

[Gateway]        SpanId: 001, ParentSpan: null
    ↓
[Order Service]  SpanId: 002, ParentSpan: 001
    ↓           ↘
[Inventory]     [Payment]
SpanId: 003     SpanId: 004
Parent: 002     Parent: 002

3.2 Spring Boot 2.x 方案:Sleuth + Zipkin

依赖引入

<!-- 链路追踪核心 -->
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-sleuth</artifactId>
</dependency>
<!-- Zipkin 报告(HTTP 方式) -->
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-sleuth-zipkin</artifactId>
</dependency>
<!-- Zipkin 通过 Kafka 发送(推荐生产使用) -->
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-sleuth</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-sleuth-zipkin</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.kafka</groupId>
    <artifactId>spring-kafka</artifactId>
</dependency>

配置

# application.yml
spring:
  application:
    name: order-service
  sleuth:
    enabled: true
    sampler:
      probability: 1.0        # 采样率:1.0=100%(开发环境),生产建议 0.1=10%
      rate: 100                # 每秒最多采样数
    propagation:
      type: B3                 # 传播格式:B3 / W3C
    async:
      enabled: true            # 异步线程传播 TraceId
  zipkin:
    enabled: true
    sender:
      type: kafka              # 发送方式:kafka / web / rabbit
    kafka:
      topic: zipkin            # Kafka topic 名称
    # 如果用 HTTP 方式:
    # base-url: http://zipkin-server:9411
    # sender:
    #   type: web
  kafka:
    bootstrap-servers: localhost:9092

3.3 Spring Boot 3.x 方案:Micrometer Tracing + Zipkin

Spring Boot 3.x 中 Sleuth 被 Micrometer Tracing 替代:

依赖引入

<!-- Micrometer Tracing(替代 Sleuth) -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>
<!-- Zipkin Reporter -->
<dependency>
    <groupId>io.zipkin.reporter2</groupId>
    <artifactId>zipkin-reporter-brave</artifactId>
</dependency>
<!-- Kafka 发送器 -->
<dependency>
    <groupId>io.zipkin.reporter2</groupId>
    <artifactId>zipkin-sender-kafka</artifactId>
</dependency>

配置

# application.yml (Spring Boot 3.x)
management:
  tracing:
    enabled: true
    sampling:
      probability: 1.0       # 采样率
    propagation:
      type: B3               # 传播格式
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans
      # 或使用 Kafka:
      # transport: kafka

3.4 Sleuth 自动增强的组件

Sleuth 无需手动编码即可自动追踪以下组件:

组件说明
Spring MVC自动为每个 HTTP 请求创建 Span
RestTemplateHTTP 调用自动传播 TraceId
WebClientReactive HTTP 调用自动传播
OpenFeignFeign 调用自动传播 TraceId
Spring Kafka消息发送/消费自动传播 TraceId
RabbitMQ消息发送/消费自动传播
Spring Data RedisRedis 操作自动创建 Span
JDBC数据库操作自动创建 Span
@Async异步方法自动传播 TraceId
@Scheduled定时任务自动创建新 Trace

3.5 日志中的 TraceId

Sleuth 自动将 TraceId 和 SpanId 注入到 MDC(Mapped Diagnostic Context),日志中自动输出:

<!-- logback-spring.xml -->
<configuration>
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>
                %d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] [%X{traceId:-},%X{spanId:-}] %-5level %logger{36} - %msg%n
            </pattern>
        </encoder>
    </appender>
    <root level="INFO">
        <appender-ref ref="CONSOLE" />
    </root>
</configuration>

日志输出效果:

2024-01-15 10:23:45.123 [http-nio-8080-exec-1] [abc123def456,001] INFO  OrderController - 收到创建订单请求, userId: U001
2024-01-15 10:23:45.234 [http-nio-8080-exec-1] [abc123def456,002] INFO  InventoryService - 调用库存服务扣减库存
2024-01-15 10:23:45.345 [http-nio-8080-exec-1] [abc123def456,003] INFO  PaymentService - 调用支付服务创建支付单

同一个 TraceId abc123def456 贯穿所有服务,可以在日志系统(ELK)中一次检索出完整链路。

3.6 手动创建 Span

对于 Sleuth 无法自动覆盖的场景,手动创建 Span:

/**
 * 手动 Span 示例 - 追踪外部 HTTP 调用或复杂业务逻辑.
 */
@Slf4j
@Service
@RequiredArgsConstructor
public class ExternalApiService {

    private final Tracer tracer;

    /**
     * 调用第三方物流接口 - 手动创建 Span.
     */
    public ShippingResult queryShippingStatus(String trackingNo) {
        // 创建新 Span
        Span span = tracer.nextSpan().name("query-shipping-status").start();

        try (Tracer.SpanInScope ws = tracer.withSpan(span)) {
            // 添加标签(方便后续在 Zipkin 中搜索和过滤)
            span.tag("tracking.no", trackingNo);
            span.tag("provider", "sf-express");

            // 记录事件
            span.event("开始调用物流API");

            // 实际调用逻辑
            ShippingResult result = doCallShippingApi(trackingNo);

            span.event("物流API返回成功");
            span.tag("shipping.status", result.getStatus());
            return result;

        } catch (Exception e) {
            span.error(e);  // 记录异常
            throw e;
        } finally {
            span.end();  // 结束 Span
        }
    }
}

注解方式(更简洁)

@Service
public class InventoryService {

    /**
     * 使用 @NewSpan 创建新的子 Span.
     */
    @NewSpan("check-inventory")
    public boolean checkInventory(@SpanTag("sku") String sku,
                                  @SpanTag("quantity") int quantity) {
        // 业务逻辑
        return doCheck(sku, quantity);
    }

    /**
     * 使用 @ContinueSpan 在当前 Span 上添加标签.
     */
    @ContinueSpan(log = "reduce-stock")
    public void reduceStock(@SpanTag("sku") String sku,
                            @SpanTag("quantity") int quantity) {
        // 业务逻辑
    }
}

3.7 Kafka 发送器原理与配置

为什么用 Kafka 发送(而非 HTTP)

对比HTTP 发送Kafka 发送
耦合度服务直连 Zipkin,Zipkin 宕机影响发送解耦,Zipkin 宕机不影响业务
性能同步 HTTP 调用有延迟异步发送,几乎零延迟
可靠性Zipkin 不可用时数据丢失Kafka 持久化,不丢数据
吞吐量受限于 Zipkin 处理能力Kafka 缓冲,削峰填谷
适用场景开发/测试环境生产环境推荐

架构

Service A ─┐
Service B ─┼── 发送 Span 到 Kafka ──→ Kafka Topic: zipkin
Service C ─┘                                  │
                                              ▼
                                    Zipkin Server(消费 Kafka)
                                              │
                                              ▼
                                    存储(Elasticsearch / MySQL / Cassandra)
                                              │
                                              ▼
                                    Zipkin UI(查询和展示链路)

Zipkin Server 消费 Kafka 配置

# 启动 Zipkin Server(Docker)
docker run -d --name zipkin \
  -p 9411:9411 \
  -e KAFKA_BOOTSTRAP_SERVERS=kafka:9092 \
  -e STORAGE_TYPE=elasticsearch \
  -e ES_HOSTS=http://elasticsearch:9200 \
  openzipkin/zipkin

环境变量说明:

变量说明
KAFKA_BOOTSTRAP_SERVERSKafka 集群地址
KAFKA_TOPIC消费的 topic(默认 zipkin)
STORAGE_TYPE存储类型:elasticsearch / mysql / cassandra / mem
ES_HOSTSES 地址(当 STORAGE_TYPE=elasticsearch)

四、Zipkin 部署与使用

4.1 Docker Compose 完整部署

# docker-compose-tracing.yml
version: '3.8'
services:
  zipkin:
    image: openzipkin/zipkin:latest
    ports:
      - "9411:9411"
    environment:
      - KAFKA_BOOTSTRAP_SERVERS=kafka:9092
      - STORAGE_TYPE=elasticsearch
      - ES_HOSTS=http://elasticsearch:9200
      - ES_INDEX=zipkin
      - ES_INDEX_SHARDS=3
      - ES_INDEX_REPLICAS=1
    depends_on:
      - kafka
      - elasticsearch
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
    depends_on:
      - zookeeper
  zookeeper:
    image: confluentinc/cp-zookeeper:7.5.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
volumes:
  es-data:

4.2 Zipkin UI 功能

访问 http://localhost:9411

功能说明
搜索 Trace按服务名、时间范围、时长、标签搜索
链路拓扑服务间调用关系的可视化图
Span 详情每个 Span 的耗时、标签、异常信息
依赖图自动生成服务依赖拓扑

4.3 数据保留策略

# Zipkin on ES - 设置索引保留天数
docker run -d --name zipkin \
  -e STORAGE_TYPE=elasticsearch \
  -e ES_HOSTS=http://es:9200 \
  -e ES_INDEX_SHARDS=3 \
  -e ES_INDEX_REPLICAS=1 \
  openzipkin/zipkin
# 使用 ES ILM 策略自动清理旧数据
# 或定时删除旧索引:
# curl -X DELETE "es:9200/zipkin:span-2024-01-*"

五、采样策略

为什么需要采样

生产环境 100% 采集 会带来:

采样配置

spring:
  sleuth:
    sampler:
      # 方式1:概率采样(推荐)
      probability: 0.1   # 10% 的请求会被追踪
      # 方式2:限速采样
      rate: 100           # 每秒最多采样 100 个请求

自定义采样器

/**
 * 自定义采样策略.
 * 错误请求和慢请求100%采集,正常请求10%采集.
 */
@Configuration
public class CustomSamplerConfig {

    @Bean
    public Sampler customSampler() {
        return new Sampler() {
            // 基础采样率 10%
            private final Sampler defaultSampler = Sampler.create(0.1f);

            @Override
            public boolean isSampled(long traceId) {
                // 基础概率采样
                return defaultSampler.isSampled(traceId);
            }
        };
    }
}

基于请求路径的采样

/**
 * 按路径差异化采样.
 * 健康检查等高频端点不采样,核心业务接口高采样率.
 */
@Configuration
public class PathBasedSamplerConfig {

    @Bean
    public HttpTracingCustomizer httpTracingCustomizer() {
        return builder -> builder.serverSampler(new HttpSampler() {
            @Override
            public Boolean trySample(HttpRequest request) {
                String path = request.path();

                // 健康检查和 actuator 不采样
                if (path.startsWith("/actuator") || path.equals("/health")) {
                    return false;
                }

                // 支付相关接口 100% 采样
                if (path.startsWith("/api/payment")) {
                    return true;
                }

                // 其他接口使用默认采样率
                return null; // null 表示交给默认采样器决定
            }
        });
    }
}

采样率建议

环境采样率说明
开发/测试1.0(100%)全量采集便于调试
预发布0.5(50%)足够排查问题
生产(低流量)0.1-0.3(10%-30%)平衡可观测性和资源
生产(高流量)0.01-0.05(1%-5%)避免存储爆炸
关键业务路径强制 1.0支付/退款等必须全量

六、TraceId 透传机制

6.1 HTTP 调用传播

Sleuth 通过 HTTP Header 传播追踪上下文:

# B3 格式 Header(默认)
X-B3-TraceId: 463ac35c9f6413ad48485a3953bb6124
X-B3-SpanId: a2fb4a1d1a96d312
X-B3-ParentSpanId: 0020000000000001
X-B3-Sampled: 1

# W3C Trace Context 格式
traceparent: 00-463ac35c9f6413ad48485a3953bb6124-a2fb4a1d1a96d312-01

6.2 Kafka 消息传播

Sleuth 自动将 TraceId 写入 Kafka 消息的 Header:

// 生产者端(自动注入,无需手动编码)
// Sleuth 会自动在 ProducerRecord 的 headers 中添加追踪信息

// 消费者端(自动提取)
@KafkaListener(topics = "order-topic")
public void onMessage(ConsumerRecord<String, String> record) {
    // TraceId 已自动从 Header 中提取并设置到当前上下文
    // 日志中会自动打印继承的 TraceId
    log.info("处理订单消息, orderId: {}", record.value());
}

6.3 线程池传播

默认情况下 TraceId 不会跨线程传播。Sleuth 提供了包装类:

/**
 * 线程池配置 - 支持 TraceId 跨线程传播.
 */
@Configuration
public class AsyncConfig {

    private final BeanFactory beanFactory;

    public AsyncConfig(BeanFactory beanFactory) {
        this.beanFactory = beanFactory;
    }

    @Bean
    public Executor asyncExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(10);
        executor.setMaxPoolSize(50);
        executor.setQueueCapacity(200);
        executor.setThreadNamePrefix("async-");
        executor.initialize();

        // 使用 Sleuth 包装器,自动传播 TraceId
        return new LazyTraceExecutor(beanFactory, executor);
    }
}

或者使用 @Async 注解(Sleuth 自动增强):

@Async
public CompletableFuture<Void> asyncProcess(String data) {
    // TraceId 自动传播到这里
    log.info("异步处理, data: {}", data);
    return CompletableFuture.completedFuture(null);
}

6.4 手动传递 TraceId(特殊场景)

/**
 * 需要手动传递 TraceId 的场景:
 * 比如使用自定义 HTTP Client 或第三方 SDK.
 */
@Service
@RequiredArgsConstructor
public class ThirdPartyCallService {

    private final Tracer tracer;

    public void callThirdPartyApi() {
        Span currentSpan = tracer.currentSpan();
        if (currentSpan != null) {
            TraceContext context = currentSpan.context();
            String traceId = context.traceId();
            String spanId = context.spanId();

            // 手动设置到第三方请求头
            HttpHeaders headers = new HttpHeaders();
            headers.set("X-B3-TraceId", traceId);
            headers.set("X-B3-SpanId", spanId);
            headers.set("X-B3-Sampled", "1");

            // 发起请求...
        }
    }
}

七、TraceId 返回给前端

将 TraceId 返回给前端,方便用户报错时提供追踪线索:

/**
 * 统一响应包装 - 包含 TraceId.
 */
@Data
public class Result<T> {

    private Integer code;
    private String message;
    private T data;
    private String traceId;

    public static <T> Result<T> success(T data) {
        Result<T> result = new Result<>();
        result.setCode(200);
        result.setMessage("success");
        result.setData(data);
        return result;
    }

    public static <T> Result<T> fail(String message) {
        Result<T> result = new Result<>();
        result.setCode(500);
        result.setMessage(message);
        return result;
    }
}
/**
 * 响应增强 - 自动注入 TraceId.
 */
@RestControllerAdvice
@RequiredArgsConstructor
public class TraceIdResponseAdvice implements ResponseBodyAdvice<Object> {

    private final Tracer tracer;

    @Override
    public boolean supports(MethodParameter returnType, Class converterType) {
        return true;
    }

    @Override
    public Object beforeBodyWrite(Object body, MethodParameter returnType,
                                   MediaType selectedContentType,
                                   Class selectedConverterType,
                                   ServerHttpRequest request,
                                   ServerHttpResponse response) {
        // 在 Response Header 中添加 TraceId
        Span currentSpan = tracer.currentSpan();
        if (currentSpan != null) {
            String traceId = currentSpan.context().traceId();
            response.getHeaders().add("X-Trace-Id", traceId);

            // 如果返回值是统一包装类,注入 traceId 字段
            if (body instanceof Result) {
                ((Result<?>) body).setTraceId(traceId);
            }
        }
        return body;
    }
}

前端拿到 TraceId 后,用户反馈问题时附带此 ID,运维可直接在 Zipkin 或 ELK 中搜索定位全链路。

八、与 ELK 日志系统联动

架构

Service (日志含 TraceId)
    ↓ (Filebeat / Logstash)
Elasticsearch (日志存储)
    ↓
Kibana (日志查询)
    ↔ Zipkin (链路查询)

通过 TraceId 在两个系统之间跳转

Logback 配置输出 JSON 格式日志

<!-- logback-spring.xml -->
<configuration>
    <springProperty scope="context" name="appName" source="spring.application.name"/>
    <appender name="JSON_FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>logs/${appName}.json</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>logs/${appName}.%d{yyyy-MM-dd}.json</fileNamePattern>
            <maxHistory>30</maxHistory>
        </rollingPolicy>
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <customFields>{"service":"${appName}"}</customFields>
            <!-- TraceId/SpanId 由 Sleuth 自动注入到 MDC -->
        </encoder>
    </appender>
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>
                %d{HH:mm:ss.SSS} [%thread] [%X{traceId:-},%X{spanId:-}] %-5level %logger{36} - %msg%n
            </pattern>
        </encoder>
    </appender>
    <root level="INFO">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="JSON_FILE"/>
    </root>
</configuration>

JSON 日志输出效果:

{
  "@timestamp": "2024-01-15T10:23:45.123+08:00",
  "level": "INFO",
  "thread": "http-nio-8080-exec-1",
  "logger": "com.example.order.service.OrderServiceImpl",
  "message": "创建订单成功, orderId: ORD20240115001",
  "service": "order-service",
  "traceId": "abc123def456",
  "spanId": "002",
  "userId": "U001"
}

在 Kibana 中,搜索 traceId: abc123def456 即可获取该请求在所有服务中的日志。

九、生产环境完整配置模板

# application-prod.yml - 生产环境监控与追踪配置
spring:
  application:
    name: order-service
  # === 链路追踪 ===
  sleuth:
    enabled: true
    sampler:
      probability: 0.1          # 生产环境 10% 采样
    propagation:
      type: B3
    async:
      enabled: true
    scheduled:
      enabled: true             # 定时任务也追踪
    log:
      slf4j:
        enabled: true           # 自动注入 MDC
  zipkin:
    enabled: true
    sender:
      type: kafka               # 通过 Kafka 发送
    kafka:
      topic: zipkin
    service:
      name: ${spring.application.name}
  kafka:
    bootstrap-servers: kafka-cluster:9092
# === 监控端点 ===
management:
  server:
    port: 8081                  # 管理端口与业务端口分离
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
      base-path: /actuator
  endpoint:
    health:
      show-details: when-authorized
      probes:
        enabled: true           # K8s 探针
    shutdown:
      enabled: false            # 禁用远程关闭
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true
    db:
      enabled: true
    redis:
      enabled: true
    kafka:
      enabled: true
  metrics:
    tags:
      application: ${spring.application.name}
      env: prod
    distribution:
      percentiles-histogram:
        http.server.requests: true
      slo:
        http.server.requests: 100ms, 500ms, 1s, 5s
    export:
      prometheus:
        enabled: true
        step: 15s               # 指标推送间隔

十、安全防护

10.1 Actuator 端点安全

/**
 * Actuator 安全配置.
 * 生产环境禁止公网直接访问 Actuator.
 */
@Configuration
@EnableWebSecurity
public class ActuatorSecurityConfig {

    @Bean
    public SecurityFilterChain actuatorSecurity(HttpSecurity http) throws Exception {
        http
            .requestMatcher(EndpointRequest.toAnyEndpoint())
            .authorizeRequests()
                // health 和 info 允许匿名访问(K8s 探针需要)
                .requestMatchers(EndpointRequest.to("health", "info")).permitAll()
                // 其他端点需要认证
                .anyRequest().hasRole("ACTUATOR_ADMIN")
            .and()
            .httpBasic();
        return http.build();
    }
}

10.2 敏感信息过滤

management:
  endpoint:
    env:
      # 隐藏配置中的敏感值
      keys-to-sanitize: password,secret,key,token,credentials

十一、告警规则(Prometheus + AlertManager)

# prometheus-alerts.yml
groups:
  - name: service-alerts
    rules:
      # 服务不健康
      - alert: ServiceDown
        expr: up{job="spring-boot-apps"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "服务 {{ $labels.instance }} 已宕机"
      # 接口响应时间超过 2 秒
      - alert: HighLatency
        expr: http_server_requests_seconds_max{uri!~"/actuator.*"} > 2
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.application }} 接口 {{ $labels.uri }} 延迟超过2秒"
      # 5xx 错误率超过 5%
      - alert: HighErrorRate
        expr: |
          sum(rate(http_server_requests_seconds_count{status=~"5.."}[5m])) by (application)
          /
          sum(rate(http_server_requests_seconds_count[5m])) by (application)
          > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.application }} 5xx错误率超过5%"
      # JVM 内存使用超过 85%
      - alert: HighMemoryUsage
        expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.application }} 堆内存使用超过85%"
      # Kafka 消费积压
      - alert: KafkaConsumerLag
        expr: kafka_consumer_records_lag_max > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.application }} Kafka消费积压超过10000"

十二、完整技术栈架构图

┌─────────────────────────────────────────────────────────────────┐
│                        监控与追踪架构                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                     │
│  │ Service A │  │ Service B │  │ Service C │   ← 微服务集群       │
│  │ Actuator  │  │ Actuator  │  │ Actuator  │                     │
│  │ Sleuth    │  │ Sleuth    │  │ Sleuth    │                     │
│  └────┬──┬──┘  └────┬──┬──┘  └────┬──┬──┘                     │
│       │  │          │  │          │  │                          │
│   指标│  │Span  指标│  │Span  指标│  │Span                      │
│       │  │          │  │          │  │                          │
│       ▼  ▼          ▼  ▼          ▼  ▼                          │
│  ┌─────────┐    ┌─────────────────────┐                        │
│  │Prometheus│    │       Kafka         │                        │
│  │ (拉取)   │    │ (zipkin topic)      │                        │
│  └────┬────┘    └──────────┬──────────┘                        │
│       │                    │                                    │
│       ▼                    ▼                                    │
│  ┌─────────┐    ┌──────────────────┐                           │
│  │ Grafana  │    │  Zipkin Server   │                           │
│  │ (可视化) │    │  (消费Kafka)     │                           │
│  └─────────┘    └────────┬─────────┘                           │
│                          │                                      │
│       ┌──────────────────┼────────────────┐                    │
│       ▼                  ▼                ▼                    │
│  ┌─────────┐    ┌──────────────┐   ┌──────────┐              │
│  │AlertMgr │    │Elasticsearch │   │ Zipkin UI │              │
│  │ (告警)   │    │ (链路存储)   │   │ (查询)    │              │
│  └────┬────┘    └──────┬───────┘   └──────────┘              │
│       │                │                                       │
│       ▼                ▼                                       │
│  ┌─────────┐    ┌──────────────┐                              │
│  │钉钉/邮件 │    │   Kibana     │                              │
│  │ (通知)   │    │ (日志+链路)  │                              │
│  └─────────┘    └──────────────┘                              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

十三、总结

组件职责核心能力
Actuator健康检查 + 指标暴露服务是否存活、资源使用情况
Prometheus指标采集与存储时序数据存储、告警规则
Grafana指标可视化Dashboard、趋势分析
SleuthTraceId 生成与传播自动注入追踪上下文
Zipkin链路数据收集与展示全链路可视化、性能瓶颈定位
Kafka(发送器)追踪数据传输通道解耦、削峰、保证可靠性
ELK日志收集与检索结合 TraceId 搜索全链路日志
AlertManager告警通知异常实时通知到人

一句话总结:Actuator 让你知道"系统现在怎么样",Sleuth + Zipkin 让你知道"请求经过了哪里、在哪里出了问题",两者配合 Prometheus + Grafana 构成完整的可观测性体系。

以上就是SpringBoot监控与链路追踪的完整指南的详细内容,更多关于SpringBoot监控与链路追踪的资料请关注脚本之家其它相关文章!

您可能感兴趣的文章:
阅读全文