首页 > 软件编程 > java > SpringBoot ElasticSearch文档检索

基于SpringBoot+ElasticSearch实现文档智能化检索的完整指南

2025-08-01 09:14:37 作者：墨夶

Spring Boot是一个用来快速开发、运行和部署 Spring 应用程序的框架,Elasticsearch是一个开源的、分布式的全文搜索,本文给大家介绍了基于SpringBoot+ElasticSearch实现文档智能化检索的完整指南,需要的朋友可以参考下

一、项目背景与技术选型

在企业级应用中，文档内容的智能化检索是一个高频需求。例如：

上传PDF/Word文档后自动抽取文本
支持中文分词和模糊匹配
搜索结果高亮显示关键词

技术选型

技术	作用
SpringBoot	快速构建微服务
ElasticSearch	实现全文检索与高亮功能
Jieba分词插件	中文分词支持
Ingest Attachment Processor Plugin	文档内容抽取（PDF/Word等）

二、环境准备

2.1 Maven依赖配置

<!-- pom.xml -->
<dependencies>
    <!-- SpringBoot基础 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Elasticsearch连接 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
    </dependency>

    <!-- 文件处理工具 -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.2.3</version>
    </dependency>

    <!-- Jieba分词插件 -->
    <dependency>
        <groupId>com.nlp</groupId>
        <artifactId>elasticsearch-analysis-jieba</artifactId>
        <version>7.17.0</version>
    </dependency>
</dependencies>

2.2 配置文件

# application.yml
spring:
  data:
    elasticsearch:
      cluster-name: my-cluster
      cluster-nodes: localhost:9200
  elasticsearch:
    rest:
      uris: http://localhost:9200
      username: elastic
      password: your_password

三、核心功能实现步骤

3.1 安装ElasticSearch插件

Ingest Attachment Processor Plugin

# 安装插件（本地ES）
elasticsearch-plugin install ingest-attachment

# 安装插件（Docker容器内）
docker exec -it elasticsearch bin/elasticsearch-plugin install ingest-attachment

注意：确保插件版本与ES版本匹配！重启ES后生效。

Jieba中文分词插件

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-jieba/releases/download/v7.17.0/elasticsearch-analysis-jieba-7.17.0.zip

3.2 创建文档抽取管道

ElasticSearch的Ingest Pipeline用于自动化处理上传的文件内容。

3.2.1 定义Pipeline

PUT _ingest/pipeline/attachment-extract
{
  "description": "Extract attachment content",
  "processors": [
    {
      "attachment": {
        "field": "content",
        "target_field": "attachment",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "content"
      }
    }
  ]
}

关键点：

attachment处理器将Base64编码的文件内容解析为文本。
remove处理器删除原始二进制字段，保留提取后的文本。

3.3 定义索引与映射

索引的mapping和settings决定了数据存储格式和分词规则。

3.3.1 创建索引

PUT /fileinfo
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "fileName": { "type": "text" },
      "fileType": { "type": "keyword" },
      "attachment": {
        "properties": {
          "content": { "type": "text", "analyzer": "jieba" }  // 使用Jieba分词
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "jieba": {
          "type": "custom",
          "tokenizer": "jieba_tokenizer"
        }
      }
    }
  }
}

注意：attachment.content字段必须使用分词器，否则全文检索会失败！

3.4 Java代码实现文档处理

3.4.1 文件上传接口

@RestController
@RequestMapping("/api/files")
public class FileUploadController {

    @Autowired
    private ElasticsearchRestTemplate elasticsearchRestTemplate;

    @PostMapping("/upload")
    public ResponseEntity<String> uploadFile(@RequestParam("file") MultipartFile file) throws IOException {
        // 1. 文件转Base64
        String base64Content = Base64.getEncoder().encodeToString(file.getBytes());

        // 2. 构造文档对象
        Map<String, Object> document = new HashMap<>();
        document.put("id", UUID.randomUUID().toString());
        document.put("fileName", file.getOriginalFilename());
        document.put("fileType", getFileType(file.getOriginalFilename()));
        document.put("content", base64Content);  // 二进制字段

        // 3. 使用Pipeline处理并索引文档
        IndexRequest request = new IndexRequest("fileinfo")
                .setId(document.get("id").toString())
                .setPipeline("attachment-extract")  // 关键：绑定Pipeline
                .setSource(document);

        elasticsearchRestTemplate.index(request);

        return ResponseEntity.ok("文件已成功索引");
    }

    private String getContentType(MultipartFile file) {
        String originalFilename = file.getOriginalFilename();
        if (originalFilename.endsWith(".pdf")) {
            return "application/pdf";
        } else if (originalFilename.endsWith(".docx")) {
            return "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
        }
        return "application/octet-stream";
    }
}

代码解析：

Base64.getEncoder() 将文件转为Base64字符串，便于传输。
setPipeline("attachment-extract") 调用预定义的Pipeline处理内容。
elasticsearchRestTemplate.index() 执行索引操作。

3.5 全文检索与高亮分词

3.5.1 搜索接口

@GetMapping("/search")
public ResponseEntity<Map<String, Object>> searchFiles(@RequestParam String keyword) {
    // 1. 构建查询
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    sourceBuilder.query(QueryBuilders.matchQuery("attachment.content", keyword)
            .analyzer("jieba")  // 使用Jieba分词
            .fuzziness("AUTO"));

    // 2. 启用高亮
    HighlightBuilder highlightBuilder = new HighlightBuilder();
    highlightBuilder.field("attachment.content").preTags("<mark>").postTags("</mark>");
    sourceBuilder.highlighter(highlightBuilder);

    // 3. 执行搜索
    SearchRequest searchRequest = new SearchRequest("fileinfo");
    searchRequest.source(sourceBuilder);
    SearchResponse response = elasticsearchRestTemplate.search(searchRequest);

    // 4. 提取高亮结果
    List<Map<String, Object>> results = new ArrayList<>();
    for (SearchHit hit : response.getHits().getHits()) {
        Map<String, Object> source = hit.getSourceAsMap();
        Map<String, HighlightField> highlights = hit.getHighlightFields();
        HighlightField contentHighlight = highlights.get("attachment.content");
        if (contentHighlight != null) {
            source.put("highlight", contentHighlight.fragments()[0].string());
        }
        results.add(source);
    }

    return ResponseEntity.ok(Collections.singletonMap("results", results));
}

关键点：

matchQuery("attachment.content", keyword) 对内容字段进行分词匹配。
HighlightBuilder 控制高亮标签（如<mark>）。
搜索结果中highlight字段包含高亮片段。

四、性能优化与注意事项

4.1 缓存策略

ElasticSearch缓存：启用request_cache减少重复查询开销。
应用层缓存：使用Redis缓存高频搜索结果。

4.2 分页与过滤

// 分页示例
sourceBuilder.from(0).size(10);  // 限制每页10条
sourceBuilder.sort(SortBuilders.fieldSort("createTime").order(SortOrder.DESC));  // 按时间排序

4.3 安全与容错

文件类型校验：防止非法文件上传。
异常处理：捕获ElasticsearchException并返回友好的错误信息。

五、代码整合

5.1 配置类（ElasticSearch连接）

@Configuration
public class ElasticsearchConfig {

    @Value("${spring.elasticsearch.rest.uris}")
    private String esUri;

    @Bean
    public RestHighLevelClient elasticsearchClient() {
        return new RestHighLevelClient(
                RestClient.builder(new HttpHost(esUri.split(":")[0], Integer.parseInt(esUri.split(":")[1]), "http")));
    }

    @Bean
    public ElasticsearchRestTemplate elasticsearchRestTemplate(RestHighLevelClient client) {
        return new ElasticsearchRestTemplate(client);
    }
}

5.2 高亮结果返回示例

{
  "results": [
    {
      "id": "123",
      "fileName": "进口红酒.pdf",
      "fileType": "pdf",
      "attachment": {
        "content": "这款红酒产自法国波尔多地区，口感醇厚..."
      },
      "highlight": "这款红酒产自法国波尔多地区，<mark>口感醇厚</mark>..."
    }
  ]
}

六、从零到一的文档搜索闭环

步骤	核心代码/配置	作用
1. 依赖配置	pom.xml	引入ElasticSearch和分词插件
2. 管道定义	PUT _ingest/pipeline/attachment-extract	自动抽取文件内容
3. 索引映射	PUT /fileinfo	定义字段类型和分词规则
4. 文件上传	FileUploadController.uploadFile()	将文件转为Base64并索引
5. 全文搜索	FileUploadController.searchFiles()	使用Jieba分词和高亮

七、行动号召：立即动手实践！

“文档检索不再是难题！现在就搭建你的智能搜索系统！”

尝试基础功能：上传一个PDF并验证内容抽取是否成功。
挑战分词优化：自定义Jieba分词词典，提升匹配准确率。
扩展搜索维度：添加按文件类型、时间范围的过滤功能。

以上就是基于SpringBoot+ElasticSearch实现文档智能化检索的完整指南的详细内容，更多关于SpringBoot ElasticSearch文档检索的资料请关注脚本之家其它相关文章！