首页 > 脚本专栏 > python > Python DOCX转为Markdown

Python使用Flask实现将DOCX转为Markdown

2026-01-29 14:59:54 作者：weixin_46244623

这篇文章主要为大家详细介绍了如何基于Python的Flask框架编写一个项目,能够将上传的.docx文件转换为 Markdown,并提取图片以供下载预览,感兴趣的小伙伴可以了解下

摘要

本文演示一个基于 Flask 的后端 + 简单前端页面的项目，能够将上传的 .docx 文件转换为 Markdown，并提取图片以供下载/预览。

包含运行方式、关键代码解析（app.py）、前端 templates/index.html 功能说明，以及常见注意事项。

项目简介

功能：上传 .docx → 返回 Markdown 文本 + 提取图片（通过 API 提供图片访问 URL），前端支持预览与打包下载 ZIP。

适用场景：将 Word 文档内容迁移到博客、技术文档、知识库时快速生成 Markdown。

环境与依赖

Python 3.8+（示例中也可用 Python 3.13）

依赖见 requirements.txt：

python-docx==0.8.11
Flask==2.3.0
Flask-CORS==4.0.0
Werkzeug==2.3.0

安装命令：

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

项目结构（简要）

app.py — Flask 后端，负责接收上传、解析 DOCX、提取图片并返回 Markdown。

templates/index.html — 前端页面，基于 Vue 3，提供上传、展示与下载功能。

uploads/ — 存放上传的文件与提取的图片（运行时生成）。

完整代码

#!/usr/bin/env python3
"""
DOCX 转 Markdown 的 Flask API 后端
"""

from flask import Flask, request, jsonify, send_file, render_template
from flask_cors import CORS
from docx import Document
import os
import io
import json
from pathlib import Path
from werkzeug.utils import secure_filename

app = Flask(__name__, static_folder='templates', static_url_path='')

CORS(app)

# 配置上传文件夹
UPLOAD_FOLDER = 'uploads'
ALLOWED_EXTENSIONS = {'docx', 'doc'}
MAX_FILE_SIZE = 50 * 1024 * 1024  # 50MB

if not os.path.exists(UPLOAD_FOLDER):
    os.makedirs(UPLOAD_FOLDER)

app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.config['MAX_CONTENT_LENGTH'] = MAX_FILE_SIZE


def allowed_file(filename):
    """检查文件扩展名"""
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


def extract_images_from_run(run, image_dir, image_counter, doc_part=None):
    """从 run 元素中提取图片"""
    images = []
    
    for drawing in run.element.findall('.//{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing'):
        for blip in drawing.findall('.//{http://schemas.openxmlformats.org/drawingml/2006/main}blip'):
            embed_id = blip.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed')
            if embed_id:
                try:
                    # 从关系中获取图片
                    image_part = run.part.rels[embed_id].target_part
                    image_data = image_part.blob
                    
                    # 确定图片扩展名
                    content_type = image_part.content_type
                    ext_map = {
                        'image/jpeg': 'jpg',
                        'image/png': 'png',
                        'image/gif': 'gif',
                        'image/bmp': 'bmp',
                        'image/tiff': 'tiff',
                        'image/webp': 'webp'
                    }
                    ext = ext_map.get(content_type, 'png')
                    
                    image_counter += 1
                    image_filename = f"image_{image_counter}.{ext}"
                    image_path = os.path.join(image_dir, image_filename)
                    
                    # 保存图片
                    with open(image_path, 'wb') as f:
                        f.write(image_data)
                    
                    images.append({
                        'filename': image_filename,
                        'path': image_path,
                        'counter': image_counter
                    })
                except Exception as e:
                    pass
    
    return images, image_counter


def convert_docx_to_markdown(docx_path, image_dir=None):
    """转换 DOCX 文件为 Markdown"""
    
    if image_dir is None:
        image_dir = os.path.join(UPLOAD_FOLDER, 'images')
    
    if not os.path.exists(image_dir):
        os.makedirs(image_dir)
    
    # 加载 DOCX 文档
    doc = Document(docx_path)
    
    markdown_content = []
    image_counter = 0
    
    # 处理段落和图片
    for para in doc.paragraphs:
        # 检查段落中的图片
        for run in para.runs:
            images, image_counter = extract_images_from_run(run, image_dir, image_counter)
            for img in images:
                rel_path = os.path.join('images', img['filename'])
                markdown_content.append(f"![image]({rel_path})")
                markdown_content.append("")
        
        text = para.text.strip()
        
        if not text:
            markdown_content.append("")
            continue
        
        # 检查段落样式
        style = para.style.name if para.style else ""
        
        # 处理标题
        if "Heading 1" in style:
            markdown_content.append(f"# {text}")
        elif "Heading 2" in style:
            markdown_content.append(f"## {text}")
        elif "Heading 3" in style:
            markdown_content.append(f"### {text}")
        elif "Heading 4" in style:
            markdown_content.append(f"#### {text}")
        elif "Heading 5" in style:
            markdown_content.append(f"##### {text}")
        elif "Heading 6" in style:
            markdown_content.append(f"###### {text}")
        else:
            # 处理文本格式
            formatted_text = process_runs(para.runs)
            if formatted_text:
                markdown_content.append(formatted_text)
            else:
                markdown_content.append(text)
    
    # 处理表格
    for table in doc.tables:
        markdown_content.append("")
        markdown_table, image_counter = convert_table_to_markdown(table, image_dir, image_counter)
        markdown_content.extend(markdown_table)
        markdown_content.append("")
    
    result = "\n".join(markdown_content)
    
    return result, image_counter


def process_runs(runs):
    """处理文本 runs 以处理加粗、斜体等格式"""
    result = []
    
    for run in runs:
        text = run.text
        
        if not text:
            continue
        
        # 处理加粗
        if run.bold:
            text = f"**{text}**"
        
        # 处理斜体
        if run.italic:
            text = f"*{text}*"
        
        # 处理下划线
        if run.underline:
            text = f"__{text}__"
        
        result.append(text)
    
    return "".join(result).strip()


def convert_table_to_markdown(table, image_dir, image_counter):
    """将 DOCX 表格转换为 Markdown 表格格式"""
    markdown_lines = []
    
    # 处理每一行
    for i, row in enumerate(table.rows):
        cells = row.cells
        row_content = []
        
        for cell in cells:
            # 从单元格中获取文本
            cell_parts = []
            for para in cell.paragraphs:
                # 检查段落中的图片
                for run in para.runs:
                    images, image_counter = extract_images_from_run(run, image_dir, image_counter)
                    for img in images:
                        rel_path = os.path.join('images', img['filename'])
                        cell_parts.append(f"![img]({rel_path})")
                
                para_text = para.text.strip()
                if para_text:
                    cell_parts.append(para_text)
            
            cell_text = " ".join(cell_parts).strip()
            row_content.append(cell_text)
        
        # 添加行到 markdown
        markdown_lines.append("| " + " | ".join(row_content) + " |")
        
        # 在表头行（第一行）后添加分隔符
        if i == 0:
            separator = "|" + "|".join([" --- " for _ in row_content]) + "|"
            markdown_lines.append(separator)
    
    return markdown_lines, image_counter


@app.route('/', methods=['GET'])
def index():
    """返回主页面"""
    return send_file('templates/index.html', mimetype='text/html')


@app.route('/api/health', methods=['GET'])
def health():
    """健康检查"""
    return jsonify({'status': 'ok'})


@app.route('/api/convert', methods=['POST'])
def convert():
    """转换 DOCX 文件为 Markdown 和图片"""
    try:
        # 检查是否有文件上传
        if 'file' not in request.files:
            return jsonify({'status': 'error', 'message': '没有上传文件'}), 400
        
        file = request.files['file']
        
        if file.filename == '':
            return jsonify({'status': 'error', 'message': '文件名为空'}), 400
        
        if not allowed_file(file.filename):
            return jsonify({'status': 'error', 'message': '只支持 .docx 格式文件'}), 400
        
        # 保存上传的文件
        filename = secure_filename(file.filename)
        filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
        file.save(filepath)
        
        # 创建专门的图片目录
        image_dir = os.path.join(app.config['UPLOAD_FOLDER'], Path(filename).stem + '_images')
        
        # 转换 DOCX 到 Markdown
        markdown_content, image_count = convert_docx_to_markdown(filepath, image_dir)
        
        # 获取图片列表
        images = []
        if os.path.exists(image_dir):
            for img_file in os.listdir(image_dir):
                if os.path.isfile(os.path.join(image_dir, img_file)):
                    images.append({
                        'name': img_file,
                        'path': f"/api/image/{Path(filename).stem + '_images'}/{img_file}"
                    })
        
        # 替换 markdown 中的图片路径
        for img in images:
            # 将相对路径替换为 API 路径
            old_path = f"images/{img['name']}"
            markdown_content = markdown_content.replace(old_path, img['path'])
        
        # 清理上传的 docx 文件（可选）
        try:
            os.remove(filepath)
        except:
            pass
        
        return jsonify({
            'status': 'success',
            'markdown': markdown_content,
            'images': images,
            'image_count': image_count
        })
    
    except Exception as e:
        return jsonify({'status': 'error', 'message': str(e)}), 500


@app.route('/api/image/<path:filepath>', methods=['GET'])
def get_image(filepath):
    """获取提取的图片"""
    try:
        full_path = os.path.join(app.config['UPLOAD_FOLDER'], filepath)
        
        # 安全检查
        if not os.path.abspath(full_path).startswith(os.path.abspath(app.config['UPLOAD_FOLDER'])):
            return jsonify({'status': 'error', 'message': '非法请求'}), 403
        
        if not os.path.exists(full_path):
            return jsonify({'status': 'error', 'message': '文件不存在'}), 404
        
        return send_file(full_path)
    
    except Exception as e:
        return jsonify({'status': 'error', 'message': str(e)}), 500


if __name__ == '__main__':
    app.run(debug=False, host='0.0.0.0', port=5000)

运行步骤

启动服务：

python app.py
# 或
python3 app.py

服务默认监听 0.0.0.0:5000，打开浏览器访问 http://localhost:5000/ 使用前端页面。

关键代码解析

1.应用与配置

创建 Flask 实例：app = Flask(__name__, static_folder='templates', static_url_path='')
配置上传目录与最大文件大小：UPLOAD_FOLDER = 'uploads'、MAX_FILE_SIZE = 50 * 1024 * 1024

2.允许的文件检查

allowed_file(filename)：只允许 docx/doc 扩展名。

3.图片提取：extract_images_from_run(run, image_dir, image_counter, doc_part=None)

通过读取 run 元素的 drawing/blip，使用关系 id（embed）找到 image_part，读取其 blob 保存为文件；根据 content_type 推断扩展名。

4.DOCX 转 Markdown：convert_docx_to_markdown(docx_path, image_dir=None)

使用 python-docx 的 Document(docx_path) 加载文档。
遍历 doc.paragraphs：对每个 run 提取图片、根据段落样式（如 Heading 1）生成对应 Markdown 标题，非标题段落调用 process_runs 处理加粗/斜体/下划线等；遍历 doc.tables 使用 convert_table_to_markdown 转换表格。

5.文本样式处理：process_runs(runs)

根据 run.bold、run.italic、run.underline 包裹 **、*、__，然后拼接返回。

6.表格转换：convert_table_to_markdown(table, image_dir, image_counter)

逐行逐单元格处理，单元格内可能包含图片（同样提取），生成 Markdown 表格和表头分隔符。

7.API 路由

GET /：返回前端页面 templates/index.html。
GET /api/health：健康检查，返回 {'status': 'ok'}。
POST /api/convert：接收上传文件（字段 file），保存文件、创建图片目录 {stem}_images、调用转换并返回 JSON（包含 markdown、images 列表和 image_count）。
GET /api/image/<path:filepath>：从 uploads/ 安全返回图片文件（带路径校验）。

前端 templates/index.html 功能概览

基于 Vue 3（CDN）实现，交互包括：

拖拽或点击上传 .docx 文件（前端做扩展名校验）。
显示已选文件名和大小，调用 /api/convert 上传文件并等待结果；期间显示 Loading 状态。
转换成功后弹窗显示 Markdown 内容、预览（简单解析）和图片列表。
支持将 Markdown + 图片打包为 ZIP 下载（使用 JSZip）。

前端渲染要点：

renderMarkdown：简单将 Markdown → HTML（支持标题、加粗、斜体、代码块、图片、链接、列表的基础转换），并把图片路径替换为后端返回的 img.path。
下载逻辑：先把 document.md 写入 zip，再 fetch 每个图片的 URL 把 Blob 写入 zip，最后触发浏览器下载。

示例：用 curl 调用 API

curl -F "file=@/path/to/test.docx" http://localhost:5000/api/convert

响应后可直接访问图片：

http://localhost:5000/api/image/<yourfile_stem>_images/image_1.png

常见问题与注意事项

python-docx 解析并非 100% 保留 Word 的复杂布局和样式，复杂的段落结构（嵌套列表、复杂表格合并单元格）可能需要额外处理。

图片提取通过关系表（rels）和 drawing/blip 节点查找，若有特殊嵌入方式（例如 OLE）可能无法提取。

并发上传时请注意服务器磁盘与内存限制，生产环境建议：

使用临时目录并定期清理过期文件；
通过反向代理（如 Nginx）设置上传大小限制与超时；
将图片和生成文件存储到对象存储（OSS/S3）。

以上就是Python使用Flask实现将DOCX转为Markdown的详细内容，更多关于Python DOCX转为Markdown的资料请关注脚本之家其它相关文章！