Docker运维一键排查Shell脚本,6大功能搞定容器90%线上故障
作者:Flying_Fish_Xuan
文章简介
做后端开发、Linux运维、云原生工程师,日常避不开Docker容器运维:容器莫名重启、端口占用冲突、服务器磁盘爆满、容器域名解析失败、镜像体积臃肿、容器安全权限合规排查……
传统排障需要反复敲 docker ps、docker inspect、docker logs、ss、du 等零散命令,重复低效还容易遗漏关键排查点。
今天给大家分享一款自研生产级Docker一键排查Shell脚本,纯原生Bash编写、无第三方依赖、兼容所有Linux发行版,封装6大核心运维场景,一条命令即可完成全维度故障诊断,自带排障提示,新手也能秒定位线上问题。
一、脚本核心能力概览
脚本内置6个子命令,覆盖Docker日常运维所有高频场景:
| 命令 | 作用场景 | 核心能力 |
|---|---|---|
triage restart 容器名 | 容器异常重启、崩溃排查 | 容器状态、退出码、OOM杀死、日志、重启策略、健康检查 |
triage port 端口号 | 端口冲突、端口映射失效 | Docker端口映射、主机监听进程、docker-proxy进程排查 |
triage disk | 服务器磁盘爆满 | Docker磁盘占用统计、目录层级占用、容器日志大文件排行 |
triage net | 容器网络/DNS解析故障 | 网络列表、容器/主机DNS对比、路由、MTU网络黑洞排查 |
triage image | 镜像臃肿、构建缓存清理 | Docker版本、大镜像排行、构建缓存分析、镜像优化建议 |
triage sec 容器名 | 容器安全基线检查 | 特权模式、只读根目录、挂载权限、Linux能力集、安全参数 |
二、脚本安装部署(一分钟搞定)
1. 保存脚本
新建文件 triage,将文末完整源码复制粘贴进去。
2. 授权全局可用
# 移动到系统全局命令目录 mv triage /usr/local/bin/ # 添加执行权限 chmod +x /usr/local/bin/triage
3. 验证使用
直接终端输入,查看帮助文档:
triage
三、基础使用语法
# 查看帮助 triage -h triage --help # 1. 排查容器重启/崩溃问题 triage restart 容器名/容器ID # 2. 排查端口占用/映射问题 triage port 8080 # 3. 排查Docker磁盘占用爆满 triage disk # 4. 排查容器网络/DNS故障 triage net # 5. 排查镜像体积/构建缓存 triage image # 6. 容器安全合规检查 triage sec 容器名/容器ID
四、六大功能模块深度解析
1. triage restart 容器排查
专门解决:容器自动重启、启动失败、OOM被杀、健康检查异常
- 筛选匹配容器的运行状态表格
- 自动解析容器状态、退出码、OOMKilled、启动/结束时间
- 自动输出容器最后200行日志
- 解析重启策略、健康检查配置
- 内置排障Hint:137退出码代表OOM、健康检查失败原因等
2. triage port 端口排查
专门解决:端口被占用、Docker端口映射不生效、外部无法访问容器
- 筛选占用目标端口的Docker容器
- 检测主机ss/netstat端口监听情况
- 排查docker-proxy代理进程
- 提示监听127.0.0.1与0.0.0.0的访问区别、容器内监听配置规范
3. triage disk 磁盘排查
专门解决:服务器磁盘告警、Docker莫名占用大量空间
- 输出
docker system df全局磁盘统计 - 分析
/var/lib/docker目录层级占用 - 自动适配自定义Docker根目录场景
- 统计容器json日志文件大小,输出Top20大日志文件
- 给出安全清理命令、日志持久化配置建议
4. triage net 网络排查
专门解决:容器能ping通IP但解析不了域名、跨容器网络不通
- 列出所有Docker网络、关键环境信息
- 自动获取第一个运行中容器的DNS、路由配置
- 对比主机与容器
resolv.conf解析配置 - 提示MTU黑洞、iptables转发规则等常见网络坑
5. triage image 镜像排查
专门解决:镜像体积过大、构建缓存堆积、版本兼容问题
- 展示Docker客户端/服务端版本
- 输出Top20超大镜像列表
- 分析构建缓存、镜像资源占用
- 给出多阶段构建、分层优化、基础镜像版本固化建议
6. triage sec 安全排查
专门解决:容器安全基线加固、特权容器风险排查
- 检测是否特权模式、运行用户、只读根目录
- 解析网络模式、PID/IPC隔离配置
- 展示挂载目录、Cap权限增减、安全策略
- 给出生产环境容器安全最佳实践建议
五、线上真实故障实战场景
场景1:业务容器频繁重启
执行命令:
triage restart my-api
脚本自动输出:容器退出码、是否OOM杀死、日志报错、健康检查失败原因,快速定位是代码Bug、内存溢出还是配置问题。
场景2:8080端口启动报错:address already in use
执行命令:
triage port 8080
自动查出是其他Docker容器占用、还是主机进程监听,一键定位冲突源。
场景3:服务器磁盘使用率100%
执行命令:
triage disk
快速看到是镜像、卷、构建缓存还是容器日志占满磁盘,根据脚本提示执行docker system prune安全清理。
场景4:容器内无法访问外网域名
执行命令:
triage net
对比主机和容器DNS配置,直接定位解析服务器配置问题。
六、脚本设计亮点
- 严格Bash模式:
set -euo pipefail严格语法校验,避免脚本异常崩溃; - 自动依赖检测:自动判断Docker、ss、netstat是否安装,友好提示;
- 人性化排版:自带分割线、表格格式化输出,日志清晰易读;
- 无侵入设计:纯查询操作,不会修改、删除任何容器/镜像数据,生产环境可放心使用;
- 内置排障知识库:每个模块结尾附带运维Hint,新手不用查文档也能懂;
- 自适应兼容:适配自定义Docker根目录、无ss自动降级netstat、无运行容器友好提示。
七、完整源码
#!/usr/bin/env bash
set -euo pipefail
usage() {
cat <<'EOF'
Docker ops triage (one command).
Usage:
triage restart <container>
triage port <port>
triage disk
triage net
triage image
triage sec <container>
Examples:
triage restart my-api
triage port 8080
triage disk
triage net
triage image
triage sec my-api
EOF
}
need_cmd() { command -v "$1" >/dev/null 2>&1; }
hr() { echo; echo "------------------------------------------------------------"; echo; }
must_docker() {
if ! need_cmd docker; then
echo "docker not found in PATH." >&2
exit 1
fi
}
cmd_restart() {
must_docker
local C="${1:-}"
if [[ -z "${C}" ]]; then
echo "Missing <container>" >&2
exit 2
fi
echo "== docker ps (matching) =="
docker ps -a --format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Image}}' | (head -n 1; grep -E "(^|\\s)${C}(\\s|$)" || true)
hr
echo "== docker inspect (state) =="
docker inspect -f 'Name={{.Name}} Status={{.State.Status}} ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} StartedAt={{.State.StartedAt}} FinishedAt={{.State.FinishedAt}}' "${C}" 2>/dev/null || true
hr
echo "== last 200 logs =="
docker logs --tail 200 "${C}" 2>/dev/null || true
hr
echo "== restart policy / healthcheck =="
docker inspect -f 'RestartPolicy={{json .HostConfig.RestartPolicy}} Healthcheck={{json .Config.Healthcheck}}' "${C}" 2>/dev/null || true
hr
echo "== hints =="
echo "- ExitCode=137 often means SIGKILL (commonly OOMKilled)."
echo "- If Healthcheck keeps failing, container may be restarted by orchestrator policies."
echo "- If logs show bind errors, check host port conflicts (triage port <port>)."
}
cmd_port() {
must_docker
local PORT="${1:-}"
if [[ -z "${PORT}" ]]; then
echo "Missing <port>" >&2
exit 2
fi
echo "== docker ps port publish mapping =="
docker ps --format 'table {{.ID}}\t{{.Names}}\t{{.Ports}}' | (head -n 1; grep -E "(:|\\b)${PORT}->" || true)
hr
echo "== host listeners (ss/netstat) =="
if need_cmd ss; then
ss -lntup 2>/dev/null | grep -E ":(?:${PORT})\\b" || true
elif need_cmd netstat; then
netstat -lntup 2>/dev/null | grep -E ":(?:${PORT})\\b" || true
else
echo "ss/netstat not found."
fi
hr
echo "== docker-proxy process (if exists) =="
ps -ef 2>/dev/null | grep -E "docker-proxy.*:${PORT}\\b" | grep -v grep || true
hr
echo "== hints =="
echo "- If you see 0.0.0.0:${PORT}->... in docker ps, the host port is published by Docker."
echo "- If host is listening on 127.0.0.1:${PORT}, external clients can't reach it."
echo "- If container listens only on 127.0.0.1 inside, publish won't help; it must listen on 0.0.0.0 in the container."
}
cmd_disk() {
must_docker
echo "== docker system df =="
docker system df || true
hr
echo "== docker system df -v (top part) =="
docker system df -v 2>/dev/null | sed -n '1,200p' || true
hr
echo "== /var/lib/docker usage (top 20 dirs) =="
if [[ -d /var/lib/docker ]]; then
du -h --max-depth=2 /var/lib/docker 2>/dev/null | sort -h | tail -n 20 || true
else
echo "/var/lib/docker not found (non-Linux host or custom data-root)."
echo "Check: docker info | grep -i 'Docker Root Dir'"
docker info 2>/dev/null | grep -i 'Docker Root Dir' || true
fi
hr
echo "== largest container logs (json-file) top 20 =="
if [[ -d /var/lib/docker/containers ]]; then
find /var/lib/docker/containers -name '*-json.log' -type f -printf '%s %p\n' 2>/dev/null \
| sort -n | tail -n 20 | awk '{printf "%.2fMB %s\n",$1/1024/1024,$2}' || true
fi
hr
echo "== hints =="
echo "- Biggest offenders are usually: images, build cache, volumes, and container json logs."
echo "- Safe cleanups: docker system prune -f (unused objects), docker image prune -a -f (unused images)."
echo "- Long-term: configure logging max-size/max-file to cap json-file logs."
}
cmd_net() {
must_docker
echo "== docker networks =="
docker network ls || true
hr
echo "== docker info (network related) =="
docker info 2>/dev/null | egrep -i 'Server Version|Cgroup|Security Options|Docker Root Dir|Registry Mirrors|Live Restore|Network' || true
hr
echo "== resolv.conf inside a running container (first one) =="
local CID
CID="$(docker ps -q | head -n 1 || true)"
if [[ -n "${CID}" ]]; then
docker exec "${CID}" sh -lc 'echo "container: $(hostname)"; cat /etc/resolv.conf; ip route 2>/dev/null || true' 2>/dev/null || true
else
echo "No running container found."
fi
hr
echo "== host resolv.conf =="
cat /etc/resolv.conf 2>/dev/null || true
hr
echo "== hints =="
echo "- If only domain fails inside container, check DNS server in /etc/resolv.conf."
echo "- MTU issues often show: small requests ok, large requests timeout (blackhole)."
echo "- For bridge networks, iptables/nft rules and FORWARD policy can break connectivity."
}
cmd_image() {
must_docker
echo "== docker version =="
docker version 2>/dev/null | sed -n '1,120p' || true
hr
echo "== largest images (top 20) =="
docker images --format '{{.Repository}}:{{.Tag}} {{.ID}} {{.Size}}' 2>/dev/null | head -n 20 || true
hr
echo "== build cache / system df =="
docker system df -v 2>/dev/null | sed -n '1,220p' || true
hr
echo "== hints =="
echo "- Use multi-stage builds; keep runtime image minimal."
echo "- Split dependency layers to maximize cache hits."
echo "- Pin base image versions for reproducibility."
}
cmd_sec() {
must_docker
local C="${1:-}"
if [[ -z "${C}" ]]; then
echo "Missing <container>" >&2
exit 2
fi
echo "== inspect security-related fields =="
docker inspect -f 'Name={{.Name}} User={{.Config.User}} Privileged={{.HostConfig.Privileged}} ReadonlyRootfs={{.HostConfig.ReadonlyRootfs}} NetworkMode={{.HostConfig.NetworkMode}} PidMode={{.HostConfig.PidMode}} IpcMode={{.HostConfig.IpcMode}}' "${C}" 2>/dev/null || true
hr
docker inspect -f 'CapAdd={{json .HostConfig.CapAdd}} CapDrop={{json .HostConfig.CapDrop}} SecurityOpt={{json .HostConfig.SecurityOpt}}' "${C}" 2>/dev/null || true
hr
echo "== mounts =="
docker inspect -f 'Mounts={{json .Mounts}}' "${C}" 2>/dev/null | sed -n '1,200p' || true
hr
echo "== hints =="
echo "- Prefer non-root user; avoid privileged and host PID/IPC unless necessary."
echo "- Use read-only rootfs + writable volumes for data."
echo "- Drop capabilities by default; add only what is needed."
}
main() {
local cmd="${1:-}"
if [[ -z "${cmd}" || "${cmd}" == "-h" || "${cmd}" == "--help" ]]; then
usage
exit 0
fi
shift || true
case "${cmd}" in
restart) cmd_restart "${1:-}" ;;
port) cmd_port "${1:-}" ;;
disk) cmd_disk ;;
net) cmd_net ;;
image) cmd_image ;;
sec) cmd_sec "${1:-}" ;;
*)
echo "Unknown command: ${cmd}" >&2
usage >&2
exit 2
;;
esac
}
main "$@"八、运维总结
这款脚本把Docker零散的排查命令做了封装整合,把运维排障经验内置到脚本Hint中,无论是日常开发调试、线上故障应急、服务器巡检、容器安全基线检查都能直接用。
建议收藏部署到所有Linux服务器,后续可以基于脚本扩展:增加容器CPU/内存排行、一键清理无用镜像、容器定时健康检测等功能,打造成个人专属Docker运维工具箱。
到此这篇关于Docker运维一键排查Shell脚本,6大功能搞定容器90%线上故障的文章就介绍到这了,更多相关Docker排查Shell脚本内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!
