docker

关注公众号 jb51net

关闭
首页 > 网站技巧 > 服务器 > 云和虚拟化 > docker > Docker排查Shell脚本

Docker运维一键排查Shell脚本,6大功能搞定容器90%线上故障

作者:Flying_Fish_Xuan

本文介绍了一款自研的Docker一键排查Shell脚本,该脚本主要用于Docker容器运维,包括容器重启、端口占用、磁盘使用、网络故障、镜像臃肿、安全权限等常见问题的排查,感兴趣的朋友一起看看吧

文章简介

做后端开发、Linux运维、云原生工程师,日常避不开Docker容器运维:容器莫名重启、端口占用冲突、服务器磁盘爆满、容器域名解析失败、镜像体积臃肿、容器安全权限合规排查……

传统排障需要反复敲 docker psdocker inspectdocker logsssdu 等零散命令,重复低效还容易遗漏关键排查点。

今天给大家分享一款自研生产级Docker一键排查Shell脚本,纯原生Bash编写、无第三方依赖、兼容所有Linux发行版,封装6大核心运维场景,一条命令即可完成全维度故障诊断,自带排障提示,新手也能秒定位线上问题。

一、脚本核心能力概览

脚本内置6个子命令,覆盖Docker日常运维所有高频场景:

命令作用场景核心能力
triage restart 容器名容器异常重启、崩溃排查容器状态、退出码、OOM杀死、日志、重启策略、健康检查
triage port 端口号端口冲突、端口映射失效Docker端口映射、主机监听进程、docker-proxy进程排查
triage disk服务器磁盘爆满Docker磁盘占用统计、目录层级占用、容器日志大文件排行
triage net容器网络/DNS解析故障网络列表、容器/主机DNS对比、路由、MTU网络黑洞排查
triage image镜像臃肿、构建缓存清理Docker版本、大镜像排行、构建缓存分析、镜像优化建议
triage sec 容器名容器安全基线检查特权模式、只读根目录、挂载权限、Linux能力集、安全参数

二、脚本安装部署(一分钟搞定)

1. 保存脚本

新建文件 triage,将文末完整源码复制粘贴进去。

2. 授权全局可用

# 移动到系统全局命令目录
mv triage /usr/local/bin/
# 添加执行权限
chmod +x /usr/local/bin/triage

3. 验证使用

直接终端输入,查看帮助文档:

triage

三、基础使用语法

# 查看帮助
triage -h
triage --help
# 1. 排查容器重启/崩溃问题
triage restart 容器名/容器ID
# 2. 排查端口占用/映射问题
triage port 8080
# 3. 排查Docker磁盘占用爆满
triage disk
# 4. 排查容器网络/DNS故障
triage net
# 5. 排查镜像体积/构建缓存
triage image
# 6. 容器安全合规检查
triage sec 容器名/容器ID

四、六大功能模块深度解析

1. triage restart 容器排查

专门解决:容器自动重启、启动失败、OOM被杀、健康检查异常

2. triage port 端口排查

专门解决:端口被占用、Docker端口映射不生效、外部无法访问容器

3. triage disk 磁盘排查

专门解决:服务器磁盘告警、Docker莫名占用大量空间

4. triage net 网络排查

专门解决:容器能ping通IP但解析不了域名、跨容器网络不通

5. triage image 镜像排查

专门解决:镜像体积过大、构建缓存堆积、版本兼容问题

6. triage sec 安全排查

专门解决:容器安全基线加固、特权容器风险排查

五、线上真实故障实战场景

场景1:业务容器频繁重启

执行命令:

triage restart my-api

脚本自动输出:容器退出码、是否OOM杀死、日志报错、健康检查失败原因,快速定位是代码Bug、内存溢出还是配置问题。

场景2:8080端口启动报错:address already in use

执行命令:

triage port 8080

自动查出是其他Docker容器占用、还是主机进程监听,一键定位冲突源。

场景3:服务器磁盘使用率100%

执行命令:

triage disk

快速看到是镜像、卷、构建缓存还是容器日志占满磁盘,根据脚本提示执行docker system prune安全清理。

场景4:容器内无法访问外网域名

执行命令:

triage net

对比主机和容器DNS配置,直接定位解析服务器配置问题。

六、脚本设计亮点

七、完整源码

#!/usr/bin/env bash
set -euo pipefail
usage() {
  cat <<'EOF'
Docker ops triage (one command).
Usage:
  triage restart <container>
  triage port <port>
  triage disk
  triage net
  triage image
  triage sec <container>
Examples:
  triage restart my-api
  triage port 8080
  triage disk
  triage net
  triage image
  triage sec my-api
EOF
}
need_cmd() { command -v "$1" >/dev/null 2>&1; }
hr() { echo; echo "------------------------------------------------------------"; echo; }
must_docker() {
  if ! need_cmd docker; then
    echo "docker not found in PATH." >&2
    exit 1
  fi
}
cmd_restart() {
  must_docker
  local C="${1:-}"
  if [[ -z "${C}" ]]; then
    echo "Missing <container>" >&2
    exit 2
  fi
echo "== docker ps (matching) =="
docker ps -a --format 'table {{.ID}}\t{{.Names}}\t{{.Status}}\t{{.Image}}' | (head -n 1; grep -E "(^|\\s)${C}(\\s|$)" || true)
hr
echo "== docker inspect (state) =="
docker inspect -f 'Name={{.Name}} Status={{.State.Status}} ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} StartedAt={{.State.StartedAt}} FinishedAt={{.State.FinishedAt}}' "${C}" 2>/dev/null || true
hr
echo "== last 200 logs =="
docker logs --tail 200 "${C}" 2>/dev/null || true
hr
echo "== restart policy / healthcheck =="
docker inspect -f 'RestartPolicy={{json .HostConfig.RestartPolicy}} Healthcheck={{json .Config.Healthcheck}}' "${C}" 2>/dev/null || true
hr
echo "== hints =="
echo "- ExitCode=137 often means SIGKILL (commonly OOMKilled)."
echo "- If Healthcheck keeps failing, container may be restarted by orchestrator policies."
echo "- If logs show bind errors, check host port conflicts (triage port <port>)."
}
cmd_port() {
  must_docker
  local PORT="${1:-}"
  if [[ -z "${PORT}" ]]; then
    echo "Missing <port>" >&2
    exit 2
  fi
echo "== docker ps port publish mapping =="
docker ps --format 'table {{.ID}}\t{{.Names}}\t{{.Ports}}' | (head -n 1; grep -E "(:|\\b)${PORT}->" || true)
hr
echo "== host listeners (ss/netstat) =="
if need_cmd ss; then
  ss -lntup 2>/dev/null | grep -E ":(?:${PORT})\\b" || true
elif need_cmd netstat; then
  netstat -lntup 2>/dev/null | grep -E ":(?:${PORT})\\b" || true
else
  echo "ss/netstat not found."
fi
hr
echo "== docker-proxy process (if exists) =="
ps -ef 2>/dev/null | grep -E "docker-proxy.*:${PORT}\\b" | grep -v grep || true
hr
echo "== hints =="
echo "- If you see 0.0.0.0:${PORT}->... in docker ps, the host port is published by Docker."
echo "- If host is listening on 127.0.0.1:${PORT}, external clients can't reach it."
echo "- If container listens only on 127.0.0.1 inside, publish won't help; it must listen on 0.0.0.0 in the container."
}
cmd_disk() {
  must_docker
  echo "== docker system df =="
  docker system df || true
  hr
  echo "== docker system df -v (top part) =="
  docker system df -v 2>/dev/null | sed -n '1,200p' || true
  hr
  echo "== /var/lib/docker usage (top 20 dirs) =="
  if [[ -d /var/lib/docker ]]; then
    du -h --max-depth=2 /var/lib/docker 2>/dev/null | sort -h | tail -n 20 || true
  else
    echo "/var/lib/docker not found (non-Linux host or custom data-root)."
    echo "Check: docker info | grep -i 'Docker Root Dir'"
    docker info 2>/dev/null | grep -i 'Docker Root Dir' || true
  fi
  hr
  echo "== largest container logs (json-file) top 20 =="
  if [[ -d /var/lib/docker/containers ]]; then
    find /var/lib/docker/containers -name '*-json.log' -type f -printf '%s %p\n' 2>/dev/null \
      | sort -n | tail -n 20 | awk '{printf "%.2fMB %s\n",$1/1024/1024,$2}' || true
  fi
  hr
  echo "== hints =="
  echo "- Biggest offenders are usually: images, build cache, volumes, and container json logs."
  echo "- Safe cleanups: docker system prune -f (unused objects), docker image prune -a -f (unused images)."
  echo "- Long-term: configure logging max-size/max-file to cap json-file logs."
}
cmd_net() {
  must_docker
  echo "== docker networks =="
  docker network ls || true
  hr
  echo "== docker info (network related) =="
  docker info 2>/dev/null | egrep -i 'Server Version|Cgroup|Security Options|Docker Root Dir|Registry Mirrors|Live Restore|Network' || true
  hr
  echo "== resolv.conf inside a running container (first one) =="
  local CID
  CID="$(docker ps -q | head -n 1 || true)"
  if [[ -n "${CID}" ]]; then
    docker exec "${CID}" sh -lc 'echo "container: $(hostname)"; cat /etc/resolv.conf; ip route 2>/dev/null || true' 2>/dev/null || true
  else
    echo "No running container found."
  fi
  hr
  echo "== host resolv.conf =="
  cat /etc/resolv.conf 2>/dev/null || true
  hr
  echo "== hints =="
  echo "- If only domain fails inside container, check DNS server in /etc/resolv.conf."
  echo "- MTU issues often show: small requests ok, large requests timeout (blackhole)."
  echo "- For bridge networks, iptables/nft rules and FORWARD policy can break connectivity."
}
cmd_image() {
  must_docker
  echo "== docker version =="
  docker version 2>/dev/null | sed -n '1,120p' || true
  hr
  echo "== largest images (top 20) =="
  docker images --format '{{.Repository}}:{{.Tag}} {{.ID}} {{.Size}}' 2>/dev/null | head -n 20 || true
  hr
  echo "== build cache / system df =="
  docker system df -v 2>/dev/null | sed -n '1,220p' || true
  hr
  echo "== hints =="
  echo "- Use multi-stage builds; keep runtime image minimal."
  echo "- Split dependency layers to maximize cache hits."
  echo "- Pin base image versions for reproducibility."
}
cmd_sec() {
  must_docker
  local C="${1:-}"
  if [[ -z "${C}" ]]; then
    echo "Missing <container>" >&2
    exit 2
  fi
  echo "== inspect security-related fields =="
  docker inspect -f 'Name={{.Name}} User={{.Config.User}} Privileged={{.HostConfig.Privileged}} ReadonlyRootfs={{.HostConfig.ReadonlyRootfs}} NetworkMode={{.HostConfig.NetworkMode}} PidMode={{.HostConfig.PidMode}} IpcMode={{.HostConfig.IpcMode}}' "${C}" 2>/dev/null || true
  hr
  docker inspect -f 'CapAdd={{json .HostConfig.CapAdd}} CapDrop={{json .HostConfig.CapDrop}} SecurityOpt={{json .HostConfig.SecurityOpt}}' "${C}" 2>/dev/null || true
  hr
  echo "== mounts =="
  docker inspect -f 'Mounts={{json .Mounts}}' "${C}" 2>/dev/null | sed -n '1,200p' || true
  hr
  echo "== hints =="
  echo "- Prefer non-root user; avoid privileged and host PID/IPC unless necessary."
  echo "- Use read-only rootfs + writable volumes for data."
  echo "- Drop capabilities by default; add only what is needed."
}
main() {
  local cmd="${1:-}"
  if [[ -z "${cmd}" || "${cmd}" == "-h" || "${cmd}" == "--help" ]]; then
    usage
    exit 0
  fi
  shift || true
  case "${cmd}" in
    restart) cmd_restart "${1:-}" ;;
    port)    cmd_port "${1:-}" ;;
    disk)    cmd_disk ;;
    net)     cmd_net ;;
    image)   cmd_image ;;
    sec)     cmd_sec "${1:-}" ;;
    *)
      echo "Unknown command: ${cmd}" >&2
      usage >&2
      exit 2
      ;;
  esac
}
main "$@"

八、运维总结

这款脚本把Docker零散的排查命令做了封装整合,把运维排障经验内置到脚本Hint中,无论是日常开发调试、线上故障应急、服务器巡检、容器安全基线检查都能直接用。

建议收藏部署到所有Linux服务器,后续可以基于脚本扩展:增加容器CPU/内存排行、一键清理无用镜像、容器定时健康检测等功能,打造成个人专属Docker运维工具箱。

到此这篇关于Docker运维一键排查Shell脚本,6大功能搞定容器90%线上故障的文章就介绍到这了,更多相关Docker排查Shell脚本内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!

您可能感兴趣的文章:
阅读全文