by aliyun
Unifies ACK cluster management, native Kubernetes operations, observability, security audit and diagnostic capabilities into a single AI‑native toolset, allowing natural‑language interaction with AI assistants to perform complex container‑oriented AIOps tasks.
The project provides a Model Context Protocol (MCP) server that wraps Alibaba Cloud Container Service (ACK) APIs, Kubernetes API, Prometheus, SLS logs and other cloud observability services. By exposing these capabilities through a standardized MCP interface, AI agents and large language models can execute container lifecycle management, diagnostics, monitoring queries, and audit operations using natural language.
cs, log, and arms services (see README for policy JSON).helm install ack-mcp-server ./deploy/helm -n kube-system --set accessKeyId=<ID> --set accessKeySecret=<SECRET> --set transport=ssedocker run -d -e ACCESS_KEY_ID=… -e ACCESS_KEY_SECRET=… -p 8000:8000 registry-cn-beijing.ack.aliyuncs.com/acs/ack-mcp-server:latest python -m main_server --transport sse --host 0.0.0.0 --port 8000make build-binary then run ./dist/ack-mcp-server with desired flags.git clone https://github.com/aliyun/alibabacloud-ack-mcp-server
cd alibabacloud-ack-mcp-server
uv sync && source .venv/bin/activate
make run # stdio mode
# or
make run-http # HTTP streaming
make run-sse # SSE mode
npx @modelcontextprotocol/inspector --config ./mcp.json
ack_kubectl): CRUD resources, fetch logs/events, execute arbitrary kubectl‑like commands with fine‑grained RBAC.| Scenario | How it works |
|---|---|
| Pod OOM remediation | AI agent asks “Why is pod X OOM?” → server queries metrics, logs, suggests kubectl exec or resource limit adjustments, can apply changes if --allow-write is enabled. |
| Cluster health audit | Agent requests “Run a health check on cluster Y” → server runs built‑in inspection, returns a summary report with recommendations. |
| Metric‑driven troubleshooting | Natural‑language query “Show CPU usage of namespace prod over last 2h” → server translates to PromQL, returns data. |
| Security audit | “List recent audit events for role admin” → server fetches Kubernetes audit logs and presents them. |
| Third‑party AI integration | Plugged into Claude Code, Cursor, Gemini CLI, etc., enabling developers to manage ACK resources directly from IDEs or CLI tools. |
cs, log, and arms. Write‑capable operations need additional permissions and the --allow-write flag.make build-binary and run it directly, or use the Python entry‑point python -m src.main_server.stdio (default for local development), http (streaming HTTP), and sse (Server‑Sent Events). Choose the mode that matches your AI agent’s connector.kubernetes-security@service.aliyun.com as described in SECURITY.md.阿里云容器服务MCP Server工具集: ack-mcp-server。
将 ACK 集群/资源管理、Kubernetes 原生操作与容器场景的可观测性能力、安全审计、诊断巡检等运维能力统一为AI原生的标准化工具集。
本工具集的能力被阿里云容器服务智能助手功能集成。也可支持三方AI Agent (kubectl-ai、QWen Code、Claude Code、Cursor、Gemini CLI、VS Code等)或自动化系统调用集成,基于 MCP(Model Context Protocol)协议。
实现支持通过自然语言与 AI 助手交互,完成复杂的容器运维任务。帮助构建用户自己的容器场景AIOps运维体系。
https://github.com/user-attachments/assets/9e48cac3-0af1-424c-9f16-3862d047cc68
阿里云 ACK 全生命周期的资源管理
list_clusters)Kubernetes 原生操作 (ack_kubectl)
kubectl 类操作(读写权限可控)AI原生的容器场景可观测性
query_prometheus / query_prometheus_metric_guidance)query_controlplane_logs)query_audit_log)阿里云 ACK 诊断、巡检功能
diagnose_resource)query_inspect_report)企业级工程能力
基于实际场景的 AI 能力测评,支持多种 AI 代理和大模型的效果对比:
| 任务场景 | AI Agent | 大模型 | 成功率 | 平均处理时间 |
|---|---|---|---|---|
| Pod OOM 修复 | qwen_code | qwen3-coder-plus | ✅ 100% | 2.3min |
| 集群健康检查 | qwen_code | qwen3-coder-plus | ✅ 95% | 6.4min |
| 资源异常诊断 | kubectl-ai | qwen3-32b | ✅ 90% | 4.1min |
| 历史资源分析 | qwen_code | qwen3-coder-plus | ✅ 85% | 3.8min |
最新 Benchmark 报告参见 benchmarks/results/ 目录。
建议为ack-mcp-server配置的阿里云账号认证为一个主账号的子账号,并遵循最小权限原则,为此子账号赋予如下权限策略集。
所需RAM权限策略集
如何为阿里云账号的RAM账号添加所需权限,参考文档:RAM 权限策略
当前ack-mcp-server所需只读权限集为:
{
"Version": "1",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cs:Check*",
"cs:Describe*",
"cs:Get*",
"cs:List*",
"cs:Query*",
"cs:RunClusterCheck",
"cs:RunClusterInspect"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "arms:GetPrometheusInstance",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"log:Describe*",
"log:Get*",
"log:List*"
],
"Resource": "*"
}
]
}
在 Kubernetes 集群中部署:
# 克隆代码仓库
git clone https://github.com/aliyun/alibabacloud-ack-mcp-server
cd alibabacloud-ack-mcp-server
# 使用 Helm 部署
helm install \
--set accessKeyId=<your-access-key-id> \
--set accessKeySecret=<your-access-key-secret> \
--set transport=sse \
ack-mcp-server \
./deploy/helm \
-n kube-system
部署后通过为ack-mcp-server service配置负载均衡等方式透出外网访问服务,以对接AI Agent。
参数说明
accessKeyId: 阿里云账号的 AccessKeyIdaccessKeySecret: 阿里云账号的 AccessKeySecret# 拉取镜像
docker pull registry-cn-beijing.ack.aliyuncs.com/acs/ack-mcp-server:latest
# 运行容器
docker run \
-d \
--name ack-mcp-server \
-e ACCESS_KEY_ID="your-access-key-id" \
-e ACCESS_KEY_SECRET="your-access-key-secret" \
-p 8000:8000 \
registry-cn-beijing.ack.aliyuncs.com/acs/ack-mcp-server:latest \
python -m main_server --transport sse --host 0.0.0.0 --port 8000 --allow-write
下载预编译的二进制文件 或 本地构建二进制文件后运行:
# 构建二进制文件(本地构建)
make build-binary
# 运行
./dist/ack-mcp-server --help
构建环境要求
# 克隆项目
git clone https://github.com/aliyun/alibabacloud-ack-mcp-server
cd alibabacloud-ack-mcp-server
# 安装依赖
uv sync
# 激活虚拟环境(Bash)
source .venv/bin/activate
# 配置环境
cp .env.example .env
vim .env
# 运行开发服务
make run
安装依赖
使用 uv(推荐):
uv sync
source .venv/bin/activate
或使用 pip:
pip install -r requirements.txt
创建 .env 文件(可参考 .env.example):
# 阿里云凭证与地域
ACCESS_KEY_ID=your-access-key-id
ACCESS_KEY_SECRET=your-access-key-secret
# 缓存配置
CACHE_TTL=300
CACHE_MAX_SIZE=1000
# 日志配置
FASTMCP_LOG_LEVEL=INFO
DEVELOPMENT=false
⚠️ 注意: 未设置 ACCESS_KEY_ID/ACCESS_KEY_SECRET 时,部分依赖云 API 的功能不可用。
npx @modelcontextprotocol/inspector --config ./mcp.json
本地运行ack-mcp-server Stdio 模式(适合本地开发)
make run
# 或
python -m src.main_server
本地运行ack-mcp-server Streaming HTTP 模式(推荐线上系统集成使用)
make run-http
# 或
python -m src.main_server --transport http --host 0.0.0.0 --port 8000
本地运行ack-mcp-server SSE 模式
make run-sse
# 或
python -m src.main_server --transport sse --host 0.0.0.0 --port 8000
常用参数
| 参数 | 说明 | 默认值 |
|---|---|---|
--access-key-id |
AccessKey ID | 阿里云账号凭证AK |
--access-key-secret |
AccessKey Secret | 阿里云账号凭证SK |
--allow-write |
启用写入操作 | 默认不启动 |
--transport |
传输模式 | stdio / sse / http |
--host |
绑定主机 | localhost |
--port |
端口号 | 8000 |
# 运行全部测试UT
make test
技术栈: Python 3.12+ + FastMCP 2.12.2+ + 阿里云SDK + Kubernetes Client
详细架构设计参见 DESIGN.md。
| 场景 | 描述 | 涉及模块 |
|---|---|---|
| Pod OOM 修复 | 内存溢出问题诊断修复 | kubectl, 诊断 |
| 集群健康检查 | 全面的集群状态巡检 | 诊断, 巡检 |
| 资源异常诊断 | 异常资源根因分析 | kubectl, 诊断 |
| 历史资源分析 | 资源使用趋势分析 | prometheus, sls |
基于最新 Benchmark 结果:
详细参见 Benchmark README.md。
# 运行 Benchmark
cd benchmarks
./run_benchmark.sh --openai-api-key your-key --agent qwen_code --model qwen3-coder-plus
Apache-2.0。详见 LICENSE。
Please log in to share your review and rating for this MCP.
Explore related MCPs that share similar capabilities and solve comparable challenges
by netdata
Delivers real‑time, per‑second infrastructure monitoring with zero‑configuration agents, on‑edge machine‑learning anomaly detection, and built‑in dashboards.
by Arize-ai
Open-source AI observability platform enabling tracing, evaluation, dataset versioning, experiment tracking, prompt management, and interactive playground for LLM applications.
by msgbyte
Provides integrated website traffic analysis, uptime checking, and server health monitoring in a single self‑hosted platform.
by grafana
Provides programmatic access to a Grafana instance and its surrounding ecosystem through the Model Context Protocol, enabling AI assistants and other clients to query and manipulate dashboards, datasources, alerts, incidents, on‑call schedules, and more.
by dynatrace-oss
Provides a local server that enables real‑time interaction with the Dynatrace observability platform, exposing tools for querying data, retrieving problems, sending Slack notifications, and integrating AI assistance.
by pydantic
Provides tools to retrieve and query OpenTelemetry trace and metric data from Pydantic Logfire, allowing LLMs to analyze distributed traces and run arbitrary SQL queries against telemetry records.
by VictoriaMetrics-Community
Provides a Model Context Protocol server exposing read‑only VictoriaMetrics APIs, enabling seamless monitoring, observability, and automation through AI‑driven assistants.
by GeLi2001
Enables interaction with the Datadog API through a Model Context Protocol server, providing access to monitors, dashboards, metrics, logs, events, and incident data.
by grafana
Provides a Model Context Protocol (MCP) server that enables AI agents to query Grafana Loki log data via stdin/stdout or Server‑Sent Events, supporting both local binary execution and containerized deployment.