by groundlight
Expose HuggingFace zero‑shot object detection models as tools for large language or vision‑language models, enabling object localisation and zoom functionality on images.
Mcp Vision provides a Model Context Protocol (MCP) server that wraps HuggingFace computer‑vision models—primarily zero‑shot object detection pipelines—into callable tools. These tools can be invoked by LLMs (e.g., Claude) to locate objects in an image or to crop and zoom into a specific object.
make build-docker
(or use the pre‑built image groundlight/mcp-vision:latest
).claude_desktop_config.json
that runs the Docker container, choosing the GPU or CPU variant.uv
Python package manager.Q: Do I need a GPU?
A: A GPU speeds up inference dramatically. The default google/owlvit-large-patch14
model is slow on CPU; you can set DEFAULT_OBJDET_MODEL
to a smaller model.
Q: Can I run the server without building the Docker image?
A: Yes, pull groundlight/mcp-vision:latest
and run it directly, though the initial model download may delay startup.
Q: Why does Claude sometimes ignore the tools? A: If web‑search is enabled, Claude may prefer that over local MCP tools. Disable web‑search for best results.
Q: How do I add more models or tools?
A: Extend the codebase under mcp_vision/tools
and update the Docker image; a TODO mentions hosting best models online to avoid local downloads.
A Model Context Protocol (MCP) server exposing HuggingFace computer vision models such as zero-shot object detection as tools, enhancing the vision capabilities of large language or vision-language models.
This repo is in active development. See below for details of currently available tools.
Clone the repo:
git clone git@github.com:groundlight/mcp-vision.git
Build a local docker image:
cd mcp-vision
make build-docker
Add this to your claude_desktop_config.json
:
If your local environment has access to a NVIDIA GPU:
"mcpServers": {
"mcp-vision": {
"command": "docker",
"args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "mcp-vision"],
"env": {}
}
}
Or, CPU only:
"mcpServers": {
"mcp-vision": {
"command": "docker",
"args": ["run", "-i", "--rm", "mcp-vision"],
"env": {}
}
}
When running on CPU, the default large-size object detection model make take a long time to laod and run inference. Consider using a smaller model as DEFAULT_OBJDET_MODEL
(you can tell Claude directly to use a specific model too).
(Beta) It is possible to run the public docker image directly without building locally, however the download time may interfere with Claude's loading of the server.
"mcpServers": {
"mcp-vision": {
"command": "docker",
"args": ["run", "-i", "--rm", "--runtime=nvidia", "--gpus", "all", "groundlight/mcp-vision:latest"],
"env": {}
}
}
The following tools are currently available through the mcp-vision server:
image_path
(string) URL or file path, candidate_labels
(list of strings) list of possible objects to detect, hf_model
(optional string), will use "google/owlvit-large-patch14"
by default, which could be slow on a non-GPU machineimage_path
(string) URL or file path, label
(string) object label to find and zoom and crop to, hf_model
(optional), will use "google/owlvit-large-patch14"
by default, which could be slow on a non-GPU machineRun Claude Desktop with Claude Sonnet 3.7 and mcp-vision
configured as an MCP server in claude_desktop_config.json
.
The prompt used in the example video and blog post was:
From the information on that advertising board, what is the type of this shop?
Options:
The shop is a yoga studio.
The shop is a cafe.
The shop is a seven-eleven.
The shop is a milk tea shop.
The image is the first image in the V*Bench/GPT4V-hard dataset and can be found here: https://huggingface.co/datasets/craigwu/vstar_bench/blob/main/GPT4V-hard/0.JPG (use the download link).
Note:
Run locally using the uv
package manager:
uv install
uv run python mcp_vision
Build the Docker image locally:
make build-docker
Run the Docker image locally:
make run-docker-cpu
or
make run-docker-gpu
[Groundlight Internal] Push the Docker image to Docker Hub (requires DockerHub credentials):
make push-docker
If Claude Desktop is failing to connect to mcp-vision
:
On accounts that have web search enabled, Claude will prefer to use web search over local MCP tools AFAIK. Disable web search for best results.
Please log in to share your review and rating for this MCP.
Explore related MCPs that share similar capabilities and solve comparable challenges
by DMontgomery40
A Model Context Protocol server that proxies DeepSeek's language models, enabling seamless integration with MCP‑compatible applications.
by deepfates
Runs Replicate models through the Model Context Protocol, exposing tools for model discovery, prediction management, and image handling via a simple CLI interface.
by 66julienmartin
A Model Context Protocol (MCP) server implementation connecting Claude Desktop with DeepSeek's language models (R1/V3)
by ruixingshi
Provides Deepseek model's chain‑of‑thought reasoning to MCP‑enabled AI clients, supporting both OpenAI API mode and local Ollama mode.
by 66julienmartin
Provides a Model Context Protocol server for the Qwen Max language model, enabling seamless integration with Claude Desktop and other MCP‑compatible clients.
by Verodat
Enables AI models to interact with Verodat's data management capabilities through a set of standardized tools for retrieving, creating, and managing datasets.
Run advanced AI models locally with high performance while maintaining full data privacy, accessible through native desktop applications and a browser‑based platform.
Upload, analyze, and visualize documents, compare multiple AI model responses side‑by‑side, generate diagrams, solve math with KaTeX, and collaborate securely within a single unified interface.
by zed-industries
A high‑performance, multiplayer code editor designed for speed and collaboration.