A comprehensive Gradio-based interface for running multiple state-of-the-art Vision-Language Models (VLMs) for Optical Character Recognition (OCR) and Visual Question Answering (VQA) tasks. This application supports both image and video inference across 5 different pre-trained models.
- Multi-Model Support: Integration of 5 different VLMs with specialized capabilities
- Dual Input Modes: Support for both image and video input processing
- Real-time Streaming: Live text generation with streaming output
- Advanced Configuration: Customizable generation parameters (temperature, top-p, top-k, etc.)
- User-friendly Interface: Clean Gradio web interface with example datasets
- GPU Acceleration: Optimized for CUDA-enabled environments
- Source: Yuting6/Vision-Matters-7B
- Specialty: Enhanced visual perturbation framework for better reasoning
- Architecture: Based on Qwen2.5-VL with improved visual comprehension
- Source: prithivMLmods/WR30a-Deep-7B-0711
- Specialty: Image captioning, visual analysis, and image reasoning
- Training: Fine-tuned on 1,500k image pairs for superior understanding
- Source: yunfeixie/ViGaL-7B
- Specialty: Game-trained model with transferable reasoning skills
- Performance: Enhanced performance on MathVista and MMMU benchmarks
- Source: echo840/MonkeyOCR-pro-1.2B
- Specialty: Document OCR with Structure-Recognition-Relation (SRR) paradigm
- Focus: Efficient full-page document processing
- Source: maifoundations/Visionary-R1
- Specialty: Reinforcement learning-based visual reasoning
- Approach: Scalable training using only visual question-answer pairs
- Python 3.8+
- CUDA-compatible GPU (recommended)
- 16GB+ RAM
- 50GB+ storage for models
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
pip install gradio
pip install spaces
pip install opencv-python
pip install pillow
pip install numpy
git clone https://github.com/PRITHIVSAKTHIUR/Multimodal-VLMs.git
cd Multimodal-VLMs
python app.py
MAX_INPUT_TOKEN_LENGTH
: Maximum input token length (default: 4096)
Once launched, access the interface through your browser at the provided local URL.
generate_image(
model_name: str, # Selected model identifier
text: str, # Query text
image: Image.Image, # PIL Image object
max_new_tokens: int = 1024,
temperature: float = 0.6,
top_p: float = 0.9,
top_k: int = 50,
repetition_penalty: float = 1.2
)
generate_video(
model_name: str, # Selected model identifier
text: str, # Query text
video_path: str, # Path to video file
max_new_tokens: int = 1024,
temperature: float = 0.6,
top_p: float = 0.9,
top_k: int = 50,
repetition_penalty: float = 1.2
)
- max_new_tokens: Maximum number of tokens to generate (1-2048)
- temperature: Randomness in generation (0.1-4.0)
- top_p: Nucleus sampling threshold (0.05-1.0)
- top_k: Top-k sampling parameter (1-1000)
- repetition_penalty: Penalty for repeated tokens (1.0-2.0)
- Frame Sampling: Extracts 10 evenly spaced frames from input video
- Format Support: Standard video formats (MP4, AVI, MOV)
- Resolution: Automatic frame preprocessing and RGB conversion
- Mathematical problem solving
- Document content extraction
- Scene explanation and analysis
- Expression simplification
- Variable solving
- Detailed video content analysis
- Action recognition and description
- Sequential frame understanding
- Models loaded with float16 precision
- Automatic device detection (CUDA/CPU)
- Efficient memory utilization across multiple models
- Streaming text generation for real-time feedback
- Threaded model inference to prevent UI blocking
- Optimized tokenization and preprocessing
- Video inference performance varies across models
- GPU memory requirements scale with model size
- Processing time depends on input complexity and generation length
- Some models may not perform optimally for all task types
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is open-source and available under the MIT License.
- Hugging Face Transformers library
- Gradio framework for web interface
- Model developers and research teams
- Open-source computer vision community
For issues and questions:
- Create an issue in the GitHub repository
- Check the Hugging Face Space discussions
- Review model documentation for specific capabilities