A comprehensive multi-modal AI application that combines document analysis, optical character recognition (OCR), video understanding, and object detection capabilities using state-of-the-art vision-language models.
- Document Analysis: Extract and convert document content to structured formats (text, tables, markdown)
- OCR Processing: Advanced optical character recognition for various document types
- Video Understanding: Analyze and describe video content with temporal awareness
- Object Detection: Locate and identify objects in images with bounding box annotations
- Multi-Model Support: Choose from four specialized vision-language models
- Camel-Doc-OCR-062825: Fine-tuned Qwen2.5-VL model optimized for document retrieval and content extraction
- OCRFlux-3B: Specialized 3B parameter model for OCR tasks with high accuracy
- ViLaSR-7B: Advanced spatial reasoning model with visual drawing capabilities
- ShotVL-7B: Cinematic language understanding model trained on high-quality video datasets
- Python 3.8+
- CUDA-compatible GPU (recommended)
- Git
# Clone the repository
git clone https://github.com/PRITHIVSAKTHIUR/Doc-VLMs-v2-Localization.git
cd Doc-VLMs-v2-Localization
# Install dependencies
pip install -r requirements.txt
gradio
spaces
torch
numpy
pillow
opencv-python
transformers
qwen-vl-utils
python app.py
The application will launch a Gradio interface accessible through your web browser.
- Upload images for analysis
- Query the model with natural language
- Get structured outputs including text extraction and document conversion
- Upload video files for analysis
- Generate detailed descriptions of video content
- Temporal understanding with frame-by-frame analysis
- Upload images for object localization
- Specify objects to detect using natural language
- View annotated images with bounding boxes
- Get precise coordinate information
- Max New Tokens: Control response length (1-2048)
- Temperature: Adjust creativity/randomness (0.1-4.0)
- Top-p: Nuclear sampling parameter (0.05-1.0)
- Top-k: Token selection threshold (1-1000)
- Repetition Penalty: Prevent repetitive outputs (1.0-2.0)
Process image inputs with selected model
- Parameters: Model selection, query text, PIL image, generation parameters
- Returns: Streaming text response
Analyze video content with temporal understanding
- Parameters: Model selection, query text, video file path, generation parameters
- Returns: Streaming analysis response
Perform object detection with bounding box output
- Parameters: PIL image, detection query, system prompt
- Returns: Detection results, parsed coordinates, annotated image
Extract representative frames from video files
- Parameters: Path to video file
- Returns: List of PIL images with timestamps
Convert normalized coordinates to image dimensions
- Parameters: Bounding box coordinates, image dimensions
- Returns: Rescaled coordinate arrays
- Base Model: Qwen2.5-VL-7B-Instruct
- Specialization: Document comprehension and OCR
- Use Cases: Text extraction, table conversion, document analysis
- Base Model: Qwen2.5-VL-3B-Instruct
- Specialization: Optical character recognition
- Use Cases: Text recognition, document digitization
- Base Model: Advanced spatial reasoning architecture
- Specialization: Visual drawing and spatial understanding
- Use Cases: Complex visual reasoning tasks
- Base Model: Qwen2.5-VL-7B-Instruct
- Specialization: Cinematic content understanding
- Use Cases: Video analysis, shot detection, narrative understanding
# Query: "convert this page to doc [text] precisely for markdown"
# Input: Document image
# Output: Structured markdown format
# Query: "detect red and yellow cars"
# Input: Street scene image
# Output: Bounding boxes around detected vehicles
# Query: "explain the ad video in detail"
# Input: Advertisement video file
# Output: Comprehensive video content analysis
- GPU acceleration recommended for optimal performance
- Video processing involves frame sampling (10 frames per video)
- Object detection uses normalized 512x512 coordinate system
- Streaming responses provide real-time feedback
- Video inference performance may vary across models
- GPU memory requirements scale with model size
- Processing time depends on input complexity and hardware
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is licensed under the MIT License. See LICENSE file for details.
- Built on Hugging Face Transformers
- Powered by Gradio for the web interface
- Utilizes Qwen vision-language model architecture
- Integrated with Spaces GPU acceleration
For issues and questions:
- Open an issue on GitHub
- Check the Hugging Face model documentation
- Review the examples and documentation