An experimental document-focused Vision-Language Model application that provides advanced document analysis, text extraction, and multimodal understanding capabilities. This application features a streamlined Gradio interface for processing both images and videos using state-of-the-art vision-language models specialized in document understanding.
- Document-Specialized Models: Three cutting-edge VLM models optimized for document processing
- Image Processing: Advanced document analysis and text extraction from images
- Video Processing: Frame-by-frame video analysis with temporal understanding
- Real-time Streaming: Live output generation with immediate feedback
- Dual Output Format: Raw text stream and formatted markdown results
- Canvas-Style Interface: Clean, professional output display
- Advanced Parameter Control: Fine-tune generation settings for optimal results
The Document Retrieval and Extraction Expert model is a specialized fine-tuned version of DocScopeOCR-7B, optimized for document retrieval, content extraction, and analysis recognition. Built on top of the Qwen2.5-VL architecture for superior document understanding.
The Video Information Retrieval and Extraction Expert model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically optimized for advanced video understanding, image comprehension, reasoning, and natural language decision-making through Chain-of-Thought reasoning.
Based on Qwen2-VL-7B, this model is optimized for document-level optical character recognition (OCR), long-context vision-language understanding, and accurate image-to-text conversion with mathematical LaTeX formatting. Designed with a focus on high-fidelity visual-textual comprehension.
ktcyAnjo6BJEpanLL815Q.mp4
I_zhEpmTpdGJ1p5HlY1MO.mp4
- Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/Doc-VLMs-exp.git
cd Doc-VLMs-exp
- Install required dependencies:
pip install -r requirements.txt
- torch
- transformers
- gradio
- spaces
- numpy
- PIL (Pillow)
- opencv-python
python app.py
The application will launch a Gradio interface accessible through your web browser.
- Navigate to the "Image Inference" tab
- Enter your query in the text field
- Upload an image file
- Select your preferred model (DREX-062225-exp recommended for documents)
- Adjust advanced parameters if needed
- Click Submit to process
- Navigate to the "Video Inference" tab
- Enter your query describing what you want to analyze
- Upload a video file
- Select your preferred model (VIREX-062225-exp recommended for videos)
- Configure generation parameters as needed
- Click Submit to process
The application extracts 10 evenly spaced frames from videos and processes them with timestamp information for comprehensive analysis.
Customize the generation process with these parameters:
- Max New Tokens: Control output length (1-2048 tokens)
- Temperature: Adjust creativity/randomness (0.1-4.0)
- Top-p: Nucleus sampling threshold (0.05-1.0)
- Top-k: Top-k sampling limit (1-1000)
- Repetition Penalty: Reduce repetitive output (1.0-2.0)
- "Convert this page to doc [text] precisely."
- "Extract all text from this document."
- "Analyze the structure of this document."
- "Convert chart to OTSL."
- "Identify key information in this form."
- "Explain the video in detail."
- "Describe what happens in this advertisement."
- "Analyze the content of this presentation."
- "Extract text visible in the video frames."
The application provides two output formats:
- Raw Output Stream: Real-time text generation as it happens
- Formatted Result: Clean markdown formatting for better readability
All models are based on the Qwen2.5-VL and Qwen2-VL architectures, fine-tuned for specific document and video understanding tasks.
The application automatically uses CUDA when available, with automatic fallback to CPU processing.
Videos are processed by:
- Extracting 10 evenly distributed frames
- Converting frames to PIL images
- Adding timestamp information
- Processing frames sequentially with the selected model
Models are loaded with 16-bit precision (float16) to optimize memory usage while maintaining performance.
- Minimum: 8GB RAM, modern CPU
- Recommended: 16GB+ RAM, CUDA-compatible GPU with 8GB+ VRAM
- Storage: 15GB+ free space for model downloads
- For Document Analysis: Use DREX-062225-exp for best results with text extraction and document structure analysis
- For Video Content: Use VIREX-062225-exp for comprehensive video understanding and reasoning
- For OCR Tasks: Use olmOCR-7B-0225 for high-accuracy text recognition and LaTeX formatting
- Use GPU acceleration when available for faster processing
- Adjust max_new_tokens based on expected output length
- Lower temperature values for more focused, factual outputs
- Higher temperature values for more creative responses
- Use appropriate models for specific tasks (document vs video)
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is open source. Please check individual model licenses for specific usage terms.
- Hugging Face Transformers library
- Gradio for the user interface framework
- Qwen2.5-VL and Qwen2-VL model teams
- Allen Institute for AI (olmOCR model)
For bug reports and feature requests, please visit the GitHub repository or use the "Report Bug" link in the application interface.