Doc-VLMs-exp

An experimental document-focused Vision-Language Model application that provides advanced document analysis, text extraction, and multimodal understanding capabilities. This application features a streamlined Gradio interface for processing both images and videos using state-of-the-art vision-language models specialized in document understanding.

Features

Document-Specialized Models: Three cutting-edge VLM models optimized for document processing
Image Processing: Advanced document analysis and text extraction from images
Video Processing: Frame-by-frame video analysis with temporal understanding
Real-time Streaming: Live output generation with immediate feedback
Dual Output Format: Raw text stream and formatted markdown results
Canvas-Style Interface: Clean, professional output display
Advanced Parameter Control: Fine-tune generation settings for optimal results

Supported Models

DREX-062225-exp (Default)

The Document Retrieval and Extraction Expert model is a specialized fine-tuned version of DocScopeOCR-7B, optimized for document retrieval, content extraction, and analysis recognition. Built on top of the Qwen2.5-VL architecture for superior document understanding.

VIREX-062225-exp

The Video Information Retrieval and Extraction Expert model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, specifically optimized for advanced video understanding, image comprehension, reasoning, and natural language decision-making through Chain-of-Thought reasoning.

olmOCR-7B-0225

Based on Qwen2-VL-7B, this model is optimized for document-level optical character recognition (OCR), long-context vision-language understanding, and accurate image-to-text conversion with mathematical LaTeX formatting. Designed with a focus on high-fidelity visual-textual comprehension.

Markdown (.MD) - Inference

Demo Inference

ktcyAnjo6BJEpanLL815Q.mp4

I_zhEpmTpdGJ1p5HlY1MO.mp4

Installation

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Doc-VLMs-exp.git
cd Doc-VLMs-exp

Install required dependencies:

pip install -r requirements.txt

Dependencies

torch
transformers
gradio
spaces
numpy
PIL (Pillow)
opencv-python

Usage

Running the Application

python app.py

The application will launch a Gradio interface accessible through your web browser.

Image Processing

Navigate to the "Image Inference" tab
Enter your query in the text field
Upload an image file
Select your preferred model (DREX-062225-exp recommended for documents)
Adjust advanced parameters if needed
Click Submit to process

Video Processing

Navigate to the "Video Inference" tab
Enter your query describing what you want to analyze
Upload a video file
Select your preferred model (VIREX-062225-exp recommended for videos)
Configure generation parameters as needed
Click Submit to process

The application extracts 10 evenly spaced frames from videos and processes them with timestamp information for comprehensive analysis.

Advanced Configuration

Customize the generation process with these parameters:

Max New Tokens: Control output length (1-2048 tokens)
Temperature: Adjust creativity/randomness (0.1-4.0)
Top-p: Nucleus sampling threshold (0.05-1.0)
Top-k: Top-k sampling limit (1-1000)
Repetition Penalty: Reduce repetitive output (1.0-2.0)

Example Queries

Document Processing

"Convert this page to doc [text] precisely."
"Extract all text from this document."
"Analyze the structure of this document."
"Convert chart to OTSL."
"Identify key information in this form."

Video Analysis

"Explain the video in detail."
"Describe what happens in this advertisement."
"Analyze the content of this presentation."
"Extract text visible in the video frames."

Output Formats

The application provides two output formats:

Raw Output Stream: Real-time text generation as it happens
Formatted Result: Clean markdown formatting for better readability

Technical Details

Model Architecture

All models are based on the Qwen2.5-VL and Qwen2-VL architectures, fine-tuned for specific document and video understanding tasks.

GPU Acceleration

The application automatically uses CUDA when available, with automatic fallback to CPU processing.

Video Processing Pipeline

Videos are processed by:

Extracting 10 evenly distributed frames
Converting frames to PIL images
Adding timestamp information
Processing frames sequentially with the selected model

Memory Optimization

Models are loaded with 16-bit precision (float16) to optimize memory usage while maintaining performance.

Hardware Requirements

Minimum: 8GB RAM, modern CPU
Recommended: 16GB+ RAM, CUDA-compatible GPU with 8GB+ VRAM
Storage: 15GB+ free space for model downloads

Model Selection Guide

For Document Analysis: Use DREX-062225-exp for best results with text extraction and document structure analysis
For Video Content: Use VIREX-062225-exp for comprehensive video understanding and reasoning
For OCR Tasks: Use olmOCR-7B-0225 for high-accuracy text recognition and LaTeX formatting

Performance Tips

Use GPU acceleration when available for faster processing
Adjust max_new_tokens based on expected output length
Lower temperature values for more focused, factual outputs
Higher temperature values for more creative responses
Use appropriate models for specific tasks (document vs video)

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

License

This project is open source. Please check individual model licenses for specific usage terms.

Acknowledgments

Hugging Face Transformers library
Gradio for the user interface framework
Qwen2.5-VL and Qwen2-VL model teams
Allen Institute for AI (olmOCR model)

Support

For bug reports and feature requests, please visit the GitHub repository or use the "Report Bug" link in the application interface.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
images		images
videos		videos
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Doc-VLMs-exp

Features

Supported Models

DREX-062225-exp (Default)

VIREX-062225-exp

olmOCR-7B-0225

Markdown (.MD) - Inference

Demo Inference

Installation

Dependencies

Usage

Running the Application

Image Processing

Video Processing

Advanced Configuration

Example Queries

Document Processing

Video Analysis

Output Formats

Technical Details

Model Architecture

GPU Acceleration

Video Processing Pipeline

Memory Optimization

Hardware Requirements

Model Selection Guide

Performance Tips

Contributing

License

Acknowledgments

Support

About

Uh oh!

Languages

License

PRITHIVSAKTHIUR/Doc-VLMs-exp

Folders and files

Latest commit

History

Repository files navigation

Doc-VLMs-exp

Features

Supported Models

DREX-062225-exp (Default)

VIREX-062225-exp

olmOCR-7B-0225

Markdown (.MD) - Inference

Demo Inference

Installation

Dependencies

Usage

Running the Application

Image Processing

Video Processing

Advanced Configuration

Example Queries

Document Processing

Video Analysis

Output Formats

Technical Details

Model Architecture

GPU Acceleration

Video Processing Pipeline

Memory Optimization

Hardware Requirements

Model Selection Guide

Performance Tips

Contributing

License

Acknowledgments

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages