A comprehensive multimodal OCR application that supports both image and video document processing using state-of-the-art vision-language models. This application provides an intuitive Gradio interface for extracting text, converting documents to markdown, and performing advanced document analysis.
Note
Demo here : https://huggingface.co/spaces/prithivMLmods/Multimodal-OCR2
- Multiple Model Support: Choose from 4 different OCR models optimized for various use cases
- Image Processing: Extract text and convert documents from images
- Video Processing: Process video content with OCR capabilities
- Document Conversion: Convert documents to structured markdown format
- Real-time Streaming: Get results as they are generated
- Advanced Configuration: Fine-tune generation parameters for optimal results
A multimodal Image-Text-to-Text model designed for efficient document conversion. Retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
A powerful, state-of-the-art image-to-markdown OCR model that goes far beyond traditional text extraction. It transforms documents into structured markdown with intelligent content recognition and semantic tagging.
Adopts a Structure-Recognition-Relation (SRR) triplet paradigm, which simplifies the multi-tool pipeline of modular approaches while avoiding the inefficiency of using large multimodal models for full-page document processing.
A bilingual document parsing model built specifically for real-world documents in Thai and English. Extracts and interprets embedded text including chart labels and captions in both languages.
Untitled.video.-.Made.with.Clipchamp.2.mp4
c.mp4
- Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR2.git
cd Multimodal-OCR2
- Install required dependencies:
pip install -r requirements.txt
- torch
- transformers
- gradio
- spaces
- numpy
- PIL (Pillow)
- opencv-python
- docling-core
python app.py
The application will launch a Gradio interface accessible through your web browser.
- Select the "Image Inference" tab
- Enter your query (e.g., "OCR the image", "Convert this page to docling")
- Upload an image file
- Choose your preferred model
- Adjust advanced parameters if needed
- Click Submit to process
- Select the "Video Inference" tab
- Enter your query (e.g., "Explain the video in detail")
- Upload a video file
- Choose your preferred model
- Adjust advanced parameters if needed
- Click Submit to process
The application provides several tunable parameters:
- Max New Tokens: Maximum number of tokens to generate (1-2048)
- Temperature: Controls randomness in generation (0.1-4.0)
- Top-p: Nucleus sampling parameter (0.05-1.0)
- Top-k: Top-k sampling parameter (1-1000)
- Repetition Penalty: Penalty for repetitive text (1.0-2.0)
- "OCR the image"
- "Convert this page to docling"
- "Convert chart to OTSL"
- "Convert code to text"
- "Convert this table to OTSL"
- "Convert formula to latex"
- "Explain the video in detail"
- "Extract text from video frames"
The application loads all models at startup using GPU acceleration when available. Models are loaded with 16-bit precision for optimal performance.
Videos are processed by extracting 10 evenly spaced frames, which are then processed as a sequence of images by the selected model.
- Automatic padding for OTSL and code conversion tasks
- Value normalization for OCR and element identification
- Advanced postprocessing for structured document output
- Automatic conversion to markdown format
The application uses CUDA acceleration when available and falls back to CPU processing otherwise.
- Minimum: 12GB RAM, CPU
- Recommended: 40GB+ RAM, CUDA-compatible GPU
- Storage: 50GB+ free space for model downloads
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open source. Please check individual model licenses for specific usage terms.
- Hugging Face Transformers library
- Gradio for the user interface
- All model creators and maintainers
- Docling team for document processing capabilities
Important
The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀
For issues and questions, please open an issue on the GitHub repository.