Multimodal-OCR2

A comprehensive multimodal OCR application that supports both image and video document processing using state-of-the-art vision-language models. This application provides an intuitive Gradio interface for extracting text, converting documents to markdown, and performing advanced document analysis.

Note

Demo here : https://huggingface.co/spaces/prithivMLmods/Multimodal-OCR2

Features

Multiple Model Support: Choose from 4 different OCR models optimized for various use cases
Image Processing: Extract text and convert documents from images
Video Processing: Process video content with OCR capabilities
Document Conversion: Convert documents to structured markdown format
Real-time Streaming: Get results as they are generated
Advanced Configuration: Fine-tune generation parameters for optimal results

Supported Models

SmolDocling-256M-preview

A multimodal Image-Text-to-Text model designed for efficient document conversion. Retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.

Nanonets-OCR-s

A powerful, state-of-the-art image-to-markdown OCR model that goes far beyond traditional text extraction. It transforms documents into structured markdown with intelligent content recognition and semantic tagging.

MonkeyOCR-Recognition

Adopts a Structure-Recognition-Relation (SRR) triplet paradigm, which simplifies the multi-tool pipeline of modular approaches while avoiding the inefficiency of using large multimodal models for full-page document processing.

Typhoon-OCR-7B

A bilingual document parsing model built specifically for real-world documents in Thai and English. Extracts and interprets embedded text including chart labels and captions in both languages.

Image / Video Inference Demo

Untitled.video.-.Made.with.Clipchamp.2.mp4

c.mp4

Installation

Clone the repository:

git clone https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR2.git
cd Multimodal-OCR2

Install required dependencies:

pip install -r requirements.txt

Dependencies

torch
transformers
gradio
spaces
numpy
PIL (Pillow)
opencv-python
docling-core

Usage

Running the Application

python app.py

The application will launch a Gradio interface accessible through your web browser.

Image Processing

Select the "Image Inference" tab
Enter your query (e.g., "OCR the image", "Convert this page to docling")
Upload an image file
Choose your preferred model
Adjust advanced parameters if needed
Click Submit to process

Video Processing

Select the "Video Inference" tab
Enter your query (e.g., "Explain the video in detail")
Upload a video file
Choose your preferred model
Adjust advanced parameters if needed
Click Submit to process

Advanced Configuration

The application provides several tunable parameters:

Max New Tokens: Maximum number of tokens to generate (1-2048)
Temperature: Controls randomness in generation (0.1-4.0)
Top-p: Nucleus sampling parameter (0.05-1.0)
Top-k: Top-k sampling parameter (1-1000)
Repetition Penalty: Penalty for repetitive text (1.0-2.0)

Example Queries

Image Processing

"OCR the image"
"Convert this page to docling"
"Convert chart to OTSL"
"Convert code to text"
"Convert this table to OTSL"
"Convert formula to latex"

Video Processing

"Explain the video in detail"
"Extract text from video frames"

Technical Details

Model Loading

The application loads all models at startup using GPU acceleration when available. Models are loaded with 16-bit precision for optimal performance.

Video Processing

Videos are processed by extracting 10 evenly spaced frames, which are then processed as a sequence of images by the selected model.

SmolDocling-256M Special Features

Automatic padding for OTSL and code conversion tasks
Value normalization for OCR and element identification
Advanced postprocessing for structured document output
Automatic conversion to markdown format

GPU Support

The application uses CUDA acceleration when available and falls back to CPU processing otherwise.

Hardware Requirements

Minimum: 12GB RAM, CPU
Recommended: 40GB+ RAM, CUDA-compatible GPU
Storage: 50GB+ free space for model downloads

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is open source. Please check individual model licenses for specific usage terms.

Acknowledgments

Hugging Face Transformers library
Gradio for the user interface
All model creators and maintainers
Docling team for document processing capabilities

Important

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

Support

For issues and questions, please open an issue on the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images		images
videos		videos
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal-OCR2

Features

Supported Models

SmolDocling-256M-preview

Nanonets-OCR-s

MonkeyOCR-Recognition

Typhoon-OCR-7B

Image / Video Inference Demo

Installation

Dependencies

Usage

Running the Application

Image Processing

Video Processing

Advanced Configuration

Example Queries

Image Processing

Video Processing

Technical Details

Model Loading

Video Processing

SmolDocling-256M Special Features

GPU Support

Hardware Requirements

Contributing

License

Acknowledgments

Support

About

Uh oh!

Languages

License

PRITHIVSAKTHIUR/Multimodal-OCR2

Folders and files

Latest commit

History

Repository files navigation

Multimodal-OCR2

Features

Supported Models

SmolDocling-256M-preview

Nanonets-OCR-s

MonkeyOCR-Recognition

Typhoon-OCR-7B

Image / Video Inference Demo

Installation

Dependencies

Usage

Running the Application

Image Processing

Video Processing

Advanced Configuration

Example Queries

Image Processing

Video Processing

Technical Details

Model Loading

Video Processing

SmolDocling-256M Special Features

GPU Support

Hardware Requirements

Contributing

License

Acknowledgments

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages