A specialized optical character recognition (OCR) application built on advanced vision-language models, designed for document-level OCR, long-context understanding, and mathematical LaTeX formatting. Supports both image and video processing with multiple state-of-the-art models.
- Advanced OCR Models: Three specialized models optimized for different document processing tasks
- Document-Level OCR: Comprehensive text extraction with context understanding
- Mathematical LaTeX Support: Accurate conversion of mathematical expressions to LaTeX format
- Image & Video Processing: Handle both static images and video content
- Interactive Web Interface: User-friendly Gradio interface with real-time streaming
- Long-Context Understanding: Process complex documents with extended context
- Structure-Recognition-Relation: Advanced document understanding paradigm
Fine-tuned version of Qwen2.5-VL-7B-Instruct, optimized for Document-Level Optical Character Recognition (OCR), long-context vision-language understanding, and accurate image-to-text conversion with mathematical LaTeX formatting.
Adopts a Structure-Recognition-Relation (SRR) triplet paradigm, which simplifies the multi-tool pipeline of modular approaches while avoiding the inefficiency of using large multimodal models for full-page document processing.
Fine-tuned version of Qwen2-VL-7B, optimized for Document-Level Optical Character Recognition (OCR), long-context vision-language understanding, and accurate image-to-text conversion with mathematical LaTeX formatting.
ssxlEQkbb5jdGTgU_Bgiy.mp4
- Clone the repository:
git clone https://github.com/PRITHIVSAKTHIUR/Core-OCR.git
cd Core-OCR
- Install required dependencies:
pip install torch torchvision torchaudio
pip install transformers
pip install gradio
pip install spaces
pip install opencv-python
pip install pillow
pip install numpy
python app.py
The application will launch a Gradio interface accessible through your web browser at the displayed URL.
- Navigate to the "Image Inference" tab
- Enter your query in the text box:
- "fill the correct numbers" - For mathematical problem solving
- "ocr the image" - For general text extraction
- "explain the scene" - For comprehensive image analysis
- Upload an image file (PNG, JPG, JPEG supported)
- Select your preferred model from the radio buttons
- Click "Submit" to process
- Navigate to the "Video Inference" tab
- Enter your query:
- "Explain the ad in detail" - For advertisement analysis
- "Identify the main actions in the coca cola ad..." - For action recognition
- Upload a video file (MP4, AVI, MOV supported)
- Select your preferred model
- Click "Submit" to process
The application automatically extracts 10 evenly distributed frames from videos for comprehensive analysis.
- Max New Tokens: Control response length (1-2048 tokens)
- Temperature: Adjust creativity/randomness (0.1-4.0)
- Top-p: Configure nucleus sampling (0.05-1.0)
- Top-k: Set vocabulary consideration range (1-1000)
- Repetition Penalty: Prevent repetitive outputs (1.0-2.0)
- docscopeOCR-7B-050425-exp: Best for complex documents with mathematical content
- MonkeyOCR-Recognition: Optimal for structured document processing with relation understanding
- coreOCR-7B-050325-preview: Excellent for general OCR tasks with long-context requirements
- GPU: CUDA-compatible GPU recommended (minimum 8GB VRAM)
- RAM: 16GB+ system memory recommended
- Storage: 30GB+ free space for model downloads
- CPU: Multi-core processor for video processing
- Base Models: Qwen2-VL and Qwen2.5-VL architectures
- Precision: Half-precision (float16) for memory efficiency
- Context Length: Up to 4096 tokens input context
- Streaming: Real-time text generation with TextIteratorStreamer
- Frame Extraction: 10 evenly distributed frames per video
- Timestamp Preservation: Each frame tagged with temporal information
- Format Support: MP4, AVI, MOV, and other OpenCV-compatible formats
- Resolution: Maintains original aspect ratio with PIL processing
- Academic papers with mathematical equations
- Financial documents with tables and charts
- Legal documents with complex formatting
- Technical manuals with diagrams
- LaTeX equation extraction
- Formula recognition and conversion
- Mathematical problem solving
- Scientific notation processing
- Advertisement content analysis
- Educational video transcription
- Presentation slide extraction
- Tutorial step identification
generate_image(model_name, text, image, max_new_tokens, temperature, top_p, top_k, repetition_penalty)
Processes single images with specified model and parameters.
generate_video(model_name, text, video_path, max_new_tokens, temperature, top_p, top_k, repetition_penalty)
Processes video files with frame extraction and temporal analysis.
downsample_video(video_path)
Extracts evenly distributed frames from video files with timestamps.
- Automatic GPU memory optimization
- Model loading with efficient tensor operations
- Batch processing for multiple requests
- Asynchronous text generation
- Threaded model inference
- Optimized frame extraction algorithms
- GPU Memory Error: Reduce max_new_tokens or use CPU inference
- Model Loading Failed: Ensure sufficient disk space and internet connection
- Video Processing Slow: Consider reducing video resolution or length
- Out of Memory: Lower batch size or use smaller models
- Use GPU acceleration when available
- Optimize generation parameters for your use case
- Consider model size vs. accuracy trade-offs
- Monitor system resources during processing
We welcome contributions to improve Core OCR:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
git clone https://github.com/PRITHIVSAKTHIUR/Core-OCR.git
cd Core-OCR
pip install -e .
GitHub: https://github.com/PRITHIVSAKTHIUR/Core-OCR.git
This project is open source. Please refer to the LICENSE file for specific terms and conditions.
If you use Core OCR in your research or projects, please cite:
@software{core_ocr_2024,
title={Core OCR: Advanced Document-Level OCR with Vision-Language Models},
author={PRITHIVSAKTHIUR},
year={2024},
url={https://github.com/PRITHIVSAKTHIUR/Core-OCR}
}
- Hugging Face for the Transformers library and model hosting
- Qwen team for the base vision-language models
- Gradio for the web interface framework
- OpenCV community for video processing capabilities
- The broader OCR and computer vision research community
For questions, issues, or feature requests:
- Open an issue on GitHub
- Check existing documentation
- Review model-specific guides on Hugging Face