Build software better, together

sophgo / LLM-TPU

Run generative AI models in sophgo BM1684X/BM1688

large-language-models llm generative-ai llm-inference llama2 qwen2 bm1684x llama3 qwen3 qwen2-5-vl bm1688

Updated Jul 14, 2025
C++

PRITHIVSAKTHIUR / OCR-ReportLab

A dedicated Colab notebooks to experiment (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B & more..) On T4 GPU - free tier

pdf ocr docx document gradio reportlab ocr-recognition huggingface-transformers vision-transformer reportlab-pdf qwen2-5-vl

Updated Jul 13, 2025
Jupyter Notebook

liuyifan22 / Qwen2.5-VL-Batched

Star

A batched implementation for efficient Qwen2.5-VL inference.

computer-vision batch vlm performance-optimization qwen2-5-vl

Updated Jul 9, 2025
Python

PRITHIVSAKTHIUR / Multimodal-OCR

Star

Vision Language Model : tailored for tasks that involve [messy] optical character recognition (ocr), image-to-text conversion, and math problem solving with latex formatting.

pillow video-processing opencv-python video-understanding ocr-recognition ocr-python huggingface-transformers qwen2-vl-2b qwen2-5-vl monkey-ocr

Updated Jul 13, 2025
Python

smsk-01 / GRPO-Trainer-Images

Star

GRPO trainer for VLM

images grpo qwen2-5-vl grpovlm grpoimages

Updated Apr 22, 2025
Python

PRITHIVSAKTHIUR / Multimodal-OCR2

Star

A comprehensive multimodal OCR application that supports both image and video document processing using state-of-the-art vision-language models. This application provides an intuitive Gradio interface for extracting text, converting documents to markdown, and performing advanced document analysis.

pillow image-analysis gradio video-understanding document-retrieval ocr-recognition huggingface-transformers vision-transformer qwen2-5-vl smoldocling

Updated Jul 13, 2025
Python

PRITHIVSAKTHIUR / Core-OCR

Star

A specialized optical character recognition (OCR) application built on advanced vision-language models, designed for document-level OCR, long-context understanding, and mathematical LaTeX formatting. Supports both image and video processing with multiple state-of-the-art model

torch gradio torchvision huggingface-transformers vision-transformer vision-language-model qwen2-vl qwen2-5-vl

Updated Jul 13, 2025
Python

PRITHIVSAKTHIUR / Multimodal-VLMs

Star

A comprehensive Gradio-based interface for running multiple state-of-the-art Vision-Language Models (VLMs) for Optical Character Recognition (OCR) and Visual Question Answering (VQA) tasks.

ocr pillow torch gradio multimodality torchvision huggingface-transformers vision-transformer multimodal-large-language-models vlms qwen2-5-vl

Updated Jul 13, 2025
Python

PRITHIVSAKTHIUR / Qwen2.5-VL-Video-Understanding

Star

The Qwen2.5-VL-7B-Instruct model is a multimodal AI model developed by Alibaba Cloud that excels at understanding both text and images. It's a Vision-Language Model (VLM) designed to handle various visual understanding tasks, including image understanding, video analysis, and even multilingual support.

torch gradio opencv-python video-understanding huggingface-transformers vision-language-model qwen2-vl qwen2-5-vl

Updated Jun 21, 2025
Python

PRITHIVSAKTHIUR / VisionScope-R2

Star

thinking/reasoning multimodal/vision-language model (VLM) trained to enhance spatial reasoning

ocr gradio vlm huggingface-transformers llm vision-language-model ollama qwen2-5-vl

Updated Jul 13, 2025
Python

PRITHIVSAKTHIUR / Doc-VLMs-exp

Star

An experimental document-focused Vision-Language Model application that provides advanced document analysis, text extraction, and multimodal understanding capabilities. This application features a streamlined Gradio interface for processing both images and videos using state-of-the-art vision-language models specialized in document understanding.

ocr transformers spaces demo-app gradio document-retrieval vgpu huggingface-transformers vlms drex qwen2-5-vl

Updated Jul 13, 2025
Python

PRITHIVSAKTHIUR / DREX

Star

drex-062225-exp (document retrieval and extraction expert) model is a specialized fine-tuned version of docscopeocr-7b-050425-exp, optimized for document retrieval, content extraction, and analysis recognition. built on top of the qwen2.5-vl architecture.

table documentation-tool gradio multimodality document-retrieval image-content-variation vlms image-text-to-text qwen2-5-vl

Updated Jun 24, 2025
Python

Kathan-max / RAG-Enhanced-Chatbot-with-LoRA-Fine-Tuning

Star

Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!

Updated Jul 12, 2025
Python

PRITHIVSAKTHIUR / Doc-VLMs-v2-Localization

Star

Doc-VLMs-v2-Localization is a demo app for the Camel-Doc-OCR-062825 model, fine-tuned from Qwen2.5-VL-7B-Instruct for advanced document retrieval, extraction, and analysis. It enhances document understanding and also integrates other notable Hugging Face models.

ocr table gradio document-retrieval ocr-recognition vision-language 7b huggingface-transformers vision-transformer qwen2-5-vl

Updated Jul 13, 2025
Python

PRITHIVSAKTHIUR / Cosmos-x-DocScope

Star

Understand physical common sense and generate appropriate embodied decisions. optimized for document-level optical character recognition, long-context vision-language understanding. build with hand-curated dataset for text-to-image models, providing significantly more detailed descriptions or captions of given images.

ocr torch torchvision huggingface-transformers vision-transformer qwen2-5-vl