본문으로 건너뛰기

#Vision-Language Model

34개의 포스트

[논문리뷰] PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

댓글 수 로딩 중

[논문리뷰] GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

댓글 수 로딩 중

[논문리뷰] MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

댓글 수 로딩 중

[논문리뷰] Mario: Multimodal Graph Reasoning with Large Language Models

댓글 수 로딩 중

[논문리뷰] Code2World: A GUI World Model via Renderable Code Generation

댓글 수 로딩 중

[논문리뷰] PaperBanana: Automating Academic Illustration for AI Scientists

댓글 수 로딩 중

[논문리뷰] Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

댓글 수 로딩 중

[논문리뷰] Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

댓글 수 로딩 중

[논문리뷰] LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

댓글 수 로딩 중

[논문리뷰] Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

댓글 수 로딩 중

[논문리뷰] VINO: A Unified Visual Generator with Interleaved OmniModal Context

댓글 수 로딩 중

[논문리뷰] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

댓글 수 로딩 중

[논문리뷰] Jina-VLM: Small Multilingual Vision Language Model

댓글 수 로딩 중

[논문리뷰] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

댓글 수 로딩 중

[논문리뷰] Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

댓글 수 로딩 중

[논문리뷰] MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

댓글 수 로딩 중

[논문리뷰] CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition

댓글 수 로딩 중

[논문리뷰] SAIL-VL2 Technical Report

댓글 수 로딩 중

[논문리뷰] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

댓글 수 로딩 중

[논문리뷰] MedVista3D: Vision-Language Modeling for Reducing Diagnostic Errors in 3D CT Disease Detection, Understanding and Reporting

댓글 수 로딩 중

[논문리뷰] Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

댓글 수 로딩 중

[논문리뷰] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

댓글 수 로딩 중

[논문리뷰] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

댓글 수 로딩 중