본문으로 건너뛰기

#Vision-Language Models

184개의 포스트

[논문리뷰] Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

댓글 수 로딩 중

[논문리뷰] GEM: Generative Supervision Helps Embodied Intelligence

댓글 수 로딩 중

[논문리뷰] Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

댓글 수 로딩 중

[논문리뷰] DocAtlas: Multilingual Document Understanding Across 80+ Languages

댓글 수 로딩 중

[논문리뷰] Unlocking Dense Metric Depth Estimation in VLMs

댓글 수 로딩 중

[논문리뷰] Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

댓글 수 로딩 중

[논문리뷰] 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

댓글 수 로딩 중

[논문리뷰] AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

댓글 수 로딩 중

[논문리뷰] LLaTiSA: Towards Difficulty-Stratified Time Series Reasoning from Visual Perception to Semantics

댓글 수 로딩 중

[논문리뷰] RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

댓글 수 로딩 중

[논문리뷰] PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

댓글 수 로딩 중

[논문리뷰] MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

댓글 수 로딩 중

[논문리뷰] Watch Before You Answer: Learning from Visually Grounded Post-Training

댓글 수 로딩 중

[논문리뷰] Vero: An Open RL Recipe for General Visual Reasoning

댓글 수 로딩 중

[논문리뷰] CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

댓글 수 로딩 중

[논문리뷰] VOID: Video Object and Interaction Deletion

댓글 수 로딩 중

[논문리뷰] Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

댓글 수 로딩 중

[논문리뷰] PEARL: Personalized Streaming Video Understanding Model

댓글 수 로딩 중

[논문리뷰] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

댓글 수 로딩 중

[논문리뷰] EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

댓글 수 로딩 중

[논문리뷰] MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

댓글 수 로딩 중

[논문리뷰] FVG-PT: Adaptive Foreground View-Guided Prompt Tuning for Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Large Multimodal Models as General In-Context Classifiers

댓글 수 로딩 중

[논문리뷰] UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

댓글 수 로딩 중

[논문리뷰] Beyond Language Modeling: An Exploration of Multimodal Pretraining

댓글 수 로딩 중

[논문리뷰] Half-Truths Break Similarity-Based Retrieval

댓글 수 로딩 중

[논문리뷰] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

댓글 수 로딩 중

[논문리뷰] From Perception to Action: An Interactive Benchmark for Vision Reasoning

댓글 수 로딩 중

[논문리뷰] TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

댓글 수 로딩 중

[논문리뷰] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

댓글 수 로딩 중

[논문리뷰] Selective Training for Large Vision Language Models via Visual Information Gain

댓글 수 로딩 중

[논문리뷰] DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

댓글 수 로딩 중

[논문리뷰] GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics

댓글 수 로딩 중

[논문리뷰] P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

댓글 수 로딩 중

[논문리뷰] EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

댓글 수 로딩 중

[논문리뷰] PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

댓글 수 로딩 중

[논문리뷰] OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

댓글 수 로딩 중

[논문리뷰] STEP3-VL-10B Technical Report

댓글 수 로딩 중

[논문리뷰] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

댓글 수 로딩 중

[논문리뷰] Action100M: A Large-scale Video Action Dataset

댓글 수 로딩 중

[논문리뷰] OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

댓글 수 로딩 중

[논문리뷰] ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

댓글 수 로딩 중

[논문리뷰] What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

댓글 수 로딩 중

[논문리뷰] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

댓글 수 로딩 중

[논문리뷰] RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

댓글 수 로딩 중

[논문리뷰] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Pay Less Attention to Function Words for Free Robustness of Vision-Language Models

댓글 수 로딩 중

[논문리뷰] InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

댓글 수 로딩 중

[논문리뷰] BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

댓글 수 로딩 중

[논문리뷰] Relational Visual Similarity

댓글 수 로딩 중

[논문리뷰] Decouple to Generalize: Context-First Self-Evolving Learning for Data-Scarce Vision-Language Reasoning

댓글 수 로딩 중

[논문리뷰] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

댓글 수 로딩 중

[논문리뷰] TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

댓글 수 로딩 중

[논문리뷰] Structured Extraction from Business Process Diagrams Using Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Seeing the Wind from a Falling Leaf

댓글 수 로딩 중

[논문리뷰] World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

댓글 수 로딩 중

[논문리뷰] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

댓글 수 로딩 중

[논문리뷰] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

댓글 수 로딩 중

[논문리뷰] MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

댓글 수 로딩 중

[논문리뷰] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

댓글 수 로딩 중

[논문리뷰] VisPlay: Self-Evolving Vision-Language Models from Images

댓글 수 로딩 중

[논문리뷰] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

댓글 수 로딩 중

[논문리뷰] WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation

댓글 수 로딩 중

[논문리뷰] left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

댓글 수 로딩 중

[논문리뷰] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

댓글 수 로딩 중

[논문리뷰] Revisiting Multimodal Positional Encoding in Vision-Language Models

댓글 수 로딩 중

[논문리뷰] RefAM: Attention Magnets for Zero-Shot Referral Segmentation

댓글 수 로딩 중

[논문리뷰] ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models

댓글 수 로딩 중

[논문리뷰] MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

댓글 수 로딩 중

[논문리뷰] ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

댓글 수 로딩 중

[논문리뷰] EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

댓글 수 로딩 중

[논문리뷰] 3D Aware Region Prompted Vision Language Model

댓글 수 로딩 중

[논문리뷰] Lost in Embeddings: Information Loss in Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Visual Representation Alignment for Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] MobiAgent: A Systematic Framework for Customizable Mobile Agents

댓글 수 로딩 중

[논문리뷰] Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

댓글 수 로딩 중

[논문리뷰] OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

댓글 수 로딩 중

[논문리뷰] MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment

댓글 수 로딩 중

[논문리뷰] InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

댓글 수 로딩 중

[논문리뷰] Explain Before You Answer: A Survey on Compositional Visual Reasoning

댓글 수 로딩 중

[논문리뷰] Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping

댓글 수 로딩 중

[논문리뷰] OpenCUA: Open Foundations for Computer-Use Agents

댓글 수 로딩 중

[논문리뷰] SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

댓글 수 로딩 중

[논문리뷰] Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

댓글 수 로딩 중

[논문리뷰] DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

댓글 수 로딩 중

[논문리뷰] Multimodal Referring Segmentation: A Survey

댓글 수 로딩 중

[논문리뷰] AgroBench: Vision-Language Model Benchmark in Agriculture

댓글 수 로딩 중

[논문리뷰] CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

댓글 수 로딩 중

[논문리뷰] IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

댓글 수 로딩 중

[논문리뷰] From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

댓글 수 로딩 중

[논문리뷰] StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation

댓글 수 로딩 중

[논문리뷰] Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

댓글 수 로딩 중

[논문리뷰] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

댓글 수 로딩 중

[논문리뷰] Unified Reinforcement and Imitation Learning for Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

댓글 수 로딩 중

[논문리뷰] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

댓글 수 로딩 중

[논문리뷰] VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

댓글 수 로딩 중

[논문리뷰] Code2Video: A Code-centric Paradigm for Educational Video Generation

댓글 수 로딩 중

[논문리뷰] More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

댓글 수 로딩 중