본문으로 건너뛰기

#Visual Grounding

33개의 포스트

[논문리뷰] Advancing Creative Physical Intelligence in Large Multimodal Models

댓글 수 로딩 중

[논문리뷰] GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

댓글 수 로딩 중

[논문리뷰] Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

댓글 수 로딩 중

[논문리뷰] CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

댓글 수 로딩 중

[논문리뷰] CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents

댓글 수 로딩 중

[논문리뷰] From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

댓글 수 로딩 중

[논문리뷰] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

댓글 수 로딩 중

[논문리뷰] Selective Training for Large Vision Language Models via Visual Information Gain

댓글 수 로딩 중

[논문리뷰] MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

댓글 수 로딩 중

[논문리뷰] GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

댓글 수 로딩 중

[논문리뷰] A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

댓글 수 로딩 중

[논문리뷰] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

댓글 수 로딩 중

[논문리뷰] Revisiting Multimodal Positional Encoding in Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

댓글 수 로딩 중

[논문리뷰] IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding

댓글 수 로딩 중

[논문리뷰] TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

댓글 수 로딩 중

[논문리뷰] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

댓글 수 로딩 중