#Large Vision-Language Models

10개의 포스트

[논문리뷰] Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

본 논문은 현대의 LVLM이 일상적인 비디오 이해와 조작 과제를 해결하기 위한 세밀한 시공간적 추론 능력이 부족하다는 문제에서 시작한다.

#Review #Large Vision-Language Models #Video Understanding #Spatio-Temporal Reasoning #Furniture Assembly #Object Tracking #Contact Reasoning

2026년 5월 31일

[논문리뷰] VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

본 논문은 기존의 LLM 기반 비디오 이해 모델들이 겪는 공간적·시간적 참조의 모호성 문제를 해결하기 위해 VideoSeeker를 제안한다.

#Review #Large Vision-Language Models #Instance-level Video Understanding #Visual Prompts #Agentic Tool Invocation #Reinforcement Learning #Data Synthesis Pipeline

2026년 5월 18일

[논문리뷰] MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

본 연구는 LVLM과 Memory-Augmented Agents 간의 기억 능력을 체계적으로 비교할 수 있는 표준화된 벤치마크의 부재를 해결합니다. 기존의 장기 문맥 벤치마크는 주로 텍스트 기반이거나 시각적 정보의 필요성이 낮아 진정한 다중 모달 추론 능력을 검증하지 못한다는 한계가 있습니다.

#Review #Multimodal Memory #Large Vision-Language Models #Long-Context #Benchmark #Retrieval-Augmented Generation #Multi-Session Reasoning

2026년 5월 14일

[논문리뷰] Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

본 논문은 Autoregressive LVLM이 긴 문맥 생성 시 겪는 Visual Signal Dilution 문제를 해결하고자 한다.

#Review #Large Vision-Language Models #Visual Signal Dilution #Persistent Visual Memory #Autoregressive Generation #Multimodal Reasoning #Bottleneck Adapter

2026년 5월 4일

[논문리뷰] AutoMIA: Improved Baselines for Membership Inference Attack via Agentic Self-Exploration

본 논문은 기존 MIA가 의존하는 정적인 핸드크래프트 휴리스틱의 낮은 적응성과 확장성 문제를 해결하기 위해 에이전트 기반의 자동화된 공격 프레임워크를 제안합니다.

#Review #Membership Inference Attack #Agentic Framework #Strategy Self-Exploration #Large Vision-Language Models #Privacy Auditing

2026년 4월 2일

[논문리뷰] VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning

이 논문은 기존 Tool-integrated Visual Reasoning (TiVR) 패러다임이 부정확하거나 오류 있는 도구 출력에 취약하여 환각적인 추론으로 이어지는 문제를 해결하고자 합니다.

#Review #Tool-integrated Visual Reasoning #Referring Grounded Reasoning #Agentic Reinforcement Learning #Self-Correction #Large Vision-Language Models #Chain-of-Thought #Tool Refinement

2025년 12월 8일

[논문리뷰] SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization

기존 LVLM(Large Vision-Language Models) 기반의 VLN(Vision-Language Navigation) 에이전트가 겪는 지각, 추론, 계획 오류로 인한 낮은 내비게이션 성능 문제를 해결하고자 합니다.

#Review #Vision-Language Navigation #Large Vision-Language Models #Visual Prompt #Reinforcement Fine-Tuning #Policy Optimization #Embodied AI #Spatial Reasoning #Perception Errors

2025년 12월 4일

[논문리뷰] Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings

대규모 비전-언어 모델(LVLM)이 시각적 정보를 불충분하게 활용하고 텍스트 우선(textual priors)에 과도하게 의존하여 발생하는 환각(hallucinations) 문제를 해결하는 것을 목표로 합니다. 이를 통해 모델의 시각적 grounding을 강화하고 더 균형 잡힌 멀티모달 추론을 촉진하고자 합니다.

#Review #Hallucination Mitigation #Large Vision-Language Models #Textual Embeddings #Multimodal Reasoning #Attention Mechanism #Visual Grounding #Modality Imbalance

2025년 11월 9일

[논문리뷰] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

본 논문은 에이전트 시대의 추론 및 행동을 위한 시각 중심 코딩의 미개척 영역을 탐구합니다. 기존 RGB 픽셀 기반 이미지 표현의 제한된 상징적 추상화를 넘어서, 이미지를 SVG 코드 와 같은 압축적이고 해석 가능하며 실행 가능한 시각적 표현으로 변환하는 것을 목표로 합니다.

#Review #Multimodal AI #Code Generation #SVG #Visual Representation #Benchmark #Large Vision-Language Models #Agentic AI #Reasoning

2025년 11월 9일

[논문리뷰] CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

GUI(Graphical User Interface) 기반 자율 에이전트의 핵심 난제인 장기 계획(long-horizon planning) 능력과 정밀한 미세 실행(fine-grained execution) 능력 사이의 고질적인 트레이드오프를 해결하는 것을 목표로 합니다.

#Review #GUI Agents #Reinforcement Learning #Planner-Executor Architecture #Decoupled Training #Large Vision-Language Models #Specialization #Generalization #Computer Use Agent

2025년 8월 28일