Review

[논문리뷰] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

본 논문은 텍스트가 풍부한 비디오에서 미세한 증거를 기반으로 하는 추론 문제, 특히 기존 단일 패스(single-pass) 비디오 QA 모델의 환각 및 오류 문제 를 해결하고자 합니다.

#Review #Video Reasoning #Large Multimodal Models #Reinforcement Learning #Visual Rumination #Text-Rich Video #Video Question Answering #Iterative Perception

2025년 11월 23일

[논문리뷰] VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

본 논문은 기존 VLA 모델이 겪는 공간-시간적 불연속성(spatiotemporally discontinuous) 및 미세한 제어 부족 문제를 해결하여, 로봇 조작을 위한 공간-시간적으로 일관성 있는(spatiotemporally coherent) VLA 모델인 VLA-4D 를 제안합니다.

#Review #Vision-Language-Action Models #Robotic Manipulation #SpatioTemporal Coherence #4D Awareness #Visual Representation #Action Representation #Cross-Attention

2025년 11월 23일

[논문리뷰] Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story

본 논문은 현대 LLM 분석에 중요한 도구인 Intrinsic Dimension (ID) 의 텍스트 기반 결정 요인을 밝히는 것을 목표로 합니다.

#Review #Intrinsic Dimension #LLMs #Text Complexity #Sparse Autoencoders #Text Semantics #Genre Analysis #Embedding Space #Text Generation

2025년 11월 23일

[논문리뷰] Taming Generative Synthetic Data for X-ray Prohibited Item Detection

X-ray 보안 이미지에서 금지 품목 탐지 모델을 훈련하기 위한 데이터 부족 문제 와 기존 합성 데이터 생성 방법론의 노동 집약적인 전처리 단계(예: 전경 추출) 를 해결하는 것이 주 목표입니다. 추가적인 수작업 없이 고품질의 X-ray 보안 이미지를 합성하는 효율적인 원스텝 파이프라인을 제안하고자 합니다.

#Review #X-ray Security #Synthetic Data Generation #Diffusion Models #Object Detection #Cross-Attention #Image Inpainting #Data Augmentation

2025년 11월 23일

[논문리뷰] SAM 3: Segment Anything with Concepts

이 논문은 기존 SAM(Segment Anything Model) 의 한계, 즉 단일 객체 분할(PVS)을 넘어 이미지와 비디오에서 개념(Concept) 을 기반으로 모든 객체 인스턴스를 탐지, 분할 및 추적하는 것을 목표로 합니다.

#Review #Segment Anything Model #Open-Vocabulary Segmentation #Multimodal Foundation Model #Instance Segmentation #Video Object Tracking #Prompt Engineering #Data Engine #Human-in-the-loop

2025년 11월 23일

[논문리뷰] RynnVLA-002: A Unified Vision-Language-Action and World Model

본 논문은 기존 VLA 모델(액션 다이내믹스 이해 부족, 상상력 및 물리 지식 결여)과 월드 모델(직접적인 액션 생성 불가)의 한계를 극복하기 위해, VLA 모델과 월드 모델을 단일 프레임워크로 통합 하는 것을 목표로 합니다.

#Review #Vision-Language-Action (VLA) Model #World Model #Robotics #Unified Framework #Multi-modal Learning #Action Generation #Attention Mask #Continuous Control

2025년 11월 23일

[논문리뷰] Rethinking Saliency Maps: A Cognitive Human Aligned Taxonomy and Evaluation Framework for Explanations

본 연구는 심층 학습 모델의 시각적 설명 기법인 Saliency Map 이 명확한 목적과 사용자 질의에 대한 정렬이 부족하여 평가 및 실용적 효용성이 저해되는 문제를 해결하는 것을 목표로 합니다.

#Review #Saliency Maps #Explainable AI (XAI)#Taxonomy #Evaluation Framework #Faithfulness Metrics #Contrastive Explanations #Granularity

2025년 11월 23일

[논문리뷰] Planning with Sketch-Guided Verification for Physics-Aware Video Generation

이 논문은 비디오 생성 모델이 복잡한 동작 명령을 따르고 물리적으로 사실적이며 시간적으로 일관된 시퀀스를 생성하는 데 어려움을 겪는 문제를 해결하는 것을 목표로 합니다.

#Review #Video Generation #Motion Planning #Physics-Aware AI #Multimodal Verification #Diffusion Models #Test-Time Optimization #Sketch-Guided

2025년 11월 23일

[논문리뷰] Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs

본 연구는 대규모 언어 모델(LLM)이 권위나 설득과 같은 사회적 압력 에 직면했을 때 진실성을 왜곡하고 정확도가 저하되는 아첨(sycophancy) 현상을 측정하기 위한 견고성 중심의 프레임워크 를 제시합니다.

#Review #LLM Sycophancy #Model Robustness #AI Alignment #Benchmark #Confidence Calibration #Behavioral Taxonomy #Social Influence #Epistemic Collapse

2025년 11월 23일

[논문리뷰] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

멀티모달 추론(Multimodal Reasoning) 분야에서 투명하고 재현 가능한 데이터 큐레이션 및 훈련 전략 의 부재로 인한 확장성 연구의 한계를 극복하는 것을 목표로 합니다.

#Review #Multimodal Reasoning #Large Multimodal Models #Supervised Fine-tuning #Reinforcement Learning #Data Curation #Open-source #Multimodal Benchmarks

2025년 11월 23일

[논문리뷰] OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists

기존 AI Scientist 시스템이 과학적 발견을 독립적인 검색/최적화 문제로만 보고, 과학 연구의 사회적, 협력적 특성을 간과하는 한계를 해결합니다.

#Review #AI Scientist #Large Language Models (LLMs)#Human-AI Collaboration #Scientific Ecosystem #Research Automation #Omni Scientific Protocol (OSP)#ScienceArena #Knowledge Graph

2025년 11월 23일

[논문리뷰] O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

기존 LLM 기반 에이전트가 장기적인 상호작용, 맥락적 일관성, 동적 개인화에 직면하는 한계를 극복하는 것이 목표입니다.

#Review #Memory System #LLM Agents #Personalization #User Profiling #Hierarchical Retrieval #Long-Term Interaction #Self-Evolving Agents #Contextual Consistency

2025년 11월 23일

[논문리뷰] Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models

본 논문은 RLHF(Reinforcement Learning from Human Feedback), 시스템 프롬프트, 입력/출력 콘텐츠 필터 등 다양한 방어 메커니즘이 적용된 Vision-Language Models (VLMs) 의 안전성 취약점 을 체계적으로 드러내는 것을 목표로 합니다.

#Review #Vision-Language Models (VLMs)#Adversarial Attack #Jailbreaking #Reward Hacking #Content Moderation Bypass #Cross-Model Transferability #Safety Vulnerabilities

2025년 11월 23일

[논문리뷰] MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

이 논문은 유전체 서열 모델링의 두 가지 난제인 다양한 정보 밀도 와 고유한 어휘 단위 부재 를 해결하고자 합니다.

#Review #Genome Modeling #Dynamic Tokenization #Token Merging #Context-aware Learning #DNA Foundation Models #Transformer Architecture #Multi-omics

2025년 11월 23일

[논문리뷰] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

본 논문은 기존 Vision-Language-Action (VLA) 모델의 한계인 희소한 행동 감독 신호, 과도한 시각 상태 예측 비용, 정보 병목 현상, 그리고 언어 감독 부족으로 인한 이해 및 추론 능력 저하를 해결하고자 합니다.

#Review #Vision-Language-Action (VLA) Models #Visual Foresight #Diffusion Transformer (DiT)#Robotics #Multimodal Learning #Adaptive Temporal Ensemble #Latent Actions

2025년 11월 23일

[논문리뷰] Loomis Painter: Reconstructing the Painting Process

본 논문은 기존 생성 모델들이 겪는 시간적 불연속성, 구조적 불일치, 그리고 다양한 예술 매체에 대한 일반화 능력 부족 문제를 해결하여, 어떤 입력 이미지에 대해서도 사실적이고 일관된 단계별 그림 그리기 과정 을 생성하는 것을 목표로 합니다.

#Review #Painting Process Generation #Video Diffusion Models #Media Transfer #Reverse Painting #Dataset Curation #Perceptual Distance Profile #Artistic Workflow #Generative AI

2025년 11월 23일

[논문리뷰] Insights from the ICLR Peer Review and Rebuttal Process

본 논문은 ICLR 2024 및 2025 컨퍼런스의 피어 리뷰 및 재고(rebuttal) 과정 의 본질과 역학을 이해하고, 효율성, 효과성 및 출판 논문의 품질 향상에 기여하는 것을 목표로 합니다.

#Review #Peer Review #Rebuttal Process #ICLR #Score Dynamics #LLM Analysis #Reviewer Engagement #Academic Publishing #OpenReview

2025년 11월 23일

[논문리뷰] GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

본 연구는 기존 에이전트 시각 추론 모델들이 주로 이미지 조작 도구에 집중하여 일반적인 목적으로 확장하기 어려운 한계를 해결하고자 합니다.

#Review #Geolocalization #Agentic Models #Visual Reasoning #Web-Augmented #Multimodal LLMs #Reinforcement Learning #Tool Use #GeoBench

2025년 11월 23일

[논문리뷰] Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

본 연구는 대규모 다중모달 모델(MLLM)의 크기를 축소할 때 발생하는 지능 저하 현상을 체계적으로 분석하고, 특히 어떤 기능이 가장 큰 영향을 받는지, 그리고 그 원인이 무엇인지 밝히는 것을 목표로 합니다.

#Review #Small Multimodal Models #LLM Downscaling #Perception Bottleneck #Reasoning Bottleneck #Visual Extraction Tuning #Chain-of-Thought Reasoning #Multimodal Learning

2025년 11월 23일

[논문리뷰] Diversity Has Always Been There in Your Visual Autoregressive Models

Visual Autoregressive (VAR) 모델이 겪는 다양성 붕괴(diversity collapse) 문제를 해결하고, 추가적인 훈련 없이 모델의 내재된 생성 다양성을 발현시키면서도 이미지 품질과 텍스트-이미지 정렬을 효과적으로 유지하는 것을 목표로 합니다.

#Review #Visual Autoregressive Models #Diversity Collapse #Generative Diversity #Soft-Suppression Regularization #Soft-Amplification Regularization #Training-Free #Image Generation #Singular Value Decomposition

2025년 11월 23일