Review

[논문리뷰] Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

본 연구는 VLM이 다단계 시각적 상호작용 및 효과적인 도구 통합 추론에서 겪는 한계를 해결하고자 합니다. 특히, 도구 선택, 호출 및 조율 능력이 부족한 기존 VLM의 문제를 극복하고, 확장 가능한 훈련 환경과 에이전트 학습 전략을 통해 VLM의 도구 통합 시각적 추론 능력 을 체계적으로 향상시키는 것을 목표로 합니다.

#Review #Vision-Language Models (VLMs)#Reinforcement Learning (RL)#Tool-Integrated Reasoning (TIR)#Agentic AI #VQA #Training Environment #Behavioral Cloning #Policy Optimization

2025년 11월 25일

[논문리뷰] SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

대규모 언어 모델(LLM)에서 quadratic 연산 복잡성 을 갖는 full attention 의 한계를 극복하기 위해, sparse attention 의 성능 저하 및 부족한 sparsity 문제를 해결하고자 합니다.

#Review #Sparse Attention #Full Attention #Large Language Models (LLMs)#Context Length #Attention Sparsity #Alignment Loss #Long-Context Extrapolation

2025년 11월 25일

[논문리뷰] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

본 연구는 기존 비디오 리테이크 생성 방법론이 가변 길이 입력, 동적 카메라 모션, 분포 외 카메라 궤적에 취약하며, 종종 워핑 아티팩트나 흐릿한 객체를 생성하는 한계를 해결하고자 합니다.

#Review #Video Retake Generation #Camera Control #Rotary Position Embedding (RoPE)#Rotary Camera Encoding (RoCE)#Geometric Consistency #Video Generative Models #Transformer Architecture #Multi-view Synthesis

2025년 11월 25일

[논문리뷰] PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

기존 비디오 생성 모델들이 시각적 품질은 뛰어나지만, 명시적인 물리적 제어 가능성과 현실성이 부족하다는 문제를 해결하는 것을 목표로 합니다. 단일 이미지로부터 객체의 물리적 특성을 추론하고, 이를 기반으로 물리적으로 정확하며 역동적인 비디오를 생성하는 새로운 프레임워크를 제안합니다.

#Review #Video Generation #Physics Simulation #Controllable AI #Part-Aware #Semantic Grounding #Material Properties #Image-to-Video #Diffusion Models

2025년 11월 25일

[논문리뷰] OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

본 연구는 RGBA(Red, Green, Blue, Alpha) 이미지 조작을 위한 기존의 파편화된 단일 태스크 전문 모델과, 알파 채널 처리 능력이 없는 통합 RGB 멀티태스크 프레임워크 간의 격차를 해소하는 것을 목표로 합니다.

#Review #RGBA Generation #Multi-Task Learning #Diffusion Transformers #Image Matting #Layer Decomposition #Object Removal #Alpha-aware VAE #MSROPE-BiL

2025년 11월 25일

[논문리뷰] MedSAM3: Delving into Segment Anything with Medical Concepts

의료 영상 분할 분야에서 기존 모델들의 일반화 부족과 광범위한 수동 주석 요구 사항을 해결하고, 순전히 기하학적 프롬프트에 의존하는 한계를 극복하는 것을 목표로 합니다.

#Review #Medical Image Segmentation #Segment Anything Model (SAM)#Promptable Concept Segmentation (PCS)#Multimodal Large Language Models (MLLMs)#Agentic AI #Domain Adaptation #Text-guided Segmentation

2025년 11월 25일

[논문리뷰] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

기존 3D 도시 생성 방법론의 한계인 텍스트 기반 생성의 창의적 유연성과 객체 수준 편집 가능성 및 구조적 일관성 부족 문제를 해결하는 것을 목표로 합니다.

#Review #3D City Generation #Natural Language Processing #Aesthetic Adaptation #Controllable Assets #Layout Generation #Interactive Editing #Diffusion Models #Multimodal Dataset

2025년 11월 25일

[논문리뷰] HunyuanOCR Technical Report

기존 파이프라인 기반 OCR 시스템의 에러 전파 및 높은 유지보수 비용 문제를 해결하고, 대규모 일반 VLM의 높은 컴퓨팅 자원 요구사항 과 OCR 특화 VLM의 불완전한 엔드투엔드 최적화 한계를 극복하는 것을 목표로 합니다.

#Review #Optical Character Recognition #Multimodal Large Language Model #End-to-End Learning #Reinforcement Learning #Document Parsing #Information Extraction #Text Spotting

2025년 11월 25일

[논문리뷰] GigaWorld-0: World Models as Data Engine to Empower Embodied AI

본 논문은 GigaWorld-0 라는 통합 월드 모델 프레임워크를 개발하여 Embodied AI 를 위한 확장 가능하고 데이터 효율적인 데이터 엔진 으로 활용하는 것을 목표로 합니다.

#Review #World Models #Embodied AI #Data Generation #Video Generation #3D Scene Reconstruction #Robotics #Vision-Language-Action

2025년 11월 25일

[논문리뷰] GigaEvo: An Open Source Optimization Framework Powered By LLMs And Evolution Algorithms

이 논문은 LLM(대규모 언어 모델) 기반 진화 컴퓨테이션 을 위한 확장 가능한 오픈소스 프레임워크인 GigaEvo 를 소개하는 것을 목표로 합니다.

#Review #LLM-driven Evolutionary Computation #Quality-Diversity #MAP-Elites #Program Synthesis #Open-source Framework #Algorithmic Discovery #Genetic Algorithms

2025년 11월 25일

[논문리뷰] Fara-7B: An Efficient Agentic Model for Computer Use

본 논문은 컴퓨터 사용 에이전트(CUA) 훈련을 위한 고품질 상호작용 데이터의 부족 문제 를 해결하고, 적은 연산 자원으로 온디바이스에서 실행 가능한 효율적인 에이전트 모델 을 개발하는 것을 목표로 합니다. 이를 통해 CUA 기술의 상업적 활용 가능성을 확장하고 범용 개인 디지털 비서의 길을 열고자 합니다.

#Review #Computer Use Agents #Synthetic Data Generation #Multi-modal LLM #On-device AI #Web Automation #Pixel-in Action-out #Fara-7B #WebTailBench

2025년 11월 25일

[논문리뷰] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

본 논문은 통합 멀티모달 모델(UMMs)에서 '이해' 능력이 '생성' 과정에 실제로 정보를 제공하고 안내하는지 여부를 조사합니다.

#Review #Unified Multimodal Models #Understanding-Generation Gap #Reasoning #Knowledge Transfer #Chain-of-Thought #Self-Training #Synthetic Data #Evaluation Framework

2025년 11월 25일

[논문리뷰] DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

이 논문은 AI 생성 콘텐츠(AIGC) 탐지에서 전체 이미지 분류에 집중하는 기존 방식의 한계를 극복하고, 확산 모델 기반의 로컬 편집 에 대한 동시적인 편집 영역 위치 파악(localization) 및 모델 귀속(attribution) 을 목표로 합니다.

#Review #AIGC Detection #Diffusion Models #Image Editing #Semantic Segmentation #Localization #Model Attribution #Benchmark #Multi-turn Editing

2025년 11월 25일

[논문리뷰] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

본 논문은 기존 비전-언어 에이전트가 인간 주석 기반 지도 학습의 한계와 복잡한 시각적 추론 단계 검증의 어려움, 그리고 평가 환각 문제로 인해 연속적인 자가 발전이 어렵다는 문제를 해결하고자 합니다.

#Review #Self-Evolving Agent #Vision-Language Models #Tool-Integrated Reasoning #Reinforcement Learning #Self-Correction #Multimodal AI #Generative AI

2025년 11월 25일

[논문리뷰] UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

본 논문은 기존 Diffusion Transformer(DiT) 모델을 다양한 종횡비(AR)의 4K 해상도 로 확장할 때 발생하는 한계를 극복하는 것을 목표로 합니다.

#Review #Text-to-Image Generation #Diffusion Transformers #4K Resolution #Aspect Ratio Extrapolation #Data-Model Co-Design #VAE Post-training #Positional Encoding #Diffusion Models

2025년 11월 24일

[논문리뷰] Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets?

본 논문은 최신 세계 모델(World Models, WMs)이 텍스트로 지정된 암묵적인 의미론적 목표를 가진 길 없는 경로 계획(mapless path planning) 작업을 실제 환경에서 얼마나 잘 수행하는지 정량적으로 평가하는 것을 목표로 합니다.

#Review #World Models #Mapless Navigation #Semantic Path Planning #Robot Learning #Video Prediction #Benchmark #Trajectory Generation

2025년 11월 24일

[논문리뷰] SyncMV4D: Synchronized Multi-view Joint Diffusion of Appearance and Motion for Hand-Object Interaction Synthesis

본 논문은 단일 뷰(single-view) HOI 비디오 생성의 기하학적 왜곡 및 비현실적인 모션 문제와 3D HOI 방법론의 제한된 일반화 능력 문제를 해결하고자 합니다.

#Review #Hand-Object Interaction #Multi-view Video Generation #4D Motion Synthesis #Diffusion Models #Spatio-temporal Consistency #Geometric Consistency #Appearance and Motion Joint Modeling

2025년 11월 24일

[논문리뷰] Plan-X: Instruct Video Generation via Semantic Planning

기존 비디오 확산 모델(DiT)이 복잡한 사용자 지시 및 장기 계획에서 겪는 높은 수준의 의미론적 추론 및 계획 능력 부족 문제를 해결하는 것이 목표입니다.

#Review #Video Generation #Semantic Planning #Multimodal LLM #Diffusion Transformer #Spatio-temporal Guidance #Visual Hallucination #Prompt Alignment #Instruction Following

2025년 11월 24일

[논문리뷰] Pillar-0: A New Frontier for Radiology Foundation Models

본 논문은 급증하는 영상 판독량과 인력 부족으로 인한 의료 시스템의 부담을 해결하기 위해, 기존 의료 AI 모델의 한계를 극복하는 새로운 방사선과 파운데이션 모델 Pillar-0 을 제안합니다.

#Review #Radiology Foundation Model #Volumetric Imaging #Multi-window Tokenization #Multi-scale Attention #Contrastive Learning #Clinical Evaluation #Data Efficiency #Medical Imaging

2025년 11월 24일

[논문리뷰] PRInTS: Reward Modeling for Long-Horizon Information Seeking

본 논문은 기존 Process Reward Model (PRM) 의 한계, 즉 짧은 추론 단위에 대한 이진 판단과 급증하는 컨텍스트 처리의 어려움을 극복하는 것을 목표로 합니다.

#Review #Reward Modeling #Long-Horizon Tasks #Information Seeking #Large Language Models #Trajectory Summarization #Reinforcement Learning #Tool Use #Process Reward Models

2025년 11월 24일