Review

[논문리뷰] HybridStitch: Pixel and Timestep Level Model Stitching for Diffusion Acceleration

Text-to-Image (T2I) Diffusion 모델은 인상적인 이미지 생성 능력을 보여주지만, 수십억 개의 파라미터를 포함하는 대규모 모델의 경우 극심한 계산 오버헤드와 높은 Latency로 인해 latency-sensitive한 애플리케이션에 적용하기 어렵다는 문제에 직면해 있습니다.

#Review #Diffusion model #Mixture of models #Acceleration #Text-to-Image #Model stitching #Latency reduction #Pixel-level #Timestep-level

2026년 3월 15일

[논문리뷰] HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Embodied Agents 가 가정 환경에 빠르게 도입되면서 예측 불가능한 안전 위험이 증가하고 있습니다. 기존의 안전 평가 방식은 주로 정적인 이미지, 텍스트 또는 일반적인 위험에 국한되어, household scenarios의 동적인 unsafe action detection을 적절히 벤치마킹하는 데 실패했습니다.

#Review #Embodied Agents #Unsafe Action Detection #Vision-Language Models (VLMs)#Household Scenarios #HomeSafe-Bench #HD-Guard #Real-time Safety Monitoring

2026년 3월 15일

[논문리뷰] From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

최근 Diffusion/Flow Models은 Visual Content 생성에서 혁신적인 능력을 보여주고 있지만, 생성된 Outputs이 Human Preference 및 Task-specific Constraint에 Align되도록 하는 것은 여전히 중요한 과제입니다.

#Review #Reinforcement Learning #GRPO #Diffusion Models #Flow Models #Preference Alignment #Condition Enhancement #Multi-View Learning

2026년 3월 15일

[논문리뷰] ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

기존의 Time-Series Anomaly Detection(TSAD) 연구들은 주로 workstation-class hardware에서 unconstrained execution 환경 하에 detection quality(주로 accuracy)만을 비교하고 최적화했습니다.

#Review #Time-series anomaly detection #Deployment-oriented evaluation #Compute reduction #CPU parallelism #Throughput #Latency #Automotive telemetry #AUC-PR

2026년 3월 15일

[논문리뷰] Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

자율 에이전트, 특히 메모리, 지속적인 컨텍스트, 다단계 계획을 가진 위임된(delegated) 시스템은 고유한 측정 문제를 제기합니다.

#Review #AI safety #self-preservation #instrumental convergence #Quantum Boltzmann Machine #entanglement entropy #alignment

2026년 3월 15일

[논문리뷰] CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

Large Language Models(LLMs)의 성공은 인터넷 규모의 데이터 확장에 힘입었지만, 현재 고품질 데이터의 포화로 인해 모델 인텔리전스(model intelligence)의 추가 스케일링이 한계에 부딪혔습니다.

2026년 3월 15일

[논문리뷰] Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

최근 멀티모달 모델링 분야에서 시각적 이해와 생성을 단일 모델 내에서 통합하는 연구는 인간과 유사한 멀티모달 인텔리전스를 향한 중요한 진전으로 평가받습니다. 그러나 이러한 통합은 두 가지 근본적인 문제에 직면합니다.

#Review #Unified multimodal model #Visual generation and comprehension #Unified vision encoder #Cascaded flow matching #Token compression

2026년 3월 15일

[논문리뷰] Can Vision-Language Models Solve the Shell Game?

Vision-Language Models (VLMs)는 전반적인 비디오 이해 및 추론에서 뛰어난 성능을 보였지만, 시간 경과에 따른 개체 추적(Visual Entity Tracking)과 같은 저수준 인식 능력에서는 중요한 병목 현상을 겪고 있습니다.

#Review #Visual Entity Tracking #Shell Game #Vision-Language Models (VLMs)#VET-Bench #Spatiotemporal Grounded Chain-of-Thought (SGCoT)#NC1-complete #Transformer-based VLMs

2026년 3월 15일

[논문리뷰] XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Multimodal 에이전트는 복잡한 시각적 추론 task와 다양한 툴을 처리할 수 있게 되었지만, 여전히 비효율적인 툴 사용과 open-ended 환경에서의 유연하지 않은 orchestration이라는 두 가지 근본적인 병목 현상에 직면해 있습니다.

#Review #Multimodal Agents #Continual Learning #Experience Learning #Skill Learning #Tool Use #Knowledge Base #Visual Reasoning

2026년 3월 12일

[논문리뷰] WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

저자들은 instruction-based image editing 분야에서 text-centric image editing 이 중요한 응용 잠재력에도 불구하고 아직 충분히 탐구되지 않은 영역임을 지적합니다.

#Review #Text-centric Image Editing #Diffusion Models #Glyph-Guided Fine-tuning #Reinforcement Learning #Multilingual Benchmark #Dataset Construction

2026년 3월 12일

[논문리뷰] Video-Based Reward Modeling for Computer-Use Agents

Computer-use agents ( CUAs )는 일반적인 컴퓨터 자동화 분야에서 유망한 패러다임으로 부상하고 있지만, 에이전트 trajectory가 사용자 지침을 진정으로 이행하는지 여부를 평가하는 것은 여전히 어려운 과제로 남아 있습니다.

#Review #Reward Modeling #Computer-Use Agents #Execution Video #Spatiotemporal Token Pruning #Dataset #Task Success

2026년 3월 12일

[논문리뷰] Understanding by Reconstruction: Reversing the Software Development Process for LLM Pretraining

Large Language Models (LLMs)는 코드 생성(Code Generation)에서 놀라운 성공을 거두었지만, 복잡한 소프트웨어 Engineering을 위한 깊고 긴 Horizon의 Reasoning에는 여전히 어려움을 겪고 있습니다.

2026년 3월 12일

[논문리뷰] Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Diffusion models과 autoregressive models의 발전으로 T2I generation 및 image editing task에서 상당한 진전이 있었으나, 이러한 모델들의 성능 향상을 위한 RL 기반 접근 방식은 reward model 의 신뢰성 문제에 직면해 있습니다.

#Review #Reinforcement Learning #Reward Modeling #Image Editing #Image Generation #MLLM #Data Curation #Fidelity #Instruction Following

2026년 3월 12일

[논문리뷰] TeamHOI: Learning a Unified Policy for Cooperative Human-Object Interactions with Any Team Size

물리 기반 인간형 제어는 사실적이고 고성능의 단일 에이전트(Single-agent) 행동을 가능하게 하는 데 상당한 발전을 이루었지만, 이를 협동적인 Human-Object Interaction (HOI) 으로 확장하는 것은 여전히 어려운 과제입니다.

#Review #Human-Object Interaction (HOI)#Reinforcement Learning (RL)#Transformer-based Policy #Adversarial Motion Prior (AMP)#Decentralized Policy #Multi-agent Systems #Scalable Coordination

2026년 3월 12일

[논문리뷰] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Multimodal Agent는 복잡한 문서 기반 워크플로우를 자동화하는 유망한 방향을 제시하지만, 이러한 Agent가 진정한 Strategic Reasoning 을 보여주는지, 아니면 단지 Stochastic Trial-and-error Search 에 의존하는지에 대한 근본적인 의문이 존재했습니다.

#Review #Multimodal Agents #Document QA #Agentic Reasoning #RAG #Benchmark #PDFs #Effort Calibration

2026년 3월 12일

[논문리뷰] Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

인간은 시각적 관찰 스트림을 통해 실제 공간을 인지하고 이해하므로, 잠재적으로 무한한 비디오 스트림에서 Spatial Evidence 를 스트리밍 방식으로 유지하고 업데이트하는 능력은 Spatial Intelligence 에 필수적입니다.

#Review #Spatial Intelligence #Test-Time Training #MLLM #Streaming Video #Hybrid Architecture #Spatiotemporal Convolution

2026년 3월 12일

[논문리뷰] ShotVerse: Advancing Cinematic Camera Control for Text-Driven Multi-Shot Video Creation

Text-driven Video Generation 모델들은 영화 제작의 민주화를 이끌었지만, Cinematic Multi-Shot Scenario에서의 Camera Control은 여전히 중요한 병목(Bottleneck)으로 남아 있습니다.

2026년 3월 12일

[논문리뷰] One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

기존 Diffusion Transformers (DiTs) 는 높은 생성 품질을 달성하지만, 컴퓨팅 비용이 입력 이미지 해상도에 고정되어 Latency-Quality Trade-off가 경직되어 있습니다.

2026년 3월 12일

[논문리뷰] OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

현대 visual agent는 로봇, AR 장치 등 실시간 스트리밍 환경에서 작동하기 위해 일반적이고, 인과적이며, 물리적으로 구조화된 표현을 요구합니다.

#Review #streaming visual backbone #causal spatiotemporal attention #3D-ROPE #multi-task learning #real-time inference #embodied agents #vision-language alignment

2026년 3월 12일

[논문리뷰] Mobile-GS: Real-time Gaussian Splatting for Mobile Devices

3D Gaussian Splatting (3DGS)은 고품질 novel view synthesis 를 위한 강력한 기법으로 부상했지만, 높은 computational demands 와 막대한 storage costs 로 인해 mobile devices 에 배포하여 real-time rendering 을 구현하는 데 상당한 어려움이 있습니다.

#Review #Gaussian Splatting #Mobile Rendering #Order-Independent Transparency #Neural Quantization #Real-time Rendering #View-dependent Enhancement #Spherical Harmonics Distillation #Resource-constrained Devices

2026년 3월 12일