최신 포스트

[SGLang] MoE 모델을 위한 Single Batch Overlap 기법

Hopper GPU에서 MoE 모델의 compute와 communication을 overlap하여 추론 성능을 향상시킨다

#SGLang #MoE #GPU Optimization #Inference

2025년 12월 3일

[논문리뷰] YingVideo-MV: Music-Driven Multi-Stage Video Generation

본 논문은 기존 오디오 기반 아바타 비디오 생성 모델에서 잘 다루어지지 않았던 음악 공연 비디오 생성 및 카메라 모션 제어의 한계를 극복하고자 합니다.

#Review #Music-Driven Video Generation #Diffusion Models #Multi-Stage Framework #Camera Control #Lip-Sync #Temporal Coherence #Video Diffusion Transformer

2025년 12월 2일

[논문리뷰] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

본 논문은 기존 비디오 LLM이 긴 비디오(수 시간~수 일)를 처리할 때 직면하는 제한된 컨텍스트 용량 및 시각적 세부 정보 손실 문제를 해결하고자 합니다.

#Review #Long Video Reasoning #Multimodal Memory #Adaptive Retrieval #Video Large Language Models #Knowledge Graph #Multiscale Temporal Reasoning #Episodic Memory #Semantic Memory

2025년 12월 2일

[논문리뷰] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

본 논문은 비디오 생성 모델이 시각 데이터(비디오 컨텍스트) 만을 사용하여 인간의 인지와 유사한 시공간 지능(Visuospatial Intelligence) 을 발휘할 수 있는지 탐구하는 것을 목표로 합니다.

#Review #Video Generation #Spatial Reasoning #Visuospatial Intelligence #Diffusion Models #Context-Guided Generation #Scene Navigation #Object Grounding #Out-of-Domain Generalization

2025년 12월 2일

[논문리뷰] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

본 논문은 기존 비디오-오디오 생성 모델이 모노 출력에 국한되어 공간적 몰입감이 부족하며, 기존 바이노럴 접근 방식이 2단계 파이프라인(모노 생성 후 공간화)으로 인한 오류 누적과 시공간 불일치 문제를 겪는 한계를 해결하고자 합니다.

#Review #Binaural Audio Generation #Spatial Audio #Video-Driven #End-to-End #Conditional Flow Matching #Multimodal AI #Deep Learning #Audio-Visual Synthesis

2025년 12월 2일

[논문리뷰] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models

본 연구는 대규모 언어 모델(LLMs)의 내재된 메커니즘을 탐구하여 LLM이 유추 추론을 수행하는 방식을 이해하는 것을 목표로 합니다. 특히, LLM이 관계형 개념을 추출하고 새로운 상황에 적용하며, 표면적 유사성을 넘어 구조적 정렬을 통해 병렬 관계를 어떻게 식별하는지 밝히고자 합니다.

#Review #Analogical Reasoning #Large Language Models #Mechanistic Interpretability #Proportional Analogies #Story Analogies #Structural Alignment #Attention Knockout #Patchscopes

2025년 12월 2일

[논문리뷰] TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

본 논문은 테이블 인식(TR) 시스템 개발 시 대규모 레이블링된 데이터의 높은 비용과 접근성 한계 로 인해 오픈소스 모델이 독점 모델에 비해 뒤처지는 문제를 해결하고자 합니다.

#Review #Table Recognition #Self-supervised Learning #Vision-Language Models #Reinforcement Learning #Question Answering #Data Augmentation #GRPO

2025년 12월 2일

[논문리뷰] SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

본 논문은 대규모 VLA 모델의 높은 추론 지연 시간과 메모리 사용량 문제를 해결하고, 경량 VLA 모델의 제한된 시공간 추론 능력을 극복하는 것을 목표로 합니다. 특히, 컴팩트한 VLA 모델에 4D 시공간 정보 를 통합하여 효율성을 유지하면서도 강력한 장면 이해 및 액션 계획 능력을 부여하고자 합니다.

#Review #Vision-Language-Action (VLA)#Lightweight Models #Spatiotemporal Dynamics #4D Features #Masked Autoencoding #Robotics #Edge AI

2025년 12월 2일

[논문리뷰] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

기존 멀티모달 에이전트 시스템의 한계, 즉 이미지 조작과 웹 검색의 분리, 값비싼 강화 학습(RL) 의존성, 실제 도구 실행과 괴리된 계획 수립 문제를 해결하는 것을 목표로 합니다.

#Review #Multimodal AI #Agentic Models #Interleaved Reasoning #Image Manipulation #DeepSearch #Supervised Fine-tuning (SFT)#Tool-Augmented LLM

2025년 12월 2일

[논문리뷰] SimWorld: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds

본 논문은 기존 시뮬레이터들의 한계(제한된 환경, 비현실적인 물리/사회 규칙, LLM/VLM 에이전트 미지원)를 극복하고, 현실적이고 개방적인 환경에서 자율 에이전트의 개발 및 평가를 위한 SIMWORLD 시뮬레이터를 제시합니다.

#Review #Autonomous Agents #Realistic Simulator #Unreal Engine 5 #LLM/VLM Agents #Procedural Generation #Multi-Agent Systems #Physical Simulation #Social Interaction

2025년 12월 2일

[논문리뷰] SimScale: Learning to Drive via Real-World Simulation at Scale

자율주행 시스템의 안전에 필수적인 안전-위험(safety-critical) 및 분포 외(Out-of-Distribution, OOD) 시나리오에 대한 실제 데이터 부족 문제를 해결하고, 제한된 실제 데이터 환경에서 대규모 시뮬레이션 데이터를 활용 하여 엔드투엔드(E2E) 플래너의 강건성 및 일반화 성능 을 체계적으로 향상시키는 방법을 제시하는 것이 목표입니다.

#Review #Autonomous Driving #Simulation #Neural Rendering #3D Gaussian Splatting #Sim-to-Real #Data Scaling #End-to-End Planning #Pseudo-Expert

2025년 12월 2일

[논문리뷰] Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

본 논문은 Vision-Language Models (VLMs)에서 일반화 가능한 시각적 추론 능력을 습득하는 데 다양한 Chain-of-Thought (CoT) 설계 방식 이 어떻게 영향을 미치는지 체계적으로 분석하는 것을 목표로 합니다.

#Review #Chain-of-Thought (CoT)#Vision-Language Models (VLMs)#Visual Reasoning #Generalization #Supervised Fine-Tuning (SFT)#Reinforcement Learning (RL)#Grounding CoT #Maze Solving

2025년 12월 2일

[논문리뷰] PAI-Bench: A Comprehensive Benchmark For Physical AI

현재 다중 모달 대규모 언어 모델( MLLM )과 비디오 생성 모델( VGM )이 실제 물리적 역학을 인지하고 예측하는 능력을 충분히 지원하는지 이해하는 데 한계가 있습니다.

#Review #Physical AI #Benchmark #Video Generation #Conditional Video Generation #Video Understanding #Multimodal LLMs #Physical Plausibility #Embodied Reasoning

2025년 12월 2일

[논문리뷰] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

본 논문은 단일 샷(single-shot) 비디오 생성 기술의 한계를 넘어, 유연한 샷 배열, 일관된 내러티브, 그리고 텍스트 프롬프트 이상의 제어 가능성을 갖춘 다중 샷 비디오 생성 프레임워크 를 개발하는 것을 목표로 합니다.

#Review #Multi-Shot Video Generation #Controllable Video Generation #Diffusion Models #RoPE #Spatiotemporal Consistency #Reference Injection #Data Curation Framework

2025년 12월 2일

[논문리뷰] Mixture of Horizons in Action Chunking

본 논문은 Vision-Language-Action (VLA) 모델 에서 고정된 액션 청크 길이(horizon) 가 유발하는 근본적인 한계점을 해결하고자 합니다.

#Review #Vision-Language-Action Models #Action Chunking #Robotic Manipulation #Multi-horizon Planning #Transformer Architecture #Gated Fusion #Dynamic Inference

2025년 12월 2일

[논문리뷰] Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

본 연구는 Masked Diffusion Language Models (MDLMs) 의 컨텍스트 이해 능력을 체계적으로 조사하고, locality bias 및 마스크 토큰 사용이 성능에 미치는 영향을 파악하는 것을 목표로 합니다.

#Review #Diffusion Language Models #Masked Diffusion Language Models #Context Comprehension #Locality Bias #Mask Tokens #Fine-tuning #Mask-agnostic Loss #Long-context Processing

2025년 12월 2일

[논문리뷰] MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

이 논문은 동적이고 이전에 본 적 없는 환경에서 강건한 제로샷 시각 내비게이션(zero-shot visual navigation) 을 달성하는 것을 목표로 합니다.

#Review #Visual Navigation #Dual-Scale Framework #Sparse Spatial Memory Graph #Memory-Guided Planning #Geometry-Enhanced Control #Zero-Shot Navigation #Embodied AI

2025년 12월 2일

[논문리뷰] Guided Self-Evolving LLMs with Minimal Human Supervision

본 논문은 기존의 자율 진화(self-evolving) 언어 모델(LLM)이 겪는 불안정성, 성능 정체, 개념 표류(concept drift) 및 다양성 붕괴(diversity collapse) 문제를 해결하고자 합니다.

#Review #Self-Evolving LLMs #Self-Play #Reinforcement Learning #Curriculum Learning #Few-shot Learning #Human Supervision #Concept Drift #Diversity Collapse

2025년 12월 2일

[논문리뷰] Glance: Accelerating Diffusion Models with 1 Sample

본 논문은 이미지 생성 확산 모델의 높은 계산 비용과 많은 추론 단계를 해결하고자 합니다. 특히, 모델의 재훈련 비용과 일반화 성능 저하 없이, 단일 샘플만으로도 효율적인 가속화와 강력한 일반화 능력을 갖춘 경량화된 솔루션을 제공하는 것을 목표로 합니다.

#Review #Diffusion Models #Acceleration #Distillation #LoRA #Few-shot Learning #Phase-aware #Image Generation #Computational Efficiency

2025년 12월 2일

[논문리뷰] GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning

본 연구는 GUI(Graphical User Interface) 에이전트가 실제 환경에서 복잡한 화면 탐색 과제를 수행하는 데 필요한 포괄적인 환경 정보를 얻기 어렵다는 문제를 해결합니다.

#Review #GUI Agents #Screen Navigation #Reinforcement Learning #Multi-Turn RL #Simulation #Supervised Fine-tuning #Generalization

2025년 12월 2일