Review

[논문리뷰] Collaborative Multi-Modal Coding for High-Quality 3D Generation

본 논문은 기존 3D 생성 모델들이 단일 모달리티(예: RGB 이미지)에 의존하여 훈련 데이터의 범위가 제한되고 멀티모달 데이터의 상호 보완적 이점을 간과하는 문제를 해결하고자 합니다.

#Review #3D Generation #Multi-modal Learning #Diffusion Models #Triplane Representation #Collaborative Coding #Image-to-3D #Latent Space

2025년 8월 29일

[논문리뷰] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

본 논문은 기존 Vision-Language-Action (VLA) 모델의 높은 계산 오버헤드 와 모달리티 간의 의미론적 불일치(semantic fragmentation) 문제를 해결하여, VLA 모델의 확장성과 배포 가능성을 제한하는 요소를 극복하는 것을 목표로 합니다.

#Review #Vision-Language-Action Model #Sparsification #Instruction-Driven Routing #Cognition-Aligned AI #Robotics #Computational Efficiency #Multimodal AI

2025년 8월 29일

[논문리뷰] AWorld: Orchestrating the Training Recipe for Agentic AI

본 논문은 에이전트 AI 시스템 개발의 핵심 병목인 비효율적인 경험 생성(experience generation) 문제를 해결하여, 복잡한 환경에서 '학습을 통한 실천(learning from practice)' 패러다임을 실용적이고 확장 가능하게 만드는 것을 목표로 합니다.

#Review #Agentic AI #Reinforcement Learning #Distributed Systems #Experience Generation #LLM Fine-tuning #GAIA Benchmark #Scalability #AWORLD Framework

2025년 8월 29일

[논문리뷰] Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

전통적인 자동 스케일러가 Prefill-Decode (P/D) 분리형 아키텍처 를 사용하는 대규모 언어 모델(LLM) 추론 환경에서 비효율적이라는 문제에 직면했습니다. 이로 인해 이기종 하드웨어의 비효율적인 사용, 네트워크 병목 현상, 그리고 Prefill 및 Decode 단계 간의 불균형이 발생합니다.

#Review #LLM Inference #Autoscaling #Disaggregated Architecture #Heterogeneous Hardware #Resource Management #Topology-aware Scheduling #GPU Utilization

2025년 8월 28일

[논문리뷰] StepWiser: Stepwise Generative Judges for Wiser Reasoning

본 논문은 대규모 언어 모델(LLM)이 복잡한 문제 해결을 위해 사용하는 다단계 추론(Chain-of-Thought) 전략에서 각 중간 단계의 논리적 유효성을 감독하는 과제를 해결하는 것을 목표로 합니다.

#Review #LLM Reasoning #Process Reward Models #Reinforcement Learning #Generative Judges #Stepwise Feedback #Chain-of-Thought #Meta-Reasoning

2025년 8월 28일

[논문리뷰] Self-Rewarding Vision-Language Model via Reasoning Decomposition

Vision-Language Model (VLM)이 겪는 시각적 환각 및 언어적 지름길 문제를 해결하는 것을 목표로 합니다.

#Review #Vision-Language Models #Reinforcement Learning #Self-Rewarding #Reasoning Decomposition #Visual Perception #Language Reasoning #Hallucinations #Language Shortcuts

2025년 8월 28일

[논문리뷰] Predicting the Order of Upcoming Tokens Improves Language Modeling

기존 Multi-Token Prediction (MTP) 이 정확한 미래 토큰 예측의 어려움으로 인해 보조 목표로서 불일치한 성능을 보이는 문제를 해결하고자 합니다.

#Review #Language Modeling #Next-Token Prediction #Multi-Token Prediction #Token Order Prediction #Auxiliary Objective #Learning-to-Rank #Transformer #Large Language Models

2025년 8월 28일

[논문리뷰] MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment

본 논문은 기존 텍스트 기반 모션 생성 방법론이 겪는 언어적 설명과 모션 의미 간의 부정확한 정렬 및 느리고 비효율적인 다단계 추론 과정 의 문제를 해결하고자 합니다. 궁극적으로 강력한 의미론적 정렬, 고품질 모션 생성, 그리고 실시간 합성을 가능하게 하는 프레임워크를 개발하는 것이 목표입니다.

#Review #Text-Guided Motion Generation #Rectified Flow Matching #Preference Alignment #Human Motion Synthesis #Real-time AI #Transformer Architecture #Self-supervised Learning

2025년 8월 28일

[논문리뷰] Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents

본 논문은 MLLM 기반 스마트폰 에이전트 의 개인정보 보호 인식(Privacy Awareness) 능력을 체계적으로 평가하기 위한 최초의 대규모 벤치마크를 구축하고, 에이전트들이 민감한 사용자 정보에 접근할 때 적절한 개인정보 보호 조치를 취하는지 검증하는 것을 목표로 합니다.

#Review #Multimodal LLMs (MLLMs)#Smartphone Agents #Privacy Awareness #Benchmarking #Sensitive Data Detection #Risk Assessment #UI Automation

2025년 8월 28일

[논문리뷰] MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation

본 논문은 다양한 입력 신호에 실시간으로 반응하며, 낮은 지연 시간과 높은 시각적 일관성을 유지하는 대화형 디지털 휴먼 비디오 생성 시스템 을 구축하는 것을 목표로 합니다. 기존 방식의 높은 지연 시간, 계산 비용, 제한된 제어 가능성 등의 한계를 극복하고자 합니다.

#Review #Multimodal Generation #Digital Human Synthesis #Real-time Video Generation #Autoregressive LLM #Diffusion Models #Deep Compression Autoencoder #Exposure Bias Mitigation #Streaming Inference

2025년 8월 28일

[논문리뷰] Gaze into the Heart: A Multi-View Video Dataset for rPPG and Health Biomarkers Estimation

기존 rPPG(remote PhotoPlethysmoGraphy) 데이터셋의 한계 (작은 규모, 사생활 침해 우려, 조건 다양성 부족, 접근 제한)를 극복하고, 원격 건강 모니터링 및 AI 의료 보조 시스템 개발 을 가속화하기 위한 포괄적인 대규모 다중 뷰 비디오 데이터셋과 베이스라인 모델을 구축하는 것을 목표로 합니다.

#Review #rPPG #Multi-View Video Dataset #Health Biomarkers #Physiological Monitoring #Deep Learning #Telemedicine #Biosignals

2025년 8월 28일

[논문리뷰] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

본 논문은 기존 Vision-Language-Action (VLA) 모델 디코더의 한계(고정된 순서의 autoregressive 생성 또는 continuous diffusion /flow matching 헤드의 백본 분리)를 해결하고자 합니다.

#Review #Vision-Language-Action (VLA)#Discrete Diffusion #Action Decoding #Transformer #Robot Control #Masked Modeling #Adaptive Decoding #Reinforcement Learning

2025년 8월 28일

[논문리뷰] Diffusion Language Models Know the Answer Before Decoding

본 논문은 확산 언어 모델(DLM)의 주요 단점인 느린 추론 속도를 해결하는 것을 목표로 합니다.

#Review #Diffusion Language Models #DLM Acceleration #Early Answer Convergence #Early Commit Decoding #Confidence Gap #Inference Speedup #Training-Free

2025년 8월 28일

[논문리뷰] DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

본 연구는 기존 질의응답 벤치마크나 수동 큐레이션 데이터셋의 한계를 극복하고, 생성형 연구 합성(Generative Research Synthesis) 시스템의 성능을 효과적으로 평가하기 위한 라이브 벤치마크 와 자동화된 평가 프레임워크 인 DeepScholar-Bench 를 제안합니다.

#Review #Generative Research Synthesis #Live Benchmark #Automated Evaluation #LLM-as-a-judge #Related Work Generation #Retrieval-Augmented Generation #Verifiability

2025년 8월 28일

[논문리뷰] CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

GUI(Graphical User Interface) 기반 자율 에이전트의 핵심 난제인 장기 계획(long-horizon planning) 능력과 정밀한 미세 실행(fine-grained execution) 능력 사이의 고질적인 트레이드오프를 해결하는 것을 목표로 합니다.

#Review #GUI Agents #Reinforcement Learning #Planner-Executor Architecture #Decoupled Training #Large Vision-Language Models #Specialization #Generalization #Computer Use Agent

2025년 8월 28일

[논문리뷰] Beyond Transcription: Mechanistic Interpretability in ASR

본 논문은 대규모 언어 모델(LLM)에서 성공적으로 적용된 메커니즘 해석 가능성(mechanistic interpretability) 기법 을 음성 인식(ASR) 분야에 적용하여, 현대 ASR 시스템 및 대규모 오디오-언어 모델(LALM)의 내부 동작 및 동적 특성을 이해하는 것을 목표로 합니다.

#Review #ASR #Mechanistic Interpretability #Logit Lens #Linear Probing #Activation Patching #Hallucinations #Repetitions #Encoder-Decoder

2025년 8월 28일

[논문리뷰] AudioStory: Generating Long-Form Narrative Audio with Large Language Models

본 논문은 기존 Text-to-Audio (TTA) 모델들이 단편적인 오디오 클립 생성에는 뛰어나지만, 시간적 일관성 과 구성적 추론 능력 이 요구되는 장문 서술형 오디오(long-form narrative audio) 생성 에서 겪는 한계를 해결하고자 합니다.

#Review #Text-to-Audio #Long-Form Audio Generation #Large Language Models #Narrative Reasoning #Diffusion Models #Multimodal AI #Progressive Training

2025년 8월 28일

[논문리뷰] Wan-S2V: Audio-Driven Cinematic Video Generation

본 연구는 기존 오디오 기반 캐릭터 애니메이션 모델이 복잡한 영화 및 TV 프로덕션 시나리오(미묘한 상호작용, 현실적인 신체 움직임, 다이내믹한 카메라 워크)에서 한계를 보이는 문제를 해결합니다.

#Review #Audio-Driven Video Generation #Cinematic Video #Diffusion Models #Transformer Architecture #Long Video Consistency #Human Animation #Multimodal Control #Data Curation

2025년 8월 27일

[논문리뷰] VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

본 논문은 기존 2D 이미지 기반의 3D 편집 방법론이 겪는 비일관성 및 비정밀성의 한계를 극복하고, 네이티브 3D 잠재 공간 에서 훈련 없이(training-free) 정밀하고 일관성 있는 3D 로컬 편집을 수행하는 것을 목표로 합니다.

#Review #3D Editing #Training-Free #Diffusion Models #Latent Space #3D Inversion #Contextual Feature Replacement #3D Consistency #Edit3D-Bench

2025년 8월 27일

[논문리뷰] VibeVoice Technical Report

본 논문은 기존 시스템의 한계로 남아있던 장문(long-form) 및 다중 화자(multi-speaker) 대화형 오디오 합성의 확장성, 자연스러운 턴-테이킹, 내용 인식 생성 문제를 해결하는 것을 목표로 합니다.

#Review #Speech Synthesis #Long-form Audio #Multi-speaker #Next-token Diffusion #Speech Tokenizer #Large Language Model #Variational Autoencoder #Audio Compression

2025년 8월 27일