#End-to-End

6개의 포스트

[논문리뷰] Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

본 논문은 실시간 객체 탐지 모델이 가진 NMS 의존성, 불필요한 모델 파라미터 팽창, 학습 효율성 저하, 그리고 소형 객체 탐지 실패 문제를 해결하고자 합니다 .

#Review #YOLO26 #Real-Time Object Detection #End-to-End #NMS-Free #MuSGD #STAL #Instance Segmentation

2026년 6월 2일

[논문리뷰] WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

본 논문은 통합적인 End-to-End Spoken Dialogue Model의 의미론적 지능(Intelligence, IQ)과 음성 표현력(Expressiveness, EQ)을 동시에 향상시키는 문제를 해결하고자 한다.

#Review #Spoken Dialogue Models #Post-Training #Reinforcement Learning #Preference Optimization #Modality Alignment #End-to-End #Acoustic Expressiveness

2026년 4월 22일

[논문리뷰] AURA: Always-On Understanding and Real-Time Assistance via Video Streams

본 논문은 기존 VideoLLMs 가 대부분 오프라인 분석에 최적화되어 있어, 실시간으로 변화하는 비디오 스트림에 대한 연속적이고 즉각적인 대응에 한계가 있다는 문제점을 해결하고자 합니다.

#Review #VideoLLMs #Streaming Video Understanding #End-to-End #Context Management #Proactive Response #Real-Time Inference

2026년 4월 6일

[논문리뷰] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

단일 이미지로부터 관절형 3D 객체를 재구성하는 것은 객체의 기하학적 구조, Part 구조 및 motion parameter를 제한된 시각적 증거로부터 함께 추론해야 하므로 여전히 근본적인 도전 과제이다.

#Review #Monocular 3D Reconstruction #Articulated Objects #Progressive Structural Reasoning #Kinematic Estimation #PartNet-Mobility #End-to-End

2026년 3월 19일

[논문리뷰] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

본 논문은 기존 비디오-오디오 생성 모델이 모노 출력에 국한되어 공간적 몰입감이 부족하며, 기존 바이노럴 접근 방식이 2단계 파이프라인(모노 생성 후 공간화)으로 인한 오류 누적과 시공간 불일치 문제를 겪는 한계를 해결하고자 합니다.

#Review #Binaural Audio Generation #Spatial Audio #Video-Driven #End-to-End #Conditional Flow Matching #Multimodal AI #Deep Learning #Audio-Visual Synthesis

2025년 12월 2일

[논문리뷰] OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

본 논문은 텍스트 전용 번역 LLM이 겪는 지연 시간과 멀티모달 컨텍스트 활용 불가능성, 그리고 MMFM이 가진 다국어 번역 성능 및 커버리지의 한계를 해결하고자 합니다.

#Review #Multimodal Translation #Speech Translation #Simultaneous Translation #Large Language Models #Multimodal Foundation Models #Modular Fusion #End-to-End #Gated Fusion #OCR

2025년 12월 1일