#End-to-End Learning

12개의 포스트

[논문리뷰] See like a Robot: Robot-Centric Pointmaps for Vision-Language-Action Models

본 논문은 대규모 데이터셋을 활용하는 VLA 모델 학습 시, 카메라 뷰포인트 변화에 따른 성능 저하 문제를 해결하고자 합니다. 기존 모델들은 카메라 프레임의 RGB 데이터를 입력으로 사용하므로, 실제 로봇 동작이 정의되는 Robot-frame과의 Frame mismatch가 발생합니다 .

#Review #VLA #Manipulation #3D Geometry #Pointmap #Robot-Centric #Viewpoint Variation #End-to-End Learning

2026년 7월 19일

[논문리뷰] Representation Forcing for Bottleneck-Free Unified Multimodal Models

본 논문은 기존 UMM이 frozen VAE에 의존하여 발생하는 structural bottleneck 문제를 해결하기 위해 Representation Forcing (RF)을 제안한다 .

#Review #Unified Multimodal Models #Representation Forcing #Pixel-space Diffusion #Vector Quantization #End-to-End Learning #Bottleneck-Free #Mixture-of-Transformers

2026년 5월 31일

[논문리뷰] From Pixels to Words -- Towards Native One-Vision Models at Scale

본 논문은 기존의 modular VLM이 가진 복잡한 파이프라인과 파편화된 visual-language 정보를 해결하기 위해 단일화된 Native one-vision 아키텍처를 제안한다.

#Review #Native Vision-Language Models #Monolithic Backbone #Spatiotemporal Attention #One-Vision Foundation Model #End-to-End Learning #Spatial Intelligence

2026년 5월 27일

[논문리뷰] MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

본 논문은 최신 Multimodal Large Language Models (MLLMs) 이 기본적인 Visual Question Answering (VQA) 에는 뛰어나지만, 이미지 내에 내재된 미묘한 문화적, 감정적, 상황적 함의(특히 이미지 은유 )를 이해하는 데 어려움을 겪는 문제를 해결하고자 합니다.

#Review #Image Metaphor Understanding #Visual Reasoning #Reinforcement Learning #MLLMs #TFQ-GRPO #End-to-End Learning #Cognitive AI

2026년 2월 12일

[논문리뷰] MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

기존 오디오 토크나이저의 사전 학습된 인코더 , 의미론적 증류 , 이질적인 CNN 기반 아키텍처 의존성으로 인한 재구성 충실도 및 확장성 한계를 극복하는 것이 목표입니다.

#Review #Audio Tokenizer #Transformer Architecture #End-to-End Learning #Residual Vector Quantization #Speech Synthesis #Audio Foundation Models #Scalability #Autoregressive Models

2026년 2월 12일

[논문리뷰] LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

논문은 복잡한 다단계 OCR 파이프라인 없이 문서 이미지를 깨끗하고 자연스럽게 정렬된 텍스트로 변환하는 10억 개의 파라미터를 가진 종단 간 다국어 비전-언어 모델 LightOnOCR-2-1B 를 제안합니다.

#Review #OCR #Vision-Language Model #End-to-End Learning #Multilingual #Reinforcement Learning #Document Understanding #Bounding Box Prediction #Task Arithmetic Merging

2026년 1월 20일

[논문리뷰] UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

자율 주행 시스템이 제한된 세계 지식 과 시각적 동적 모델링 부족 으로 인해 롱테일 시나리오에서 겪는 어려움을 해결하는 것이 목표입니다.

#Review #Autonomous Driving #End-to-End Learning #Vision-Language Models #World Model #Chain-of-Thought #Video Generation #Trajectory Planning #Multimodal Learning

2025년 12월 10일

[논문리뷰] OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM-as-Critic

자율 주행 시스템에서 기존 SFT(Supervised Fine-tuning) 기반 VLM(Vision-Language Model) 의 제한된 추론 일반화 및 개방형 태스크 처리 능력을 개선하는 것이 목표입니다.

#Review #Autonomous Driving #Reinforcement Fine-tuning #LLM-as-Critic #Vision-Language Model #End-to-End Learning #Chain-of-Thought #Trajectory Planning

2025년 12월 1일

[논문리뷰] HunyuanOCR Technical Report

기존 파이프라인 기반 OCR 시스템의 에러 전파 및 높은 유지보수 비용 문제를 해결하고, 대규모 일반 VLM의 높은 컴퓨팅 자원 요구사항 과 OCR 특화 VLM의 불완전한 엔드투엔드 최적화 한계를 극복하는 것을 목표로 합니다.

#Review #Optical Character Recognition #Multimodal Large Language Model #End-to-End Learning #Reinforcement Learning #Document Parsing #Information Extraction #Text Spotting

2025년 11월 25일

[논문리뷰] EVTAR: End-to-End Try on with Additional Unpaired Visual Reference

본 연구는 기존 가상 착용(virtual try-on) 모델들이 agnostic person images , human pose , densepose 등 복잡한 입력에 의존하고 레퍼런스 이미지 지원이 부족하여 현실성이 떨어지는 문제를 해결하고자 합니다.

#Review #Virtual Try-on #Diffusion Models #End-to-End Learning #Reference Images #Unpaired Data #Flow Matching #Transformer Architecture #Generative AI

2025년 11월 9일

[논문리뷰] Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

본 논문은 기존의 다중 에이전트 시스템(MAS)과 도구 통합 추론(TIR) 패러다임이 가진 한계를 극복하고, 단일 LLM(Large Language Model) 내에서 다중 에이전트 협업 능력을 내재화하여 복잡한 문제 해결을 위한 종단 간(End-to-End) 에이전트 파운데이션 모델(AFM)을 구축하는 것을 목표로 합니다.

#Review #Chain-of-Agents #Agent Foundation Models #Multi-Agent Systems #Tool-Integrated Reasoning #Multi-agent Distillation #Agentic Reinforcement Learning #LLMs #End-to-End Learning

2025년 8월 20일

[논문리뷰] PixNerd: Pixel Neural Field Diffusion

이 논문은 Variational Autoencoder (VAE) 기반의 기존 확산 모델이 야기하는 누적 오류와 디코딩 아티팩트 문제를 해결하는 것을 목표로 합니다.

#Review #Diffusion Models #Neural Fields #Pixel Space #Generative Models #Image Synthesis #Transformer Architecture #End-to-End Learning

2025년 8월 4일