#Visual Tokenization

5개의 포스트

[논문리뷰] ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

본 연구는 기존 멀티모달 모델들이 시각적 인코더와 언어 모델을 단순히 결합하는 방식에서 벗어나, 모달리티 간의 진정한 통합을 달성하고자 합니다.

#Review #Autoregressive Model #Large Multimodal Model #Discrete Representation #Visual Tokenization #Unified Architecture

2026년 6월 9일

[논문리뷰] Channel-wise Vector Quantization

본 연구는 기존 Vector Quantization (VQ) 기반 이미지 tokenization 및 autoregressive 생성 방식의 근본적인 한계점을 해결하고자 합니다.

#Review #Channel-wise Vector Quantization #Autoregressive Generation #Next-Channel Prediction #Codebook Utilization #Visual Tokenization #Image Reconstruction #Text-to-Image Generation #Nested Channel Dropout

2026년 5월 25일

[논문리뷰] A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

저자들은 비디오 프레임 전체를 모델링하는 대신, 프레임 간의 '변화(Delta)'만을 압축하는 DeltaTok과 이를 기반으로 생성적 추론을 수행하는 DeltaWorld를 제안합니다. DeltaTok은 이전 프레임의 특징을 바탕으로 현재 프레임과의 차이를 단일 토큰으로 인코딩하여 비디오를 순수 시간적 시퀀스로 변환합니다 .

#Review #Generative World Modeling #Delta Tokens #Visual Tokenization #Vision Foundation Models #Best-of-Many Training #Spatio-temporal Redundancy #Efficient Inference

2026년 4월 8일

[논문리뷰] Next Visual Granularity Generation

본 논문은 기존 이미지 생성 모델들이 이미지를 평면적이거나 비구조적인 데이터로 취급하여 미세한 제어 및 오류 누적에 한계가 있다는 문제점을 해결하고자 합니다.

#Review #Image Generation #Granularity Control #Structured Representation #Hierarchical Generation #Coarse-to-fine #Visual Tokenization #Latent Space

2025년 8월 19일

[논문리뷰] Heptapod: Language Modeling on Visual Signals

이 논문은 시각 생성 모델에서 외부 의미론적 정보 주입 및 CFG(Classifier-Free Guidance)에 대한 의존성을 비판하며, 재구성 중심의 토크나이저 와 Transformer의 내재적 의미 학습 이라는 언어 모델링의 기본 원칙으로 회귀하는 것을 목표로 합니다.

#Review #Autoregressive Models #Image Generation #Language Modeling #Causal Transformer #2D Distribution Prediction #Visual Tokenization #Self-Supervised Learning #Generative Models

2025년 10월 9일