#Visual Reference Tokens (VRTs)

1개의 포스트

[논문리뷰] Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

기존 MLLM이 시각 작업을 위해 텍스트로 좌표를 생성하는 등 간접적인 표현 방식 에 의존하여 성능이 제한되고 분할(Segmentation)과 같은 밀집 예측(Dense Prediction) 작업 이 어려웠던 문제를 해결하는 것입니다.

#Review #Multimodal Large Language Models (MLLMs)#Visual Reference Tokens (VRTs)#Dense Prediction #Referring Expression Comprehension (REC)#Open-Vocabulary Detection (OVD)#Image Captioning #Unified Architecture #Autoregressive Generation

2025년 10월 9일