#CLIP

15개의 포스트

[논문리뷰] How can embedding models bind concepts?

본 논문은 최신 Vision-Language Embedding Models인 CLIP이 개념을 개별적으로는 잘 인지하면서도, 이들을 올바르게 조합하여 객체를 구성하는 Concept Binding에는 실패하는 문제에 주목합니다.

#Review #Concept Binding #Embedding Models #Compositional Generalization #Multiplicative Interaction #Representation Geometry #CLIP #Transformer

2026년 5월 31일

[논문리뷰] SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

본 논문은 기존의 T2I 모델 안전성 확보 방식들이 가진 데이터 의존성과 모델 성능 저하 문제를 해결하고자 합니다.

#Review #Diffusion Models #Safety Alignment #Online Reinforcement Learning #GRPO #CLIP #Concept Erasure

2026년 5월 18일

[논문리뷰] Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

본 논문은 CLIP과 같은 대규모 vision-language 모델을 하위 태스크(downstream task)에 맞게 fine-tuning할 때 발생하는 OOD(Out-of-Distribution) 성능 저하 문제를 해결하고자 한다.

#Review #CLIP #Sparse Autoencoders #Robust Fine-tuning #Interpretability #Representational Drift #Computer Vision

2026년 5월 17일

[SGLang] Vision-Language 모델: CLIP, InternVL, LLaVA 프로세서

SGLang의 Vision-Language 모델 프로세서를 분석한다. CLIP, InternVL, LLaVA 등 주요 VLM의 이미지 전처리, 토큰 매핑, 임베딩 삽입을 코드와 함께 살펴본다.

#sglang #Vision Language #CLIP #InternVL #LLaVA

2026년 4월 14일

[논문리뷰] SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

CLIP과 같은 Vision-Language Models (VLMs)는 multimodal AI의 핵심 구성 요소이지만, 대규모의 uncurated training data로 인해 심각한 social 및 spurious bias가 내재되어 있다.

#Review #Vision-Language Models #CLIP #Debiasing #Sparse Autoencoder #Post-Hoc #Zero-Shot #Feature Disentanglement #Bias Mitigation

2026년 3월 23일

[논문리뷰] Large Multimodal Models as General In-Context Classifiers

본 논문은 대규모 멀티모달 모델(LMMs)이 이미지 분류 작업에서 대조 학습 기반 시각-언어 모델(VLMs)보다 성능이 떨어진다는 기존 인식을 재고하고, 인컨텍스트 학습(ICL)이 LMMs의 분류 능력을 얼마나 향상시킬 수 있는지 탐구합니다.

#Review #Large Multimodal Models #In-Context Learning #Image Classification #Open-World Classification #Zero-Shot Learning #Vision-Language Models #CLIP

2026년 3월 5일

[논문리뷰] HDINO: A Concise and Efficient Open-Vocabulary Detector

논문은 기존 개방형 단어 객체 탐지(OVD) 모델들이 수동으로 큐레이션된 학습 데이터셋 과 자원 집약적인 교차 모달 특징 추출 에 과도하게 의존하는 문제를 해결하고자 합니다. 이러한 의존성을 제거하여 간결하면서도 효율적인 개방형 단어 객체 탐지기 를 개발하는 것을 목표로 합니다.

#Review #Open-Vocabulary Object Detection #Transformer #DINO #CLIP #Semantic Alignment #Hard Example Mining #Feature Fusion #Two-stage Training

2026년 3월 4일

[논문리뷰] Half-Truths Break Similarity-Based Retrieval

본 논문은 CLIP-스타일 이중 인코더 가 '하프 트루스(half-truths)'에 취약하여, 이미지에 대해 정확하지만 짧은 설명보다 그럴듯하지만 오류가 추가된 긴 설명(half-truth) 에 더 높은 유사도를 부여하는 문제를 해결하고자 합니다.

#Review #Vision-Language Models #CLIP #Compositional Reasoning #Image-Text Retrieval #Fine-tuning #Hard Negatives #Unit-level Supervision #Half-Truths

2026년 3월 2일

[논문리뷰] Compositional Generalization Requires Linear, Orthogonal Representations in Vision Embedding Models

본 논문은 현대 비전 임베딩 모델이 훈련 중 접하지 못한 개념 조합에 대해 합성적으로 일반화하기 위해 어떤 본질적인 표현 특성을 가져야 하는지 규명하는 것을 목표로 합니다.

#Review #Compositional Generalization #Vision-Language Models #Linear Representations #Orthogonal Representations #Neural Networks #Embedding Geometry #CLIP

2026년 3월 1일

[논문리뷰] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

본 연구는 텍스트 기반 인물 검색(Text-based Person Retrieval)에서 CLIP 의 성능 저하를 야기하는 두 가지 주요 문제점을 해결하는 것을 목표로 합니다.

#Review #Text-based Person Retrieval #CLIP #MLLM #Data Curation #Dual-Masking #Gradient-Attention #WebPerson Dataset

2025년 9월 12일

[논문리뷰] CLIPSym: Delving into Symmetry Detection with CLIP

본 논문은 기존 대규모 비전-언어 모델(Vision-Language Models, VLMs)인 CLIP 을 활용하여 이미지 내의 반사 및 회전 대칭을 더욱 정확하고 견고하게 탐지하는 것을 목표로 합니다.

#Review #Symmetry Detection #Vision-Language Models #CLIP #Equivariant Networks #Prompt Engineering #Geometric Deep Learning

2025년 9월 1일

[논문리뷰] Selective Contrastive Learning for Weakly Supervised Affordance Grounding

본 논문은 약지도 어포던스 그라운딩(Weakly Supervised Affordance Grounding, WSAG) 에서 모델이 어포던스 관련 부위 대신 일반적인 클래스 패턴에 집중하는 한계를 극복하고자 합니다.

#Review #Weakly Supervised Learning #Affordance Grounding #Contrastive Learning #CLIP #Part Discovery #Object Localization #DINO #Generative Models

2025년 8월 25일

[논문리뷰] Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

본 연구는 파운데이션 시각 인코더(Foundation Visual Encoders)가 이미지 처리(예: JPEG 압축) 및 획득(예: 카메라 모델)과 관련된 메타데이터 정보를 어떻게 인코딩 하며, 이러한 정보가 의미론적 예측에 어떤 영향 을 미치는지 탐구하는 것을 목표로 합니다.

#Review #Visual Encoders #Metadata #Image Processing #Image Acquisition #Robustness #CLIP #Foundation Models #Distribution Shift

2025년 8월 15일

[논문리뷰] Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation

이 논문은 자기 지도(self-supervised) 단안 깊이 추정(MDE)에서 기존 방법론의 한계를 극복하고자 합니다.

#Review #Self-supervised Monocular Depth Estimation #Foundation Models #CLIP #DINO #Language Guidance #Coarse-to-fine Learning #Feature Aggregation #3D Perception

2025년 10월 13일

[논문리뷰] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

기존 CLIP 텍스트 인코더의 77토큰 길이 제한 , 영어 전용 지원, 미흡한 세분화된 의미 이해 능력이라는 한계를 해결하는 것이 목표입니다.

#Review #Vision-Language Models #CLIP #LLM-based Embedder #Knowledge Distillation #Contrastive Learning #Curriculum Learning #Multimodal Alignment #Progressive Alignment

2025년 10월 22일