#MLLM-as-a-Judge

7개의 포스트

[논문리뷰] Benchmark Everything Everywhere All at Once

본 논문은 기존의 수동적인 벤치마크 구축 방식이 가진 한계인 노동 집약성, 재사용 불가능성, 그리고 모델 성능 향상에 따른 빠른 벤치마크 포화(Saturation) 문제를 해결하고자 합니다.

#Review #Benchmark Agent #Autonomous Evaluation #Benchmark Construction #MLLM-as-a-Judge #Agentic Workflow #Performance Saturation

2026년 6월 4일

[논문리뷰] MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

본 연구는 29개의 기존 데이터셋에서 추출한 1,804개의 샘플을 바탕으로 9가지 유형의 편향을 분석하는 MM-JudgeBias 벤치마크를 구축하였다. 제안된 프레임워크는 각 샘플에 대해 편향되지 않은(unbiased) triplet과 편향을 주입한(biased) triplet을 생성하여 평가 결과의 차이를 비교한다.

#Review #Multimodal Large Language Models #MLLM-as-a-Judge #Compositional Bias #Benchmark #Bias-Deviation #Bias-Conformity

2026년 4월 21일

[논문리뷰] T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

텍스트-오디오-비디오 (T2AV) 생성 모델의 평가 방식이 파편화되어 있고, 단일 모달 메트릭에 의존하며 복잡한 프롬프트에서 크로스-모달 정렬, 지시 준수 및 인지적 사실성을 제대로 포착하지 못하는 문제를 해결하고자 합니다. 본 연구는 T2AV 시스템의 포괄적인 평가를 위한 통합 벤치마크 를 제시하는 것을 목표로 합니다.

#Review #Text-to-Audio-Video Generation #Multimodal Evaluation #Benchmark #MLLM-as-a-Judge #Cross-modal Alignment #Instruction Following #Perceptual Realism #Audio Realism

2025년 12월 24일

[논문리뷰] MultiRef: Controllable Image Generation with Multiple Visual References

이 연구는 텍스트 프롬프트나 단일 이미지 참조에 의존하는 기존 이미지 생성 모델의 한계를 극복하고, 다중 시각 참조(multiple visual references)를 활용한 제어 가능한 이미지 생성 이라는 새로운 문제에 초점을 맞춥니다.

#Review #Controllable Image Generation #Multi-modal Generation #Visual References #Image-to-Image #Benchmark #Dataset #MLLM-as-a-Judge

2025년 8월 20일

[논문리뷰] UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

기존 multimodal 임베딩 모델의 한계인 hard negative 샘플의 다양성 부족 과 의미적 미묘한 차이 포착 능력 부족 을 해결하여, discriminative ability 를 향상시키는 보편적인 multimodal 임베딩 모델을 개발하는 것을 목표로 합니다.

#Review #Multimodal Embeddings #MLLM-as-a-Judge #Hard Negative Mining #Semantic Alignment #Representation Learning #Reranking #Contrastive Learning

2025년 10월 16일

[논문리뷰] MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces

본 논문은 사용자 인터페이스(UI) 디자인 평가 과정에서 발생하는 리소스 제약을 해결하기 위해 Multimodal Large Language Models (MLLMs) 이 인간의 UI 인식과 선호도를 얼마나 정확하게 예측할 수 있는지 벤치마킹하는 것을 목표로 합니다.

#Review #Multimodal LLMs #UI Evaluation #Human Perception #Benchmarking #UX Research #MLLM-as-a-Judge #Cognitive Factors #Pairwise Comparison

2025년 10월 15일

[논문리뷰] VISTA: A Test-Time Self-Improving Video Generation Agent

본 논문은 텍스트-투-비디오(T2V) 생성 모델이 사용자 프롬프트에 매우 민감 하여 고품질 비디오를 얻기 위한 반복적인 프롬프트 수정과 필터링이 필요하다는 문제를 해결하고자 합니다.

#Review #Text-to-Video Generation #Prompt Optimization #Multi-Agent System #Test-Time Improvement #MLLM-as-a-Judge #Video Evaluation #Audio-Video Synthesis

2025년 10월 20일