#Embedding Models

8개의 포스트

[논문리뷰] How can embedding models bind concepts?

본 논문은 최신 Vision-Language Embedding Models인 CLIP이 개념을 개별적으로는 잘 인지하면서도, 이들을 올바르게 조합하여 객체를 구성하는 Concept Binding에는 실패하는 문제에 주목합니다.

#Review #Concept Binding #Embedding Models #Compositional Generalization #Multiplicative Interaction #Representation Geometry #CLIP #Transformer

2026년 5월 31일

[논문리뷰] Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

본 논문은 코드 스위칭 검색 시스템의 성능 평가를 위해 인간이 주석을 단 CSR-L 벤치마크를 구축하고, 11개 작업을 포함하는 CS-MTEB를 통해 그 영향력을 정량적으로 분석하였다. 실험 결과, 쿼리 내 코드 스위칭만으로도 강력한 다국어 모델을 포함한 대부분의 시스템에서 유의미한 성능 저하가 발생함이 확인되었다.

#Review #Information Retrieval #Code-Switching #Benchmark #Embedding Models #Robustness #Late-Interaction #Lexicon-Based Adaptation

2026년 4월 21일

[논문리뷰] DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

본 논문은 LLM 에이전트가 Python 중심의 학습 데이터로 인해 R 통계 생태계의 풍부한 통계 방법론을 활용하는 데 어려움을 겪는 문제를 해결하고자 합니다.

#Review #LLM Agents #R Statistical Ecosystem #Retrieval-Augmented Generation #Distribution-Aware Retrieval #R Package Knowledge Base #Statistical Analysis #Embedding Models

2026년 3월 5일

[논문리뷰] Legal RAG Bench: an end-to-end benchmark for legal RAG

법률 RAG 시스템의 종단 간(end-to-end) 성능을 평가하기 위한 고품질 벤치마크 및 평가 방법론이 부족하다는 문제점을 해결하고자 합니다.

#Review #Retrieval-Augmented Generation (RAG)#Legal AI #Benchmark #Evaluation Methodology #Embedding Models #Large Language Models (LLMs)#Error Decomposition #Information Retrieval

2026년 3월 2일

[논문리뷰] Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

본 논문은 터키어 법률 도메인에 특화된 언어 모델인 Mecellem 모델을 개발하여, 비영어권 및 전문 도메인(특히 터키어 법률)에서 대규모 언어 모델의 성능 저하 문제를 해결하는 것을 목표로 합니다. 이를 위해, 스크래치 학습된 인코더 모델과 지속적 사전 훈련(CPT)된 디코더 모델 두 가지 접근 방식을 제시합니다.

#Review #Turkish Legal NLP #Domain Adaptation #ModernBERT #Continual Pre-training (CPT)#Embedding Models #Legal LLMs #Retrieval-Augmented Generation (RAG)#Curriculum Learning

2026년 1월 25일

[논문리뷰] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

본 논문은 텍스트, 이미지, 문서 이미지, 비디오 등 다양한 양식의 데이터를 통합 하여 고정밀 멀티모달 검색을 수행하는 Qwen3-VL-Embedding 및 Qwen3-VL-Reranker 모델 시리즈를 소개합니다.

#Review #Multimodal Retrieval #Multimodal Ranking #Foundation Models #Embedding Models #Reranking Models #Contrastive Learning #Knowledge Distillation #Matryoshka Representation Learning #Quantization-Aware Training

2026년 1월 11일

[논문리뷰] GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings

본 연구는 대규모 언어 모델(LLM) 기반 임베딩 모델의 배포 문제를 해결하기 위해, 기존 가지치기(pruning) 방법론이 일반적인 의미론적 표현과 도메인 특화 패턴을 구분하지 못하여 발생하는 비최적화된 가지치기 결정 의 한계를 극복하고자 합니다.

#Review #Model Pruning #Domain Adaptation #Embedding Models #Gradient Alignment #Fisher Information #Model Compression #LLMs

2025년 9월 16일

[논문리뷰] The Massive Legal Embedding Benchmark (MLEB)

이 논문은 기존 법률 정보 검색(IR) 벤치마크의 한계, 즉 낮은 품질, 부족한 다양성, 그리고 실제 성능 예측 실패 문제를 해결하는 것을 목표로 합니다.

#Review #Legal Information Retrieval #Embedding Models #Benchmark Dataset #Natural Language Processing #Retrieval-Augmented Generation #Jurisdictional Diversity #Legal Tech

2025년 10월 24일