#Multimodal

18개의 포스트

[논문리뷰] Gemma 4 Technical Report

본 논문은 최신 LLM 생태계에서 요구되는 강력한 multimodal 이해도, 복잡한 추론 능력, 그리고 컴퓨팅 효율성을 동시에 달성하기 위해 Gemma 4 모델 제품군을 제안합니다.

#Review #Multimodal #Mixture-of-Experts #Reasoning Trace #Speculative Decoding #Quantization-Aware Training #Long-context #Encoder-free

2026년 7월 7일

[논문리뷰] RedVox: Safety and Fairness Gaps in Speech Models Across Languages

본 논문은 최신 음성 인식 모델들의 안전성 및 공정성 평가가 지나치게 영어 중심적이며, 자연스러운 실사용 환경이 아닌 합성 데이터에 치중되어 있다는 한계점을 지적합니다.

#Review #Speech Models #Safety #Fairness #Multilingual #Benchmark #Red Teaming #Multimodal

2026년 6월 30일

[논문리뷰] DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

본 논문은 제한된 컴퓨팅 환경에서 Real-time 인터랙티브 시뮬레이션을 가능하게 하는 DreamForge-World 0.1 Preview를 제안합니다 .

#Review #World Model #Interactive Generation #Real-time #Consumer GPU #Autoregressive #Multimodal #LoRA

2026년 6월 29일

[논문리뷰] ChartWalker: Benchmarking the Cross-Chart RAG Task

본 논문은 기존의 Cross-Chart RAG 연구들이 가진 구조적 정보 부족과 논리적 추론 한계를 해결하고자 합니다.

#Review #Cross-Chart RAG #Knowledge Graph #Multimodal #Reasoning Paths #Benchmark #Agentic Retrieval

2026년 6월 23일

[vllm] vLLM Qwen3-VL 멀티 비디오 프롬프트 처리 최적화 분석

텍스트 기반 프롬프트 확장 방식을 토큰 수준 치환으로 변경하여 성능 향상 및 EVS 버그를 해결했습니다.

#vLLM #Qwen3-VL #Optimization #LLM #Multimodal

2026년 6월 20일

[논문리뷰] Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

본 논문은 현대의 멀티모달 딥 리서치 시스템이 정보 수집 과정에서 발생하는 Cross-modal conflict를 적절히 해결하지 못하는 '지식적 경직성(epistemic rigidity)' 문제를 해결하고자 합니다.

#Review #Multimodal #Deep Research Agents #Belief Revision Theory #Structural Thinking #Multimodal Structural Graph (MSG)#Conflict-aware

2026년 6월 9일

[논문리뷰] CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

기존의 GUI 에이전트는 웹 탐색이나 단순 OS 작업에서는 상당한 진전을 보였으나, 정교한 미디어 후반 작업과 같은 전문적인 창의적 워크플로우에 대한 대응 능력은 거의 검증되지 않았습니다.

#Review #GUI Agents #Media Post-Production #Benchmark #Multimodal #Long-Horizon #Grounding #Vibe Cutting

2026년 5월 20일

[논문리뷰] WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

본 연구는 기존 에이전트 벤치마크가 현실적인 배포 환경을 제대로 반영하지 못하는 한계를 해결하기 위해 수행되었다.

#Review #Agent Evaluation #Long-Horizon #Native-Runtime #Multimodal #Reproducible #Hybrid Verification

2026년 5월 14일

[논문리뷰] Nexus : An Agentic Framework for Time Series Forecasting

본 논문은 기존 TSFM과 LLM 기반 시계열 예측 연구가 가진 구조적 한계를 해결하기 위해 Nexus를 제안한다.

#Review #Time Series Forecasting #Large Language Models #Agentic Framework #Multimodal #Reasoning #Temporal Dynamics #Calibration

2026년 5월 14일

[SGLang] Multimodal 처리 파이프라인 개요: Vision/Audio/Video 통합

SGLang의 Multimodal 처리 파이프라인을 분석한다. 이미지, 오디오, 비디오 입력의 전처리, 임베딩 변환, LLM과의 결합 과정을 코드와 함께 살펴본다.

#sglang #Multimodal #Vision #Audio #Video #Pipeline

2026년 4월 14일

[논문리뷰] MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

저자들은 Instruction-conditioned visual-language action policy인 MolmoWeb을 제안하며, 이를 학습시키기 위한 MolmoWebMix 데이터셋을 구축하였습니다. MolmoWeb은 Molmo2 아키텍처를 기반으로 하며, 웹 스크린샷과 작업 지시어를 입력받아 즉각적인 브라우저 액션을 출력합니다 .

#Review #Web Agents #Multimodal #Vision-Language Models #Open Data #Browser-use #GUI Perception #Instruction-conditioned Policies

2026년 4월 9일

[sglang] [VLM] 멀티모달 임베딩 최적화: 청크 인식 인코딩과 이미지별 캐싱 도입

SGLang의 VLM 추론 성능을 획기적으로 개선하는 코드 변경 분석: 청크 인식 인코딩, 이미지별 캐싱, 지연 장치 전송 도입.

#VLM #Optimization #SGLang #Multimodal #Caching #Performance

2026년 4월 4일

[SGLang] CUDA IPC Pool Handle 캐싱으로 멀티모달 전송 최적화

멀티모달 데이터 전송 시 CUDA IPC 핸들을 풀 수준에서 캐싱하여 반복적인 cudaIpcOpenMemHandle 호출을 제거한다

#SGLang #CUDA IPC #Multimodal #Performance

2026년 3월 29일

[sglang] VLM ShmPointerMMData 최적화: multi-pickle 안전성과 deferred unwrap

SGLang의 VLM 멀티모달 데이터 공유 메모리 래퍼를 리팩토링하여 multi-pickle 안전성을 확보하고, broadcast 이후 deferred unwrap 패턴을 도입한 분석.

#SGLang #VLM #Shared Memory #Multimodal #Optimization #IPC

2026년 3월 27일

[논문리뷰] MAEB: Massive Audio Embedding Benchmark

오디오 임베딩 모델의 평가 프로토콜이 파편화되어 모델 비교 및 의미 있는 진척도 추적에 어려움이 있는 문제를 해결하고자 합니다. 이를 위해 광범위하고 통일된 평가 프레임워크 인 MAEB(Massive Audio Embedding Benchmark) 를 구축하여 범용 오디오 임베딩 모델 개발을 촉진하는 것을 목표로 합니다.

#Review #Audio Embedding #Benchmark #Multimodal #Zero-shot Classification #Clustering #Representation Learning #MTEB Ecosystem #Cross-modal Audio-Text #Multilingual Audio

2026년 2월 18일

[논문리뷰] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

이 논문은 훈련 데이터셋의 라벨링 없이 산업 제품의 2D 이미지와 3D 포인트 클라우드에서 제로샷(zero-shot) 이상 분류(AC) 및 세분화(AS) 를 수행하는 것을 목표로 합니다.

#Review #Zero-Shot Learning #Anomaly Detection #Anomaly Segmentation #Multimodal #Industrial Inspection #Mutual Scoring #Unsupervised Learning #Transformer

2025년 11월 13일

[논문리뷰] M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision

의료 영상 분야에서 기존의 2D, 3D, 비디오 기반 데이터에 파편화된 모델 아키텍처 및 훈련 전략의 한계를 극복하고, 단일한 시각적 표현 학습 프레임워크를 통해 제로샷 멀티모달 의료 영상 검색 을 가능하게 하는 것이 목표입니다.

#Review #Medical Image Retrieval #Self-Supervised Learning #Multimodal #Zero-shot #Foundation Models #MAE #SimDINO #Vision Transformer

2025년 9월 3일

[논문리뷰] MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data

본 논문은 지구 관측(EO) 데이터 의 고유한 다중 모달, 다중 시간, 다중 스펙트럼 특성을 효율적으로 처리하기 위해 Masked Autoencoder (MAE) 프레임워크를 최적화하는 것을 목표로 합니다. 이를 통해 EO 데이터 의 복잡한 이질성을 효과적으로 통합하고 유용하며 다목적의 표현을 학습하고자 합니다.

#Review #Self-supervised Learning #Masked Autoencoder #Earth Observation #Multimodal #Multitemporal #Multispectral #Fusion Strategies #Target Normalization

2025년 8월 18일