본문으로 건너뛰기

#Multimodal Large Language Models

60개의 포스트

[논문리뷰] EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

댓글 수 로딩 중

[논문리뷰] SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

댓글 수 로딩 중

[논문리뷰] Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

댓글 수 로딩 중

[논문리뷰] LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

댓글 수 로딩 중

[논문리뷰] Bernini: Latent Semantic Planning for Video Diffusion

댓글 수 로딩 중

[논문리뷰] Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

댓글 수 로딩 중

[논문리뷰] IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

댓글 수 로딩 중

[논문리뷰] Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos

댓글 수 로딩 중

[논문리뷰] CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

댓글 수 로딩 중

[논문리뷰] Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

댓글 수 로딩 중

[논문리뷰] Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

댓글 수 로딩 중

[논문리뷰] UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

댓글 수 로딩 중

[논문리뷰] Visual Reasoning through Tool-supervised Reinforcement Learning

댓글 수 로딩 중

[논문리뷰] MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

댓글 수 로딩 중

[논문리뷰] Small Vision-Language Models are Smart Compressors for Long Video Understanding

댓글 수 로딩 중

[논문리뷰] OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

댓글 수 로딩 중

[논문리뷰] Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization

댓글 수 로딩 중

[논문리뷰] Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

댓글 수 로딩 중

[논문리뷰] Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

댓글 수 로딩 중

[논문리뷰] PLUME: Latent Reasoning Based Universal Multimodal Embedding

댓글 수 로딩 중

[논문리뷰] Token Warping Helps MLLMs Look from Nearby Viewpoints

댓글 수 로딩 중

[논문리뷰] Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

댓글 수 로딩 중

[논문리뷰] Automatic Image-Level Morphological Trait Annotation for Organismal Images

댓글 수 로딩 중

[논문리뷰] VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

댓글 수 로딩 중

[논문리뷰] VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

댓글 수 로딩 중

[논문리뷰] Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

댓글 수 로딩 중

[논문리뷰] SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

댓글 수 로딩 중

[논문리뷰] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] Toward Cognitive Supersensing in Multimodal Large Language Model

댓글 수 로딩 중

[논문리뷰] STEP3-VL-10B Technical Report

댓글 수 로딩 중

[논문리뷰] MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

댓글 수 로딩 중

[논문리뷰] Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

댓글 수 로딩 중

[논문리뷰] Measuring Epistemic Humility in Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

댓글 수 로딩 중

[논문리뷰] Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

댓글 수 로딩 중

[논문리뷰] Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

댓글 수 로딩 중

[논문리뷰] Detect Anything via Next Point Prediction

댓글 수 로딩 중