본문으로 건너뛰기

#Multimodal LLM

37개의 포스트

[논문리뷰] From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

댓글 수 로딩 중

[논문리뷰] Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

댓글 수 로딩 중

[논문리뷰] VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

댓글 수 로딩 중

[논문리뷰] UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^{128} for Unified Multimodal Large Language Model

댓글 수 로딩 중

[논문리뷰] REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

댓글 수 로딩 중

[논문리뷰] MoRL: Reinforced Reasoning for Unified Motion Understanding and Generation

댓글 수 로딩 중

[논문리뷰] Innovator-VL: A Multimodal Large Language Model for Scientific Discovery

댓글 수 로딩 중

[논문리뷰] The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

댓글 수 로딩 중

[논문리뷰] Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

댓글 수 로딩 중

[논문리뷰] SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL

댓글 수 로딩 중

[논문리뷰] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

댓글 수 로딩 중

[논문리뷰] JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

댓글 수 로딩 중

[논문리뷰] Thinking with Programming Vision: Towards a Unified View for Thinking with Images

댓글 수 로딩 중

[논문리뷰] HiconAgent: History Context-aware Policy Optimization for GUI Agents

댓글 수 로딩 중

[논문리뷰] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

댓글 수 로딩 중

[논문리뷰] M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

댓글 수 로딩 중

[논문리뷰] MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

댓글 수 로딩 중

[논문리뷰] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

댓글 수 로딩 중

[논문리뷰] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

댓글 수 로딩 중

[논문리뷰] LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

댓글 수 로딩 중

[논문리뷰] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

댓글 수 로딩 중

[논문리뷰] Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

댓글 수 로딩 중

[논문리뷰] L^2M^3OF: A Large Language Multimodal Model for Metal-Organic Frameworks

댓글 수 로딩 중

[논문리뷰] DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

댓글 수 로딩 중

[논문리뷰] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

댓글 수 로딩 중

[논문리뷰] LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

댓글 수 로딩 중

[논문리뷰] BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

댓글 수 로딩 중

[논문리뷰] Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

댓글 수 로딩 중