본문으로 건너뛰기

#Multimodal AI

114개의 포스트

[논문리뷰] Advancing Creative Physical Intelligence in Large Multimodal Models

댓글 수 로딩 중

[논문리뷰] Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

댓글 수 로딩 중

[논문리뷰] LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

댓글 수 로딩 중

[논문리뷰] VecGlypher: Unified Vector Glyph Generation with Language Models

댓글 수 로딩 중

[논문리뷰] NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

댓글 수 로딩 중

[논문리뷰] SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning

댓글 수 로딩 중

[논문리뷰] Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

댓글 수 로딩 중

[논문리뷰] Selective Training for Large Vision Language Models via Visual Information Gain

댓글 수 로딩 중

[논문리뷰] MMA: Multimodal Memory Agent

댓글 수 로딩 중

[논문리뷰] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

댓글 수 로딩 중

[논문리뷰] DeepSight: An All-in-One LM Safety Toolkit

댓글 수 로딩 중

[논문리뷰] P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

댓글 수 로딩 중

[논문리뷰] Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

댓글 수 로딩 중

[논문리뷰] Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

댓글 수 로딩 중

[논문리뷰] SkyReels-V3 Technique Report

댓글 수 로딩 중

[논문리뷰] FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

댓글 수 로딩 중

[논문리뷰] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

댓글 수 로딩 중

[논문리뷰] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

댓글 수 로딩 중

[논문리뷰] LTX-2: Efficient Joint Audio-Visual Foundation Model

댓글 수 로딩 중

[논문리뷰] DreamOmni3: Scribble-based Editing and Generation

댓글 수 로딩 중

[논문리뷰] Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

댓글 수 로딩 중

[논문리뷰] Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

댓글 수 로딩 중

[논문리뷰] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

댓글 수 로딩 중

[논문리뷰] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

댓글 수 로딩 중

[논문리뷰] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

댓글 수 로딩 중

[논문리뷰] TV2TV: A Unified Framework for Interleaved Language and Video Generation

댓글 수 로딩 중

[논문리뷰] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

댓글 수 로딩 중

[논문리뷰] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

댓글 수 로딩 중

[논문리뷰] Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

댓글 수 로딩 중

[논문리뷰] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

댓글 수 로딩 중

[논문리뷰] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

댓글 수 로딩 중

[논문리뷰] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

댓글 수 로딩 중

[논문리뷰] TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

댓글 수 로딩 중

[논문리뷰] MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

댓글 수 로딩 중

[논문리뷰] GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

댓글 수 로딩 중

[논문리뷰] Music Flamingo: Scaling Music Understanding in Audio Language Models

댓글 수 로딩 중

[논문리뷰] UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

댓글 수 로딩 중

[논문리뷰] VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

댓글 수 로딩 중

[논문리뷰] RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

댓글 수 로딩 중

[논문리뷰] OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

댓글 수 로딩 중

[논문리뷰] X-Streamer: Unified Human World Modeling with Audiovisual Interaction

댓글 수 로딩 중

[논문리뷰] Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

댓글 수 로딩 중

[논문리뷰] Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

댓글 수 로딩 중

[논문리뷰] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

댓글 수 로딩 중

[논문리뷰] Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents

댓글 수 로딩 중

[논문리뷰] EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

댓글 수 로딩 중

[논문리뷰] AToken: A Unified Tokenizer for Vision

댓글 수 로딩 중

[논문리뷰] PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

댓글 수 로딩 중

[논문리뷰] Lost in Embeddings: Information Loss in Vision-Language Models

댓글 수 로딩 중

[논문리뷰] Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

댓글 수 로딩 중

[논문리뷰] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

댓글 수 로딩 중

[논문리뷰] Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

댓글 수 로딩 중

[논문리뷰] CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

댓글 수 로딩 중

[논문리뷰] AudioStory: Generating Long-Form Narrative Audio with Large Language Models

댓글 수 로딩 중

[논문리뷰] Explain Before You Answer: A Survey on Compositional Visual Reasoning

댓글 수 로딩 중

[논문리뷰] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

댓글 수 로딩 중

[논문리뷰] ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?

댓글 수 로딩 중

[논문리뷰] MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

댓글 수 로딩 중

[논문리뷰] A Survey on Diffusion Language Models

댓글 수 로딩 중

[논문리뷰] Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

댓글 수 로딩 중

[논문리뷰] MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

댓글 수 로딩 중

[논문리뷰] Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

댓글 수 로딩 중

[논문리뷰] Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

댓글 수 로딩 중

[논문리뷰] Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS

댓글 수 로딩 중

[논문리뷰] LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

댓글 수 로딩 중

[논문리뷰] Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

댓글 수 로딩 중

[논문리뷰] UniVideo: Unified Understanding, Generation, and Editing for Videos

댓글 수 로딩 중

[논문리뷰] SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

댓글 수 로딩 중

[논문리뷰] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

댓글 수 로딩 중

[논문리뷰] WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

댓글 수 로딩 중

[논문리뷰] A Definition of AGI

댓글 수 로딩 중

[논문리뷰] Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

댓글 수 로딩 중

[논문리뷰] MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

댓글 수 로딩 중

[논문리뷰] BLIP3o-NEXT: Next Frontier of Native Image Generation

댓글 수 로딩 중