Review

[논문리뷰] Medal S: Spatio-Textual Prompt Model for Medical Segmentation

의료 영상 분할에서 다양한 모달리티와 해부학적 변이로 인한 문제를 해결하고, 기존 모델의 해상도 불일치 및 순차 처리 비효율성을 극복하는 것이 목표입니다.

#Review #Medical Segmentation #Foundation Model #Spatio-Textual Prompts #3D Convolution #Multi-modal Imaging #Dynamic Resampling #Parallel Inference #Iterative Refinement

2025년 11월 19일

[논문리뷰] MHR: Momentum Human Rig

본 논문은 ATLAS 모델의 골격/형상 분리 패러다임 에 Momentum 라이브러리에서 영감을 받은 유연하고 현대적인 리그 및 자세 보정 시스템을 결합하여, 산업 및 AR/VR 파이프라인에 통합 가능한 표현력 있고 해부학적으로 타당한 파라메트릭 인체 모델(MHR) 을 제안합니다.

#Review #Parametric Body Model #Human Animation #Character Rigging #Pose Correctives #Skeletal Decoupling #Computer Graphics #AR/VR

2025년 11월 19일

[논문리뷰] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

본 논문은 고품질의 일관되고 제어 가능한 이미지 및 비디오 생성을 위한 AI/ML 분야의 핵심 과제를 해결하고자 합니다. 특히, 최신 이미지 및 10초 비디오 합성을 위한 Kandinsky 5.0 이라는 최첨단 파운데이션 모델 제품군을 개발하여 최고 수준의 품질과 운영 효율성을 달성하는 것을 목표로 합니다.

#Review #Image Generation #Video Generation #Diffusion Models #Flow Matching #Diffusion Transformer #NABLA #RLHF #Supervised Fine-tuning

2025년 11월 19일

[논문리뷰] Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

본 연구는 흉부 X-ray(CXR)에서 병변 분할 모델의 제한적인 타겟 레이블 수와 전문가 수준의 상세 텍스트 입력 의존성을 해결하고자 합니다.

#Review #Medical Imaging #Chest X-ray #Lesion Segmentation #Vision-Language Models #Instruction Following #Data Generation #MIMIC-CXR

2025년 11월 19일

[논문리뷰] FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

본 논문은 기존 VLN(Vision-and-Language Navigation) 시스템의 정적인 지시, 사회적 의도 모델링 부족, 비현실적인 상호작용 환경 등의 한계를 극복하고자 합니다.

#Review #Embodied AI #Vision-and-Language Navigation (VLN)#LLM-driven Simulation #Human-Agent Interaction #Closed-Loop #Benchmark Dataset #Social Cognition

2025년 11월 19일

[논문리뷰] Aligning Generative Music AI with Human Preferences: Methods and Challenges

본 논문은 생성형 음악 AI 시스템이 계산적 최적화와 인간의 미적 감각 사이의 근본적인 격차로 인해 발생하는 문제를 해결하고, 인간의 미묘한 음악적 선호도에 더욱 잘 부합하도록 정렬하는 방법을 모색합니다.

#Review #Generative Music AI #Preference Alignment #Reinforcement Learning from Human Feedback (RLHF)#Direct Preference Optimization (DPO)#Inference-Time Optimization #Music Generation #Human-Computer Interaction

2025년 11월 19일

[논문리뷰] ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

본 논문은 기존 비디오 챕터링 방법론이 짧고 거친 주석에 의해 제한되어 장시간 비디오의 미묘한 전환에 대한 일반화가 어렵다는 문제를 해결하고자 합니다.

#Review #Video Chaptering #Long-form Video Understanding #Large Language Models #Multimodal Learning #Hierarchical Summarization #Video Segmentation #Reinforcement Learning #Dataset Creation

2025년 11월 19일

[논문리뷰] Φeat: Physically-Grounded Feature Representation

기존의 자기 지도 시각 백본이 고수준의 의미론적 특징과 저수준의 물리적 요소를 혼합하여 물리적 추론을 방해하는 문제를 해결하고자 합니다.

#Review #Self-supervised Learning #Physically-Grounded Features #Material Representation #Intrinsic Scene Understanding #Vision Transformer #Synthetic Data #Contrastive Learning

2025년 11월 18일

[논문리뷰] VIDEOP2R: Video Understanding from Perception to Reasoning

기존 비디오 RFT 프레임워크가 인식(perception)과 추론(reasoning) 과정을 단일 절차로 처리하여 신용 할당(credit assignment)이 모호해지고 오류 수정 효율성이 떨어진다는 문제를 해결하고자 합니다.

#Review #Video Understanding #Reinforcement Fine-Tuning (RFT)#Large Video Language Models (LVLMs)#Perception and Reasoning #Chain-of-Thought (CoT)#Process-Aware Learning #Policy Optimization #Credit Assignment

2025년 11월 18일

[논문리뷰] TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Large Vision-Language Models (LVLMs)가 시각적 인코더의 정보 병목 현상 과 로컬 단축키 로 인해 전역 시각 정보를 제대로 인지하지 못하는 문제를 해결하는 것이 목표입니다.

#Review #LVLM Evaluation #Global Visual Perception #Topological Properties #Shortcut-Free Benchmark #Visual Bottleneck #Multimodal AI #Synthetic Data

2025년 11월 18일

[논문리뷰] REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

본 논문은 기존 텍스트 기반 자기 성찰(self-reflection) 메커니즘 이 풍부하고 동적인 시각 정보를 처리하는 데 한계가 있어, 장문 비디오 이해(long-form video understanding) 태스크에서 성능 저하를 겪는 문제를 해결하고자 합니다.

#Review #Multimodal Reasoning #Long-Form Video Understanding #Self-Reflection #Reinforcement Learning #Tool-Augmented MLLMs #Visual Rethinking #Video Question Answering #Causal Attribution

2025년 11월 18일

[논문리뷰] Proactive Hearing Assistants that Isolate Egocentric Conversations

본 논문은 사용자의 명시적인 프롬프트 없이도 대화 상대를 자동으로 식별하고 분리하여 다른 방해 음성을 억제하는 선제적(proactive) 보청 보조 장치 를 개발하는 것을 목표로 합니다. 이는 복잡한 다자간 대화 환경에서 실시간으로 작동하며, 착용자의 자율적인 대화 참여를 지원하는 데 중점을 둡니다.

#Review #Proactive Hearing Assistant #Egocentric Audio Processing #Speech Separation #Turn-taking Dynamics #Dual-Model Architecture #Real-time Inference #Wearable Devices #Dialogue Modeling

2025년 11월 18일

[논문리뷰] Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

본 논문은 기존의 단일(monolithic) VLM(Vision-Language Model)이 가진 정밀성, 결정론적 제어 및 복합적 시각 작업 처리 능력의 한계를 극복하고자 합니다.

#Review #Visual Agent #Multimodal Perception #Tool-Augmented LLM #Agentic AI #Visual Reasoning #Computer Vision #Structured Outputs #ReAct Framework

2025년 11월 18일

[논문리뷰] OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

옴니모달 대규모 언어 모델(OmniLLMs)이 직면한 오디오-비디오 토큰의 과도한 수 와 주의 메커니즘의 2차 복잡성 으로 인한 계산 및 메모리 병목 현상 을 해결하는 것을 목표로 합니다. 특히, 기존의 단일 모달 압축 방법으로는 멀티모달 토큰의 공동 압축 요구사항을 충족하기 어렵다는 문제를 해결하고자 합니다.

#Review #Omnimodal LLMs #Token Compression #Audio-Video Understanding #Dynamic Pruning #Inference Acceleration #Spatio-Temporal Compression #Large Language Models

2025년 11월 18일

[논문리뷰] Mitigating Label Length Bias in Large Language Models

논문은 대규모 언어 모델(LLMs)이 다중 토큰 클래스 레이블을 예측할 때 발생하는 '레이블 길이 편향(label length bias)' 문제를 해결하는 것을 목표로 합니다.

#Review #Large Language Models #Label Bias #Calibration #In-Context Learning #Text Classification #Multi-token Labels #Label Length Bias #Multiple Choice QA

2025년 11월 18일

[논문리뷰] MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

기존 Large Vision-Language Models (LVLMs) 강건성 벤치마크들이 환각이나 오해의 소지가 있는 텍스트 입력에만 집중하고, 시각적 이해 평가에서 오해의 소지가 있는 시각적 입력 을 간과하는 문제를 해결하는 것이 목표입니다.

#Review #LVLM Robustness #Misleading Visual Inputs #VQA Benchmark #Visual Perception #Visual Reasoning #MVI-Sensitivity #Multimodal AI

2025년 11월 18일

[논문리뷰] Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

본 연구는 Extreme Multi-label Classification (XMC)에서 Large Language Models (LLMs) 의 잠재력을 효과적으로 활용하고, 시각적 정보 를 효율적으로 통합하여 성능을 향상하는 것을 목표로 합니다.

#Review #Extreme Multi-label Classification (XMC)#Large Language Models (LLMs)#Multi-modal Learning #Dual-decoder Learning #Vision Transformers #Contrastive Learning #Prompt Engineering

2025년 11월 18일

[논문리뷰] LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

본 논문은 카오스 엔지니어링(CE)의 수동적이고 노동 집약적인 단계(가설 설정, 실험 계획, 시스템 재구성)를 자동화하여, 누구나 저비용으로 탄력적인 소프트웨어 시스템을 구축할 수 있도록 하는 것을 목표로 합니다.

#Review #Chaos Engineering #Large Language Models #System Resilience #Kubernetes #Software Automation #AI Agents #Fault Injection

2025년 11월 18일

[논문리뷰] Error-Driven Scene Editing for 3D Grounding in Large Language Models

본 논문은 현재 3D-LLMs 가 3D 환경에서 언어를 시각적 및 공간적 요소에 정확하게 연결하지 못하는 문제점을 해결하고자 합니다.

#Review #3D Grounding #3D-LLMs #Scene Editing #Counterfactual Augmentation #Error-Driven Learning #Spatial Reasoning #Visual Grounding

2025년 11월 18일

[논문리뷰] Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

본 논문은 최신 비디오 생성 모델 이 단순한 시각적 품질을 넘어 실제 세계의 물리 법칙과 연속성을 이해하며 추론하는 Chain-of-Frames (CoF) 추론 능력 을 체계적으로 평가할 수 있는 벤치마크의 부재를 해결하는 것을 목표로 합니다.

#Review #Generative Visual Reasoning #Chain-of-Frames (CoF)#Video Generation Models #World Simulators #AI Benchmarking #Cognitive Reasoning #VLM Evaluation

2025년 11월 18일