최신 포스트

[PaddleOCR] MCP 서버에서 모든 OCR 결과 배치를 파싱하도록 수정

로컬 OCR 결과의 첫 번째 배치만 처리하던 버그를 수정하여 전체 결과를 올바르게 파싱합니다.

#PaddleOCR #MCP #Bug Fix #OCR #Python

2026년 3월 20일

[Ultralytics] Pose Loss의 keypoint 배치 루프를 벡터 연산으로 최적화

Pose 모델 학습에서 keypoint를 배치별로 정리하는 for 루프를 scatter_add 기반 벡터화로 대체합니다.

#Ultralytics #YOLO #Pose Estimation #Vectorization #PyTorch

2026년 3월 20일

[axolotl] Tensor Parallelism batch_size 계산 버그 수정: dp_world_size 기반으로 전환

Tensor Parallelism 환경에서 batch_size와 total_num_steps가 잘못 계산되던 버그를 dp_world_size 기반으로 수정하고, 파라미터화된 테스트를 추가한 사례를 분석합니다.

#Axolotl #Tensor Parallelism #Distributed Training #Bug Fix

2026년 3월 20일

[axolotl] Gemma 3 QLoRA 설정 개선: Vision Tower 동결과 model_type 제거

Gemma 3 모델의 QLoRA 학습 설정에서 불필요한 model_type 명시를 제거하고, unfrozen_parameters로 Vision Tower를 동결하는 패턴을 분석합니다.

#Axolotl #Gemma3 #QLoRA #Fine-tuning #Configuration

2026년 3월 20일

[Axolotl] ScatterMoE LoRA 최적화: 벤치마크, 커널 분할, autograd 통합

ScatterMoE LoRA Triton 커널에 벤치마크 도구를 추가하고, large expert 모델에서 fused/split forward 자동 선택 및 autograd 통합을 최적화한 분석.

#Axolotl #ScatterMoE #LoRA #Triton #MoE #Benchmark #GPU #Performance

2026년 3월 19일

[논문리뷰] VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

최근 MLLMs는 External Tools와의 통합을 통해 Agentic Problem Solvers로 발전하고 있으나, 복잡한 Visual Tasks를 위해 다양한 도구를 정확하게 실행하고 효과적으로 조합하는 데 지속적인 병목 현상(persistent bottleneck)을 겪고 있습니다.

#Review #Multimodal Large Language Models #Visual Tool Chaining #Agentic Models #Benchmark #OpenCV #Compositional Reasoning #Tool-use Evaluation

2026년 3월 19일

[논문리뷰] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

현재 instruction-guided video editing models은 fine-grained semantic modifications와 faithful motion preservation 간의 균형을 맞추는 데 어려움을 겪고 있습니다.

#Review #Instruction-Guided Video Editing #Diffusion Models #Semantic Anchoring #Motion Alignment #Factorized Pre-training #Zero-shot Learning #Temporal Consistency

2026년 3월 19일

[논문리뷰] Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

현재 언어 모델(LM)의 수학 및 과학 추론 능력 평가는 주로 숫자 값이나 multiple-choice 질문과 같은 단순화된 답변 형식에 의존합니다.

2026년 3월 19일

[논문리뷰] Prompt-Free Universal Region Proposal Network

기존의 Region Proposal Network (RPN) 및 Open-Vocabulary Object Detection (OVD) 방법들은 잠재적 객체를 식별하기 위해 exemplar images, predefined categories, 또는 textual descriptions과 같은 외부 프롬프트에 의존하는 경향이 있습니다.

#Review #Prompt-Free #Region Proposal Network #Universal Object Detection #Cross-Domain Generalization #Learnable Embedding #Self-Prompting #Centerness-Guided

2026년 3월 19일

[논문리뷰] ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

Multi-turn LLM Agents는 복잡하고 인터랙티브한 작업을 해결하는 데 점차 중요해지고 있으며, Reinforcement Learning (RL)은 long-horizon behavior를 개선하는 데 핵심적인 역할을 합니다.

#Review #Multi-turn LLM Agents #Reinforcement Learning #Rollout-as-a-Service #Training-Rollout Decoupling #Sandbox Environments #HPC #Token-in/Token-out #Scalability

2026년 3월 19일

[논문리뷰] OSM-based Domain Adaptation for Remote Sensing VLMs

원격 감지(Remote Sensing) 분야의 Vision-Language Models (VLMs)는 위성 및 항공 이미지의 풍부함에도 불구하고, 고품질의 도메인 특화 이미지-텍스트 주석(annotation)이 희소하고 생성 비용이 높다 는 문제에 직면해 있습니다.

2026년 3월 19일

[논문리뷰] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Reinforcement Learning (RL)은 LLM Post-Training의 핵심으로 부상하며 Reasoning, Agentic Capabilities, Real-World Problem-Solving 발전에 기여하고 있습니다.

#Review #LLM Post-Training #Cascade RL #Multi-Domain On-Policy Distillation #Mixture-of-Experts #Reasoning #Agentic Capabilities #Competitive Programming #Mathematical Olympiad

2026년 3월 19일

[논문리뷰] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

단일 이미지로부터 관절형 3D 객체를 재구성하는 것은 객체의 기하학적 구조, Part 구조 및 motion parameter를 제한된 시각적 증거로부터 함께 추론해야 하므로 여전히 근본적인 도전 과제이다.

#Review #Monocular 3D Reconstruction #Articulated Objects #Progressive Structural Reasoning #Kinematic Estimation #PartNet-Mobility #End-to-End

2026년 3월 19일

[논문리뷰] Memento-Skills: Let Agents Design Agents

현대의 Large Language Models (LLMs) 은 few-shot learning , supervised fine-tuning , post-training 을 통해 다양한 시나리오에서 탁월한 성능을 보이지만, 실제 활용을 위해서는 막대한 데이터와 컴퓨팅 자원을 요구하는 parameter optimization 이 필수적입니다.

#Review #LLM Agents #Continual Learning #Skill Learning #Reinforcement Learning #Memory-based Agents #Agent Design #Read-Write Reflective Learning #Offline RL

2026년 3월 19일

[논문리뷰] Matryoshka Gaussian Splatting

3D Gaussian Splatting (3DGS)의 실질적인 배포를 위해서는 단일 모델에서 조정 가능한 충실도(fidelity)로 장면을 렌더링하는 LoD 기능이 매우 중요합니다.

#Review #3D Gaussian Splatting #Level of Detail (LoD)#Continuous LoD #Matryoshka Representation Learning #Stochastic Budget Training #Neural Rendering

2026년 3월 19일

[논문리뷰] MOSS-TTS Technical Report

Text-to-Speech (TTS)는 이제 Foundation Model처럼 동작하며, 다양한 화자, 언어, 스타일, 음향 조건에 걸쳐 Generalize하고, Controllable하며 Low-Latency Synthesis를 지원하며, Long-Form 콘텐츠에 대해 Stable해야 하는 Speech Generation의 광범위한 패러다임으로 진화하고 있습니다.

#Review #Speech Generation #Foundation Model #Audio Tokenizer #Autoregressive Modeling #Voice Cloning #Duration Control #Multilingual TTS

2026년 3월 19일

[논문리뷰] Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Multimodal Large Language Models (MLLMs)는 Vision과 Language를 연결하는 데 상당한 발전을 이루었지만, 공간 이해와 시점 인지(viewpoint-aware) 추론 능력은 여전히 부족합니다.

#Review #Vision-Language Models #3D Reasoning #Language-based Localization #Spatial Understanding #Situation Modeling #Global Layout Reconstruction #Monocular Video

2026년 3월 19일

[논문리뷰] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

최근 Multimodal Large Language Models (MLLMs)는 인상적인 Semantic Capability를 보여주지만, Fine-grained geometric reasoning 및 Physical dynamics와 관련된 'Spatial blindness' 문제를 겪고 있습니다.

#Review #Video Generation Models #3D Priors #Scene Understanding #Spatial Reasoning #Multimodal Large Language Models (MLLMs)#Latent World Simulator #Adaptive Gated Fusion #Generative AI

2026년 3월 19일

[논문리뷰] FASTER: Rethinking Real-Time Flow VLAs

Vision-Language-Action (VLA) 모델의 실제 로봇 배포에서 실시간 실행(real-time execution)은 매우 중요합니다.

#Review #Vision-Language-Action (VLA) Models #Real-Time Robotics #Action Chunking #Reaction Latency #Flow Matching #Horizon-Aware Schedule (HAS)#Time to First Action (TTFA)

2026년 3월 19일

[논문리뷰] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

최근 Encoder-based 아키텍처에서 Decoder-based LLM embeddings로의 전환은 성능 향상을 가져왔지만, 현재 연구는 두 가지 주요 한계를 가지고 있습니다.

#Review #Multilingual Embedding #LLM #Matryoshka Representation Learning #Knowledge Distillation #Model Pruning #MTEB Benchmark #Low-resource Languages #Open-source

2026년 3월 19일