#Multi-modal LLM

4개의 포스트

[논문리뷰] KlingAvatar 2.0 Technical Report

본 연구는 장시간 고해상도 아바타 비디오 생성 시 발생하는 효율성 부족, 시간적 드리프트, 품질 저하, 프롬프트 불일치 문제를 해결하는 것을 목표로 합니다.

#Review #Avatar Generation #Video Diffusion #Multi-modal LLM #Long-duration Video #High-resolution Video #Lip Synchronization #Multi-character Control #Spatio-temporal Cascade

2025년 12월 15일

[논문리뷰] Fara-7B: An Efficient Agentic Model for Computer Use

본 논문은 컴퓨터 사용 에이전트(CUA) 훈련을 위한 고품질 상호작용 데이터의 부족 문제 를 해결하고, 적은 연산 자원으로 온디바이스에서 실행 가능한 효율적인 에이전트 모델 을 개발하는 것을 목표로 합니다. 이를 통해 CUA 기술의 상업적 활용 가능성을 확장하고 범용 개인 디지털 비서의 길을 열고자 합니다.

#Review #Computer Use Agents #Synthetic Data Generation #Multi-modal LLM #On-device AI #Web Automation #Pixel-in Action-out #Fara-7B #WebTailBench

2025년 11월 25일

[논문리뷰] MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

본 연구는 기존 지시 기반 이미지 편집(IBIE) 방법론의 한계, 특히 제한된 데이터셋 다양성과 품질로 인한 복잡한 편집 태스크에서의 성능 저하 문제를 해결하고자 합니다.

#Review #Instruction-based Image Editing #Dataset #Multi-modal LLM #Image Generation #Style Transfer #Multi-task Learning #Fine-tuning

2025년 9월 19일

[논문리뷰] Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

본 논문은 다양한 양상의 데이터(텍스트, 이미지)를 처리할 수 있는 옴니(Omni) 형태의 멀티모달 생성 및 이해 모델 인 Lumina-DiMOO를 제안합니다.

#Review #Multi-modal LLM #Discrete Diffusion #Image Generation #Image Understanding #Omni-modal #Interactive Retouching #Generative AI #Reinforcement Learning

2025년 10월 9일