[논문리뷰] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

2026년 3월 15일수정: 2026년 3월 15일

링크: 논문 PDF로 바로 열기

저자: Chong Xia, Kai Zhu, et al.

1. Key Terms & Definitions

SimRecon : Real-world 비디오로부터 simulation-ready하고 compositional한 3D scene을 재구성하기 위한 제안 프레임워크입니다.
Perception-Generation-Simulation Pipeline : cluttered input video를 physically assembled scene으로 변환하는 3단계 파이프라인으로, scene-level semantic reconstruction (Perception), single-object generation (Generation), 그리고 simulator 내에서의 asset assembly (Simulation)로 구성됩니다.
Active Viewpoint Optimization (AVO) : Perception 단계에서 Generation 단계로의 전환을 위한 브리징 모듈로, 3D 공간에서 optimal projected image를 actively search하여 single-object completion을 위한 조건을 확보합니다.
Scene Graph Synthesizer (SGS) : Generation 단계에서 Simulation 단계로의 전환을 위한 브리징 모듈로, physical simulator에서 scene을 constructive하게 조립하는 과정을 guiding하며, 객체 간의 supportive 및 attached relation을 모델링합니다.
Compositional 3D Scene Reconstruction : holistic scene 표현 대신 object-centric representation을 생성하는 방식으로, 각 객체에 대한 완전한 geometry와 명확한 boundary를 목표로 하여 simulation 및 interaction에 적합합니다.

2. Motivation & Problem Statement

기존 3D scene reconstruction 방법론들은 대개 scene을 holistic 하게 표현하여 시각적 fidelity는 뛰어나지만, 완전한 object geometry와 명확한 object boundary가 부족하여 simulation 및 interaction에 부적합하다는 근본적인 한계점을 가집니다. 특히, real-world video에서 compositional 3D reconstruction 을 시도하는 최근 연구들은 다음과 같은 문제에 직면합니다. 첫째, small, large, 또는 occluded object에 대해 complete하고 plausible한 geometry를 생성하기 어렵습니다. 이는 종종 input image의 heuristic view selection이나 3D representation에 의존하기 때문입니다. 둘째, 최종 결과가 simulation-ready scene이 아닌 단순한 visual representation에 머물러 physical implausibility 를 야기하는 "real-to-sim" gap이 발생합니다. 셋째, semantic reconstruction과 object generation을 위한 specialized method들이 파이프라인에 tightly coupled되어 있어 advanced approach를 쉽게 활용하기 어렵습니다. 이러한 문제들은 cluttered하고 complex한 scene에서 generated asset 의 visual infidelity와 final scene의 physical implausibility로 이어집니다.

3. Method & Key Results

저자들은 이러한 문제들을 해결하기 위해 SimRecon 이라는 "Perception-Generation-Simulation" 파이프라인을 제안합니다. 이 프레임워크는 input video로부터 scene-level semantic reconstruction을 수행하여 3D scene을 복원하고 개별 object를 구분한 다음, Active Viewpoint Optimization (AVO) 을 통해 actively optimized된 image projection을 조건으로 single-object generation을 수행합니다. 마지막으로, Scene Graph Synthesizer (SGS) 가 구성한 scene graph를 기반으로 physical simulator에서 이러한 asset들을 조립합니다

Figure 1: We propose SimRecon, a framework for reconstructing simulation-ready, compositional 3D scenes from real-world videos. Our method introduces a "Perception-Generation-Simulation" pipeline which transforms cluttered input videos into physically assembled scenes. To ensure visually faithful generated assets and physically plausible final scenes, we propose two bridging strategies: Active View Optimization to acquire optimal generation conditions, and Constructive Assembly to follow a native building principle.

AVO 는 정보 이론적 관점에서 optimal view projection 문제를 정보 획득(information gaining) 작업으로 모델링하며, 3D Gaussian Splatting 렌더링 파이프라인의 미분 가능성(differentiability)을 활용하여 accumulated opacity A(v)를 최대화하는 방향으로 viewpoint v를 최적화합니다. 이와 함께 Ldepth regularization term을 도입하여 viewpoint가 object surface에 너무 가깝게 collapse하는 극단적인 경우를 방지합니다. SGS 는 DBSCAN clustering을 통해 object instance들을 spatial region으로 분할하고, 각 region에서 optimal observation viewpoint vk를 얻은 후 Vision-Language Model (VLM) 을 사용하여 local graph Gk를 추론합니다. 이 local graph들은 iterative merging 및 conflict resolution 과정을 거쳐 global scene graph G를 형성하며, 이는 hierarchical, physics-based assembly를 통해 simulator에서 physically plausible한 scene construction을 guiding합니다.

ScanNet dataset에 대한 광범위한 실험 결과는 SimRecon의 우수한 성능을 입증합니다

Table 1: Quantitative Comparison for Compositional 3D Reconstruction. We evaluate our method against single-view (Gen3DSR [2], SceneGen [36]) and scene-level (DPRecon [41], InstaScene [74]) baselines. The comparison includes metrics for geometric fidelity (CD, F-Score, NC), novel-view rendering quality (PSNR, SSIM, LPIPS, MUSIQ), and inference time.

. compositional 3D reconstruction 측면에서, SimRecon은 InstaScene 대비 CD 를 6.90 에서 4.34 로, F-Score 를 49.69% 에서 62.65% 로, NC 를 82.55% 에서 87.37% 로 개선하여 geometric fidelity에서 상당한 우위를 보였습니다. 렌더링 품질에서도 PSNR 24.43 , SSIM 0.924 , LPIPS 0.153 를 달성하여 기존 방법론들을 능가했습니다. 특히, AVO 는 heuristic view selection이나 canonical view sampling보다 최적화된 projection view를 제공하여 occluded object에서도 높은 fidelity의 asset generation을 가능하게 했습니다

Figure 4: Qualitative comparison of viewpoint sampling strategies. We uniformly use a single image as the condition and utilize the same generative model.

. 또한, SGS 는 물리적 상호 의존성을 모델링하여 MetaScenes 와 달리 floating이나 penetration 없는 physically plausible 한 scene construction을 보장합니다 [Figure 5].

4. Conclusion & Impact

본 연구는 real-world video로부터 object-centric 하고 simulation-ready 한 3D scene을 재구성하기 위한 SimRecon 프레임워크를 성공적으로 제안했습니다. 핵심 기여는 Active Viewpoint Optimization (AVO) 과 Scene Graph Synthesizer (SGS) 라는 두 가지 브리징 모듈을 통해 기존 파이프라인 조합에서 발생하는 visual infidelity와 physical implausibility 문제를 해결한 것입니다. AVO 는 optimal projection을 actively 탐색하여 고품질의 generative condition을 보장하며, SGS 는 real-world의 constructive principle을 반영하여 물리적으로 그럴듯한 scene assembly를 안내합니다. ScanNet dataset에 대한 실험 결과는 SimRecon이 reconstruction quality와 physical adherence 모두에서 기존 state-of-the-art 방법론 대비 우수한 성능을 달성함을 검증했습니다. 이 연구는 arbitrary video로부터 diverse simulation environment를 생성할 잠재력을 열어, embodied AI 및 interactive scene generation 분야에 중요한 시사점을 제공합니다.

⚠️ 알림: 이 리뷰는 AI로 작성되었습니다.

Review 의 다른글

이전글 [논문리뷰] OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
현재글 : [논문리뷰] SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
다음글 [논문리뷰] Spend Less, Reason Better: Budget-Aware Value Tree Search for LLM Agents