Converting a single image to a 3D scene with geometry, materials, and lighting is a challenging problem. While geometry reconstruction from a single view has been extensively studied, material recovery for the single-photo-to-scene task remains underexplored. Recent advances in differentiable procedural materials, inverse rendering, and texture generation can be potentially applied to this task. However, they have not been systematically evaluated in a benchmark.
In this project, we establish a comprehensive benchmark for material recovery in the single-image-to-scene task. We assume ground-truth 3D geometry as input to isolate material estimation from geometric error. We evaluate three families of methods inspired by recent state-of-the-art approaches in inverse rendering, texture generation, and single-image-to-scene.
Our results show that single-view inverse rendering baselines outperform procedural material baselines (19.40 vs 13.60 in PSNR for albedo on original views), highlighting the strong potential of methods based on single-view inverse rendering for material recovery in the single-image-to-scene task. We will release the full dataset, evaluation code, and baseline implementations to support future work.
MatEval consists of 180 scenes from 3D-FRONT and 4 high-quality scenes from Bitterli. Each scene provides one original camera view plus four novel views (5° camera-center rotations), with ground-truth RGB, albedo, and part-level segmentation. Compared with OpenRooms, our scenes use professionally designed 3D-FUTURE furniture, are more densely populated (13.2 objects/scene vs. 9.4), and look more photorealistic.
We evaluate three families of baselines:
MonoIR variants use ColorfulShading, RGBX, or Marigold to estimate camera-space materials, which are then projected onto the 3D geometry. All baselines run under a three-stage framework: material initialization → light initialization → joint optimization.
Single-view inverse rendering substantially outperforms procedural-material optimization. On albedo recovery from the original camera view (3D-FRONT), the best MonoIR baseline reaches PSNR 19.40 versus 13.60 for the strongest DiffProcMat baseline. With ground-truth albedo as input, the same lifting pipeline reaches 32.98, showing how much headroom remains for the 2D estimators themselves.
| Method family | Best baseline | Albedo PSNR ↑ |
|---|---|---|
| DiffProcMat | PSDR-Room | 13.60 |
| MonoIR + NN | RGBXa | 19.40 |
| MonoIR + TEXGen | RGBXa | 19.71 |
| Upper bound (GT albedo → NN) | 32.98 | |
3D-FRONT, original camera view, unaligned albedo. See the paper for full results across RGB, novel views, and scale-aligned albedo.
@inproceedings{yang2026mateval,
title = {{MatEval}: Evaluating Indoor {3D} Scene Material Recovery from a Single Image},
author = {Yang, Dongchen and Savva, Manolis},
booktitle = {Proceedings of the Conference on Robots and Vision (CRV)},
year = {2026}
}
TODO: acknowledgements text.