MatEval: Evaluating Indoor 3D Scene Material Recovery from a Single Image

Overview of the material recovery problem. — **Overview of our problem statement.** The input is a single-view RGB image I₀ and the geometry of the scene G. To focus on material recovery, we use ground-truth geometry, factoring out geometric errors. The output is the reconstructed lighting L(p,e) and the recovered procedural materials M(θ) or image-based material UV^{{a, m, r}} of the scene, which we render to images I_DiffProcMat and I_MonoIR respectively.

Abstract

Converting a single image to a 3D scene with geometry, materials, and lighting is a challenging problem. While geometry reconstruction from a single view has been extensively studied, material recovery for the single-photo-to-scene task remains underexplored. Recent advances in differentiable procedural materials, inverse rendering, and texture generation can be potentially applied to this task. However, they have not been systematically evaluated in a benchmark.

In this project, we establish a comprehensive benchmark for material recovery in the single-image-to-scene task. We assume ground-truth 3D geometry as input to isolate material estimation from geometric error. We evaluate three families of methods inspired by recent state-of-the-art approaches in inverse rendering, texture generation, and single-image-to-scene.

Our results show that single-view inverse rendering baselines outperform procedural material baselines (19.40 vs 13.60 in PSNR for albedo on original views), highlighting the strong potential of methods based on single-view inverse rendering for material recovery in the single-image-to-scene task. We will release the full dataset, evaluation code, and baseline implementations to support future work.

Benchmark Dataset

MatEval consists of 180 scenes from 3D-FRONT and 4 high-quality scenes from Bitterli. Each scene provides one original camera view plus four novel views (5° camera-center rotations), with ground-truth RGB, albedo, and part-level segmentation. Compared with OpenRooms, our scenes use professionally designed 3D-FUTURE furniture, are more densely populated (13.2 objects/scene vs. 9.4), and look more photorealistic.

Comparison of MatEval and OpenRooms renderings. — **Comparison between MatEval (ours) and OpenRooms.** MatEval includes scenes curated from the 3D-FRONT (left) and Bitterli (middle) datasets. Compared with OpenRooms (right), our dataset features furniture with more diverse and photorealistic material textures, leading to richer visual variety across indoor scenes.

Benchmarked Baselines

We evaluate three families of baselines:

Differentiable Procedural Materials (DiffProcMat)
Monocular Inverse Rendering + Nearest Neighbor (MonoIR + NN)
Monocular Inverse Rendering + TEXGen (MonoIR + TEXGen)

MonoIR variants use ColorfulShading, RGBX, or Marigold to estimate camera-space materials, which are then projected onto the 3D geometry. All baselines run under a three-stage framework: material initialization → light initialization → joint optimization.

Pipeline for lifting 2D materials to 3D. — **Overview of pipeline to lift 2D materials to 3D.** We lift 2D estimations {a, m, r} to 3D in UV space for each object i. For the nearest neighbor approach: (1) map estimated 2D material properties {a, m, r} to a point cloud p_partial^{a,m,r} and fill unobserved regions with nearest neighbor lookup to get p_nn^{a,m,r}, and (2) project p_nn^{a,m,r} to UV space UV_nn^{a,m,r}. For the texture generation approach: (1) map the albedo and object mask from camera space to UV space UV_partial^{a} and UV_partial^M_i, (2) obtain the position map and mask map UV_full^M_i from the 3D model of the object *G_i* (3) apply TEXGen to fill unobserved regions and obtain UV_texgen^{a}.

Key Finding

Single-view inverse rendering substantially outperforms procedural-material optimization. On albedo recovery from the original camera view (3D-FRONT), the best MonoIR baseline reaches PSNR 19.40 versus 13.60 for the strongest DiffProcMat baseline. With ground-truth albedo as input, the same lifting pipeline reaches 32.98, showing how much headroom remains for the 2D estimators themselves.

Method family	Best baseline	Albedo PSNR ↑
DiffProcMat	PSDR-Room	13.60
MonoIR + NN	RGBX^a	19.40
MonoIR + TEXGen	RGBX^a	19.71
Upper bound (GT albedo → NN)		32.98

3D-FRONT, original camera view, unaligned albedo. See the paper for full results across RGB, novel views, and scale-aligned albedo.

Qualitative Results

Qualitative comparisons on original camera view. — **Qualitative comparisons on the original camera view on 3D-FRONT scenes.** “PSDR-Room^*” indicates the PSDR-Room baseline using meshes segmented by PartField. Superscripts denote which material channels are estimated by MonoIR methods. MonoIR methods (last 3 columns) recover diverse material textures and avoid the local minima issue seen in DiffProcMat (2nd and 3rd columns).

Novel-view albedo on unobserved regions. — **Qualitative comparison of aligned albedo for novel view renderings on 3D-FRONT scenes.** Darkened regions are previously observed regions. TEXGen generates reasonable textures for unobserved regions.

Limitations

**Mirror region failures.** All baselines struggle with mirrors. In this scene from the Bitterli dataset, for the mirror on the closet, DiffProcMat assigns a median color and MonoIR produces a “baked in” appearance.

TEXGen noise issues on featureless regions. — **TEXGen noise issues.** The darkened regions are previously observed regions. TEXGen generates noise in featureless regions (e.g., ceilings).

BibTeX

@inproceedings{yang2026mateval,
  title     = {{MatEval}: Evaluating Indoor {3D} Scene Material Recovery from a Single Image},
  author    = {Yang, Dongchen and Savva, Manolis},
  booktitle = {Proceedings of the Conference on Robots and Vision (CRV)},
  year      = {2026}
}

Acknowledgements

TODO: acknowledgements text.