The test data can be split into three categories:
(1) Novel view and light synthesis (NVLS) where novel views and novel light conditions are tested
(2) Novel view synthesis (NVS) where novel views and trained light conditions are tested
(3) Novel light synthesis (NLS) where trained views and novel light conditions are tested
Switch between "NVLS", "NVS" and "NLS" tabs in the sheet below to view the per-dataset and per-scene results for each baseline.
Below we provide videos for a handful of the NVLS, NVS and NLS video results. Note that we compile the videos per-datset. Each tile is essentially it's own 1080p video, so youtube will compress this even with 4k settings. You can click here for the original image-based side-by-side results. More videos are avalibale on the main page.
The test data can be split into three categories:
The NVLS results for each ablation experiment is provided in the google sheet below. The video results and extended descriptions/analysis of each ablation experiment is provided in each section below. The first presents analysis on experiments (i-ii). The second presents the statistical approach and statistical outcomes of experiment (iii), where we evaluate the effect of selecting K=1 IBL scenes based on texture statistics. The third presents the additional ablation results, evaluating the impact of reformulating the training of the canonical scene based on different practical scenarios.
Each tile is essentially it's own 1080p video, so youtube will compress this even with 4k settings. You can click here for the original image-based side-by-side results.
Here is lies...
In the carousel below, we present two important ablations concerning the canonical scene
(a) The case when no "unlit" references are avaliable so one of the lit scenes acts as the canonical scene
(b) The case when the canonical scene is left un-constrained by removing the canonical loss from training
As done previously, we test this on Dataset 3 and show the video results below.
...
In the carousel below, we present the outcomes and insights of experiment (iii). This concerns the capture scenrio where
only 1x IBL scene/background is captured, as was done in experiment (ii.a). However, we question whether a texture could
be selected to optimize for VSR quality. To accomplish this we evaluate two important traits associated with spatially-varying
RIC-IBL textures:
(1) The frequency density of the image, i.e. how varied is the frequency distribution of the texture?
(2) The texture regularity/uniformity, i.e. how unique are local image patches?
For (1) we use the energy function for the Garbor-Wavelet coefficients to indicate the variance and magnitude of texture-frequencies,
in a single metric. The Garbor-Wavelet is typically used for natural images, over e.g. Fourier or other Wavelet schemes.
For (2) we use the homoeneity heuristic from the Grey-Level Co-occurent Matric (GLCM). This is another common scheme for
assessing the regularity/uniformity of a texture. A set of heuristics are avaliable under the GLCM algorith, though we
find the results are relatively similar, so we select the homogeneity metric for this paper. The GLCM algorithm is reliant
on computing metrics based on filter direction and scale. This is essentially the direction along the image should we move to compare
the current pixel-patch to the next, and how far we should move. In this paper we evaluate four directions, 0deg, 45deg, 90deg, 135deg, and
four distances 1 pixel, 2 pixels and 4 pixels. The
GLCM Scikit-learn package
handles this all for us.
The results show that selecting the right texture can have an large impact on both reconstruction and lighting quality. This experiment effectively shows that textures can be statistically selected to boost VSR performance. Carrying this into the VP use-case, this insinuates that backgrounds can be designed for VSR prupose. In layman terms, directors would no longer need to make decisions about VFX and scene lighting before or during filming. If they capture scenes for VSR rather than for the final pixel, important creative decisions could be made later in the process with much more flexibility and fine-grained control. Especially considering that VSR inherits the editability and AOVs associated with Gaussian Splatting tools and research.
The study was completed in three stages:
(1) Partner-YY was employed to render edits for WP A and B
(2) We rendered edits for WP C and D
(3) Experts were asked to assess the outputs of WP A-D
The goal of this study was to assess the impact of VSR in practical scenairos, and compare our results to those produced by consumer-ready generative models. As part of our social
responsibility, this study was and should not be used to infer comparisons between WP A-B and WP D; this is why we only show the paired t-test results for WP D in comparison to WP C.
The study also limits the time that Partner-YY has to edit the renders for WP A-B to one hour. Clearly, this does not consider the time it took to train our VSR model to provide the AOVs for WP B.
Though it is important to understand the economy of time in VFX workflows, whereby the hand-off VSR training liberates time for VFX artists to focus on other tasks/shots. Still, due to this minor
assumption, we choose not to directly compare WP A and C, as we believe it would be misleading.
Ultimately, we chose the single-image relighting/re-coloring task as it was simple enough to simulate in the paper without 1. introducing too many unknowns, and 2. simple enough to complete
in a short time limit. The "unknowns" in VFX workflows pertain to the tools and personal decisions VFX artists may make to handle various working conditions. Hence, for a more complex problem
the sample size (i.e., three proffessional artists) would not be sufficient to capture the broad range of approaches to relighting. Moreover, as VSR provides AOVs that are not generally acquirable
in traditional VFX pipelines, this introduced uncertainty on how to use the AOVs effectively. So, by simplifying the problem to a single-image relighting task, Partner-YY could develop a
pipeline for editing that reduces unknowns allowing us to make more direct comparisons between WP A, B and C. In the following section we present the VFX workflow that was designed to handle WP A and B.
The workflows for WP A and B were accomplished in Adobe Photoshop. For WB A, Partner-YY opted for the following process, where "optional" indicates tasks to complete if time was avaliable
(1) Roto-scope/segment the foreground objects
(2) Input the new background texture as a base layer and re-position it
(3) Re-grade the foreground using the HSV and RGB per-channel histograms to adjust the color, saturation and exposure setting to match the background image
(4) Deal with transparent objects using an alpha-channel clipping mask
(5) Apply coarse shadows using a multiply clipping layer on the foregounrd, and use an additive clipping layer to apply coarse highlights. These should match
the light direction in the background image
(Optional 6) Apply bounce lighting and diffuse reflections
(Optional 7) Do (4-6) with a hard brush and normal clipping layers, for rendering fine details
For WB B, Partner-YY designed the following pipeline.
(1) Use the alpha map to segment the foreground and background.
(2) Layer the intensity map as a multiply clipping layer over the UV color map
(3) Layer the UV color map as an additive clipping layer over the base color maps
(4) Order the base color map ontop of the relighting prediction and erase regions in the base color map where relighting is not required
(5) Iteratively smooth the UV color and intensity maps based on the object-level texture and material
(Optional 6) Apply additional highlights and shadows to constrast effects
Examples of each workflow is show in the carousel below. Note the timelapse captures every 30 frames - it not based on time. For rotoscoping and histogram editing functions,
we were only able to capture the frame once we commited the edits so while the frame count for these steps in WP A and B seems short, they are in actuallity quite time consuming.
The full set of results are shown in the sheet below. The code used to run the paired t-test and generate the box plots is also provided below.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import combinations
from scipy.stats import ttest_rel
# ---- Load ----
df = pd.read_csv("results_full.csv")
# Drop the non-data header row (ID is blank and cells are a/b/c)
df = df.dropna(subset=["ID"]).copy()
# ---- Likert -> numeric ----
scale = {
"Disagree": 1,
"Somewhat Disagree": 2,
"Unsure": 3,
"Somewhat Agree": 4,
"Agree": 5
}
df = df.replace(scale)
# Ensure numeric
for c in df.columns:
if c != "ID":
df[c] = pd.to_numeric(df[c], errors="coerce")
models = ["WP A", "WP B", "WP C", "WP D"]
labels = ["WP A", "WP B", "WP C", "WP D"]
# Map subquestion -> column suffix
# a = base, b = .1, c = .2
subq_map = {"A": "", "B": ".1", "C": ".2"}
# ---- Make one plot per subquestion ----
for subq_label, suffix in subq_map.items():
# participant means for THIS subquestion
part = df.groupby("ID")[[m + suffix for m in models]].mean()
print(f"\nPaired t-tests for Question {subq_label}")
alpha = 0.05
print(f"Alpha = {alpha:.4f}")
for m1, m2 in combinations(models, 2):
x = part[m1 + suffix]
y = part[m2 + suffix]
# drop any participants missing either value
mask = x.notna() & y.notna()
t, p = ttest_rel(x[mask], y[mask])
sig = " *" if p < alpha else ""
print(f"{m1} vs {m2}: t = {t:.3f}, p = {p:.4f}{sig}")
data = [part[m + suffix].dropna().values for m in models]
means = [np.mean(d) for d in data]
xpos = np.arange(1, len(models) + 1)
plt.figure()
# Boxplot: IQR box, median line; whiskers set to min/max
plt.boxplot(
data,
labels=labels,
whis=(0, 100), # min/max whiskers
showfliers=False # hide outliers markers since whiskers already min/max
)
# Overlay mean as a point
plt.scatter(xpos, means, marker="D", label="Mean")
plt.ylabel("Score (1–5)")
plt.title(f"Participant Summary by Model — Question {subq_label}")
# legend to side
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.savefig(f"Raw_{subq_label}_boxplot.png", dpi=300, bbox_inches="tight")
Not all filmmaking is dynamic. In VP, the range applications exceeds human-centric capture. Hence, there is value is developing
robust static VSR pipelines.
A good use case for this is car advertisement, where cars are placed within a VP volume to simulate driving in various foreign locations. Through
VP, numerous costs linked to trasnportation, insurance and hiring car-specific
camera hardware can be avoided. Pre-production planning is also simplified as directors no longer need to consider location-specific challenges. Still,
VP sets are expensive to hire, various shot types are still challenging to achieve, and baked lighting for highly reflective and transparent surfaces
(i.e. cars) are non-trivial to edit in post. Hence, VSR provides a clear solution to reducing VS
productions costs by only having to capture the scene once and also support downstream video/shot editing. However, we chose this case
because it also highlights flaws in our current VSR approach that need to be developed in future work. The results from our model show that one of the
largest challenge with VSR is rendering high-resolution reflections in regions where Gaussian distribution is sparse. This mostly concerns the relighting of
highly reflective transparent surfaces, so would present major problems when filming cars for advertisement using a VP stage.
We believe this issue mainly lies with our choice of adaptive Gaussian densification scheme. We use the vanilla approach that instances points based on the
gradient changes propagrated from the reconstruction loss. Hence, image-regions with greater reconstruction errors garner greater attention. However, this
method of instancing new Gaussians relies on spliting and cloning pre-existing Gaussians. This presents problems in regions with sparse Gaussians,
such as those relating to transparent object. Hence, during training our VSR pipeline is unlikely to populate these spatial regions. This highlights the
need for content-aware densifications schemes.
Past this we believe that the initialization of per-Gaussian texture sample coordinates could be improved. When viewing the novel view synthesis results
with a moving camera we notice view-dependent flickering artifacts. This occurs when the view-dependent change in texture sample coordinates is sensitive
and the IBL sample scale is low, i.e. when the Gaussian samples a high-resolution mipmap. For IBL textures with high frequency patterns, this can be
detrimental to viewing smoothness. In part this is a dataset limitation - as we only use 18 cameras for training, sparse-view reconstruction
problems arise leading to view-dependent overfitting. Thus, future work may want to explore methods of smoothing view-dependent features to suppress
flickering artifacts.
Developing a dynamic VSR pipeline reguires first understanding the four types of illumination events. First, illumination can change due to changes in
RIC-IBL texture. Second, illumination can change depending on the viewing angle. Third, illumination can change depending on the global position of
the object within the scene. Fourth, illumination can change depending the objects local orientation. In the paper, we deal with disentangling the
first two types of lighting event. The third and fourth events arise when dealing with dynamic or editable static scenes.
Dynamic VSR pipelines that deal with all four events could concievable take a number of approaches:
(1) Modelling 3D neural exposure and texture sampling fields with occlusion awarness
(2) Implicitly modelling per-Gaussian temporal lighting and shadow features
Regarding option (1), future work could rely signed distance fields (SDF) typically found in neural surface reconstruction research. The challenge here
would be to adapt the per-Gaussian texture sampling and exposure parameters (proposed in the paper) as a 3D neural field, such that as an object moves
within the 3D field the lighting reponse changes. This problem can be simplified by assuming that the VP LED wall is static, hence the 3D lighting
field can be modelled as time-independent. For example, using a neural radiance field (ie.e. a coordinate MLP) would only require inputting the current
position to return the gamma and mu lighting parameters prodposed in the paper. To deal with local temporal events (i.e. the fourth type of illumination event),
a dynamic signed distance field could then be employed to handle lighting-based occlusions. This would ideally modify the lighting reponse from the
3D light fields to account for dynamic shadows and highlights. For filmmaking, we believe this options is the strongest option as, ideally, this solution
models 3D lighting field independent of the scene composition. Hence, the step towards instancing new objects, removing objects, or moving objects
is trivialized as the 3D light field and occlusion model should be capable of handling these types of event.
On the other hand, option (2) may provide a stronger approach to fitting a dynamic scene for cases where geometry editing is not desired/necessary.
The challenge here would be to discover per-Gaussian representation that inherits the prior accomplishments of dynamic GS while also handling with local
lighting changes. A naive solution to this may involve the hexplane representation which could be used to approximate residual changes in
gamma and mu w.r.t time, such that c' = c + gamma(t)*delta_c(mu(t), I_k^B). We could then implicitly deal with occlusions by introducing an
additional temporal indensity parameter that models c'' = psi(t) * c'. Still, work would need to be done towards designing a training
strategy that is capable of disentangling/constraining gamma(t) and psi(t). Otherwise this naive solution would be prone to overfitting the psi(t) parameter
as it undergoes a larger degre of gradients flow in comparison to gamma(t), during backpropagation.
[]