Deformable Gaussian Splatting (GS) accomplishes photorealistic dynamic 3-D reconstruction from dense multi-view video (MVV) by learning to deform a canonical GS representation. However, in filmmaking, tight budgets can result in sparse camera configurations, which limits state-of-the-art (SotA) methods when capturing complex dynamic features. To address this issue, we introduce an approach that splits the canonical Gaussians and deformation field into foreground and background components using a sparse set of masks for frames at t=0. Each representation is separately trained on different loss functions during canonical pre-training. Then, during dynamic training, different parameters are modeled for each deformation field following common filmmaking practices. The foreground stage contains diverse dynamic features so changes in color, position and rotation are learned. While, the background containing film-crew and equipment, is typically dimmer and less dynamic so only changes in point position are learned. Experiments on 3-D and 2.5-D entertainment datasets show that our method produces SotA qualitative and quantitative results; up to 3 PSNR higher with half the model size on 3-D scenes. Unlike the SotA and without the need for dense mask supervision, our method also produces segmented dynamic reconstructions including transparent and dynamic textures.
As per the ablations in the paper we compare with:
(1) Unified Λ as in 4DGS, (2) Canonical training as in all other plane-based methods, (3) 3DGS/4DGS densification strategy, (4) Temporal Opacity function in STG.
An area for research that has recently caught our gaze is SaRO-GS, which tackles the issue of sampling hex-planes
with multi-scale grid resolutions - we use a single resolution while other methods fuse features via concatenations, sampled from a set of multi-scale feature planes.
This also encouraged an internal debate for us between using billinear and nearest neighbour for grid sampling.
For any temporal event (concerning the XT, YT and ZT planes) the temporal axis' resolution
is typically set to half the number of frames. This allows multiple samples (at different time steps) to inhabit the same grid-cell at various intervals along the cell's temporal axis.
Thus, sampling this axis with linear interpolation (billinearly in 2-D) means that we get a relatively smooth transition between timestep leading to a nearly-smooth temporal feature space - this is
good for us. However, for space-only features (concerning XY, XZ and YZ planes) we have to think a bit harder. The obvious point to make is that proximal points
(in our canonical space) share different portions of the same features due to the bilieanr interpolation operation, which is good for points that pertain to the same
object or limb for humans. However, what if two nearby points pertain to different objects? For small grid resolutions this becomes a problem, yet with large grid
resolutions objects pertaining to the same body no longer share features smoothly. Furthermore, in sparse-view settings, the intial geometry may not be viable so proximal points pertaining
to different objects may be initially mixed together (before hopefully later separating).
Thus, we have to think a bit harder. SaRO-GS explores a solution that essentially simulates subsampling a canonical Gaussian based the covariance of a canonical point splatted onto
the hex-planes. They also do some work to avoid large canonical Gaussians having influence on the high resolution grids (reminder that hex-planes fuse features sampled from multi-scale planes)
by restricting the subsampling of multi-scale planes based on Gaussian size. Their method works - we use something similar in our approach, whereby we uniformly subsample 12 additional points based on fixed
Mahalanobis distances frm the center, based on the canonical rotation and scale parameters. As we use a single feature-plane scale, we do not enforce multi-scale filtering based on point scale as with SaRO-GS,
though this perhaps isn't as necessary as we may think considering that the impact of distance sub-samples, sampled by a large canonical point, impact the final feature less as it accounts for 1/13th of the final
feature values, such that some fine features can be spatially preserved. Still our solution is far from ideal and requires further investigation.
Tied to this is the interpretation of the canonical Gaussian and deformation field as entangled components. We note that for most hex-plane GS methods (not ours), the canon is expected to learn the
optimal position of points for sampling G, as well as acting as a basis vector such that x' = x + deformation. In our approach, we try to tie the canon to t=0 through the proposed canonical training
strategy. Then, when training the dynamic scene, the canonical points should organize themselves in the same way as 4DGS and other approaches. Though, due to the initialisation, there is still some
link tying the canon to t=0. This doesn't necessarily presents a dilemma for us to investigate, though it does get our minds thinking about better and less entangled approaches to treating the canon
w.r.t the dynamic
field. One idea we would like to try out is to treat deformation feature outputs (the changes in color, rotation and position) as being normalize between 0 and 1 (e.g. using a Cosine activation n the defomation
decoder outputs).
Considering that both rotation and
color already lead to values between 0 and 1 (either after decoding or during the SH to RGB operation), we think there may be benefit in forcing point position to adhere to similar conditions. This may
level the importance of various dynamic Gaussian parameters during gradient backpropagation. As one may expect a change in position (w.r.t the scene scale) may be much larger than a change in color or
rotation so backpropagation will affect color and positional changes non-linearly. This would make learning position challenging and as we know from countless reports - dynamic GS does have an underlying
issue learning scene geometry. By forcing x' = deformation, we may need to rescale to have x' = k * deformation, such that the final position can move more than 1 unit in space. Considering that
for most filmmaking applications dynamic motion and effects are contained within the stationary stage, re-scaling may be as simple as multiplying the normalize positional by the stage's extent. These
measurements are easy to generate by, for example, analysing the initial point cloud.
This may also open up new doors. We tinkered with the idea of a deformation field that did not update the initial gaussian G but instead replaced it, such that x' = deformation instead of
x' = x + deformation. This allows us to simplify the aforementioned interpretations of the canon and deformation field, as the canon acts only to sample the deformation field and the deformation
field no longer acts to predict residuals but instead directly produces the Gaussian field at t. While this didnt work, it also didn't completely fail and has pushed us to think about other potential
avenues for re-interpreting the relatively confusing hex-plane field representations. We believe the ideal solution lies with a grounded interpretations that allows for disentangling visual changes
(color) from geometric changes (position and rotation). We experimented on learning a different set of (XT, yT and ZT) planes for the color feature and found little quantitative change but some very
small visual improvements - which are really only evident under a microscope so do not motivate any paper. Along the same lines, we also looked at modelling the temporal opacity parameters (h, omega and mu)
using the (XY, XZ and YZ) planes and decoding the feature from the three space-only planes. We found more promising results, but in the Splatography paper it made more sense to leave them as explicit
parameters, as intended from STG, as we wanted a direct coparison with our proposed modifications to the temporal opacity function.
While its unexplored, this later attempt at research does indicate that benefit can come from re-interpreting the space-only and space-time planes, such that, again, visual and geometric changes c an be
disentangled. One approach we rather like is the STG method, that assumed no visual change until rendering, whereby the render-er is assumed to be incontrol of the dynamic visual changes. This method,
however, treats all dynamic Gaussian parameters as explicit and disconnected components so working this into a hex-plane method could be rather challenging - though perhaps using the hex-planes to only
model geometry and using 2-D (1-to-1) feature-to-image decoders (as in STG) to model view-dependant and visual changes could prove a powerful solution. The only caveat being that the STG rendering
approach is heavy and limited to ours/your ability to use CUDA code.
@article{...}