Splatography: Sparse multi-view dynamic Gaussian Splatting for film-making challenges

Anonymous Submission

Supp. Material Code

Audio-visual VFX for music

Combining various Gaussian Splatting VFX

Ripple, Fade out, Bloom, Chromatic Abberation, Undulate out, Fly Up

This paper's primary focus is to tackle SV3D reconstruction challenges by:
(1) Proposing a new dynamic scene representation that disentangles the foreground and background to deal with point-importance
(2) Developing a new strategy for training canonical representations to deal with poor initialization.

Linking this work to filmmaking, we leverage foreground-background separation to apply background-specific constraints based on filmmaking practices; discussed in Introduction. This aims to suppress the reconstruction artifacts associated with background and foreground reconstruction, which we indentify in the above figure. We also place emphasis on reconstructing RTD textures as they are more common in filmmaking than most other practices.

Bassist Test Videos

Ground Truth | 4D-GS | STG | SC-GS | Ours
Small foreground assets are challenging for SotA to reconstruct as the background consumes a lot of the image/render and volumetric scene spaces. By disentangling foreground and background we overcome this issue.

Pony Test Videos

Ground Truth | 4D-GS | STG | SC-GS | Ours
Large foreground assets are easier for SotA to reconstruct (compared with Bassist) but still miss many fine details.

Piano Test Videos

4D-GS | STG | Ours | Ours (Foreground Only)
We disentangle the scene using foreground-background separation.

Fruit Test Videos

4D-GS | Ours | Ours (Foreground Only)
We supress various over and under reconstruction errors. Check out the background at time 30-40s!

Curling Test Videos

4D-GS | STG | Ours | Ours (Foreground Only)
There are still clear issues with SV3D. Notably, dealing with large motions that cause large temporal occlusions. This affects both the renders (check out 4DGS and Ours) as well as the final segmented model. This is an issue with disentangling view-dependant color using Spherical Harmonics (honest SH is terrible for dynmaic scenes). Behind the scenes, we tested a simple view-dependant MLP instead of SH and its much easier to constrain...

Flame Salmon Test Videos

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

Flame Steak Test Videos

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

Cook Spinach Test Videos

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

Slow-Motion Flame Salmon

4D-GS | Ours | Ours (Foreground Only)

Slow-Motion Flame Steak

4D-GS | W4D-GS | Ours | Ours (Foreground Only)

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Test views on the Piano Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Novel views on the Pony Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Novel views on the Piano Scene

We compiled some notes for anyone interested in exploring future works relating to dynamic GS reconstruction. This is written in essay/conversation form. For any queries please email ....

An area for research that has recently caught our gaze is SaRO-GS, which tackles the issue of sampling hex-planes with multi-scale grid resolutions - we use a single resolution while other methods fuse features via concatenations, sampled from a set of multi-scale feature planes. This also encouraged an internal debate for us between using billinear and nearest neighbour for grid sampling. For any temporal event (concerning the XT, YT and ZT planes) the temporal axis' resolution is typically set to half the number of frames. This allows multiple samples (at different time steps) to inhabit the same grid-cell at various intervals along the cell's temporal axis. Thus, sampling this axis with linear interpolation (billinearly in 2-D) means that we get a relatively smooth transition between timestep leading to a nearly-smooth temporal feature space - this is good for us. However, for space-only features (concerning XY, XZ and YZ planes) we have to think a bit harder. The obvious point to make is that proximal points (in our canonical space) share different portions of the same features due to the bilieanr interpolation operation, which is good for points that pertain to the same object or limb for humans. However, what if two nearby points pertain to different objects? For small grid resolutions this becomes a problem, yet with large grid resolutions objects pertaining to the same body no longer share features smoothly. Furthermore, in sparse-view settings, the intial geometry may not be viable so proximal points pertaining to different objects may be initially mixed together (before hopefully later separating). Thus, we have to think a bit harder. SaRO-GS explores a solution that essentially simulates subsampling a canonical Gaussian based the covariance of a canonical point splatted onto the hex-planes. They also do some work to avoid large canonical Gaussians having influence on the high resolution grids (reminder that hex-planes fuse features sampled from multi-scale planes) by restricting the subsampling of multi-scale planes based on Gaussian size. Their method works - we use something similar in our approach, whereby we uniformly subsample 12 additional points based on fixed Mahalanobis distances frm the center, based on the canonical rotation and scale parameters. As we use a single feature-plane scale, we do not enforce multi-scale filtering based on point scale as with SaRO-GS, though this perhaps isn't as necessary as we may think considering that the impact of distance sub-samples, sampled by a large canonical point, impact the final feature less as it accounts for 1/13th of the final feature values, such that some fine features can be spatially preserved. Still our solution is far from ideal and requires further investigation.

Tied to this is the interpretation of the canonical Gaussian and deformation field as entangled components. We note that for most hex-plane GS methods (not ours), the canon is expected to learn the optimal position of points for sampling G, as well as acting as a basis vector such that x' = x + deformation. In our approach, we try to tie the canon to t=0 through the proposed canonical training strategy. Then, when training the dynamic scene, the canonical points should organize themselves in the same way as 4DGS and other approaches. Though, due to the initialisation, there is still some link tying the canon to t=0. This doesn't necessarily presents a dilemma for us to investigate, though it does get our minds thinking about better and less entangled approaches to treating the canon w.r.t the dynamic field. One idea we would like to try out is to treat deformation feature outputs (the changes in color, rotation and position) as being normalize between 0 and 1 (e.g. using a Cosine activation n the defomation decoder outputs). Considering that both rotation and color already lead to values between 0 and 1 (either after decoding or during the SH to RGB operation), we think there may be benefit in forcing point position to adhere to similar conditions. This may level the importance of various dynamic Gaussian parameters during gradient backpropagation. As one may expect a change in position (w.r.t the scene scale) may be much larger than a change in color or rotation so backpropagation will affect color and positional changes non-linearly. This would make learning position challenging and as we know from countless reports - dynamic GS does have an underlying issue learning scene geometry. By forcing x' = deformation, we may need to rescale to have x' = k * deformation, such that the final position can move more than 1 unit in space. Considering that for most filmmaking applications dynamic motion and effects are contained within the stationary stage, re-scaling may be as simple as multiplying the normalize positional by the stage's extent. These measurements are easy to generate by, for example, analysing the initial point cloud.

This may also open up new doors. We tinkered with the idea of a deformation field that did not update the initial gaussian G but instead replaced it, such that x' = deformation instead of x' = x + deformation. This allows us to simplify the aforementioned interpretations of the canon and deformation field, as the canon acts only to sample the deformation field and the deformation field no longer acts to predict residuals but instead directly produces the Gaussian field at t. While this didnt work, it also didn't completely fail and has pushed us to think about other potential avenues for re-interpreting the relatively confusing hex-plane field representations. We believe the ideal solution lies with a grounded interpretations that allows for disentangling visual changes (color) from geometric changes (position and rotation). We experimented on learning a different set of (XT, yT and ZT) planes for the color feature and found little quantitative change but some very small visual improvements - which are really only evident under a microscope so do not motivate any paper. Along the same lines, we also looked at modelling the temporal opacity parameters (h, omega and mu) using the (XY, XZ and YZ) planes and decoding the feature from the three space-only planes. We found more promising results, but in the Splatography paper it made more sense to leave them as explicit parameters, as intended from STG, as we wanted a direct coparison with our proposed modifications to the temporal opacity function.

While its unexplored, this later attempt at research does indicate that benefit can come from re-interpreting the space-only and space-time planes, such that, again, visual and geometric changes c an be disentangled. One approach we rather like is the STG method, that assumed no visual change until rendering, whereby the render-er is assumed to be incontrol of the dynamic visual changes. This method, however, treats all dynamic Gaussian parameters as explicit and disconnected components so working this into a hex-plane method could be rather challenging - though perhaps using the hex-planes to only model geometry and using 2-D (1-to-1) feature-to-image decoders (as in STG) to model view-dependant and visual changes could prove a powerful solution. The only caveat being that the STG rendering approach is heavy and limited to ours/your ability to use CUDA code.

Splatography: Sparse multi-view dynamic Gaussian Splatting for film-making challenges

Audio-visual VFX for music

Combining various Gaussian Splatting VFX

Ripple, Fade out, Bloom, Chromatic Abberation, Undulate out, Fly Up

Proposed work:

Bassist Test Videos

Ground Truth | 4D-GS | STG | SC-GS | Ours
Small foreground assets are challenging for SotA to reconstruct as the background consumes a lot of the image/render and volumetric scene spaces. By disentangling foreground and background we overcome this issue.

Pony Test Videos

Ground Truth | 4D-GS | STG | SC-GS | Ours
Large foreground assets are easier for SotA to reconstruct (compared with Bassist) but still miss many fine details.

Piano Test Videos

4D-GS | STG | Ours | Ours (Foreground Only)
We disentangle the scene using foreground-background separation.

Fruit Test Videos

4D-GS | Ours | Ours (Foreground Only)
We supress various over and under reconstruction errors. Check out the background at time 30-40s!

Curling Test Videos

Flame Salmon Test Videos

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

Flame Steak Test Videos

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

Cook Spinach Test Videos

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

Novel-view Renders

Slow-Motion Flame Salmon

4D-GS | Ours | Ours (Foreground Only)

Slow-Motion Flame Steak

4D-GS | W4D-GS | Ours | Ours (Foreground Only)

Ablations (+ Supplementary)

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Test views on the Piano Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Novel views on the Pony Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Novel views on the Piano Scene

Notes for Future Work

BibTeX

Splatography: Sparse multi-view dynamic Gaussian Splatting for film-making challenges

Audio-visual VFX for music

Combining various Gaussian Splatting VFX

Ripple, Fade out, Bloom, Chromatic Abberation, Undulate out, Fly Up

Proposed work:

Bassist Test Videos

Ground Truth | 4D-GS | STG | SC-GS | Ours Small foreground assets are challenging for SotA to reconstruct as the background consumes a lot of the image/render and volumetric scene spaces. By disentangling foreground and background we overcome this issue.

Pony Test Videos

Ground Truth | 4D-GS | STG | SC-GS | Ours Large foreground assets are easier for SotA to reconstruct (compared with Bassist) but still miss many fine details.

Piano Test Videos

4D-GS | STG | Ours | Ours (Foreground Only) We disentangle the scene using foreground-background separation.

Fruit Test Videos

4D-GS | Ours | Ours (Foreground Only) We supress various over and under reconstruction errors. Check out the background at time 30-40s!

Curling Test Videos

Flame Salmon Test Videos

GT | 4D-GS | W4D-GS | Ours GT | STG | ITGS | Ours (Foreground Only) ...

Flame Steak Test Videos

GT | 4D-GS | W4D-GS | Ours GT | STG | ITGS | Ours (Foreground Only) ...

Cook Spinach Test Videos

GT | 4D-GS | W4D-GS | Ours GT | STG | ITGS | Ours (Foreground Only) ...

Novel-view Renders

Slow-Motion Flame Salmon

4D-GS | Ours | Ours (Foreground Only)

Slow-Motion Flame Steak

4D-GS | W4D-GS | Ours | Ours (Foreground Only)

Ablations (+ Supplementary)

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity Test views on the Piano Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity Novel views on the Pony Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity Novel views on the Piano Scene

Notes for Future Work

BibTeX

Ground Truth | 4D-GS | STG | SC-GS | Ours
Small foreground assets are challenging for SotA to reconstruct as the background consumes a lot of the image/render and volumetric scene spaces. By disentangling foreground and background we overcome this issue.

Ground Truth | 4D-GS | STG | SC-GS | Ours
Large foreground assets are easier for SotA to reconstruct (compared with Bassist) but still miss many fine details.

4D-GS | STG | Ours | Ours (Foreground Only)
We disentangle the scene using foreground-background separation.

4D-GS | Ours | Ours (Foreground Only)
We supress various over and under reconstruction errors. Check out the background at time 30-40s!

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

GT | 4D-GS | W4D-GS | Ours
GT | STG | ITGS | Ours (Foreground Only)
...

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Test views on the Piano Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Novel views on the Pony Scene

Ours | Unified Λ | 4DGS Canonical Training | 4DGS Densification | STG Temporal Opacity
Novel views on the Piano Scene