Road to 3D Gaussian Splatting

A simplified and self-contained tutorial for understanding 3D Gaussian Splatting [Under Construction]

Introduction

The end goal of this blog is to go over the necessary concepts and explanations to understand 3D Gaussian Splatting. I created this blog for my own learning and to simplify learning for others as well without the need to go around the internet gathering and intersecting the pieces of knowledge. This tutorial is based on several readings from different sources which will be cited with links.

Problem Overview

3D Gaussian Splatting is a deep learning-based method used to create an implicit 3D representation of a scene, which allows for projecting the scene on to a 2D surface from different view points.The 3D representation is learned from a view available images of the same scene from different view points, along with their camera position information. The goal is to make the representation general enough so that we can project the scene on to novel view points. Also, whatever representation we have, we need to have an algorithm to allow rendering for rendering to a 2D image.


Representations of a 3D Scene

Classical Representations

A 3D scene can be represented by one of the following:

  1. 3D Mesh: a set of vertices, edges, and faces that outline the 3D shape.
  2. Point Clouds: a collection of points in 3D that represent an object. They can carry additional numerical information like color, density …etc.
  3. Voxel Grids: the 3D version of pixels. Essentially cubes that partition 3D space where each cube can carry additional information.

Implicit Representations

These include Neural Radiance Fields (NeRFs) and Gaussian Splatting, which will be discussed in detail in later sections.


Neural Radiance Fields (NeRFs)

In order to train a NeRF, we need to construct a dataset of \(N\) images, each with its corresponding camera position information. Then, a neural network, (usually an MLP) is trained to take 5 coordinates \((x, y, z, \theta, \phi)\) as input, where \((x, y, z)\) are the 3D location information, and \((\theta, \phi)\) are the angles determining the view point angle. The network outputs the color and density values for each pixel of an image of a particular view. The network hence can be optimized by matching the ground truth image with the generated one over the views available in the dataset .

Volumetric Rendering

3D Gaussian Splatting

Representation